数据集来源https://www.kesci.com
PassengerId | Survived | Pclass | Name | Sex | Age |
乘客编号 | 乘客是否存活(0=NO 1=Yes) | 乘客所在的船舱等级,(1=1st,2=2nd,3=3rd) | 乘客姓名 | 性别 | 年龄 |
SibSp | Parch | Ticket | Fare | Cabin | Embarked |
乘客的兄弟姐妹和配偶数量 | 乘客的父母与子女数量 | 票的编号 | 票价 | 座位号 | 乘客登船码头 |
1.导入相关的包与载入数据集
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 载入准备好的泰坦尼克号数据集
dataframe = pd.read_csv('train.csv')
2.查看数据集各特征的数据类型
# 查看数据信息
print(dataframe.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
3.处理缺失值
# 从表的基本情况来看Cabin仅有204个非空的数据,有效数据占比较小,可以直接去除
dataframe = dataframe.drop(["Cabin"], axis=1)
# 特征‘Age’,‘Embarked’,也也存在缺失值,但数量不多。可以利用:平均值、中位数、直接删去
# 我们直接采用去除缺失值
dataframe = dataframe.dropna()
Data columns (total 11 columns):
PassengerId 712 non-null int64
Survived 712 non-null int64
Pclass 712 non-null int64
Name 712 non-null object
Sex 712 non-null object
Age 712 non-null float64
SibSp 712 non-null int64
Parch 712 non-null int64
Ticket 712 non-null object
Fare 712 non-null float64
Embarked 712 non-null object
dtypes: float64(2), int64(5), object(4)
4.绘制各特性与生还之间的直观图
4.1 生还人数个数总体情况
# 首先看下生还人数的一个分布
dataframe.Survived.value_counts().plot(kind='bar')
plt.title(distribution of survival(0=NO 1=Yes)")

4.2 生还人数个数总体情况
# 不同阶级游客的数量
dataframe.Pclass.value_counts().plot(kind='bar')
plt.title("Number Of Passengers PClass")

4.3 有n个兄弟姐妹或配偶在船上的数量
# 有n个兄弟姐妹或配偶在船上的数量
dataframe.SibSp.value_counts().plot(kind='bar')
plt.title("Passengers with siblings or spouse")

4.4 存活下来与年龄之间的关系
# 还有存活下来与年龄之间的关系
plt.scatter(dataframe.Age, dataframe.Survived, alpha=0.1)
plt.title("Age Distribution v/s Survived")

4.5 性别与存活情况之间的关系
# 男士的生还情况
dataframe.Survived[dataframe.Sex == 'male'].value_counts().plot(kind='bar')
plt.title("Analyzing Male Dassengers: Survived And Not Survived")

# 女性的生还情况
dataframe.Survived[dataframe.Sex == 'female'].value_counts().sort_index().plot(kind = 'bar', color='pink')
plt.title("Analyzing Female Passengers: Survived And Not Survived")

4.6 不同阶级之间存活情况的区别
# 下面我们来看看那些低阶级的穷人们
dataframe.Survived[dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', color = 'green')
plt.title("Analyzing Low Class Passengers: Not Survived And Survived")

# 将1和2均当成富人
dataframe.Survived[dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', color = 'red')
plt.title("Analyzing High Class Passengers: Not Survived And Survived")

4.7 不同阶层不同性别之间的生存情况
fig = plt.figure()
ax1=fig.add_subplot(221)
hightclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_female", color='pink')
plt.legend(loc='best')
ax2=fig.add_subplot(222)
lowclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5, label ="lowclass_female", color='pink')
plt.legend(loc='best')
ax3=fig.add_subplot(223)
hightclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_male", color='blue')
plt.legend(loc='best')
ax4=fig.add_subplot(224)
lowclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5,label ="lowclass_male", color='blue')
plt.legend(loc='best')

5.利用corr()函数粗略计算生还与各特征的相关性
# 将性别、上船位置转化成数字来代替
dataframe.loc[dataframe["Sex"] == "female","Sex"] = 1
dataframe.loc[dataframe["Sex"] == "male","Sex"] = 0
print('性别:',dataframe['Survived'].corr(dataframe['Sex']))
print('舱位:', dataframe['Survived'].corr(dataframe['Pclass']))
print('年龄:', dataframe['Survived'].corr(dataframe['Age']))
性别: 0.5367616233485025
舱位: -0.35646158844523856
年龄: -0.08244586804341386
评论