博客详情页

2020-07-20

389

原创

数据集来源https://www.kesci.com

PassengerId	Survived	Pclass	Name	Sex	Age
乘客编号	乘客是否存活(0=NO 1=Yes)	乘客所在的船舱等级,(1=1st,2=2nd,3=3rd)	乘客姓名	性别	年龄

SibSp	Parch	Ticket	Fare	Cabin	Embarked
乘客的兄弟姐妹和配偶数量	乘客的父母与子女数量	票的编号	票价	座位号	乘客登船码头

1.导入相关的包与载入数据集

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 载入准备好的泰坦尼克号数据集
dataframe = pd.read_csv('train.csv')

2.查看数据集各特征的数据类型

# 查看数据信息
print(dataframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3.处理缺失值

# 从表的基本情况来看Cabin仅有204个非空的数据，有效数据占比较小，可以直接去除
dataframe = dataframe.drop(["Cabin"], axis=1)

# 特征‘Age’，‘Embarked’,也也存在缺失值，但数量不多。可以利用：平均值、中位数、直接删去
# 我们直接采用去除缺失值
dataframe = dataframe.dropna()

Data columns (total 11 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Name           712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Ticket         712 non-null object
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(4)

4.绘制各特性与生还之间的直观图

4.1 生还人数个数总体情况

# 首先看下生还人数的一个分布
dataframe.Survived.value_counts().plot(kind='bar')
plt.title(distribution of survival(0=NO 1=Yes)")

avatar

4.2 生还人数个数总体情况

# 不同阶级游客的数量
dataframe.Pclass.value_counts().plot(kind='bar')
plt.title("Number Of Passengers PClass")

avatar

4.3 有n个兄弟姐妹或配偶在船上的数量

# 有n个兄弟姐妹或配偶在船上的数量
dataframe.SibSp.value_counts().plot(kind='bar')
plt.title("Passengers with siblings or spouse")

avatar

4.4 存活下来与年龄之间的关系

# 还有存活下来与年龄之间的关系
plt.scatter(dataframe.Age, dataframe.Survived, alpha=0.1)
plt.title("Age Distribution v/s Survived")

avatar

4.5 性别与存活情况之间的关系

# 男士的生还情况
dataframe.Survived[dataframe.Sex == 'male'].value_counts().plot(kind='bar')
plt.title("Analyzing Male Dassengers: Survived And Not Survived")

avatar

# 女性的生还情况
dataframe.Survived[dataframe.Sex == 'female'].value_counts().sort_index().plot(kind = 'bar', color='pink')
plt.title("Analyzing Female Passengers: Survived And Not Survived")

avatar

4.6 不同阶级之间存活情况的区别

# 下面我们来看看那些低阶级的穷人们
dataframe.Survived[dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', color = 'green')
plt.title("Analyzing Low Class Passengers: Not Survived And Survived")

avatar

# 将1和2均当成富人
dataframe.Survived[dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', color = 'red')
plt.title("Analyzing High Class Passengers: Not Survived And Survived")

avatar

4.7 不同阶层不同性别之间的生存情况

fig = plt.figure()

ax1=fig.add_subplot(221)
hightclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_female", color='pink')
plt.legend(loc='best')

ax2=fig.add_subplot(222)
lowclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5, label ="lowclass_female", color='pink')
plt.legend(loc='best')

ax3=fig.add_subplot(223)
hightclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_male", color='blue')
plt.legend(loc='best')

ax4=fig.add_subplot(224)
lowclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5,label ="lowclass_male", color='blue')
plt.legend(loc='best')

avatar

5.利用corr()函数粗略计算生还与各特征的相关性

# 将性别、上船位置转化成数字来代替

dataframe.loc[dataframe["Sex"] == "female","Sex"] = 1
dataframe.loc[dataframe["Sex"] == "male","Sex"] = 0

print('性别：',dataframe['Survived'].corr(dataframe['Sex']))
print('舱位：', dataframe['Survived'].corr(dataframe['Pclass']))
print('年龄：', dataframe['Survived'].corr(dataframe['Age']))

性别： 0.5367616233485025
舱位： -0.35646158844523856
年龄： -0.08244586804341386

Python

机器学习

作者：李延松(联系作者)
发表时间：2020-07-20 17:17
版本声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)
公众号转载:请在文末添加作者公众号二维码

泰坦尼克号数据集练习01

1.导入相关的包与载入数据集

2.查看数据集各特征的数据类型

3.处理缺失值

4.绘制各特性与生还之间的直观图

4.1 生还人数个数总体情况

4.2 生还人数个数总体情况

4.3 有n个兄弟姐妹或配偶在船上的数量

4.4 存活下来与年龄之间的关系

4.5 性别与存活情况之间的关系

4.6 不同阶级之间存活情况的区别

4.7 不同阶层不同性别之间的生存情况

5.利用corr()函数粗略计算生还与各特征的相关性

评论