原创

泰坦尼克号数据集练习01


数据集来源https://www.kesci.com

PassengerIdSurvivedPclassNameSexAge
乘客编号乘客是否存活(0=NO 1=Yes)乘客所在的船舱等级,(1=1st,2=2nd,3=3rd)乘客姓名性别年龄
SibSpParchTicketFareCabinEmbarked
乘客的兄弟姐妹和配偶数量乘客的父母与子女数量票的编号票价座位号乘客登船码头

1.导入相关的包与载入数据集

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 载入准备好的泰坦尼克号数据集
dataframe = pd.read_csv('train.csv')

2.查看数据集各特征的数据类型

# 查看数据信息
print(dataframe.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3.处理缺失值

# 从表的基本情况来看Cabin仅有204个非空的数据,有效数据占比较小,可以直接去除
dataframe = dataframe.drop(["Cabin"], axis=1)

# 特征‘Age’,‘Embarked’,也也存在缺失值,但数量不多。可以利用:平均值、中位数、直接删去
# 我们直接采用去除缺失值
dataframe = dataframe.dropna()
Data columns (total 11 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Name           712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Ticket         712 non-null object
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(4)

4.绘制各特性与生还之间的直观图

4.1 生还人数个数总体情况

# 首先看下生还人数的一个分布
dataframe.Survived.value_counts().plot(kind='bar')
plt.title(distribution of survival(0=NO 1=Yes)")

avatar

4.2 生还人数个数总体情况

# 不同阶级游客的数量
dataframe.Pclass.value_counts().plot(kind='bar')
plt.title("Number Of Passengers PClass")

avatar

4.3 有n个兄弟姐妹或配偶在船上的数量

# 有n个兄弟姐妹或配偶在船上的数量
dataframe.SibSp.value_counts().plot(kind='bar')
plt.title("Passengers with siblings or spouse")

avatar

4.4 存活下来与年龄之间的关系

# 还有存活下来与年龄之间的关系
plt.scatter(dataframe.Age, dataframe.Survived, alpha=0.1)
plt.title("Age Distribution v/s Survived")

avatar

4.5 性别与存活情况之间的关系

# 男士的生还情况
dataframe.Survived[dataframe.Sex == 'male'].value_counts().plot(kind='bar')
plt.title("Analyzing Male Dassengers: Survived And Not Survived")

avatar

# 女性的生还情况
dataframe.Survived[dataframe.Sex == 'female'].value_counts().sort_index().plot(kind = 'bar', color='pink')
plt.title("Analyzing Female Passengers: Survived And Not Survived")

avatar

4.6 不同阶级之间存活情况的区别

# 下面我们来看看那些低阶级的穷人们
dataframe.Survived[dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', color = 'green')
plt.title("Analyzing Low Class Passengers: Not Survived And Survived")

avatar

# 将1和2均当成富人
dataframe.Survived[dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', color = 'red')
plt.title("Analyzing High Class Passengers: Not Survived And Survived")

avatar

4.7 不同阶层不同性别之间的生存情况

fig = plt.figure()

ax1=fig.add_subplot(221)
hightclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_female", color='pink')
plt.legend(loc='best')

ax2=fig.add_subplot(222)
lowclass_female = dataframe.Survived[dataframe.Sex == 'female'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5, label ="lowclass_female", color='pink')
plt.legend(loc='best')

ax3=fig.add_subplot(223)
hightclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass != 3].value_counts().sort_index().plot(kind = 'bar', label ="hightclass_male", color='blue')
plt.legend(loc='best')

ax4=fig.add_subplot(224)
lowclass_male = dataframe.Survived[dataframe.Sex == 'male'][dataframe.Pclass == 3].value_counts().sort_index().plot(kind = 'bar', alpha = 0.5,label ="lowclass_male", color='blue')
plt.legend(loc='best')

avatar

5.利用corr()函数粗略计算生还与各特征的相关性

# 将性别、上船位置转化成数字来代替

dataframe.loc[dataframe["Sex"] == "female","Sex"] = 1
dataframe.loc[dataframe["Sex"] == "male","Sex"] = 0

print('性别:',dataframe['Survived'].corr(dataframe['Sex']))
print('舱位:', dataframe['Survived'].corr(dataframe['Pclass']))
print('年龄:', dataframe['Survived'].corr(dataframe['Age']))
性别: 0.5367616233485025
舱位: -0.35646158844523856
年龄: -0.08244586804341386
Python
机器学习
  • 作者:李延松(联系作者)
  • 发表时间:2020-07-20 17:17
  • 版本声明:自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)
  • 公众号转载:请在文末添加作者公众号二维码

评论

留言