通过一项房地产公司项目,体验数据科学家在进行机器学习经历的整个过程:
【说明】:我们仅仅练习其中一部分,数据来源:1990年加利福尼亚人口普查数据,下面我们通过一张思维导图来大体浏览一下我们的具体执行步骤。
网上有很多公开数据集的网站:Kaggle、AWS、KESCI等等,我是先将数据集下载到本地,再读取数据集。
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn.linear_model
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# 记载数据集
housing = pd.read_csv("./datasets/housing/housing.csv", thousands=',')
例如:查看数据集的类型、数目、统计情况、可视化等等,对手中的数据集有总体的认识,方便后续步骤的进行。
# 查看数据集的前5行
housing.head(5)
【说明】:数据集共有10个特征:(1)经度longitude(2)纬度latitude(3)住房中位年龄housing_median_age(4)总房间数total_rooms(5)卧室数量total_bedrooms(6)人口population(7)家庭households(8)收入中位数median_income(9)房价中位数median_house_value(10)靠近海洋情况ocean_proximity。
# 查看数据集的简单描述
housing.info()
【输出】:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
【说明】:前9个特征属于float类型数据,total_bedrooms这个特征数量仅有20433说明存在207个缺失值,在后续处理时应该格外注意。最后1个特征ocean_proximity属于object类型,大概率是文本类型数据,需要进一步查看。
# 查看ocean_proximity中有多少种分类存在
housing["ocean_proximity"].value_counts()
【输出】:
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
# 显示基本的统计信息
housing.describe()
# 通过直方图快速查看数值型数据的分布情况
housing.hist(bins=50, figsize=(15,15))
plt.show()
【分析】:
为什么要在这个阶段创建测试集,而不是继续查看整个数据集?这是因为,人的大脑是一个非常神奇的模式检测系统,也就是说它很容易过度匹配:如果你浏览测试数据,你很有可能会跌入某个看似有趣的数据模式,进而选择某个特征的机器学习模型。然后,你再使用测试数据对泛化误差率进行估算时,估计结果将过于乐观,该系统启动后的表现将不如预期那般优秀。这叫做“数据窥探偏误”
np.random.seed(42)
# 方案一:纯随机抽样法分割数据集
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
# 方案二:分层抽样
# 1.先按照收入中位数进行层级划分:1、2、3、4、5
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
# 2.然后在按照层级的比例进行随机抽样
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
# 查看划分层级后的新特征:income_cat
housing["income_cat"].value_counts()
【输出】:
3 7236
2 6581
4 3639
5 2362
1 822
Name: income_cat, dtype: int64
# 查看训练集各个层级的占比
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
【输出】:
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
# 分层抽样后特征income_cat可以删除
strat_train_set = strat_train_set.drop("income_cat", axis=1)
由于这个案例中存在经纬度这类地理位置信息的特征,所以我们可以利用这个特征可视化出所在地的形状:
# 选择x、y轴的数据、并设置透明度、点的大小反应人口密度、颜色表示价格的高低
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
# 可视化与地图相结合
import matplotlib.image as mpimg
california_img=mpimg.imread('./images/end_to_end_project/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=housing['population']/100, label="Population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=False, alpha=0.4,
)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
cmap=plt.get_cmap("jet"))
# 再加上一个colorbar
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
plt.show()
# 数据相关性:
# 皮尔逊相关系数:1正相关、-1负相关、0无相关
corr_matrix = housing.corr()
# 查看每个特征与房价中位数的相关性
corr_matrix["median_house_value"].sort_values(ascending=False)
【输出】:
median_house_value 1.000000
median_income 0.687160
income_cat 0.642274
total_rooms 0.135097
housing_median_age 0.114110
Unnamed: 0 0.067723
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
# 可视化相关性:分布矩阵
from pandas.plotting import scatter_matrix
# 显示一部分特征
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
# 最有潜力的是收入中位数
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
plt.axis([0, 16, 0, 550000])
【说明】:收入中位数与房价中位数相关性比较强,明显的看出上升趋势,而且点也不太分散。但是,根据图像会发现:有一条50万美元的价格上限的水平线,为了避免算法学习之后重现这些怪异数据,我们可以尝试后面处理一下这类数据。
这一步骤在数据处理的时候经常遇到,但是本次案例中我们仅仅试验一下,并不真正的应用到我们的机器学习算法中。
# 数值计算得到其他新特征
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
# 重新查看相关性
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000
median_income 0.687160
income_cat 0.642274
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
Unnamed: 0 0.067723
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64
# 查看新特征rooms_per_household与房价的相关性图像
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()
【说明】:经过分析会发现,新创建的特征效果并不明显。
# 重新读取保存好的训练集数据
strat_train_set = pd.read_csv('./strat_train_set.cvs')
strat_train_set.head()
# 将训练集数据中的特征与标签分离(原表格median_house_value没删除)
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
# 查看缺失值
housing[housing.isnull().any(axis=1)]
【说明】:下面我们来说明解决缺失值的三个方案思路:
# 举例说明
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
# 方案一:删除行(这里没有写inplace=True,原来的列表不会改变!)
sample_incomplete_rows.dropna(subset=["total_bedrooms"])
# 方案二:删除一列
sample_incomplete_rows.drop("total_bedrooms", axis=1)
# 方案三:填充(这里我们使用均值填充)
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
# 查看填充后的数据,可以看见total_bedrooms已经被填充
sample_incomplete_rows
from sklearn.impute import SimpleImputer
# 删除文本数据
housing_num = housing.drop("ocean_proximity", axis=1)
# 参数strategy="median":利用每列中的中位数替换缺失值
imputer = SimpleImputer(strategy="median")
imputer.fit(housing_num)
print(imputer.statistics_)
# 利用pandas库计算中位数,进行验证
print(housing_num.median().values)
# 将housing_num转换为数组
X = imputer.transform(housing_num)
# 将X重新调整为pandas格式
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing.index)
# 再次查看刚刚有缺失值的列
housing_tr.loc[sample_incomplete_rows.index.values]
【输出】:
array([ 1.0341e+04, -1.1851e+02, 3.4260e+01, 2.9000e+01, 2.1195e+03,
4.3300e+02, 1.1640e+03, 4.0800e+02, 3.5409e+00])
array([ 1.0341e+04, -1.1851e+02, 3.4260e+01, 2.9000e+01, 2.1195e+03,
4.3300e+02, 1.1640e+03, 4.0800e+02, 3.5409e+00])
# 查看文本类型的特征ocean_proximity
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)
# 将文本类标签转换为one-hot格式
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
ordinal_encoder.categories_
【输出】:
array([[0.],
[0.],
[4.],
[1.],
[0.],
[1.],
[0.],
[1.],
[0.],
[0.]])
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]
# 将分类特征转换为one-hot编码
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
...,
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
【注意】:现在处理的数据还未进行保存!
【说明】:重新新建文档。
# 载入原始训练集数据
strat_train_set = pd.read_csv('./strat_train_set.cvs', index_col=0)
strat_train_set.head()
# 查看训练集的描述
strat_train_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 10 columns):
longitude 16512 non-null float64
latitude 16512 non-null float64
housing_median_age 16512 non-null float64
total_rooms 16512 non-null float64
total_bedrooms 16354 non-null float64
population 16512 non-null float64
households 16512 non-null float64
median_income 16512 non-null float64
median_house_value 16512 non-null float64
ocean_proximity 16512 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB
# 训练集的特征与标签分离
housing = strat_train_set.drop("median_house_value", axis=1) # 原表格median_house_value没删除
housing_labels = strat_train_set["median_house_value"].copy()
housing_num = housing.drop("ocean_proximity", axis=1)
from sklearn.base import BaseEstimator, TransformerMixin
# 列索引
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
# 自定义转换器
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self
def transform(self, X):
# 先加两列
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
# 判断是否需要加第三列
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
# 调用定义好的类处理数据:增加两列
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,
columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
index=housing.index)
housing_extra_attribs.head()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# 第1个流水线的创建
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")), # 中位数填充
('attribs_adder', CombinedAttributesAdder()), # 加入特征
('std_scaler', StandardScaler()), # 标准化:均值变成0
])
# 使用定义好的num_pipeline对数据进行处理
housing_num_tr = num_pipeline.fit_transform(housing_num)
# 查看处理好的数据
housing_num_tr
array([[-1.15604281, 0.77194962, 0.74333089, ..., -0.31205452,
-0.08649871, 0.15531753],
[-1.17602483, 0.6596948 , -1.1653172 , ..., 0.21768338,
-0.03353391, -0.83628902],
[ 1.18684903, -1.34218285, 0.18664186, ..., -0.46531516,
-0.09240499, 0.4222004 ],
...,
[ 1.58648943, -0.72478134, -1.56295222, ..., 0.3469342 ,
-0.03055414, -0.52177644],
[ 0.78221312, -0.85106801, 0.18664186, ..., 0.02499488,
0.06150916, -0.30340741],
[-1.43579109, 0.99645926, 1.85670895, ..., -0.22852947,
-0.09586294, 0.10180567]])
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
# 第二个流水线是对文本类数据进行one-hot编码
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs), # 第1个流水线
("cat", OneHotEncoder(), cat_attribs), # one-hot
])
# 使用
housing_prepared = full_pipeline.fit_transform(housing)
import pickle
# 训练集特征
list_file = open('./housing_prepared.pickle', 'wb')
pickle.dump(housing_prepared, list_file)
list_file.close()
# 训练集标签
list_file = open('./housing_labels.pickle', 'wb')
pickle.dump(housing_labels, list_file)
list_file.close()
import pickle
# 载入处理好的训练集特征
list_file = open('./housing_prepared.pickle', 'rb')
housing_prepared = pickle.load(list_file)
print(housing_prepared.shape)
# 载入处理好的训练集标签
list_file = open('./housing_labels.pickle', 'rb')
housing_labels = pickle.load(list_file)
print(housing_labels.shape)
(16512, 16)
(16512,)
from sklearn.linear_model import LinearRegression
# 线性回归:
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
from sklearn.ensemble import RandomForestRegressor
# 随机森林:
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
from sklearn.svm import SVR
# 支持向量机
svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
先利用均方误差简单的比较一下三者的情况:
from sklearn.metrics import mean_squared_error
# 线性回归的均方误差:
housing_predictions = lin_reg.predict(housing_prepared) # 使用模型得到预测数据
lin_mse = mean_squared_error(housing_labels, housing_predictions) # 真实值数据与预测数据的均方差
lin_rmse = np.sqrt(lin_mse) # 对均方差开方
print('线性回归的均方误差:', lin_rmse)
# 随机森林的均方误差:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions) # 均方误差
forest_rmse = np.sqrt(forest_mse)
print('随机森林的均方误差:', forest_rmse)
# 支持向量机的均方误差:
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
print('支持向量机的均方误差:', svm_rmse)
线性回归的均方误差:68628.19819848923
随机森林的均方误差:18603.515021376355
支持向量机的均方误差:111094.6308539982
【说明】:支持向量机与另外两种模型差距很大,所以直接排除。为了近一步分析剩下两种模型,利用交叉验证来比较两者情况:
from sklearn.model_selection import cross_val_score
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
# 线性回归交叉验证:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
# 随机森林交叉验证:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Scores: [66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.6740017983466
Scores: [49519.80364233 47461.9115823 50029.02762854 52325.28068953
49308.39426421 53446.37892622 48634.8036574 47585.73832311
53490.10699751 50021.5852922 ]
Mean: 50182.303100336096
Standard deviation: 2097.0810550985693
【说明】:经过比较对比,随机森林的模型相对比较好,适合本次案例的数据。
为了近一步优化模型,找到最佳超参数,一般我们有三种方案:
【说明】:这里我们选择网格搜索。
from sklearn.model_selection import GridSearchCV
# 超参数:
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# 网格搜索最佳超参数
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
【输出】:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators='warn', n_jobs=None,
oob_score=False, random_state=42,
verbose=0, warm_start=False),
iid='warn', n_jobs=None,
param_grid=[{'max_features': [2, 4, 6, 8],
'n_estimators': [3, 10, 30]},
{'bootstrap': [False], 'max_features': [2, 3, 4],
'n_estimators': [3, 10]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring='neg_mean_squared_error', verbose=0)
# 查看最好的超参数
grid_search.best_params_
# 最好的超参数对应的模型
grid_search.best_estimator_
【输出】:
{'max_features': 8, 'n_estimators': 30}
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=8, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=30,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)
# 查看网格搜索的所有情况
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
【输出】:
63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
# 载入测试集数据
strat_test_set = pd.read_csv('./strat_test_set.cvs', index_col=0)
strat_test_set.head()
# 评估模型:
final_model = grid_search.best_estimator_
# 将特征与标签分开
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
# 先利用流水线进行数据处理
X_test_prepared = full_pipeline.transform(X_test)
# 再利用训练好的随机森林模型进行预测
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
【输出】:
47730.22690385927
略
评论