用Python研究了三千套房子，揭秘是什么抬高了房价？

您的位置：
门户
>> 文章精选
>> 软件开发专栏
>> Python
>> 查看资讯

用Python研究了三千套房子，揭秘是什么抬高了房价？

发表于：2018-1-24 10:00

字体：大中小 | 上一篇 | 下一篇 | 我要投稿

作者：佚名来源：Python中文社区

　　5、构建基准模型

from sklearn import cross_validation

from sklearn import linear_model

from sklearn.learning_curve import learning_curve

from sklearn.metrics import explained_variance_score

from sklearn.grid_search import GridSearchCV

from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor

y = y.values #转换成array数组

X = train_df.values #转换成array数组

cv = cross_validation.ShuffleSplit(len(X),n_iter=3,test_size=0.2)

print('岭回归交叉验证结果：')

for train_index,test_index in cv:

ridge = linear_model.Ridge(alpha=1).fit(X,y)

print('train_score:{0:.3f},test_score:{1:.3f}\n'.format(ridge.score(X[train_index],y[train_index]), ridge.score(X[test_index],y[test_index])))

print('随机森林交叉验证结果：')

for train_index,test_index in cv:

rf = RandomForestRegressor().fit(X,y)

print('train_score:{0:.3f},test_score:{1:.3f}\n'.format(rf.score(X[train_index],y[train_index]), rf.score(X[test_index],y[test_index])))

　　哇！好意外啊，这两个模型的结果表现都不错，但是随机森林的结果似乎更好，下面来看看学习曲线情况。

　　我们采用的是默认的参数，没有调优处理，得到的两个基准模型都存在过拟合现象。下面，我们开始着手参数的调整，希望能够改善模型的过拟合现象。

　　6、参数调优

　　岭回归正则项缩放系数alpha调整

alphas =[0.01,0.1,1,10,20,50,100,300]

test_scores = []

for alp in alphas:

clf = linear_model.Ridge(alp)

test_score = -cross_val_score(clf,X,y,cv=10,scoring='neg_mean_squared_error')

test_scores.append(np.mean(test_score))

import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(alphas,test_scores)

plt.title('alpha vs test_score')

　　alpha在10-20附近均方误差最小

　　随机森林参数调优

　　随机森林算法，本篇中主要调整三个参数：maxfeatures,maxdepth,n_estimators

#随机森林的深度参数

max_depth=[2,4,6,8,10]

test_scores_depth = []

for depth in max_depth:

clf = RandomForestRegressor(max_depth=depth)

test_score_depth = -cross_val_score(clf,X,y,cv=10,scoring='neg_mean_squared_error')

test_scores_depth.append(np.mean(test_score_depth))

#随机森林的特征个数参数

max_features =[.1, .3, .5, .7, .9, .99]

test_scores_feature = []

for feature in max_features:

clf = RandomForestRegressor(max_features=feature)

test_score_feature = -cross_val_score(clf,X,y,cv=10,scoring='neg_mean_squared_error')

test_scores_feature.append(np.mean(test_score_feature))

#随机森林的估计器个位数参数

n_estimators =[10,50,100,200,500]

test_scores_n = []

for n in n_estimators:

clf = RandomForestRegressor(n_estimators=n)

test_score_n = -cross_val_score(clf,X,y,cv=10,scoring='neg_mean_squared_error')

test_scores_n.append(np.mean(test_score_n))

　　随机森林的各项参数来看，深度位于8，选择特征个数比例为0.5，估计器个数为500时，效果最好。下面分别利用上述得到的最优参数分别重新训练，看一下学习曲线，过拟合现象是否得到缓解？

　　再回想一下，我们最初的基线模型学习曲线的形状，是不是得到了一定程度的缓解？OK，下面我们采用模型融合技术，对数据进行预测。

#预测

ridge = linear_model.Ridge(alpha=10).fit(X,y)

rf = RandomForestRegressor(n_estimators=500,max_depth=8,max_features=.5).fit(X,y)

y_ridge = np.expm1(ridge.predict(test_df.values))

y_rf = np.expm1(rf.predict(test_df.values))

y_final = (y_ridge + y_rf)/2

　　本篇房价预测的模型搭建已经完成。同样，再梳理一边思路：

　　一、本篇用到的房价数据集存在比较多的数据缺失，且分类变量十分多。在预处理阶段需要将训练集和测试集合并，进行缺失填充和one-hot独热变量处理，保证数据处理过程的一致性，在数据缺失填充过程中，需要综合考虑特征的实际描述和数据的分布，选择合适的填充方式填充；

　　二、为防止数据变量不统一带来的模型准确率下降，将数值型特征进行标准化处理，数据处理完成后，按照数据合并的方式，再还原到训练集和测试集；

　　三、先构建岭回归和随机森林基准模型，进行三折交叉验证，绘制学习曲线，存在明显的过拟合现象；

　　四、接下来分别对两个基准模型进行参数调优，获得使得均方误差最小的参数，返回到训练集进行训练；

　　五、采用并行模型融合的方式，计算两个模型预测结果的均值作为测试集的预测结果。

上文内容不用于商业目的，如涉及知识产权问题，请权利人联系博为峰小编(021-64471599-8017)，我们将立即处理。

22/2<12

《2023软件测试行业现状调查报告》独家发布~

搜索风云榜

测试技术了解

2023测试行业调查报告

挣点稿费

AI与软件测试

文章资料精选