- 使用公开的房价数据集进行预测,数据包含8个特征1个目标值
- 特征最多使用2次幂
代码示例
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler# 1. 读取公开数据集
data = fetch_california_housing()
print('california 房价数据简介:')
print(data.DESCR) # 20640行,8个特征,目标值是房价
np.set_printoptions(threshold=1000)
print('california 房价特征集:')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
X = pd.DataFrame(data.data, columns=data.feature_names) # 获取特征,封装成 DataFrame
print(X)
print('california 房价目标值:')
y = data.target # 获取目标值,每一行特征对应的房价,单位是10w美元
print(y)# 2. 切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)# 3. 建立多项式回归 Pipeline 包含特征标准化、特征多项式扩展、线性回归
model = Pipeline([("scaler", StandardScaler()), # 均值0,方差1("poly", PolynomialFeatures(degree=2, include_bias=False)), # 每一个特征最多2次幂("linear", LinearRegression()) # 线性回归
])# 4. 拟合模型
model.fit(X_train, y_train)# 5. 预测
y_pred = model.predict(X_test)# 6. 评估
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)print(f"均方误差 MSE: {mse:.4f}")
print(f"决定系数 R²: {r2:.4f}")# 7. 查看生成的多项式特征
poly_feature_names = model.named_steps["poly"].get_feature_names_out(X.columns)
print("多项式特征:")
print(poly_feature_names) # 8(原特征)+8(平方)+28(交叉)=44
# 8. 查看生成的多项式参数
linear = model.named_steps['linear']
print("多项式参数:")
print(linear.coef_) # 参数也是44个
print(linear.intercept_)
输出结果
california 房价数据简介:
.. _california_housing_dataset:California Housing dataset
--------------------------**Data Set Characteristics:**:Number of Instances: 20640:Number of Attributes: 8 numeric, predictive attributes and the target:Attribute Information:- MedInc median income in block group- HouseAge median house age in block group- AveRooms average number of rooms per household- AveBedrms average number of bedrooms per household- Population block group population- AveOccup average number of household members- Latitude block group latitude- Longitude block group longitude:Missing Attribute Values: NoneThis dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.htmlThe target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function... rubric:: References- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,Statistics and Probability Letters, 33:291-297, 1997.california 房价特征集:MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -121.09
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -121.21
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -121.22
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -121.32
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -121.24[20640 rows x 8 columns]
california 房价目标值:
[4.526 3.585 3.521 ... 0.923 0.847 0.894]
均方误差 MSE: 0.4643
决定系数 R²: 0.6457
多项式特征:
['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup''Latitude' 'Longitude' 'MedInc^2' 'MedInc HouseAge' 'MedInc AveRooms''MedInc AveBedrms' 'MedInc Population' 'MedInc AveOccup''MedInc Latitude' 'MedInc Longitude' 'HouseAge^2' 'HouseAge AveRooms''HouseAge AveBedrms' 'HouseAge Population' 'HouseAge AveOccup''HouseAge Latitude' 'HouseAge Longitude' 'AveRooms^2''AveRooms AveBedrms' 'AveRooms Population' 'AveRooms AveOccup''AveRooms Latitude' 'AveRooms Longitude' 'AveBedrms^2''AveBedrms Population' 'AveBedrms AveOccup' 'AveBedrms Latitude''AveBedrms Longitude' 'Population^2' 'Population AveOccup''Population Latitude' 'Population Longitude' 'AveOccup^2''AveOccup Latitude' 'AveOccup Longitude' 'Latitude^2''Latitude Longitude' 'Longitude^2']
多项式参数:
[ 0.93594011 0.13205802 -0.38759869 0.53020674 0.04051346 -1.78126342-1.27267893 -1.1676299 -0.11222558 0.03784584 0.17978116 -0.12015160.11142996 -0.09883978 -0.66721635 -0.58616928 0.0332914 -0.016246720.05234485 0.0360252 -0.27866746 -0.2767792 -0.25281254 0.06040245-0.10958604 -0.15473981 0.57792376 0.54353082 0.47907069 0.049544820.24209969 -0.40169311 -0.48876332 -0.4228783 0.00195178 0.323615260.03280047 0.01523969 0.00769438 0.50676749 0.36713809 0.26320960.4351273 0.15301617]
1.956590491804413