数据处理一般步骤
1、识别出X和Y
2、识别出连续 和 分类变量
3、分割数据集,70%训练集,30%测试集
4、建立模型
5、训练模型、测试模型


一、对离散特征的编码
离散特征的编码分为两种情况:
1、离散特征的取值之间没有大小的意义,比如color:[red,blue],那么就使用one-hot编码
2、离散特征的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}
假设有数据集:
1 2 3 4 5 6 7 | import pandas as pd
df = pd.DataFrame([
[ 'green' , 'M' , 10.1 , 'class1' ],
[ 'red' , 'L' , 13.5 , 'class2' ],
[ 'blue' , 'XL' , 15.3 , 'class1' ]])
df.columns = [ 'color' , 'size' , 'prize' , 'class label' ]
|
1、使用pandas可以很方便的对离散型特征进行one-hot编码,对于有大小意义的离散特征,直接使用映射就可以了,{'XL':3,'L':2,'M':1}
1 2 3 4 5 6 7 8 | size_mapping = {
'XL' : 3 ,
'L' : 2 ,
'M' : 1 }
df[ 'size' ] = df[ 'size' ]. map (size_mapping)
class_mapping = {label:idx for idx,label in enumerate ( set (df[ 'class label' ]))}
df[ 'class label' ] = df[ 'class label' ]. map (class_mapping)
|

2、使用get_dummies进行one-hot编码
1 | pd.get_dummies(df)<br>pd.get_dummies(data = df,columns = [ '列名' , '..' ,...])
|

二、训练集,预测集的划分
1 2 3 4 | # 需要提前把 X 和 Y 分离出来
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(xdata,ydata,test_size = 0.3 ) # 70%训练集,30%预测集 注意训练集,预测集的返回顺序!
|
三、模型选择
1、以 分类模型 Logistic regression 为例
①、定义模型
1 2 3 4 5 6 7 8 | from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C = 0.1 , max_iter = 100 )
class sklearn.linear_model.LogisticRegression(penalty = 'l2' , dual = False , tol = 0.0001 , C = 1.0 , fit_intercept = True , intercept_scaling = 1 , class_weight = None , random_state = None , solver = 'liblinear' ,max_iter = 100 , multi_class = 'ovr' , verbose = 0 , warm_start = False , n_jobs = 1 )
penalty : 正则项 , L1,L2 默认:L2
C : λ 惩罚项
max_iter : 最大迭代次数
|
②、训练模型
③、预测
④、返回预测概率
1 2 3 4 5 6 7 8 9 | lr.predict_proba(x_test)
array([[ 0.81104664 , 0.18895336 ],
[ 0.7089903 , 0.2910097 ],
[ 0.72999523 , 0.27000477 ],
...,
[ 0.79589777 , 0.20410223 ],
[ 0.84381244 , 0.15618756 ],
[ 0.81695779 , 0.18304221 ]])
|
⑤、准确率
四、最优参数的选择
①、导入 Exhaustive Grid Search
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from sklearn.model_selection import GridSearchCV
params = { # 需要筛选的参数项
'C' :[ 1 , 0.1 , 0.01 ],
'max_iter' :[ 10 , 100 , 200 ]
}
gs = GridSearchCV( lr , params , cv = 5 , scoring = 'f1' )
gs.fit(x_train,y_train) # 训练模型
<br>
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring = None , fit_params = None , n_jobs = 1 , iid = True , refit = True <br> ,cv = None , verbose = 0 , pre_dispatch = '2*n_jobs' , error_score = 'raise' , return_train_score = True )
estimator : 模型
param_grid : 需要的参数
cv: 数据集划分
scoring : 评估指标
|
②、查看建立的所有模型
1 2 3 4 5 6 | gs.grid_scores_
[mean: 0.38408 , std: 0.01678 , params: { 'C' : 1 , 'max_iter' : 10 },
mean: 0.38758 , std: 0.02156 , params: { 'C' : 1 , 'max_iter' : 100 },
mean: 0.38758 , std: 0.02156 , params: { 'C' : 1 , 'max_iter' : 200 },
...]
|
③、返回最佳参数
1 2 3 | gs.best_params_
{ 'C' : 0.01 , 'max_iter' : 100 }
|
五、模型衡量指标
①、Model selection =》 Model evaluation: quantifying the quality of predictions =》 Classification metrics
②、导入包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from sklearn.metrics import precision_recall_curve
pre,recall,thre = precision_recall_curve(y_test,gs.predict_proba(x_test)[:, 1 ]) # 选择概率为1那列进行计算<br>
sklearn.metrics.precision_recall_curve(y_true,probas_pred,pos_label = None ,sample_weight = None )
# 返回3个参数
precision : array, shape = [n_thresholds + 1 ]
Precision values such that element i is the precision of predictions with score > = thresholds[i] and the last element is 1.
recall : array, shape = [n_thresholds + 1 ]
Decreasing recall values such that element i is the recall of predictions with score > = thresholds[i] and the last element is 0.
thresholds :
阈值 : 大于阈值, 认为是 0 或 1 , 阈值越高, precision 越高
# 输入值
y_true : array, shape = [n_samples] # 测试集合
probas_pred : array, shape = [n_samples] # 预测的概率值
|
六、数据标准化
①、Preprocessing =》 StandardScaler
②、先对训练集进行标准化
1 2 | sds = StandardScaler()
sds.fit(x_train)
|
③、再应用 transform 返回新训练集数组 !!!!
1 | sds_train = sds.transform(x_train)
|
④、对测试集进行标准化
1 | sds_test = sds.transform(x_test)
|
注意:
1、进行标准化 对过大数据列进行 , , 其余不需要标准化的数据需要提前剔除!
2、需要进行梯度下降, 距离计算等的模型,需要标准化
七、作图对比
1 2 3 4 5 6 7 8 9 10 | import matplotlib.pyplot as plt
plt.style.use( "classic" )
% matplotlib inline
plt.plot(recall,pre,label = 'no sds' )
plt.plot(recall2,pre2,,label = ' sds)
plt.xlabel( "recall" )
plt.ylabel( "pre" )
plt.legend()
plt.grid()
|


|