ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)案例应用
利用多种特征筛选技术(PCC_SVMC/chi2_RF/MIC/DiC/RF单模、RFE_RLasso/RF/SF_ETreesC)是否毒性(二分类)最全案例
相关文章
ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例应用
ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例实现代码
特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例应用
# 1、定义数据集
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 8124 non-null object
1 cap-shape 8124 non-null object
2 cap-surface 8124 non-null object
3 cap-color 8124 non-null object
4 bruises 8124 non-null object
5 odor 8124 non-null object
6 gill-attachment 8124 non-null object
7 gill-spacing 8124 non-null object
8 gill-size 8124 non-null object
9 gill-color 8124 non-null object
10 stalk-shape 8124 non-null object
11 stalk-root 8124 non-null object
12 stalk-surface-above-ring 8124 non-null object
13 stalk-surface-below-ring 8124 non-null object
14 stalk-color-above-ring 8124 non-null object
15 stalk-color-below-ring 8124 non-null object
16 veil-type 8124 non-null object
17 veil-color 8124 non-null object
18 ring-number 8124 non-null object
19 ring-type 8124 non-null object
20 spore-print-color 8124 non-null object
21 population 8124 non-null object
22 habitat 8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB
None
class cap-shape cap-surface ... spore-print-color population habitat
0 p x s ... k s u
1 e x s ... n n g
2 e b s ... n n m
3 p x y ... k s u
4 e x s ... n a g
[5 rows x 23 columns]
(8124, 23)
# 2、特征工程/数据预处理
# 2.1、统计各特征的缺失值占比
percent_missing
class 0.0
cap-shape 0.0
cap-surface 0.0
cap-color 0.0
bruises 0.0
odor 0.0
gill-attachment 0.0
gill-spacing 0.0
gill-size 0.0
gill-color 0.0
stalk-shape 0.0
stalk-root 0.0
stalk-surface-above-ring 0.0
stalk-surface-below-ring 0.0
stalk-color-above-ring 0.0
stalk-color-below-ring 0.0
veil-type 0.0
veil-color 0.0
ring-number 0.0
ring-type 0.0
spore-print-color 0.0
population 0.0
habitat 0.0
# 2.2、分离特征与标签
# 2.3、分析目标标签变量

e 4208
p 3916
Name: class, dtype: int64
# 2.4、特征编码化
特征数据集执行OneHotEncoding编码、标签数据执行LE化
df_X2dum_cols 117 Index(['cap-shape_b', 'cap-shape_c', 'cap-shape_f', 'cap-shape_k',
'cap-shape_s', 'cap-shape_x', 'cap-surface_f', 'cap-surface_g',
'cap-surface_s', 'cap-surface_y',
...
'population_s', 'population_v', 'population_y', 'habitat_d',
'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
'habitat_w'],
dtype='object', length=117)
cap-shape_b cap-shape_c cap-shape_f ... habitat_p habitat_u habitat_w
0 0 0 0 ... 0 1 0
1 0 0 0 ... 0 0 0
2 1 0 0 ... 0 0 0
3 0 0 0 ... 0 1 0
4 0 0 0 ... 0 0 0
[5 rows x 117 columns]
[1 0 0 ... 0 1 0]
# 2.5、数据归一化:-1~1
df_X2dum2stanard
[[-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 4.59086996
-0.15558197]
[-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
-0.15558197]
[ 4.11988487 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
-0.15558197]
...
[-0.24272523 -0.02219484 1.2559503 ... -0.40484176 -0.21782364
-0.15558197]
[-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
-0.15558197]
[-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
-0.15558197]]
# 3、特征筛选+模型训练与评估
# 3.1、切分数据集
# 3.2、模型训练与评估
# T1、LoR模型、LinearSVC模型、DTC模型、RF模型
LinearSVC time_cost: 0.140625
LogisticRegression auc_s: 1.0
LogisticRegression time_cost: 0.0625
DecisionTreeClassifier auc_s: 1.0
DecisionTreeClassifier time_cost: 0.0625
RandomForestClassifier auc_s: 1.0
RandomForestClassifier time_cost: 1.546875
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438
LinearSVC time_cost: 0.21875
LogisticRegression auc_s: 1.0
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438
LogisticRegression time_cost: 0.125
DecisionTreeClassifier auc_s: 1.0
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438
DecisionTreeClassifier time_cost: 0.0625
RandomForestClassifier auc_s: 1.0
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438
RandomForestClassifier time_cost: 1.5625
# 对决策树模型可视化树结构图

# 3.3、特征筛选
# T1、过滤式filter:常用SelectKBest选择器,PCC_SVMC/chi2_RF
# T1、过滤式filter:常用SelectKBest选择器
'''
单变量特征选择是一种统计方法,用于选择与对应标签关系最强的特征
根据我们的需要,我们提供不同类型的评分功能:
分类:chi2, f_classif, mutual_info_classif
回归:f_regression, mutual_info_regression
核心原理:通过评估每个特征的重要性,从而选择出最相关的特征来进行建模
(1)、SelectKBest、SelectPercentile:这两种方法都是使用 univariate statistics 来评估每个特征的重要性,然后根据评分选择前 k 个或者前百分之几的特征。
SelectKBest 是选择评分最高的 k 个特征,而 SelectPercentile 是选择评分最高的前百分之几的特征。
(2)、SelectFpr、SelectFdr、SelectFwe:分别是基于假阳性率(false positive rate)、错误发现率(false discovery rate)、错误拒绝率(false negative rate)来进行特征选择的。
SelectFpr 是控制假阳性率的特征选择方法,SelectFdr 是控制错误发现率的特征选择方法,SelectFwe 是控制错误拒绝率的特征选择方法。
(3)、GenericUnivariateSelect:这个方法是一个通用的 univariate 特征选择方法,它可以选择不同的统计方法来评估每个特征的重要性,也可以选择不同的策略来选择特征。
'''
# T1.0、基于方差阈值的筛选(只分析自己):针对【离散型】变量,移除方差小于 0.2的列
# T1.1、利用皮尔森相关系数PCC筛选降维并基于SVM模型测试评估

FS_by_corr-------------------------------
10 ['odor_n', 'odor_f', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'ring-type_p', 'gill-size_n', 'gill-size_b', 'gill-color_b', 'bruises_t', 'bruises_f']
LinearSVC time_cost: 0.03125
[[1248 26]
[ 46 1118]]
precision recall f1-score support
0 0.96 0.98 0.97 1274
1 0.98 0.96 0.97 1164
accuracy 0.97 2438
macro avg 0.97 0.97 0.97 2438
weighted avg 0.97 0.97 0.97 2438
# T1.2、利用卡方检验chi2自动筛选降维并基于RF模型测试评估
FS_by_chi2-------------------------------
10 ['odor_n', 'odor_f', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'gill-color_b', 'gill-size_n', 'spore-print-color_h', 'ring-type_l', 'ring-type_p', 'bruises_t']
RandomForestClassifier auc_s: 0.9933
RandomForestClassifier time_cost: 1.6875
FS_by_chi2 time_cost: 1.765625
RandomForestClassifier auc_s: 0.9933
[[1248 26]
[ 31 1133]]
precision recall f1-score support
0 0.98 0.98 0.98 1274
1 0.98 0.97 0.98 1164
accuracy 0.98 2438
macro avg 0.98 0.98 0.98 2438
weighted avg 0.98 0.98 0.98 2438
# T1.3、利用MIC最大互信息系数计算特征与标签之间的距离相关系数
FS_by_MIC_SelectKBest-------------------------------
MIC_value P_value
odor_n 0.528778 None
odor_f 0.357168 None
stalk-surface-above-ring_k 0.284429 None
stalk-surface-below-ring_k 0.270560 None
gill-color_b 0.269398 None
... ... ...
cap-shape_c 0.000519 None
cap-shape_f 0.000248 None
stalk-root_b 0.000226 None
stalk-surface-above-ring_y 0.000194 None
veil-type_p 0.000000 None
[117 rows x 2 columns]
# T1.4、利用DiC距离相关系数计算特征与标签之间的距离相关系数:Distance Correlation,计算特征之间的距离相关系数
FS_by_dcorr_byscipy-------------------------------
distance_corr_byscipy
odor_n 1.785557
ring-type_p 1.540469
gill-size_b 1.540024
bruises_t 1.501530
stalk-surface-above-ring_s 1.491314
... ...
gill-size_n 0.459976
stalk-surface-below-ring_k 0.426476
stalk-surface-above-ring_k 0.412342
odor_f 0.376158
veil-type_p NaN
[117 rows x 1 columns]
# T1.5、利用RF模型逐单个对特征建模计算平均ACC
FS_by_RF_CVS-------------------------------
importance
odor_n 0.6105
odor_f 0.3875
stalk-surface-above-ring_k 0.3442
stalk-surface-below-ring_k 0.3253
gill-color_b 0.2970
... ...
stalk-surface-above-ring_y -0.0003
stalk-root_b -0.0003
cap-color_p -0.0002
cap-shape_c -0.0001
cap-surface_g -0.0001
[117 rows x 1 columns]
# T2、包装式wrapper:常用RFE,如RFE_RF
# T2.1、利用递归特征消除(RFE)自动筛选降维并基于RF模型测试评估:特别耗时,2分钟
FS_by_RFEonRF-------------------------------
DecisionTreeClassifier auc_s: 0.9987
DecisionTreeClassifier time_cost: 0.015625
overall_accuracy RFE_on_RF: 0.9860541427399507
10 ['odor_f', 'odor_n', 'gill-size_b', 'gill-size_n', 'gill-color_b', 'stalk-shape_t', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'ring-type_p', 'spore-print-color_h']
RFE_on_RF, time_cost 119.78125
DecisionTreeClassifier auc_s: 0.9987
[[1274 0]
[ 34 1130]]
precision recall f1-score support
0 0.97 1.00 0.99 1274
1 1.00 0.97 0.99 1164
accuracy 0.99 2438
macro avg 0.99 0.99 0.99 2438
weighted avg 0.99 0.99 0.99 2438
# T3、嵌入式Embedded:常用SelectFromModel,如Lasso/RF/SF_ETreesC
'''
SelectFromModel基于重要性权重选择特征的元转换器
该方法可以用于所有具有coef_或feature_importances_属性的不同类型的Scikit-learn模型(拟合后)。
与RFE相比,SelectFromModel是一个健壮性较差的解决方案。事实上,SelectFromModel只是基于计算的阈值(不涉及优化迭代过程)删除不太重要的特性。
ETree与RF对比:极端随机树ETree可以产生更少的方差(因此降低了过拟合的风险)。在ETree中,节点被采样而不进行替换。
'''
# T3.1、利用Lasso正则化器算法交叉验证并可视化特征重要性

FS_by_LassoCV_coef-------------------------------
LassoCV_model.alpha_: 0.0003964898084478883
LassoCV_model.score: 0.9971840741918596
44 ['odor_n', 'odor_l', 'odor_a', 'stalk-root_r', 'stalk-surface-above-ring_y', 'stalk-color-above-ring_c', 'ring-type_f', 'gill-size_b', 'odor_m', 'habitat_w', 'cap-color_c', 'gill-attachment_a', 'spore-print-color_u', 'cap-shape_s', 'stalk-color-below-ring_n', 'cap-color_n', 'cap-surface_f', 'ring-number_n', 'stalk-surface-above-ring_f', 'spore-print-color_n', 'stalk-surface-below-ring_f', 'stalk-color-below-ring_c', 'veil-color_o', 'gill-spacing_w', 'gill-attachment_f', 'gill-size_n', 'gill-spacing_c', 'cap-color_w', 'stalk-surface-above-ring_k', 'stalk-color-below-ring_y', 'population_c', 'veil-color_y', 'stalk-color-above-ring_y', 'ring-number_o', 'cap-surface_g', 'spore-print-color_w', 'spore-print-color_h', 'odor_c', 'odor_p', 'odor_y', 'odor_s', 'odor_f', 'spore-print-color_r', 'stalk-surface-below-ring_y']
-------------------------------
# T3.2、树模型之RF基于RF模型特征重要性的筛选降维并测试评估

FS_by_importance-------------------------------
10 odor_n 0.126408
odor_f 0.068441
gill-size_b 0.065297
gill-size_n 0.057966
stalk-surface-above-ring_k 0.044860
spore-print-color_h 0.044733
gill-color_b 0.039469
stalk-surface-below-ring_k 0.038445
ring-type_p 0.034700
bruises_f 0.026444
dtype: float64
10 ['odor_n', 'odor_f', 'gill-size_b', 'gill-size_n', 'stalk-surface-above-ring_k', 'spore-print-color_h', 'gill-color_b', 'stalk-surface-below-ring_k', 'ring-type_p', 'bruises_f']
RandomForestClassifier auc_s: 0.9933
RandomForestClassifier time_cost: 1.09375
RandomForestClassifier auc_s: 0.9933
[[1248 26]
[ 31 1133]]
precision recall f1-score support
0 0.98 0.98 0.98 1274
1 0.98 0.97 0.98 1164
accuracy 0.98 2438
macro avg 0.98 0.98 0.98 2438
weighted avg 0.98 0.98 0.98 2438
# T3.3、利用通用性内置优化器SF(基于ETC)自动筛选降维并基于RF模型测试评估


FS_by_SFMonETC-------------------------------
train_X (5686, 117)
after selected_train_fit_X (5686, 29)
Feature Importance
27 odor_n 0.140039
35 gill-size_b 0.075933
24 odor_f 0.065969
57 stalk-surface-above-ring_k 0.054334
36 gill-size_n 0.048152
.. ... ...
43 gill-color_o 0.000010
103 spore-print-color_y 0.000000
83 veil-color_n 0.000000
82 veil-type_p 0.000000
95 spore-print-color_b 0.000000
[117 rows x 2 columns]
after selected_train_X (5686, 28)
28 ['bruises_f', 'bruises_t', 'odor_c', 'odor_f', 'odor_l', 'odor_n', 'odor_p', 'gill-spacing_c', 'gill-spacing_w', 'gill-size_b', 'gill-size_n', 'gill-color_b', 'stalk-shape_e', 'stalk-shape_t', 'stalk-root_b', 'stalk-root_c', 'stalk-root_e', 'stalk-surface-above-ring_k', 'stalk-surface-above-ring_s', 'stalk-surface-below-ring_f', 'stalk-surface-below-ring_k', 'stalk-surface-below-ring_s', 'ring-type_p', 'spore-print-color_h', 'spore-print-color_n', 'spore-print-color_w', 'population_v', 'habitat_g']
RandomForestClassifier auc_s: 1.0
RandomForestClassifier time_cost: 1.1875
10
odor_n 0.193514
odor_f 0.110004
gill-size_b 0.078928
gill-size_n 0.070207
stalk-surface-above-ring_k 0.067094
gill-color_b 0.050591
spore-print-color_h 0.047754
stalk-surface-below-ring_k 0.046781
ring-type_p 0.035519
bruises_f 0.025825
dtype: float64
28 [ 5 3 9 10 17 11 23 20 22 0 1 8 14 25 7 26 18 15 13 12 16 21 6 2
4 27 19 24]
RandomForestClassifier auc_s: 1.0
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438