分享

ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例应用

 处女座的程序猿 2023-04-16 发布于上海

ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)案例应用
利用多种特征筛选技术(PCC_SVMC/chi2_RF/MIC/DiC/RF单模、RFE_RLasso/RF/SF_ETreesC)是否毒性(二分类)最全案例


相关文章
ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例应用
ML之FE之FS:特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例实现代码

特征工程/数据预处理—特征选择之利用过滤式filter、包装式wrapper、嵌入式Embedded方法(RF/SF)进行特征选择(mushroom蘑菇数据集二分类预测)最全案例应用
​​​​​# 1、定义数据集

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring    8124 non-null   object
 15  stalk-color-below-ring    8124 non-null   object
 16  veil-type                 8124 non-null   object
 17  veil-color                8124 non-null   object
 18  ring-number               8124 non-null   object
 19  ring-type                 8124 non-null   object
 20  spore-print-color         8124 non-null   object
 21  population                8124 non-null   object
 22  habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB
None
  class cap-shape cap-surface  ... spore-print-color population habitat
0     p         x           s  ...                 k          s       u
1     e         x           s  ...                 n          n       g
2     e         b           s  ...                 n          n       m
3     p         x           y  ...                 k          s       u
4     e         x           s  ...                 n          a       g

[5 rows x 23 columns]
(8124, 23)

# 2、特征工程/数据预处理

# 2.1、统计各特征的缺失值占比

                          percent_missing
class                                 0.0
cap-shape                             0.0
cap-surface                           0.0
cap-color                             0.0
bruises                               0.0
odor                                  0.0
gill-attachment                       0.0
gill-spacing                          0.0
gill-size                             0.0
gill-color                            0.0
stalk-shape                           0.0
stalk-root                            0.0
stalk-surface-above-ring              0.0
stalk-surface-below-ring              0.0
stalk-color-above-ring                0.0
stalk-color-below-ring                0.0
veil-type                             0.0
veil-color                            0.0
ring-number                           0.0
ring-type                             0.0
spore-print-color                     0.0
population                            0.0
habitat                               0.0

# 2.2、分离特征与标签

# 2.3、分析目标标签变量

e    4208
p    3916
Name: class, dtype: int64

# 2.4、特征编码化

特征数据集执行OneHotEncoding编码、标签数据执行LE化

df_X2dum_cols 117 Index(['cap-shape_b', 'cap-shape_c', 'cap-shape_f', 'cap-shape_k',
       'cap-shape_s', 'cap-shape_x', 'cap-surface_f', 'cap-surface_g',
       'cap-surface_s', 'cap-surface_y',
       ...
       'population_s', 'population_v', 'population_y', 'habitat_d',
       'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
       'habitat_w'],
      dtype='object', length=117)

   cap-shape_b  cap-shape_c  cap-shape_f  ...  habitat_p  habitat_u  habitat_w
0            0            0            0  ...          0          1          0
1            0            0            0  ...          0          0          0
2            1            0            0  ...          0          0          0
3            0            0            0  ...          0          1          0
4            0            0            0  ...          0          0          0

[5 rows x 117 columns]
[1 0 0 ... 0 1 0]

# 2.5、数据归一化:-1~1

df_X2dum2stanard 
 [[-0.24272523 -0.02219484 -0.79620985 ... -0.40484176  4.59086996
  -0.15558197]
 [-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
  -0.15558197]
 [ 4.11988487 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
  -0.15558197]
 ...
 [-0.24272523 -0.02219484  1.2559503  ... -0.40484176 -0.21782364
  -0.15558197]
 [-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
  -0.15558197]
 [-0.24272523 -0.02219484 -0.79620985 ... -0.40484176 -0.21782364
  -0.15558197]]

# 3、特征筛选+模型训练与评估

# 3.1、切分数据集

# 3.2、模型训练与评估

# T1、LoR模型、LinearSVC模型、DTC模型、RF模型

LinearSVC time_cost:  0.140625
LogisticRegression auc_s:  1.0
LogisticRegression time_cost:  0.0625
DecisionTreeClassifier auc_s:  1.0
DecisionTreeClassifier time_cost:  0.0625
RandomForestClassifier auc_s:  1.0
RandomForestClassifier time_cost:  1.546875

[[1274    0]
 [   0 1164]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1274
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438

LinearSVC time_cost:  0.21875
LogisticRegression auc_s:  1.0
[[1274    0]
 [   0 1164]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1274
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438

LogisticRegression time_cost:  0.125
DecisionTreeClassifier auc_s:  1.0
[[1274    0]
 [   0 1164]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1274
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438

DecisionTreeClassifier time_cost:  0.0625
RandomForestClassifier auc_s:  1.0
[[1274    0]
 [   0 1164]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1274
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438

RandomForestClassifier time_cost:  1.5625

# 对决策树模型可视化树结构图

# 3.3、特征筛选

# T1、过滤式filter:常用SelectKBest选择器,PCC_SVMC/chi2_RF

# T1、过滤式filter:常用SelectKBest选择器
'''
单变量特征选择是一种统计方法,用于选择与对应标签关系最强的特征
根据我们的需要,我们提供不同类型的评分功能:
分类:chi2, f_classif, mutual_info_classif
回归:f_regression, mutual_info_regression
核心原理:通过评估每个特征的重要性,从而选择出最相关的特征来进行建模
(1)、SelectKBest、SelectPercentile:这两种方法都是使用 univariate statistics 来评估每个特征的重要性,然后根据评分选择前 k 个或者前百分之几的特征。
SelectKBest 是选择评分最高的 k 个特征,而 SelectPercentile 是选择评分最高的前百分之几的特征。
(2)、SelectFpr、SelectFdr、SelectFwe:分别是基于假阳性率(false positive rate)、错误发现率(false discovery rate)、错误拒绝率(false negative rate)来进行特征选择的。
SelectFpr 是控制假阳性率的特征选择方法,SelectFdr 是控制错误发现率的特征选择方法,SelectFwe 是控制错误拒绝率的特征选择方法。
(3)、GenericUnivariateSelect:这个方法是一个通用的 univariate 特征选择方法,它可以选择不同的统计方法来评估每个特征的重要性,也可以选择不同的策略来选择特征。
'''

# T1.0、基于方差阈值的筛选(只分析自己):针对【离散型】变量,移除方差小于 0.2的列

# T1.1、利用皮尔森相关系数PCC筛选降维并基于SVM模型测试评估

FS_by_corr-------------------------------
10 ['odor_n', 'odor_f', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'ring-type_p', 'gill-size_n', 'gill-size_b', 'gill-color_b', 'bruises_t', 'bruises_f']
LinearSVC time_cost:  0.03125
[[1248   26]
 [  46 1118]]
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      1274
           1       0.98      0.96      0.97      1164

    accuracy                           0.97      2438
   macro avg       0.97      0.97      0.97      2438
weighted avg       0.97      0.97      0.97      2438

# T1.2、利用卡方检验chi2自动筛选降维并基于RF模型测试评估

FS_by_chi2-------------------------------
10 ['odor_n', 'odor_f', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'gill-color_b', 'gill-size_n', 'spore-print-color_h', 'ring-type_l', 'ring-type_p', 'bruises_t']
RandomForestClassifier auc_s:  0.9933
RandomForestClassifier time_cost:  1.6875
FS_by_chi2 time_cost:  1.765625
RandomForestClassifier auc_s:  0.9933
[[1248   26]
 [  31 1133]]
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1274
           1       0.98      0.97      0.98      1164

    accuracy                           0.98      2438
   macro avg       0.98      0.98      0.98      2438
weighted avg       0.98      0.98      0.98      2438

# T1.3、利用MIC最大互信息系数计算特征与标签之间的距离相关系数

FS_by_MIC_SelectKBest-------------------------------
                            MIC_value P_value
odor_n                       0.528778    None
odor_f                       0.357168    None
stalk-surface-above-ring_k   0.284429    None
stalk-surface-below-ring_k   0.270560    None
gill-color_b                 0.269398    None
...                               ...     ...
cap-shape_c                  0.000519    None
cap-shape_f                  0.000248    None
stalk-root_b                 0.000226    None
stalk-surface-above-ring_y   0.000194    None
veil-type_p                  0.000000    None

[117 rows x 2 columns]

# T1.4、利用DiC距离相关系数计算特征与标签之间的距离相关系数:Distance Correlation,计算特征之间的距离相关系数

FS_by_dcorr_byscipy-------------------------------
                            distance_corr_byscipy
odor_n                                   1.785557
ring-type_p                              1.540469
gill-size_b                              1.540024
bruises_t                                1.501530
stalk-surface-above-ring_s               1.491314
...                                           ...
gill-size_n                              0.459976
stalk-surface-below-ring_k               0.426476
stalk-surface-above-ring_k               0.412342
odor_f                                   0.376158
veil-type_p                                   NaN

[117 rows x 1 columns]

# T1.5、利用RF模型逐单个对特征建模计算平均ACC

FS_by_RF_CVS-------------------------------
                           importance
odor_n                         0.6105
odor_f                         0.3875
stalk-surface-above-ring_k     0.3442
stalk-surface-below-ring_k     0.3253
gill-color_b                   0.2970
...                               ...
stalk-surface-above-ring_y    -0.0003
stalk-root_b                  -0.0003
cap-color_p                   -0.0002
cap-shape_c                   -0.0001
cap-surface_g                 -0.0001

[117 rows x 1 columns]

# T2、包装式wrapper:常用RFE,如RFE_RF

# T2.1、利用递归特征消除(RFE)自动筛选降维并基于RF模型测试评估:特别耗时,2分钟


FS_by_RFEonRF-------------------------------
DecisionTreeClassifier auc_s:  0.9987

DecisionTreeClassifier time_cost:  0.015625
overall_accuracy RFE_on_RF:  0.9860541427399507
10 ['odor_f', 'odor_n', 'gill-size_b', 'gill-size_n', 'gill-color_b', 'stalk-shape_t', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'ring-type_p', 'spore-print-color_h']
RFE_on_RF, time_cost 119.78125
DecisionTreeClassifier auc_s:  0.9987
[[1274    0]
 [  34 1130]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1274
           1       1.00      0.97      0.99      1164

    accuracy                           0.99      2438
   macro avg       0.99      0.99      0.99      2438
weighted avg       0.99      0.99      0.99      2438

# T3、嵌入式Embedded:常用SelectFromModel,如Lasso/RF/SF_ETreesC

'''
SelectFromModel基于重要性权重选择特征的元转换器
该方法可以用于所有具有coef_或feature_importances_属性的不同类型的Scikit-learn模型(拟合后)。
与RFE相比,SelectFromModel是一个健壮性较差的解决方案。事实上,SelectFromModel只是基于计算的阈值(不涉及优化迭代过程)删除不太重要的特性。
ETree与RF对比:极端随机树ETree可以产生更少的方差(因此降低了过拟合的风险)。在ETree中,节点被采样而不进行替换。
'''

# T3.1、利用Lasso正则化器算法交叉验证并可视化特征重要性

FS_by_LassoCV_coef-------------------------------
LassoCV_model.alpha_:  0.0003964898084478883
LassoCV_model.score:  0.9971840741918596
44 ['odor_n', 'odor_l', 'odor_a', 'stalk-root_r', 'stalk-surface-above-ring_y', 'stalk-color-above-ring_c', 'ring-type_f', 'gill-size_b', 'odor_m', 'habitat_w', 'cap-color_c', 'gill-attachment_a', 'spore-print-color_u', 'cap-shape_s', 'stalk-color-below-ring_n', 'cap-color_n', 'cap-surface_f', 'ring-number_n', 'stalk-surface-above-ring_f', 'spore-print-color_n', 'stalk-surface-below-ring_f', 'stalk-color-below-ring_c', 'veil-color_o', 'gill-spacing_w', 'gill-attachment_f', 'gill-size_n', 'gill-spacing_c', 'cap-color_w', 'stalk-surface-above-ring_k', 'stalk-color-below-ring_y', 'population_c', 'veil-color_y', 'stalk-color-above-ring_y', 'ring-number_o', 'cap-surface_g', 'spore-print-color_w', 'spore-print-color_h', 'odor_c', 'odor_p', 'odor_y', 'odor_s', 'odor_f', 'spore-print-color_r', 'stalk-surface-below-ring_y']
-------------------------------

# T3.2、树模型之RF基于RF模型特征重要性的筛选降维并测试评估

FS_by_importance-------------------------------
10 odor_n                        0.126408
odor_f                        0.068441
gill-size_b                   0.065297
gill-size_n                   0.057966
stalk-surface-above-ring_k    0.044860
spore-print-color_h           0.044733
gill-color_b                  0.039469
stalk-surface-below-ring_k    0.038445
ring-type_p                   0.034700
bruises_f                     0.026444
dtype: float64
10 ['odor_n', 'odor_f', 'gill-size_b', 'gill-size_n', 'stalk-surface-above-ring_k', 'spore-print-color_h', 'gill-color_b', 'stalk-surface-below-ring_k', 'ring-type_p', 'bruises_f']
RandomForestClassifier auc_s:  0.9933
RandomForestClassifier time_cost:  1.09375
RandomForestClassifier auc_s:  0.9933
[[1248   26]
 [  31 1133]]
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1274
           1       0.98      0.97      0.98      1164

    accuracy                           0.98      2438
   macro avg       0.98      0.98      0.98      2438
weighted avg       0.98      0.98      0.98      2438

# T3.3、利用通用性内置优化器SF(基于ETC)自动筛选降维并基于RF模型测试评估

FS_by_SFMonETC-------------------------------
train_X (5686, 117)
after selected_train_fit_X (5686, 29)
                        Feature  Importance
27                       odor_n    0.140039
35                  gill-size_b    0.075933
24                       odor_f    0.065969
57   stalk-surface-above-ring_k    0.054334
36                  gill-size_n    0.048152
..                          ...         ...
43                 gill-color_o    0.000010
103         spore-print-color_y    0.000000
83                 veil-color_n    0.000000
82                  veil-type_p    0.000000
95          spore-print-color_b    0.000000

[117 rows x 2 columns]
after selected_train_X (5686, 28)
28 ['bruises_f', 'bruises_t', 'odor_c', 'odor_f', 'odor_l', 'odor_n', 'odor_p', 'gill-spacing_c', 'gill-spacing_w', 'gill-size_b', 'gill-size_n', 'gill-color_b', 'stalk-shape_e', 'stalk-shape_t', 'stalk-root_b', 'stalk-root_c', 'stalk-root_e', 'stalk-surface-above-ring_k', 'stalk-surface-above-ring_s', 'stalk-surface-below-ring_f', 'stalk-surface-below-ring_k', 'stalk-surface-below-ring_s', 'ring-type_p', 'spore-print-color_h', 'spore-print-color_n', 'spore-print-color_w', 'population_v', 'habitat_g']
RandomForestClassifier auc_s:  1.0
RandomForestClassifier time_cost:  1.1875
10 
odor_n                        0.193514
odor_f                        0.110004
gill-size_b                   0.078928
gill-size_n                   0.070207
stalk-surface-above-ring_k    0.067094
gill-color_b                  0.050591
spore-print-color_h           0.047754
stalk-surface-below-ring_k    0.046781
ring-type_p                   0.035519
bruises_f                     0.025825
dtype: float64
28 [ 5  3  9 10 17 11 23 20 22  0  1  8 14 25  7 26 18 15 13 12 16 21  6  2
  4 27 19 24]
RandomForestClassifier auc_s:  1.0
[[1274    0]
 [   0 1164]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1274
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438

    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多