分享

R语言特征选择

 拓端数据 2020-07-19

变量选择方法

所有可能的回归

  1. model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
  2. ols_all_subset(model)
  3. ## # A tibble: 15 x 6
  4. ## Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
  5. ##
  6. ## 1 1 1 wt 0.75283 0.74459 12.48094
  7. ## 2 2 1 disp 0.71834 0.70895 18.12961
  8. ## 3 3 1 hp 0.60244 0.58919 37.11264
  9. ## 4 4 1 qsec 0.17530 0.14781 107.06962
  10. ## 5 5 2 hp wt 0.82679 0.81484 2.36900
  11. ## 6 6 2 wt qsec 0.82642 0.81444 2.42949
  12. ## 7 7 2 disp wt 0.78093 0.76582 9.87910
  13. ## 8 8 2 disp hp 0.74824 0.73088 15.23312
  14. ## 9 9 2 disp qsec 0.72156 0.70236 19.60281
  15. ## 10 10 2 hp qsec 0.63688 0.61183 33.47215
  16. ## 11 11 3 hp wt qsec 0.83477 0.81706 3.06167
  17. ## 12 12 3 disp hp wt 0.82684 0.80828 4.36070
  18. ## 13 13 3 disp wt qsec 0.82642 0.80782 4.42934
  19. ## 14 14 3 disp hp qsec 0.75420 0.72786 16.25779
  20. ## 15 15 4 disp hp wt qsec 0.83514 0.81072 5.00000

plot方法显示了所有可能的回归方法的拟合  。

  1. model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
  2. k <- ols_all_subset(model)
  3. plot(k)


最佳子集回归

选择在满足一些明确的客观标准时做得最好的预测变量的子集,例如具有最大R2值或最小MSE, Cp或AIC。

  1. model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
  2. ols_best_subset(model)
  3. ## Best Subsets Regression
  4. ## ------------------------------
  5. ## Model Index Predictors
  6. ## ------------------------------
  7. ## 1 wt
  8. ## 2 hp wt
  9. ## 3 hp wt qsec
  10. ## 4 disp hp wt qsec
  11. ## ------------------------------
  12. ##
  13. ## Subsets Regression Summary
  14. ## -------------------------------------------------------------------------------------------------------------------------------
  15. ## Adj. Pred
  16. ## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
  17. ## -------------------------------------------------------------------------------------------------------------------------------
  18. ## 1 0.7528 0.7446 0.7087 12.4809 166.0294 74.2916 170.4266 9.8972 9.8572 0.3199 0.2801
  19. ## 2 0.8268 0.8148 0.7811 2.3690 156.6523 66.5755 162.5153 7.4314 7.3563 0.2402 0.2091
  20. ## 3 0.8348 0.8171 0.782 3.0617 157.1426 67.7238 164.4713 7.6140 7.4756 0.2461 0.2124
  21. ## 4 0.8351 0.8107 0.771 5.0000 159.0696 70.0408 167.8640 8.1810 7.9497 0.2644 0.2259
  22. ## -------------------------------------------------------------------------------------------------------------------------------
  23. ## AIC: Akaike Information Criteria
  24. ## SBIC: Sawa's Bayesian Information Criteria
  25. ## SBC: Schwarz Bayesian Criteria
  26. ## MSEP: Estimated error of prediction, assuming multivariate normality
  27. ## FPE: Final Prediction Error
  28. ## HSP: Hocking's Sp
  29. ## APC: Amemiya Prediction Criteria

plot  

  1. model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
  2. k <- ols_best_subset(model)
  3. plot(k)



逐步前进回归

从一组候选预测变量中建立回归模型,方法是逐步输入基于p值的预测变量,直到没有变量进入变量。该模型应该包括所有的候选预测变量。如果细节设置为TRUE,则显示每个步骤。

变量选择

  1. # stepwise forward regression
  2. model <- lm(y ~ ., data = surgical)
  3. ols_step_forward(model)
  4. ## We are selecting variables based on p value...
  5. ## 1 variable(s) added....
  6. ## 1 variable(s) added...
  7. ## 1 variable(s) added...
  8. ## 1 variable(s) added...
  9. ## 1 variable(s) added...
  10. ## No more variables satisfy the condition of penter: 0.3
  11. ## Forward Selection Method
  12. ##
  13. ## Candidate Terms:
  14. ##
  15. ## 1 . bcs
  16. ## 2 . pindex
  17. ## 3 . enzyme_test
  18. ## 4 . liver_test
  19. ## 5 . age
  20. ## 6 . gender
  21. ## 7 . alc_mod
  22. ## 8 . alc_heavy
  23. ##
  24. ## ------------------------------------------------------------------------------
  25. ## Selection Summary
  26. ## ------------------------------------------------------------------------------
  27. ## Variable Adj.
  28. ## Step Entered R-Square R-Square C(p) AIC RMSE
  29. ## ------------------------------------------------------------------------------
  30. ## 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
  31. ## 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
  32. ## 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
  33. ## 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
  34. ## 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
  35. ## ------------------------------------------------------------------------------
  36. model <- lm(y ~ ., data = surgical)
  37. k <- ols_step_forward(model)
  38. ## We are selecting variables based on p value...
  39. ## 1 variable(s) added....
  40. ## 1 variable(s) added...
  41. ## 1 variable(s) added...
  42. ## 1 variable(s) added...
  43. ## 1 variable(s) added...
  44. ## No more variables satisfy the condition of penter: 0.3
  45. plot(k)

 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多