R语言ctree()

脑系科数据科学 2018-07-09

展开全文

转自CSDN博客：Daeyeon7

说明

除了传统的决策树（rpart）算法，条件推理树（ctree）是另一种比较常用的基于树的分类算法。两者的不同之处是，条件推理树是选择分类变量时的依据是显著性测量的结果，而不是采用信息最大化法（rpart采用的是基尼系数）。

操作

调用party包的ctree命令来构建分类器

library(zoo)
library(party)
ctree.model = ctree(churn ~ .,data = trainset)
ctree.model
 Conditional inference tree with 18 terminal nodes

Response:  churn 
Inputs:  international_plan, voice_mail_plan, number_vmail_messages, total_day_minutes, total_day_calls, total_day_charge, total_eve_minutes, total_eve_calls, total_eve_charge, total_night_minutes, total_night_calls, total_night_charge, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls 
Number of observations:  2315 

1) international_plan == {no}; criterion = 1, statistic = 173.582
  2) number_customer_service_calls <= 3; criterion = 1, statistic = 133.882
    3) total_day_minutes <= 259.3; criterion = 1, statistic = 232.371
      4) total_eve_minutes <= 258.7; criterion = 1, statistic = 39.065
        5)*  weights = 1544 
      4) total_eve_minutes > 258.7
        6) total_day_minutes <= 222.9; criterion = 1, statistic = 47.453
          7)*  weights = 209 
        6) total_day_minutes > 222.9
          8) voice_mail_plan == {yes}; criterion = 1, statistic = 20
            9)*  weights = 8 
          8) voice_mail_plan == {no}
            10)*  weights = 28 
    3) total_day_minutes > 259.3
      11) voice_mail_plan == {no}; criterion = 1, statistic = 46.262
        12) total_eve_charge <= 14.09; criterion = 1, statistic = 37.877
          13)*  weights = 21 
        12) total_eve_charge > 14.09
          14) total_night_minutes <= 178.3; criterion = 1, statistic = 19.789
            15)*  weights = 23 
          14) total_night_minutes > 178.3
            16)*  weights = 60 
      11) voice_mail_plan == {yes}
        17)*  weights = 34 
  2) number_customer_service_calls > 3
    18) total_day_minutes <= 159.4; criterion = 1, statistic = 34.903
      19) total_eve_minutes <= 233.2; criterion = 0.991, statistic = 11.885
        20) voice_mail_plan == {no}; criterion = 0.99, statistic = 11.683
          21)*  weights = 40 
        20) voice_mail_plan == {yes}
          22)*  weights = 7 
      19) total_eve_minutes > 233.2
        23)*  weights = 16 
    18) total_day_minutes > 159.4
      24)*  weights = 96 
1) international_plan == {yes}
  25) total_intl_charge <= 3.51; criterion = 1, statistic = 35.28
    26) total_intl_calls <= 2; criterion = 1, statistic = 28.013
      27)*  weights = 40 
    26) total_intl_calls > 2
      28) number_customer_service_calls <= 3; criterion = 0.957, statistic = 8.954
        29) total_day_minutes <= 271.5; criterion = 1, statistic = 25.328
          30) total_eve_charge <= 25.82; criterion = 0.987, statistic = 11.167
            31)*  weights = 116 
          30) total_eve_charge > 25.82
            32)*  weights = 7 
        29) total_day_minutes > 271.5
          33)*  weights = 11 
      28) number_customer_service_calls > 3
        34)*  weights = 14 
  25) total_intl_charge > 3.51
    35)*  weights = 41 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

条件推理树可视化

plot(ctree.model)
1
2

通过减少特征值，再重新绘制分类树，得到一颗简化的条件推理树

daycharge.model = ctree(churn ~ total_day_charge,data = trainset)
plot(daycharge.mode)1
2

用total_day_charge作为唯一分割条件得到的推理树
输出结果图可以显示出每个中间节点的相应的依赖的变量名称与p值，分裂条件在左右的分枝上有所显示，叶子节点可以显示不同类别样本的个数n，以及样例属于0与1的概率。从图中可以知道，当total_day_charge的值大于48.18时，节点9的亮灰域要大于深灰域，这意味日消费大于48.18的客户流失文概率要非常大（类标签yes）
####评测条件推理树的预测能力

ctree.predict = predict(ctree.model,testset)
table(ctree.predict,testset$churn)

ctree.predict yes  no
          yes  99  15
          no   42 8621
2
3
4
5
6

使用caret包的confusionMarix完成

library(lattice)
library(ggplot2)
library(caret)
confusionMatrix(table(ctree.predict,testset$churn))
Confusion Matrix and Statistics


ctree.predict yes  no
          yes  99  15
          no   42 862

               Accuracy : 0.944           
                 95% CI : (0.9281, 0.9573)
    No Information Rate : 0.8615          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7449          
 Mcnemar's Test P-Value : 0.0005736       

            Sensitivity : 0.70213         
            Specificity : 0.98290         
         Pos Pred Value : 0.86842         
         Neg Pred Value : 0.95354         
             Prevalence : 0.13851         
         Detection Rate : 0.09725         
   Detection Prevalence : 0.11198         
      Balanced Accuracy : 0.84251         

       'Positive' Class : yes   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

调用treeresponse( )函数，输出这一列的概率

tr = treeresponse(ctree.model,newdata = testset[1:5,])
> tr
[[1]]
[1] 0.03497409 0.96502591

[[2]]
[1] 0.02586207 0.97413793

[[3]]
[1] 0.02586207 0.97413793

[[4]]
[1] 0.02586207 0.97413793

[[5]]
[1] 0.03497409 0.96502591
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

本节首先使用prediction函数实现测试数据集的标记（类别号），然后调用table函数生成分类表，最后使用caret包内置的confusionMatrix函数来评价预测性能。除了predict函数，也可以使用treeresponse函数来评估类概率，该函数通常会选择概率更高的类标号来标记数据。
本节样例展示了使用测试数据集testset中的前五条记录来得到分类的概率的估计值，调用treeresponse函数可以得到这5个概率的具体值，可以根据这个来判断样例的类别标号。