分享

Weka开发[47]——Stacking源代码分析

 lzqkean 2013-07-22

         从网上拷了一段解释,这不是什么权威论文,拷贝它只是因为它条理清楚简单。

Stacked generalization (or stacking) (Wolpert 1992) is a different way of combining multiple models, that introduces the concept of a meta learner. Although an attractive idea, it is less widely used than bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to combine models of different types. The procedure is as follows:

1. Split the training set into two disjoint sets.

2. Train several base learners on the first part.

3. Test the base learners on the second part.

4. Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.

Note that steps 1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, the base learners are combined, possibly non-linearly.

         buildClassifier中的前一部分就不看了,以前看过多次了,重要的就下面几行:

// Create meta level

generateMetaLevel(newData, random);

 

// Rebuilt all the base classifiers on the full training data

for (int i = 0; i < m_Classifiers.length; i ) {

    getClassifier(i).buildClassifier(newData);

}

         而下面的for循环是在全部数据集上训练所有基分类器,那么最要的也就上面的generateMetaLevel函数了。

protected void generateMetaLevel(Instances newData, Random random)

       throws Exception {

    Instances metaData = metaFormat(newData);

    m_MetaFormat = new Instances(metaData, 0);

    for (int j = 0; j < m_NumFolds; j ) {

       Instances train = newData.trainCV(m_NumFolds, j, random);

 

       // Build base classifiers

       for (int i = 0; i < m_Classifiers.length; i ) {

           getClassifier(i).buildClassifier(train);

       }

 

       // Classify test instances and add to meta data

       Instances test = newData.testCV(m_NumFolds, j);

       for (int i = 0; i < test.numInstances(); i ) {

           metaData.add(metaInstance(test.instance(i)));

       }

    }

 

    m_MetaClassifier.buildClassifier(metaData);

}

         代码的大意是先用newData得到metaData的格式,即m_MetaFormat,然后用原始的数据集newData通过十折交叉的方式拆分,得到traintest两个数据集,train交由m_Classifiers训练,再将test分类,并将分类结果合并到原数据中,再加入到metaData中。最然用m_MetaClassifiermetaData训练。

protected Instances metaFormat(Instances instances) throws Exception {

    FastVector attributes = new FastVector();

    Instances metaFormat;

 

    for (int k = 0; k < m_Classifiers.length; k ) {

       Classifier classifier = (Classifier) getClassifier(k);

       String name = classifier.getClass().getName();

       if (m_BaseFormat.classAttribute().isNumeric()) {

           attributes.addElement(new Attribute(name));

       } else {

           for (int j = 0; j < m_BaseFormat.classAttribute().numValues();

 j ) {

              attributes.addElement(new Attribute(name ":"

                     m_BaseFormat.classAttribute().value(j)));

           }

       }

    }

    attributes.addElement(m_BaseFormat.classAttribute().copy());

    metaFormat = new Instances("Meta format", attributes, 0);

    metaFormat.setClassIndex(metaFormat.numAttributes() - 1);

    return metaFormat;

}

         Instanceslevel 0的训练样本集,现在要加入一部分属性用以保存后来的分类结果,如果m_BaseFormat类别属性是连续值,那么就加入m_Classifiers个属性,如果是离散值,每次要加入level 0类别属性取值个数个属性,最后加入metaFormat的类别属性。

         下面是一个我用iris.arff测试得到的结果,它有一个基分类器:

@relation 'Meta format'

 

@attribute weka.classifiers.rules.ZeroR:Iris-setosa numeric

@attribute weka.classifiers.rules.ZeroR:Iris-versicolor numeric

@attribute weka.classifiers.rules.ZeroR:Iris-virginica numeric

@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

 

@data

protected Instance metaInstance(Instance instance) throws Exception {

 

    double[] values = new double[m_MetaFormat.numAttributes()];

    Instance metaInstance;

    int i = 0;

    for (int k = 0; k < m_Classifiers.length; k ) {

       Classifier classifier = getClassifier(k);

       if (m_BaseFormat.classAttribute().isNumeric()) {

           values[i ] = classifier.classifyInstance(instance);

       } else {

           double[] dist = classifier.

distributionForInstance(instance);

           for (int j = 0; j < dist.length; j ) {

              values[i ] = dist[j];

           }

       }

    }

    values[i] = instance.classValue();

    metaInstance = new Instance(1, values);

    metaInstance.setDataset(m_MetaFormat);

    return metaInstance;

}

         Values是用来保存分类结果的,如果是连续属性那么就将结果直接保存,如果是离散值,则先求得分布,将每种取值的分布加入values,将它设为m_MetaFormat格式,然后返回。

 

 

 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章