Weka开发［34］——LWL源代码分析

lzqkean 2013-07-22

展开全文

看了笨笨的Blog，想起来Ng Andrew说过他的导师Michael Jordan(与那个篮球明星同名)最喜欢的就是LWL算法。

Locally weighted learning. Uses an instance-based algorithm to assign instance weights which are then used by a specified WeightedInstancesHandler. Can do classification (e.g. using naive Bayes) or regression (e.g. using linear regression)

Locally weighted learning(LWL)，用基于样本的算法用一个特定的weightedInstancesHandler赋给样本权重。可以进行分类或是回归。

我也大概看了一下它是如何来实现的。

它是继承自SingleClassifierEnhancer类，在它的构造函数中：

public LWL() {

m_Classifier = new weka.classifiers.trees.DecisionStump();

}

分类器也是用的DecisionStump算法，在Adaboost篇里已经提过它是什么了。

接下来是buildClassifier函数：

public void buildClassifier(Instances instances) throws Exception {

if (!(m_Classifier instanceof WeightedInstancesHandler)) {

throw new IllegalArgumentException("Classifier must be a "

"WeightedInstancesHandler!");

}

// can classifier handle the data?

getCapabilities().testWithFail(instances);

// remove instances with missing class

instances = new Instances(instances);

instances.deleteWithMissingClass();

// only class? -> build ZeroR model

if (instances.numAttributes() == 1) {

m_ZeroR = new weka.classifiers.rules.ZeroR();

m_ZeroR.buildClassifier(instances);

return;

} else {

m_ZeroR = null;

}

m_Train = new Instances(instances, 0, instances.numInstances());

m_NNSearch.setInstances(m_Train);

}

如果分类器不能利用样本权重，那么抛出异常，后面的代码会解释这个原因的。再下面是将所有的样本都默认为是训练样本，m_NNSearch可以看一下我(Koala )写《Weka开发［18］——寻找K个邻居》，

public void updateClassifier(Instance instance) throws Exception {

if (m_Train == null) {

throw new Exception("No training instance structure set!");

} else if (m_Train.equalHeaders(instance.dataset()) == false) {

throw new Exception("Incompatible instance types");

}

if (!instance.classIsMissing()) {

m_NNSearch.update(instance);

m_Train.add(instance);

}

LWL实现了UpdatableClassifier，表示它是可以增量学习的，其实也不太是那么回事感觉。m_NNSearch.update还牵扯了一大堆代码，这里先不说了，以后看距离类的时候再看。

再下面是distributionForInstance，把这个函数拆开来看：

m_NNSearch.addInstanceInfo(instance);

int k = m_Train.numInstances();

if ((!m_UseAllK && (m_kNN < k))

&& !(m_WeightKernel == INVERSE || m_WeightKernel == GAUSS)) {

k = m_kNN;

}

如果不是用全部的数据，并且k已经大于指定的样本数，不是用下面两个加权方法，就将k指定为用户指定的样本数。

Instances neighbours = m_NNSearch.kNearestNeighbours(instance, k);

double distances[] = m_NNSearch.getDistances();

// IF LinearNN has skipped so much that <k neighbours are remaining.

if (k > distances.length)

k = distances.length;

kNearestNeighbours是得到instance的k个邻居，而getDistances是Returns the distances of the k nearest neighbours. The kNearestNeighbours or nearestNeighbour needs to be called first for this to work. (返回k个邻居的距离。需要先调用kNearestNeighbours或是nearestNeighbour)，简单地说，就是这个函数设计的不好，因为类用户要知道调用顺序。下面是如果LinearNN跳的过多了，那么只是小于K个邻居了，这里就是上面那个函数的作用。

// Determine the bandwidth

double bandwidth = distances[k - 1];

// Check for bandwidth zero

if (bandwidth <= 0) {

// if the kth distance is zero than give all instances the same

// weight

for (int i = 0; i < distances.length; i )

distances[i] = 1;

} else {

// Rescale the distances by the bandwidth

for (int i = 0; i < distances.length; i )

distances[i] = distances[i] / bandwidth;

}

下面是就是确定范围，如果bandwidth=0当然就是所有的样本和要分类的样本是在一起的，那么所有的距离都是1。否则就将它们变换到 (0,1]区间里。

// Pass the distances through a weighting kernel

for (int i = 0; i < distances.length; i ) {

switch (m_WeightKernel) {

case LINEAR:

distances[i] = 1.0001 - distances[i];

break;

case EPANECHNIKOV:

distances[i] = 3 / 4D * (1.0001 - distances[i] * distances[i]);

break;

case TRICUBE:

distances[i] = Math

.pow((1.0001 - Math.pow(distances[i], 3)), 3);

break;

case CONSTANT:

// System.err.println("using constant kernel");

distances[i] = 1;

break;

case INVERSE:

distances[i] = 1.0 / (1.0 distances[i]);

break;

case GAUSS:

distances[i] = Math.exp(-distances[i] * distances[i]);

break;

}

将距离通过一个加权的核，这是一个很有意思的东西，可以看一下pattern recognition and machine learning的2.5节，简单地说就是k近邻是在空间中找目标样本的邻居，那也就是用邻居的数量，来确定了这个样本邻居所在的子空间大小，而核是在目标样本周围划出来一个子空间，用它来确定样本邻居的数量。

常用的是GAUSS，可以去笨笨的日志中找《关于LWL (Local weighting Learning )的一些问题》这一篇看一下。

// Set the weights on the training data

double sumOfWeights = 0, newSumOfWeights = 0;

for (int i = 0; i < distances.length; i ) {

double weight = distances[i];

Instance inst = (Instance) neighbours.instance(i);

sumOfWeights = inst.weight();

newSumOfWeights = inst.weight() * weight;

inst.setWeight(inst.weight() * weight);

// weightedTrain.add(newInst);

}

// Rescale weights