推荐wikipidia的Simple Linear Regression,或是pattern recognition and machine learning的第3章,如果只想了解这个算法,那就有些没必要了。 先看一下classifyInstance,主要从这里到最后求得的是什么: public double classifyInstance(Instance inst) throws Exception {
if (m_attribute == null) { return m_intercept; } else { if (inst.isMissing(m_attribute.index())) { throw new Exception( "SimpleLinearRegression: No missing values!"); } return m_intercept m_slope * inst.value(m_attribute.index()); } } 如果m_attribte为null,就返回m_intercept,intercept的意思是截距,也就是如果这个属性没有值,就认为是0,如果有值,那么就是m_intercept m_slope * value,就是一个线性函数。m_slope是斜率。 buildClassifier的代码非常简单: 先拷贝一点解释(wiki): Suppose there are n da
which would provide a “best” fit for the da
Using simple calculus it can be shown that the values of α and β that minimize the objective function Q are (用简单的微积分推导可以得到最小化目标函数Q的值可以如下表示) Substituting the above expressions (代回到上面的方程): 用到的就这么多,属性wiki上写的也有,自己看。Weka中实现的算法是在属性中找一个最好的属性,最后用这个属性得到的截距和斜率做为结果。 for (int i = 0; i < insts.numAttributes(); i ) { if (i != insts.classIndex()) { m_attribute = insts.attribute(i);
// Compute slope and intercept double xMean = insts.meanOrMode(i); double sumWeightedXDiffSquared = 0; double sumWeightedYDiffSquared = 0; m_slope = 0; for (int j = 0; j < insts.numInstances(); j ) { Instance inst = insts.instance(j); if (!inst.isMissing(i) && !inst.classIsMissing()) { double xDiff = inst.value(i) - xMean; double yDiff = inst.classValue() - yMean; double weightedXDiff = inst.weight() * xDiff; double weightedYDiff = inst.weight() * yDiff; m_slope = weightedXDiff * yDiff; sumWeightedXDiffSquared = weightedXDiff * xDiff; sumWeightedYDiffSquared = weightedYDiff * yDiff; } }
// Skip attribute if not useful if (sumWeightedXDiffSquared == 0) { continue; } double numerator = m_slope; m_slope /= sumWeightedXDiffSquared; m_intercept = yMean - m_slope * xMean;
// Compute sum of squared errors double msq = sumWeightedYDiffSquared - m_slope * numerator;
// Check whether this is the best attribute if (msq < minMsq) { minMsq = msq; chosen = i; chosenSlope = m_slope; chosenIntercept = m_intercept; } } } 这里带来的干扰就是weight,直接把它看成是1就可以了,斜率m_slope /= sumWeightedXDiffSquared用到的就是上面的公式beta hat等式后第一个式子,而截距用的公式是有上面的是完全一样的。写到这我才想起来,wiki还有中文版:把公式贴一下:
这里用的是多元的符号,我不想再复制一次了,自己到wiki里搜索一下“线性回归”就可以了。如果msq<minMsq当然就是找到了更好的一个属性,记录下来。 // Set parameters if (chosen == -1) { if (!m_suppressErrorMessage) System.err.println("----- no useful attribute found"); m_attribute = null; m_attributeIndex = 0; m_slope = 0; m_intercept = yMean; } else { m_attribute = insts.attribute(chosen); m_attributeIndex = chosen; m_slope = chosenSlope; m_intercept = chosenIntercept; } 这里就是记录下最佳的属性,选中属性的index,斜率和截距,注意上面的一句话,如果没有什么有用的属性就将slope设为0,而intercept作为yMean,就是平行于x轴的直线,在平行线中,当然是它的msq最小了。
|
|