Weka开发［33］——SimpleLinearRegression源代码分析

lzqkean 2013-07-22

展开全文

推荐wikipidia的Simple Linear Regression，或是pattern recognition and machine learning的第3章，如果只想了解这个算法，那就有些没必要了。

先看一下classifyInstance，主要从这里到最后求得的是什么：

public double classifyInstance(Instance inst) throws Exception {

if (m_attribute == null) {

return m_intercept;

} else {

if (inst.isMissing(m_attribute.index())) {

throw new Exception(

"SimpleLinearRegression: No missing values!");

}

return m_intercept m_slope * inst.value(m_attribute.index());

}

如果m_attribte为null，就返回m_intercept，intercept的意思是截距，也就是如果这个属性没有值，就认为是0，如果有值，那么就是m_intercept m_slope * value，就是一个线性函数。m_slope是斜率。

buildClassifier的代码非常简单：

先拷贝一点解释（wiki）：

Suppose there are n data points {y_i, x_i}, where i = 1, 2, …, n. The goal is to find the equation of the straight line. (假设有n个点{y_i, x_i}，其中i = 1, 2, …, n.目标是找到一个直线方程)

which would provide a “best” fit for the data points. Here the “best” will be understood as in the least-squares approach: such a line that minimizes the sum of squared residuals of the linear regression model. In other words, numbers α and β solve the following minimization problem. (它可以最好地拟合数据，这里最好可以用least-squares方法来理解：即一条可以最小化线性回归模型的误差平方的线。换句话说，alpha和beta用来最小化下面的问题)

Using simple calculus it can be shown that the values of α and β that minimize the objective function Q are (用简单的微积分推导可以得到最小化目标函数Q的值可以如下表示)

Substituting the above expressions (代回到上面的方程):

用到的就这么多，属性wiki上写的也有，自己看。Weka中实现的算法是在属性中找一个最好的属性，最后用这个属性得到的截距和斜率做为结果。

for (int i = 0; i < insts.numAttributes(); i ) {

if (i != insts.classIndex()) {

m_attribute = insts.attribute(i);

// Compute slope and intercept

double xMean = insts.meanOrMode(i);

double sumWeightedXDiffSquared = 0;

double sumWeightedYDiffSquared = 0;

m_slope = 0;

for (int j = 0; j < insts.numInstances(); j ) {

Instance inst = insts.instance(j);

if (!inst.isMissing(i) && !inst.classIsMissing()) {

double xDiff = inst.value(i) - xMean;

double yDiff = inst.classValue() - yMean;

double weightedXDiff = inst.weight() * xDiff;

double weightedYDiff = inst.weight() * yDiff;

m_slope = weightedXDiff * yDiff;

sumWeightedXDiffSquared = weightedXDiff * xDiff;

sumWeightedYDiffSquared = weightedYDiff * yDiff;

}

// Skip attribute if not useful

if (sumWeightedXDiffSquared == 0) {

continue;

}

double numerator = m_slope;

m_slope /= sumWeightedXDiffSquared;

m_intercept = yMean - m_slope * xMean;

// Compute sum of squared errors

double msq = sumWeightedYDiffSquared - m_slope * numerator;

// Check whether this is the best attribute

if (msq < minMsq) {

minMsq = msq;

chosen = i;

chosenSlope = m_slope;

chosenIntercept = m_intercept;

}

这里带来的干扰就是weight，直接把它看成是1就可以了，斜率m_slope /= sumWeightedXDiffSquared用到的就是上面的公式beta hat等式后第一个式子，而截距用的公式是有上面的是完全一样的。写到这我才想起来，wiki还有中文版：把公式贴一下：

这里用的是多元的符号，我不想再复制一次了，自己到wiki里搜索一下“线性回归”就可以了。如果msq<minMsq当然就是找到了更好的一个属性，记录下来。

// Set parameters

if (chosen == -1) {

if (!m_suppressErrorMessage)

System.err.println("----- no useful attribute found");

m_attribute = null;

m_attributeIndex = 0;

m_slope = 0;

m_intercept = yMean;

} else {

m_attribute = insts.attribute(chosen);

m_attributeIndex = chosen;

m_slope = chosenSlope;

m_intercept = chosenIntercept;

}

这里就是记录下最佳的属性，选中属性的index，斜率和截距，注意上面的一句话，如果没有什么有用的属性就将slope设为0，而intercept作为yMean，就是平行于x轴的直线，在平行线中，当然是它的msq最小了。

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： lzqkean > 《WEKA开发》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

lzqkean

关注对话

TA的最新馆藏

[转] java常用集合类详解（有例子，集合类糊涂的来看！） .
[转] SQL动态语句用法
Open Source Bayesian Network Structure Learning API, Free
[转] svm基本原理
Jacket for Matlab常见问题
Java序列化的机制和原理

喜欢该文的人也喜欢更多

热门阅读换一换