

 lzqkean 2013-07-22

         推荐wikipidiaSimple Linear Regression,或是pattern recognition and machine learning的第3章,如果只想了解这个算法,那就有些没必要了。


public double classifyInstance(Instance inst) throws Exception {


    if (m_attribute == null) {

       return m_intercept;

    } else {

       if (inst.isMissing(m_attribute.index())) {

           throw new Exception(

                  "SimpleLinearRegression: No missing values!");


       return m_intercept m_slope * inst.value(m_attribute.index());



         如果m_attribtenull,就返回m_interceptintercept的意思是截距,也就是如果这个属性没有值,就认为是0,如果有值,那么就是m_intercept m_slope * value,就是一个线性函数。m_slope是斜率。



Suppose there are n data points {yi, xi}, where i = 1, 2, …, n. The goal is to find the equation of the straight line. (假设有n个点{yi, xi},其中i = 1, 2, …, n.目标是找到一个直线方程)

                                                   Weka开发[31]——SimpleLinearRegression源代码分析  - quweiprotoss - Koala  s blog

         which would provide a “best” fit for the data points. Here the “best” will be understood as in the least-squares approach: such a line that minimizes the sum of squared residuals of the linear regression model. In other words, numbers α and β solve the following minimization problem. (它可以最好地拟合数据,这里最好可以用least-squares方法来理解:即一条可以最小化线性回归模型的误差平方的线。换句话说,alphabeta用来最小化下面的问题)

              Weka开发[31]——SimpleLinearRegression源代码分析  - quweiprotoss - Koala  s blog

Using simple calculus it can be shown that the values of α and β that minimize the objective function Q are (用简单的微积分推导可以得到最小化目标函数Q的值可以如下表示)

         Weka开发[31]——SimpleLinearRegression源代码分析  - quweiprotoss - Koala  s blog

         Substituting the above expressions (代回到上面的方程):

                                             Weka开发[31]——SimpleLinearRegression源代码分析  - quweiprotoss - Koala  s blog           


for (int i = 0; i < insts.numAttributes(); i ) {

    if (i != insts.classIndex()) {

       m_attribute = insts.attribute(i);


       // Compute slope and intercept

       double xMean = insts.meanOrMode(i);

       double sumWeightedXDiffSquared = 0;

       double sumWeightedYDiffSquared = 0;

       m_slope = 0;

       for (int j = 0; j < insts.numInstances(); j ) {

           Instance inst = insts.instance(j);

           if (!inst.isMissing(i) && !inst.classIsMissing()) {

              double xDiff = inst.value(i) - xMean;

              double yDiff = inst.classValue() - yMean;

              double weightedXDiff = inst.weight() * xDiff;

              double weightedYDiff = inst.weight() * yDiff;

              m_slope = weightedXDiff * yDiff;

              sumWeightedXDiffSquared = weightedXDiff * xDiff;

              sumWeightedYDiffSquared = weightedYDiff * yDiff;




       // Skip attribute if not useful

       if (sumWeightedXDiffSquared == 0) {



       double numerator = m_slope;

       m_slope /= sumWeightedXDiffSquared;

       m_intercept = yMean - m_slope * xMean;


       // Compute sum of squared errors

       double msq = sumWeightedYDiffSquared - m_slope * numerator;


       // Check whether this is the best attribute

       if (msq < minMsq) {

           minMsq = msq;

           chosen = i;

           chosenSlope = m_slope;

           chosenIntercept = m_intercept;




         这里带来的干扰就是weight,直接把它看成是1就可以了,斜率m_slope /= sumWeightedXDiffSquared用到的就是上面的公式beta hat等式后第一个式子,而截距用的公式是有上面的是完全一样的。写到这我才想起来,wiki还有中文版:把公式贴一下:

                      Weka开发[31]——SimpleLinearRegression源代码分析  - quweiprotoss - Koala  s blog


// Set parameters

if (chosen == -1) {

    if (!m_suppressErrorMessage)

       System.err.println("----- no useful attribute found");

    m_attribute = null;

    m_attributeIndex = 0;

    m_slope = 0;

    m_intercept = yMean;

} else {    

    m_attribute = insts.attribute(chosen);

    m_attributeIndex = chosen;

    m_slope = chosenSlope;

    m_intercept = chosenIntercept;






    转藏 分享 献花(0



    请遵守用户 评论公约

    类似文章 更多