分享

Weka开发[37]——ChiSquareAttributeEval源代码分析

 lzqkean 2013-07-22

         卡方(chi-square)核心代码在buildEvaluator中,而buildEvalutor中的代码绝大部分是与InfoGainAttributeEval,因为只是加一个每个类别值,每个属性的每一个属性值的次数,保存在counts中,下面的代码是不同的几句:

// Compute chi-squared values

m_ChiSquareds = new double[data.numAttributes()];

for (int i = 0; i < data.numAttributes(); i ) {

    if (i != classIndex) {

       m_ChiSquareds[i] = ContingencyTables.chiVal(ContingencyTables

              .reduceMatrix(counts[i]), false);

    }

}

         所调用的reduceMatrix代码如下:

/**

 * Reduces a matrix by deleting all zero rows and columns.

*/

public static double[][] reduceMatrix(double[][] matrix) {

 

    int row, col, currCol, currRow, nrows, ncols, nonZeroRows = 0,

 nonZeroColumns = 0;

    double[] rtotal, ctotal;

    double[][] newMatrix;

 

    nrows = matrix.length;

    ncols = matrix[0].length;

    rtotal = new double[nrows];

    ctotal = new double[ncols];

    for (row = 0; row < nrows; row ) {

       for (col = 0; col < ncols; col ) {

           rtotal[row] = matrix[row][col];

           ctotal[col] = matrix[row][col];

       }

    }

    for (row = 0; row < nrows; row ) {

       if (Utils.gr(rtotal[row], 0)) {

           nonZeroRows ;

       }

    }

    for (col = 0; col < ncols; col ) {

       if (Utils.gr(ctotal[col], 0)) {

           nonZeroColumns ;

       }

    }

    newMatrix = new double[nonZeroRows][nonZeroColumns];

    currRow = 0;

    for (row = 0; row < nrows; row ) {

       if (Utils.gr(rtotal[row], 0)) {

           currCol = 0;

           for (col = 0; col < ncols; col ) {

              if (Utils.gr(ctotal[col], 0)) {

                  newMatrix[currRow][currCol] = matrix[row][col];

                  currCol ;

              }

           }

           currRow ;

       }

    }

    return newMatrix;

}

         rtotalctotal分别是每个行与列的的全部元素之和,nonZeroRowsnonZeroColumns分别是非0行与列的值,将这些元素值全为0的行或列删去,得到一个新的矩阵newMatrix

/**

 * Computes chi-squared statistic for a contingency table.

*/

public static double chiVal(double[][] matrix, boolean useYates) {

 

    int df, nrows, ncols, row, col;

    double[] rtotal, ctotal;

    double expect = 0, chival = 0, n = 0;

    boolean yates = true;

 

    nrows = matrix.length;

    ncols = matrix[0].length;

    rtotal = new double[nrows];

    ctotal = new double[ncols];

    for (row = 0; row < nrows; row ) {

       for (col = 0; col < ncols; col ) {

           rtotal[row] = matrix[row][col];

           ctotal[col] = matrix[row][col];

           n = matrix[row][col];

       }

    }

    df = (nrows - 1) * (ncols - 1);

    if ((df > 1) || (!useYates)) {

       yates = false;

    } else if (df <= 0) {

       return 0;

    }

    chival = 0.0;

    for (row = 0; row < nrows; row ) {

       if (Utils.gr(rtotal[row], 0)) {

           for (col = 0; col < ncols; col ) {

              if (Utils.gr(ctotal[col], 0)) {

                  expect = (ctotal[col] * rtotal[row]) / n;

                  chival = chiCell(matrix[row][col], expect, yates);

              }

           }

       }

    }

    return chival;

}

         rtotalctotaln分别是一行的元素值之和,一列的元素值之和,全部元素之和。Expect就是(A C)*(A B)/N,下面看chiCell中的代码:

/**

 * Computes chi-value for one cell in a contingency table.

*/

private static double chiCell(double freq, double expected, boolean yates) {

 

    // Cell in empty row and column?

    if (Utils.smOrEq(expected, 0)) {

       return 0;

    }

 

    // Compute difference between observed and expected value

    double diff = Math.abs(freq - expected);

    if (yates) {

 

       // Apply Yates' correction if wanted

       diff -= 0.5;

 

       // The difference should never be negative

       if (diff < 0) {

           diff = 0;

       }

    }

 

    // Return chi-value for the cell

    return (diff * diff / expected);

}

         关于yates,可以看一下wiki,词条:Yates' correction for continuity。它是用于小样本,防止高估它的统计显著性,它存在过度校正的可能。

         Diff * diff / expected就是公式了,没什么好讲的。在chiValue中,要把所有的chiValue加起来。

         具体的,可以看一下Jasper写的,他写的还蛮有意思的:

http://www./zhenandaci/archive/2008/08/31/225966.html

         比较权威的资料,我以前看的,我自己也找不到了。

 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约