卡方(chi-square)核心代码在buildEvaluator中,而buildEvalutor中的代码绝大部分是与InfoGainAttributeEval,因为只是加一个每个类别值,每个属性的每一个属性值的次数,保存在counts中,下面的代码是不同的几句: // Compute chi-squared values m_ChiSquareds = new double[da for (int i = 0; i < da if (i != classIndex) { m_ChiSquareds[i] = ContingencyTables.chiVal(ContingencyTables .reduceMatrix(counts[i]), false); } } 所调用的reduceMatrix代码如下: /** * Reduces a matrix by deleting all zero rows and columns. */ public static double[][] reduceMatrix(double[][] matrix) {
int row, col, currCol, currRow, nrows, ncols, nonZeroRows = 0, nonZeroColumns = 0; double[] rtotal, ctotal; double[][] newMatrix;
nrows = matrix.length; ncols = matrix[0].length; rtotal = new double[nrows]; ctotal = new double[ncols]; for (row = 0; row < nrows; row ) { for (col = 0; col < ncols; col ) { rtotal[row] = matrix[row][col]; ctotal[col] = matrix[row][col]; } } for (row = 0; row < nrows; row ) { if (Utils.gr(rtotal[row], 0)) { nonZeroRows ; } } for (col = 0; col < ncols; col ) { if (Utils.gr(ctotal[col], 0)) { nonZeroColumns ; } } newMatrix = new double[nonZeroRows][nonZeroColumns]; currRow = 0; for (row = 0; row < nrows; row ) { if (Utils.gr(rtotal[row], 0)) { currCol = 0; for (col = 0; col < ncols; col ) { if (Utils.gr(ctotal[col], 0)) { newMatrix[currRow][currCol] = matrix[row][col]; currCol ; } } currRow ; } } return newMatrix; } rtotal,ctotal分别是每个行与列的的全部元素之和,nonZeroRows和nonZeroColumns分别是非0行与列的值,将这些元素值全为0的行或列删去,得到一个新的矩阵newMatrix。 /** * Computes chi-squared statistic for a contingency table. */ public static double chiVal(double[][] matrix, boolean useYates) {
int df, nrows, ncols, row, col; double[] rtotal, ctotal; double expect = 0, chival = 0, n = 0; boolean yates = true;
nrows = matrix.length; ncols = matrix[0].length; rtotal = new double[nrows]; ctotal = new double[ncols]; for (row = 0; row < nrows; row ) { for (col = 0; col < ncols; col ) { rtotal[row] = matrix[row][col]; ctotal[col] = matrix[row][col]; n = matrix[row][col]; } } df = (nrows - 1) * (ncols - 1); if ((df > 1) || (!useYates)) { yates = false; } else if (df <= 0) { return 0; } chival = 0.0; for (row = 0; row < nrows; row ) { if (Utils.gr(rtotal[row], 0)) { for (col = 0; col < ncols; col ) { if (Utils.gr(ctotal[col], 0)) { expect = (ctotal[col] * rtotal[row]) / n; chival = chiCell(matrix[row][col], expect, yates); } } } } return chival; } rtotal,ctotal,n分别是一行的元素值之和,一列的元素值之和,全部元素之和。Expect就是(A C)*(A B)/N,下面看chiCell中的代码: /** * Computes chi-value for on */ private static double chiCell(double freq, double expected, boolean yates) {
// Cell in empty row and column? if (Utils.smOrEq(expected, 0)) { return 0; }
// Compute difference between observed and expected value double diff = Math.abs(freq - expected); if (yates) {
// Apply Yates' correction if wanted diff -= 0.5;
// The difference should never be negative if (diff < 0) { diff = 0; } }
// Return chi-value for the cell return (diff * diff / expected); } 关于yates,可以看一下wiki,词条:Yates' correction for continuity。它是用于小样本,防止高估它的统计显著性,它存在过度校正的可能。 Diff * diff / expected就是公式了,没什么好讲的。在chiValue中,要把所有的chiValue加起来。 具体的,可以看一下Jasper写的,他写的还蛮有意思的: http://www./zhenandaci/archive/2008/08/31/225966.html 比较权威的资料,我以前看的,我自己也找不到了。
|
|