Weka开发［35］——StringToWordVector源代码分析(1)

lzqkean 2013-07-22

展开全文

最近使用wvtool去算tf-idf，但它要求输入是文件，而我的数据都是很短的几句话，然而个数很多，我试着产生300万个文件，产生个字典十几个小时都完成不了，并且给我的硬盘还很小，才100G，一下就用完了，删除也要花无数个小时才能把这些小文件删除，所以我想如果可以以行为单位，而不是以文件为单位，可以自己定义行的解析函数，这样速度会提高很多，因为没有那么多的I/O操作了。本想写出来一个的，但是我的计算机很慢（让我用，我很难会感觉哪个计算机快），数据一加载就不动了，耐性有限，我也没心情做了。

下面的代码是从黄少力他们那里要来的，我当时只知道有这么回事，到底怎么用的，也懒得去google了，就直接拿来用了。如果只用weka，这也可以。我自己用王义以前的代码写了一个wvtool产生VSM模型，最后产生libsvm数据集，还可以进一步生成.arff的代码（网上的转换不能将产生真实的属性名），这代码我好像写过3次，一次用c ，两次用java，每次都以为是最后一次用。

/**

* 预处理数据集，并生成Arff文件格式

* @param dataDir 原始文档目录

* @param desTi 存储的目标文件

* @throws Exception

public void priProcessData(String dataDir,String desTi) throws Exception{

//将dataDir目录下的所有文档转换成字符串属性的形式存储

TextDirectoryLoader tdl = new TextDirectoryLoader();

tdl.setDirectory(new File(dataDir));

Instances ins = tdl.getDataSet();

ins.setClassIndex(0);

//将字符串属性转换为表示词频的词属性向量空间

StringToWordVector filter = new StringToWordVector();

filter.setUseStoplist(true);

filter.setTFTransform(true);

filter.setIDFTransform(true);

LovinsStemmer stemmer = new LovinsStemmer ();

filter.setStemmer(stemmer);

filter.setMinTermFreq(5);

filter.setWordsToKeep(500);

filter.setInputFormat(ins);

Instances newtrain = Filter.useFilter(ins, filter);

BufferedWriter bw = new BufferedWriter(new FileWriter(new

File(desTi)));

bw.write(newtrain.toString());

bw.flush();

bw.close();

}

weka.filters.unsupervised.attribute.StringToWordVector注释写到：Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data). 将String属性转换成一个表示词出现信息的属性集合，出现信息是从字符串中得到的文本中得到，词的集合由第一次批过滤来确定。

public boolean input(Instance instance) throws Exception {

if (getInputFormat() == null) {

throw new IllegalStateException("No input instance format

defined");

}

if (m_NewBatch) {

resetQueue();

m_NewBatch = false;

}

if (isFirstBatchDone()) {

FastVector fv = new FastVector();

int firstCopy = convertInstancewoDocNorm(instance, fv);

Instance inst = (Instance) fv.elementAt(0);

if (m_filterType != FILTER_NONE) {

normalizeInstance(inst, firstCopy);

}

push(inst);

return true;

} else {

bufferInput(instance);

return false;

}

m_NewBatch是表示是不是一个新的Batch，现在当然是了，所以resetQueue：

/** The output instance queue */

private Queue m_OutputQueue = null;

protected void resetQueue() {

m_OutputQueue = new Queue();

}

重置m_OutputQueue。

isFirstBatchDone的代码如下：

/** True if the first batch has been done */

protected boolean m_FirstBatchDone = false;

/**

* Returns true if the first batch of instances got processed. Necessary

* for supervised filters, which "learn" from the first batch and then

* shouldn't get updated with subsequent calls of batchFinished().

public boolean isFirstBatchDone() {

return m_FirstBatchDone;

}

如果instances第一批已经处理过了，返回真，对于监督学习是必需的，它从第一批中学习，然后不应该再调用batchFinished更新。现在m_FirstBatchDone为false执行bufferInput：

protected void bufferInput(Instance instance) {

if (instance != null) {

copyValues(instance, true);

m_InputFormat.add(instance);

}

再看copyValues和add函数实在没什么意义，向下看batchFinished函数中调用了determineDictionary函数：

/**

* a file containing stopwords for using others than the default Rainbow

* ones.

private File m_Stopwords = new File(System.getProperty("user.dir"));

*********************************************************************

// initialize stopwords

Stopwords stopwords = new Stopwords();

if (getUseStoplist()) {

try {

if (getStopwords().exists() && !getStopwords().isDirectory())

stopwords.read(getStopwords());

} catch (Exception e) {

e.printStackTrace();

}

初始化停词，Stopwords中自己有几百个的停词。

// Operate on a per-class basis if class attribute is set

int classInd = getInputFormat().classIndex();

int values = 1;

if (!m_doNotOperateOnPerClassBasis && (classInd != -1)) {

values = getInputFormat().attribute(classInd).numValues();

}

// TreeMap dictionaryArr [] = new TreeMap[values];

TreeMap[] dictionaryArr = new TreeMap[values];

for (int i = 0; i < values; i ) {

dictionaryArr[i] = new TreeMap();

}

Values是类别的个数，dictionaryArr是每个类别的带记数的字典。

private void determineSelectedRange() {

Instances inputFormat = getInputFormat();

// Calculate the default set of fields to convert

if (m_SelectedRange == null) {

StringBuffer fields = new StringBuffer();

for (int j = 0; j < inputFormat.numAttributes(); j ) {

if (inputFormat.attribute(j).type() == Attribute.STRING)

fields.append((j 1) ",");

}

m_SelectedRange = new Range(fields.toString());

}

m_SelectedRange.setUpper(inputFormat.numAttributes() - 1);

// Prevent the user from converting non-string fields

StringBuffer fields = new StringBuffer();

for (int j = 0; j < inputFormat.numAttributes(); j ) {

if (m_SelectedRange.isInRange(j)

&& inputFormat.attribute(j).type() == Attribute.STRING)

fields.append((j 1) ",");

}

m_SelectedRange.setRanges(fields.toString());

m_SelectedRange.setUpper(inputFormat.numAttributes() - 1);

}

如果没有选择范围，就把所有为String类型的属性都认为要转换成word vector。如果选择了范围，如果选择的属性的确是String类型，就加入这些属性。

// Tokenize all training text into an orderedMap of "words".

long pruneRate = Math.round((m_PeriodicPruningRate / 100.0)

* getInputFormat().numInstances());

计算有用剪除的比例。

// Iterate through all relevant string attributes of the current

// instance

Hashtable h = new Hashtable();

for (int j = 0; j < instance.numAttributes(); j ) {

if (m_SelectedRange.isInRange(j)

&& (instance.isMissing(j) == false)) {

// Get tokenizer

m_Tokenizer.tokenize(instance.stringValue(j));

// Iterate through tokens, perform stemming, and remove

// stopwords

// (if required)

while (m_Tokenizer.hasMoreElements()) {

String word = ((String) m_Tokenizer.nextElement())

.intern();

if (this.m_lowerCaseTokens == true)

word = word.toLowerCase();

word = m_Stemmer.stem(word);

if (this.m_useStoplist == true)

if (stopwords.is(word))

continue;

if (!(h.contains(word)))

h.put(word, new Integer(0));

Count count = (Count) dictionaryArr[vInd].get(word);

if (count == null) {

dictionaryArr[vInd].put(word, new Count(1));

} else {

count.count ;

}

对循环的每个样本，再对它的属性进行循环，将它要to word vector的属性先用Tokenizer分词。将分词所得的词词干化，还要将停词去掉，最后在相应的属性dictionaryArr下，或是这个词没出现过，加入这个词，或是出现过，将词出现次数加1。

// updating the docCount for the words that have occurred in this

// instance(document).

Enumeration e = h.keys();

while (e.hasMoreElements()) {

String word = (String) e.nextElement();

Count c = (Count) dictionaryArr[vInd].get(word);

if (c != null) {

c.docCount ;

} else

System.err

.println("Warning: A word should definitely be in the "

"dictionary.Please check the code");

}

将这些词的docCount加1 ，刚才不加是因为一个词可能多次出现。

if (pruneRate > 0) {

if (i % pruneRate == 0 && i > 0) {

for (int z = 0; z < values; z ) {

Vector d = new Vector(1000);

Iterator it = dictionaryArr[z].keySet().iterator();

while (it.hasNext()) {

String word = (String) it.next();

Count count = (Count) dictionaryArr[z].get(word);

if (count.count <= 1) {

d.add(word);

}

Iterator iter = d.iterator();

while (iter.hasNext()) {

String word = (String) iter.next();

dictionaryArr[z].remove(word);

}

这里可以看出pruneRate这个是一定多少个样，就要执行的一次的值，这当然可以节约内存，但也有缺点，一个词我前面出现很少，总是被pruned，后面才多起来，前面的就没算到。这也就是为什么要把类别分开的原因吧，不然一个类别样本少的特别点的词就被prune了。可以看一下过程，到pruneRate的位数后，就把词典里所以只出现两次以下的全去掉，作法是先把要删的词加入到d这个集合中，再把这个集合中的词全部删掉，这样写我想应该是逻辑更清楚些。

// Figure out the minimum required word frequency

int totalsize = 0;

int prune[] = new int[values];

for (int z = 0; z < values; z ) {

totalsize = dictionaryArr[z].size();

int array[] = new int[dictionaryArr[z].size()];

int pos = 0;

Iterator it = dictionaryArr[z].keySet().iterator();

while (it.hasNext()) {

String word = (String) it.next();

Count count = (Count) dictionaryArr[z].get(word);

array[pos] = count.count;