配色: 字号:
word co-occurrence在Hadoop中得以实现
2018-03-22 | 阅:  转:  |  分享 
  
14
15publicclassWCoextendsConfiguredimplementsTool{
16
17@Override
18publicintrun(String[]args)throwsException{
19
20if(args.length!=2){
21System.out.printf("Usage:hadoopjarwco.WCo\n");
22return-1;
23}
24
25Jobjob=newJob(getConf());
26job.setJarByClass(WCo.class);
27job.setJobName("WordCoOccurrence");
28
29FileInputFormat.setInputPaths(job,newPath(args[0]));
30FileOutputFormat.setOutputPath(job,newPath(args[1]));
31
32job.setMapperClass(WCoMapper.class);
33job.setReducerClass(WCoReducer.class);
34
35job.setOutputKeyClass(Text.class);
36job.setOutputValueClass(IntWritable.class);
37
38booleansuccess=job.waitForCompletion(true);
39returnsuccess?0:1;
40}
41
42publicstaticvoidmain(String[]args)throwsException{
43intexitCode=ToolRunner.run(newConfiguration(),newWCo(),args);
44System.exit(exitCode);
45}
46}
算法的核心其实就是把前词和后词同时取出来作为key加上一个value做wordcount,统计单
词的共生频率来对文本进行聚类.看网上说k-means的很多,其实很多时候算法是根据需求走
的,k-means或者模糊k均值不一定就高大上,wordcount也不一定就穷矮矬.
献花(0)
+1
(本文系真实感vcder...首藏)