word co-occurrence在Hadoop中得以实现

来自：真实感vcderr4v > 馆藏分类

配色：

字号：大中小

2018-03-22 | 阅：转： | 分享

14
15publicclassWCoextendsConfiguredimplementsTool{
16
17@Override
18publicintrun(String[]args)throwsException{
19
20if(args.length!=2){
21System.out.printf("Usage:hadoopjarwco.WCo\n");
22return-1;
23}
24
25Jobjob=newJob(getConf());
26job.setJarByClass(WCo.class);
27job.setJobName("WordCoOccurrence");
28
29FileInputFormat.setInputPaths(job,newPath(args[0]));
30FileOutputFormat.setOutputPath(job,newPath(args[1]));
31
32job.setMapperClass(WCoMapper.class);
33job.setReducerClass(WCoReducer.class);
34
35job.setOutputKeyClass(Text.class);
36job.setOutputValueClass(IntWritable.class);
37
38booleansuccess=job.waitForCompletion(true);
39returnsuccess?0:1;
40}
41
42publicstaticvoidmain(String[]args)throwsException{
43intexitCode=ToolRunner.run(newConfiguration(),newWCo(),args);
44System.exit(exitCode);
45}
46}
算法的核心其实就是把前词和后词同时取出来作为key加上一个value做wordcount,统计单
词的共生频率来对文本进行聚类.看网上说k-means的很多,其实很多时候算法是根据需求走
的,k-means或者模糊k均值不一定就高大上,wordcount也不一定就穷矮矬.

献花(0)

(本文系真实感vcder...首藏)

类似文章 更多

发表评论：