【原】MapReduce初级编程单词统计

青语csl 2020-05-15

展开全文

实验序号

十

实验项目

MapReduce编程初级实践- 单词统计

实验成绩

一、实验目的

（1）理解的MapReduce编程的基本方法；

（2）掌握用MapReduce编程模型解决单词统计问题。

二、实验内容

（1）在eclipse 里创建新工程并添加Mapreduce和hadoop所需要的jar包；

（2）写基于MapReduce方法的单词统计程序(包含map方法、reduce方法以及main方法)；

（3）把单词统计程序进行打包生成jar文件；

（4）启动hadoop ，在hdfs上创建input和output文件夹，把需要处理的文件上传到input文件夹里；

（5）运行单词统程序(jar文件)；

（6）查看结果（保存在output文件夹里）。

三、实验过程及分析

前期准备条件：eclipse\centos7

(1) 下载并安装Hadoop-Eclipse-Plugin

I.下载插件至桌面

Ii.解压并复制release中的hadoop-eclipse-plugin-2.6.0.jar至eclipse安装目录的plugins目录下,并运行eclipse -clean命令使插件生效

(2) 配置Hadoop-Eclipse-Plugin

I.启动hadoop

Ii.启动eclipse，选择Window菜单下的Preference

Ii.切换Map/Reduce开发视图

Window -> Perspective -> Open Perspective -> Other

Iii.建立与Hadoop集群的连接，点击eclipse软件中的Map/Reduce Location面板，在面板中单击右键，选择New Hadoop Location

(3) 在Eclipse中操作HDFS中的文件，output文件夹下可查看结果

(4) 创建MapReduce项目

(5) 编写代码

package org.apache.hadoop.examples;

import java.io.IOException;

import java.util.Iterator;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public WordCount() {

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

// String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();

String[] otherArgs=new String[]{"input","output"}; /* 直接设置输入参数 */

if(otherArgs.length < 2) {

System.err.println("Usage: wordcount <in> [<in>...] <out>");

System.exit(2);

}

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCount.TokenizerMapper.class);

job.setCombinerClass(WordCount.IntSumReducer.class);

job.setReducerClass(WordCount.IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

for(int i = 0; i < otherArgs.length - 1; ++i) {

FileInputFormat.addInputPath(job, new Path(otherArgs[i]));

}

FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));

System.exit(job.waitForCompletion(true)?0:1);

}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public IntSumReducer() {

}

public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

int sum = 0;

IntWritable val;

for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {

val = (IntWritable)i$.next();

}

this.result.set(sum);

context.write(key, this.result);

}

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private static final IntWritable one = new IntWritable(1);

private Text word = new Text();

public TokenizerMapper() {

}

public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while(itr.hasMoreTokens()) {

this.word.set(itr.nextToken());

context.write(this.word, one);

}

(6) 打包代码

I.编译java程序

Ii.打包编译生成的class文件

(7) 创建input文件

(8) 运行jar文件

查看运行结果：

四、实验总结（心得体会）

本次实验中遇到的问题：

（1）namenode处于安全模式

解决办法：关闭namenode的安全模式

（2）hdfs上传不了文件+上传的文件没有内容

解决办法：重启hdfs

一开始上传不了文件时，百度搜索的问题原因是没有关闭datanode和namenode的防火墙，经过一串命令后发现防火墙早已关闭。后来通过eclipse上传了文件后，发现文件怎么都存储不了内容。此时，我能百度的方法依旧是关闭防火墙，附上的步骤又是一篇上百字的命令。我想，今天我可能要和防火墙来个不死不休的战斗了。突然间，我果断的重启了hdfs。然后结果就出来了。真是奇妙的一天！