Intellij IDEA中配置依赖 ---- Hadoop 普通 java 项目

dzh1121 2015-04-24

展开全文

创建普通Java项目
添加引用
1. File – Project Structure... – Modules – Dependencies – + – Library... – Java
2. 选择/usr/local/Cellar/hadoop/2.5.2/libexec/share/hadoop 目录下除了httpfs外的全部文件夹。 (Ubuntu is : /usr/local/hadoop/share/hadoop/ )
3. Name可以随便写，例如”common”，OK。
4. + – Jars or directories...
5. 选择/usr/local/Cellar/hadoop/2.5.2/libexec/share/hadoop/common/lib (Ubuntu is : /usr/local/hadoop/share/hadoop/common/lib )
此时Dependencies内应该总共增加了一个”common”和一个”lib”目录。
修改Project Structure中的Artifacts，增加Jar包的生成配置。

生成HelloHadoop jar包

生成jar包的过程也比较简单，
1.选择菜单File->Project Structure，弹出Project Structure的设置对话框。
2.选择左边的Artifacts后点击上方的“+”按钮
3.在弹出的框中选择jar->from moduls with dependencies..
4.选择要启动的类，然后确定
5.应用之后，对话框消失。在IDEA选择菜单Build->Build Artifacts,选择Build或者Rebuild后即可生成，生成的jar文件位于工程项目目录的out/artifacts下。

(这样对于LInux应该就可以了，但是对Mac OS x是不行的。因为OSX by default, the filesystem is set to case-insensitive . 一个解决方法是转用maven 来build project. 然后

可以写代码编译打包了。

编写WordCount MapReduce程序

这里直接使用了官方的代码

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        public void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException {
            
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "wordcount");
        job.setJarByClass(WordCount.class); //注意，必须添加这行，否则hadoop无法找到对应的class
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.waitForCompletion(true);
    }
}

需要注意的是需要在官方代码中加入job.setJarByClass(WordCount.class);这一行，具体解释可以参考这里。

运行HelloHadoop jar包

将生成的HelloHadoop.jar传送到hadoop集群的Name node节点上。
处于测试目的，简单写了一个测试数据文本wctest.txt

this is hadoop test string
hadoop hadoop
test test
string string string

将该测试文本传到HDFS

1 2	[hdfs@172-22-195-15 data]$ hdfs dfs -mkdir /user/chenbiaolong/wc_test_input [hdfs@172-22-195-15 data]$ hdfs dfs -put wctest.txt /user/chenbiaolong/wc_test_input

cd 到jar包对应的目录，执行HelloHadoop jar包

[hdfs@172-22-195-15 code]$ cd WorkCount/
[hdfs@172-22-195-15 WorkCount]$ ls
HelloHadoop.jar  
[hdfs@172-22-195-15 WorkCount]$ hadoop jar HelloHadoop.jar WordCount /user/chenbiaolong/wc_test_input /user/chenbiaolong/wc_test_output
15/03/26 15:54:19 INFO impl.TimelineClientImpl: Timeline service address: http://:8188/ws/v1/timeline/
15/03/26 15:54:19 INFO client.RMProxy: Connecting to ResourceManager at /172.22.195.17:8050
15/03/26 15:54:20 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/03/26 15:54:20 INFO input.FileInputFormat: Total input paths to process : 1
15/03/26 15:54:21 INFO mapreduce.JobSubmitter: number of splits:1
15/03/26 15:54:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427255014010_0005
15/03/26 15:54:21 INFO impl.YarnClientImpl: Submitted application application_1427255014010_0005
15/03/26 15:54:21 INFO mapreduce.Job: The url to track the job: http://172-22-195-17.com:8088/proxy/application_1427255014010_0005/
15/03/26 15:54:21 INFO mapreduce.Job: Running job: job_1427255014010_0005
15/03/26 15:54:28 INFO mapreduce.Job: Job job_1427255014010_0005 running in uber mode : false
15/03/26 15:54:28 INFO mapreduce.Job:  map 0% reduce 0%
15/03/26 15:54:34 INFO mapreduce.Job:  map 100% reduce 0%
15/03/26 15:54:41 INFO mapreduce.Job:  map 100% reduce 100%
15/03/26 15:54:42 INFO mapreduce.Job: Job job_1427255014010_0005 completed successfully
15/03/26 15:54:43 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=150
                FILE: Number of bytes written=225815
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=210
                HDFS: Number of bytes written=37
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=4133
                Total time spent by all reduces in occupied slots (ms)=4793
                Total time spent by all map tasks (ms)=4133
                Total time spent by all reduce tasks (ms)=4793
                Total vcore-seconds taken by all map tasks=4133
                Total vcore-seconds taken by all reduce tasks=4793
                Total megabyte-seconds taken by all map tasks=16928768
                Total megabyte-seconds taken by all reduce tasks=19632128
        Map-Reduce Framework
                Map input records=4
                Map output records=12
                Map output bytes=120
                Map output materialized bytes=150
                Input split bytes=137
                Combine input records=0
                Combine output records=0
                Reduce input groups=5
                Reduce shuffle bytes=150
                Reduce input records=12
                Reduce output records=5
                Spilled Records=24
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=91
                CPU time spent (ms)=3040
                Physical memory (bytes) snapshot=1466998784
                Virtual memory (bytes) snapshot=8678326272
                Total committed heap usage (bytes)=2200961024
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=73
        File Output Format Counters 
                Bytes Written=37
[hdfs@172-22-195-15 WorkCount]$

结果被输出到/user/chenbiaolong/wc_test_output

[hdfs@172-22-195-15 WorkCount]$ hdfs dfs -ls /user/chenbiaolong/wc_test_output
Found 2 items
-rw-r--r--   3 hdfs hdfs          0 2015-03-26 15:54 /user/chenbiaolong/wc_test_output/_SUCCESS
-rw-r--r--   3 hdfs hdfs         37 2015-03-26 15:54 /user/chenbiaolong/wc_test_output/part-r-00000
[hdfs@172-22-195-15 WorkCount]$ hdfs dfs -cat /user/chenbiaolong/wc_test_output/part-r-00000
hadoop  3
is      1
string  4
test    3
this    1
[hdfs@172-22-195-15 WorkCount]$