Hadoop WordCount实例

您的位置：
门户
>> 文章精选
>> 软件开发专栏
>> 大数据
>> 查看资讯

发表于：2017-12-20 10:18

字体：大中小 | 上一篇 | 下一篇 | 我要投稿

作者：senselyan 来源：简书

软件开发

hadoop

大数据

　　环境：ubuntu14、JAVA_HOME、HADOOP_HOME

　　环境搭建可见：Ubuntu安装hadoop

　　1.编写WordCount.java

　　包含Mapper类和Reducer类

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class WordCountMap extends

Mapper<LongWritable, Text, Text, IntWritable> {

private final IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer token = new StringTokenizer(line);

while (token.hasMoreTokens()) {

word.set(token.nextToken());

context.write(word, one);

}

public static class WordCountReduce extends

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf);

job.setJarByClass(WordCount.class);

job.setJobName("wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(WordCountMap.class);

job.setReducerClass(WordCountReduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

　　2.编译WordCount.java

　　语法：

　　javac

　　-classpath [包路径1]:[包路径2]

　　-d [编译的路径] [java的路径]

　　文件：

　　java文件：

　　/opt/data/hadoop/WordCount/WordCount.java

　　class文件目录：

　　/opt/data/hadoop/WordCount/class

　　命令：

　　> javac -classpath /opt/hadoop-1.2.1/hadoop-core-1.2.1.jar:/opt/hadoop-1.2.1/lib/commons-cli-1.2.jar -d class/ WordCount.java

　　编译后文件：

　　3.打包

　　> jar -cvf wordcount.jar *.class

　　4.作业提交

　　文件：

　　两个输入文件：

　　/opt/data/hadoop/WordCount/input/file1

　　/opt/data/hadoop/WordCount/input/file2

　　file1:

　　hello world hello hadoop hadoop file system hadoop java api hello java

　　file2:

　　new file hadoop file hadoop new world hadoop free home hadoop free school

　　a.hdfs创建路径

　　> hadoop fs -mkdir input_wordcount

　　b.传文件到hdfs

　　> hadoop fs -put input/* input_wordcount/

　　c.提交作业

　　> hadoop jar class/wordcount.jar WordCount input_wordcount output_wordcount

　　d.看看结果

　　> hadoop fs -s output_wordcount/part-r-00000

　　结果：

api 1

file 3

free 2

hadoop 7

hello 3

home 1

java 2

new 2

school 1

system 1

world 2

　　附命令行：

root@senselyan-virtual-machine: hadoop jar class/wordcount.jar WordCount input_wordcount output_wordcount

17/12/17 16:33:07 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

17/12/17 16:33:07 INFO input.FileInputFormat: Total input paths to process : 2

17/12/17 16:33:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library

17/12/17 16:33:07 WARN snappy.LoadSnappy: Snappy native library not loaded

17/12/17 16:33:07 INFO mapred.JobClient: Running job: job_201712171254_0001

17/12/17 16:33:08 INFO mapred.JobClient: map 0% reduce 0%

17/12/17 16:33:17 INFO mapred.JobClient: map 100% reduce 0%

17/12/17 16:33:25 INFO mapred.JobClient: map 100% reduce 33%

17/12/17 16:33:27 INFO mapred.JobClient: map 100% reduce 100%

17/12/17 16:33:28 INFO mapred.JobClient: Job complete: job_201712171254_0001

17/12/17 16:33:28 INFO mapred.JobClient: Counters: 29

17/12/17 16:33:28 INFO mapred.JobClient: Job Counters

17/12/17 16:33:28 INFO mapred.JobClient: Launched reduce tasks=1

17/12/17 16:33:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13623

17/12/17 16:33:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

17/12/17 16:33:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

17/12/17 16:33:28 INFO mapred.JobClient: Launched map tasks=2

17/12/17 16:33:28 INFO mapred.JobClient: Data-local map tasks=2

17/12/17 16:33:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9900

17/12/17 16:33:28 INFO mapred.JobClient: File Output Format Counters

17/12/17 16:33:28 INFO mapred.JobClient: Bytes Written=83

17/12/17 16:33:28 INFO mapred.JobClient: FileSystemCounters

17/12/17 16:33:28 INFO mapred.JobClient: FILE_BYTES_READ=301

17/12/17 16:33:28 INFO mapred.JobClient: HDFS_BYTES_READ=383

17/12/17 16:33:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=156859

17/12/17 16:33:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=83

17/12/17 16:33:28 INFO mapred.JobClient: File Input Format Counters

17/12/17 16:33:28 INFO mapred.JobClient: Bytes Read=147

17/12/17 16:33:28 INFO mapred.JobClient: Map-Reduce Framework

17/12/17 16:33:28 INFO mapred.JobClient: Map output materialized bytes=307

17/12/17 16:33:28 INFO mapred.JobClient: Map input records=11

17/12/17 16:33:28 INFO mapred.JobClient: Reduce shuffle bytes=307

17/12/17 16:33:28 INFO mapred.JobClient: Spilled Records=50

17/12/17 16:33:28 INFO mapred.JobClient: Map output bytes=245

17/12/17 16:33:28 INFO mapred.JobClient: Total committed heap usage (bytes)=350224384

17/12/17 16:33:28 INFO mapred.JobClient: CPU time spent (ms)=2510

17/12/17 16:33:28 INFO mapred.JobClient: Combine input records=0

17/12/17 16:33:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=236

17/12/17 16:33:28 INFO mapred.JobClient: Reduce input records=25

17/12/17 16:33:28 INFO mapred.JobClient: Reduce input groups=11

17/12/17 16:33:28 INFO mapred.JobClient: Combine output records=0

17/12/17 16:33:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=615907328

17/12/17 16:33:28 INFO mapred.JobClient: Reduce output records=11

17/12/17 16:33:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2537697280

17/12/17 16:33:28 INFO mapred.JobClient: Map output records=25

root@senselyan-virtual-machine: hadoop fs -ls output_wordcount

Found 3 items

-rw-r--r-- 3 root supergroup 0 2017-12-17 16:33 /user/root/output_wordcount/_SUCCESS

drwxr-xr-x - root supergroup 0 2017-12-17 16:33 /user/root/output_wordcount/_logs

-rw-r--r-- 3 root supergroup 83 2017-12-17 16:33 /user/root/output_wordcount/part-r-00000

root@senselyan-virtual-machine: hadoop fs -cat output_wordcount/part-r-00000

Warning: $HADOOP_HOME is deprecated.

api 1

file 3

free 2

hadoop 7

hello 3

home 1

java 2

new 2

school 1

system 1

world 2

上文内容不用于商业目的，如涉及知识产权问题，请权利人联系博为峰小编(021-64471599-8017)，我们将立即处理。

《2023软件测试行业现状调查报告》独家发布~

搜索风云榜

测试技术了解

2023测试行业调查报告

挣点稿费

AI与软件测试

文章资料精选