Hadoop WordCount实例

发表于:2017-12-20 10:18

字体: | 上一篇 | 下一篇 | 我要投稿

 作者:senselyan    来源:简书

  环境:ubuntu14、JAVA_HOME、HADOOP_HOME
  环境搭建可见:Ubuntu安装hadoop
  1.编写WordCount.java
  包含Mapper类和Reducer类
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class WordCountMap extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
word.set(token.nextToken());
context.write(word, one);
}
}
}
public static class WordCountReduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCount.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMap.class);
job.setReducerClass(WordCountReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
  2.编译WordCount.java
  语法:
  javac
  -classpath [包路径1]:[包路径2]
  -d [编译的路径] [java的路径]
  文件:
  java文件:
  /opt/data/hadoop/WordCount/WordCount.java
  class文件目录 :
  /opt/data/hadoop/WordCount/class
  命令:
  > javac -classpath  /opt/hadoop-1.2.1/hadoop-core-1.2.1.jar:/opt/hadoop-1.2.1/lib/commons-cli-1.2.jar  -d class/  WordCount.java
  编译后文件:
  3.打包
  > jar -cvf wordcount.jar *.class
  4.作业提交
  文件:
  两个输入文件:
  /opt/data/hadoop/WordCount/input/file1
  /opt/data/hadoop/WordCount/input/file2
  file1:
  hello world hello hadoop hadoop file system hadoop java api hello java
  file2:
  new file hadoop file hadoop new world hadoop free home hadoop free school
  a.hdfs创建路径
  > hadoop fs -mkdir input_wordcount
  b.传文件到hdfs
  > hadoop fs -put input/* input_wordcount/
  c.提交作业
  > hadoop jar class/wordcount.jar WordCount input_wordcount output_wordcount
  d.看看结果
  > hadoop fs -s output_wordcount/part-r-00000
  结果:
api     1
file    3
free    2
hadoop  7
hello   3
home    1
java    2
new     2
school  1
system  1
world   2
  附 命令行:
root@senselyan-virtual-machine: hadoop jar class/wordcount.jar WordCount input_wordcount output_wordcount
17/12/17 16:33:07 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/12/17 16:33:07 INFO input.FileInputFormat: Total input paths to process : 2
17/12/17 16:33:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
17/12/17 16:33:07 WARN snappy.LoadSnappy: Snappy native library not loaded
17/12/17 16:33:07 INFO mapred.JobClient: Running job: job_201712171254_0001
17/12/17 16:33:08 INFO mapred.JobClient:  map 0% reduce 0%
17/12/17 16:33:17 INFO mapred.JobClient:  map 100% reduce 0%
17/12/17 16:33:25 INFO mapred.JobClient:  map 100% reduce 33%
17/12/17 16:33:27 INFO mapred.JobClient:  map 100% reduce 100%
17/12/17 16:33:28 INFO mapred.JobClient: Job complete: job_201712171254_0001
17/12/17 16:33:28 INFO mapred.JobClient: Counters: 29
17/12/17 16:33:28 INFO mapred.JobClient:   Job Counters
17/12/17 16:33:28 INFO mapred.JobClient:     Launched reduce tasks=1
17/12/17 16:33:28 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=13623
17/12/17 16:33:28 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/12/17 16:33:28 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/12/17 16:33:28 INFO mapred.JobClient:     Launched map tasks=2
17/12/17 16:33:28 INFO mapred.JobClient:     Data-local map tasks=2
17/12/17 16:33:28 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9900
17/12/17 16:33:28 INFO mapred.JobClient:   File Output Format Counters
17/12/17 16:33:28 INFO mapred.JobClient:     Bytes Written=83
17/12/17 16:33:28 INFO mapred.JobClient:   FileSystemCounters
17/12/17 16:33:28 INFO mapred.JobClient:     FILE_BYTES_READ=301
17/12/17 16:33:28 INFO mapred.JobClient:     HDFS_BYTES_READ=383
17/12/17 16:33:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=156859
17/12/17 16:33:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=83
17/12/17 16:33:28 INFO mapred.JobClient:   File Input Format Counters
17/12/17 16:33:28 INFO mapred.JobClient:     Bytes Read=147
17/12/17 16:33:28 INFO mapred.JobClient:   Map-Reduce Framework
17/12/17 16:33:28 INFO mapred.JobClient:     Map output materialized bytes=307
17/12/17 16:33:28 INFO mapred.JobClient:     Map input records=11
17/12/17 16:33:28 INFO mapred.JobClient:     Reduce shuffle bytes=307
17/12/17 16:33:28 INFO mapred.JobClient:     Spilled Records=50
17/12/17 16:33:28 INFO mapred.JobClient:     Map output bytes=245
17/12/17 16:33:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=350224384
17/12/17 16:33:28 INFO mapred.JobClient:     CPU time spent (ms)=2510
17/12/17 16:33:28 INFO mapred.JobClient:     Combine input records=0
17/12/17 16:33:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=236
17/12/17 16:33:28 INFO mapred.JobClient:     Reduce input records=25
17/12/17 16:33:28 INFO mapred.JobClient:     Reduce input groups=11
17/12/17 16:33:28 INFO mapred.JobClient:     Combine output records=0
17/12/17 16:33:28 INFO mapred.JobClient:     Physical memory (bytes) snapshot=615907328
17/12/17 16:33:28 INFO mapred.JobClient:     Reduce output records=11
17/12/17 16:33:28 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2537697280
17/12/17 16:33:28 INFO mapred.JobClient:     Map output records=25
root@senselyan-virtual-machine: hadoop fs -ls output_wordcount
Found 3 items
-rw-r--r--   3 root supergroup          0 2017-12-17 16:33 /user/root/output_wordcount/_SUCCESS
drwxr-xr-x   - root supergroup          0 2017-12-17 16:33 /user/root/output_wordcount/_logs
-rw-r--r--   3 root supergroup         83 2017-12-17 16:33 /user/root/output_wordcount/part-r-00000
root@senselyan-virtual-machine: hadoop fs -cat output_wordcount/part-r-00000
Warning: $HADOOP_HOME is deprecated.
api     1
file    3
free    2
hadoop  7
hello   3
home    1
java    2
new     2
school  1
system  1
world   2

上文内容不用于商业目的,如涉及知识产权问题,请权利人联系博为峰小编(021-64471599-8017),我们将立即处理。
《2023软件测试行业现状调查报告》独家发布~

关注51Testing

联系我们

快捷面板 站点地图 联系我们 广告服务 关于我们 站长统计 发展历程

法律顾问:上海兰迪律师事务所 项棋律师
版权所有 上海博为峰软件技术股份有限公司 Copyright©51testing.com 2003-2024
投诉及意见反馈:webmaster@51testing.com; 业务联系:service@51testing.com 021-64471599-8017

沪ICP备05003035号

沪公网安备 31010102002173号