9. 格式化HDFS文件系统
使用下列命令格式化HDFS文件系统:
hdfs namenode -format
启动Hadoop
启动HDFS:
start-dfs.sh
启动yarn:
start-yarn.sh
HDFS和yarn的web控制台默认监听端口分别为50070和8088。
如果一切正常,使用jps可以查看到正在运行的Hadoop服务,在我机器上的显示结果为:
29117 NameNode
29675 ResourceManager
29278 DataNode
30002 NodeManager
30123 Jps
29469 SecondaryNameNode
运行Hadoop任务
下面以著名的WordCount例子来说明如何使用Hadoop。
1. 准备程序包
下面是WordCount的源代码。
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
编译代码,并打包:
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
wc.jar就是打包后的Hadoop Mapreduce程序文件。
2. 准备输入文件
我们的Hadoop Mapreduce程序从HDFS读取输入文件,同时也将输出存放到HDFS中。本文将测试程序的输入目录和输出目录确定为wordcount/input和wordcount/output。
在HDFS上创建输入文件夹:
hdfs dfs -mkdir -p wordcount/input
准备一些文本文件作为测试数据,本文准备的两个文件如下:
文件1:input1
The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS?): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. |
文件2:input2
Apache Hadoop 2.6.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.4.1. Here is a short overview of the major features and improvements. Common Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server. A new Hadoop metrics sink that allows writing directly to Graphite. Specification work related to the Hadoop Compatible Filesystem (HCFS) effort. HDFS Support for POSIX-style filesystem extended attributes. See the user documentation for more details. Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API. The NFS gateway received a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports. The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript. YARN YARN's REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs. The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos. The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue. |
将这两个文件拷贝到wordcount/input:
hdfs dfs -copyFromLocal input* wordcount/input/
3. 运行程序
在Hadoop上执行程序:
hadoop jar wc.jar WordCount wordcount/input wordcount/output
程序的结果在wordcount/output,查看输出目录:
hdfs dfs -ls wordcount/output
查看输出结果:
hdfs dfs -cat wordcount/output/part-r-00000