发布时间:2025-12-09 13:44:32 浏览次数:3
原文地址 :http://blog.csdn.net/nsrainbow/article/details/43426061 最新课程请关注原作者博客,获得更好的显示体验
grunt> lshdfs://mycluster/user/root/.staging<dir>hdfs://mycluster/user/root/employee<dir>hdfs://mycluster/user/root/people<dir>grunt> cd ..grunt> lshdfs://mycluster/user/cloudera<dir>hdfs://mycluster/user/history<dir>hdfs://mycluster/user/hive<dir>hdfs://mycluster/user/root<dir>hdfs://mycluster/user/test3<dir>hdfs://mycluster/user/test_hive<dir> 就像操作linux shell 一样的操作 hdfs 空间
并且Pig还创造了 Pig Latin语言,可以通过Pig写一个类似存储过程的MapReduce的Job,pig会自动帮你把这个job翻译成MapReduce去执行,大家就不用自己写原始的java代码了。
yum install pig export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce然后运行 source 让设置生效
[root@host1 impala]# source /etc/profile[root@host1 impala]# echo $HADOOP_MAPRED_HOME/usr/lib/hadoop-mapreduce
$ pig看到一堆日志最后出现
, use fs.defaultFS2015-02-02 08:29:03,302 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFSgrunt> 试着运行ls命令可以看到当前用户目录下的文件
grunt> lshdfs://mycluster/user/root/.staging<dir>hdfs://mycluster/user/root/employee<dir>hdfs://mycluster/user/root/people<dir>还可以cd 到上一级,再ls
grunt> cd ..grunt> lshdfs://mycluster/user/cloudera<dir>hdfs://mycluster/user/history<dir>hdfs://mycluster/user/hive<dir>hdfs://mycluster/user/root<dir>hdfs://mycluster/user/test3<dir>hdfs://mycluster/user/test_hive<dir>怎么样?是不是比直接输入 hdfs dfs -ls / 这一大串命令爽多了?
(Dec 10 01:22:11 NetworkManager: <INFO> hello world [start](Dec 10 03:56:43 NetworkManager: <WARN> Oops! There is an error!(Dec 10 04:10:18 NetworkManager: <WARN> Please check the database ...(Dec 10 05:22:11 NetworkManager: <INFO> hello world [end] grunt > cd hdfs://mycluster/grunt > cd usergrunt > mkdir piggrunt > cd piggrunt > copyFromLocal /root/logs logs hdfs dfs -put /root/logs /user/pig grunt> messages = LOAD '/user/pig/logs';grunt> warns = FILTER messages BY $0 MATCHES '.*WARN+.*';grunt> DUMP warns;执行结果
((Dec 10 03:56:43 NetworkManager: <WARN> Oops! There is an error!)((Dec 10 04:10:18 NetworkManager: <WARN> Please check the database ...)messages = LOAD '/user/pig/logs';warns = FILTER messages BY $0 MATCHES '.*WARN+.*';DUMP warns;由此可以看出一个需要很多java代码的MapReduce任务被简化成了三句话,如此的简洁。
Latin语言还有很多更强大的功能,比如对矩阵的计算和定义接口等,具体见4000001,Kristina,Chung,55,Pilot4000002,Paige,Chen,74,Teacher4000003,Sherri,Melton,34,Firefighter4000004,Gretchen,Hill,66,Computer hardware engineer4000005,Karen,Puckett,74,Lawyer4000006,Patrick,Song,42,Veterinarian4000007,Elsie,Hamilton,43,Pilot4000008,Hazel,Bender,63,Carpenter 上传到hdfs的 /user/pig 目录下
grunt > cd /user/piggrunt > copyFromLocal /root/customers ./customershbase(main):001:0> create 'customers', 'customers_data'raw_data = LOAD 'hdfs:/user/pig/customers' USING PigStorage(',') AS ( custno:chararray, firstname:chararray, lastname:chararray, age:int, profession:chararray);STORE raw_data INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname customers_data:lastname customers_data:age customers_data:profession'); 在这个例子中第一个列custno会被作为hbase 的 rowkey
$ PIG_CLASSPATH=/usr/lib/hbase/hbase-client-0.98.6-cdh5.2.1.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh5.3.0.jar /usr/bin/pig /root/Load_HBase_Customers.pig这里的jar包根据你们的实际情况改变
hbase(main):001:0> scan 'customers' 它使得数据消费者不必知道其数据存储的位置和方式。HCatalog依赖于Hive 的 metastore服务,所以其他服务比如Pig可以通过HCatalog访问由Hive metastore定义的那些表。
hive > CREATE TABLE occupations(code STRING, description STRING,salary INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '4';11-0000,Management occupations,9615011-1011,Chief executives,15137011-1021,General and operations managers,10378011-1031,Legislators,3388011-2011,Advertising and promotions managers,91100 hive> LOAD DATA LOCAL INPATH '/root/occupations.txt' INTO TABLE occupations;occ_data = LOAD 'occupations' USING org.apache.hcatalog.pig.HCatLoader();salaries = GROUP occ_data ALL;out = FOREACH salaries GENERATE AVG(occ_data.salary);DUMP out; 并查看计算出的平均工资结果
其实这一大段就等于是Hive的这句话
SELECT AVG(salary) FROM occupations;这样看起来pig反而比Hive还麻烦了?其实不是的,我只是用这个例子来说明pig跟hive之间的交互,并没有任何的比较性。
关于Pig就介绍这么多,其他更深入的功能大家可以自己查看pig的文档