网站建设费计入什么科目,专业团队打造专业品质,东平县建设局网站,举一个网络营销的例子概述
桶表是对数据进行哈希取值#xff0c;然后放到不同文件中存储。
数据加载到桶表时#xff0c;会对字段取hash值#xff0c;然后与桶的数量取模。把数据放到对应的文件中。 物理上#xff0c;每个桶就是表(或分区#xff09;目录里的一个文件#xff0c;一个作业产…
概述
桶表是对数据进行哈希取值然后放到不同文件中存储。
数据加载到桶表时会对字段取hash值然后与桶的数量取模。把数据放到对应的文件中。 物理上每个桶就是表(或分区目录里的一个文件一个作业产生的桶(输出文件)和reduce任务个数相同。 作用
桶表专门用于抽样查询是很专业性的不是日常用来存储数据的表需要抽样查询时才创建和使用桶表。 实验
创建
[22:39:03]hive (zmgdb) create table bucket_t1(id string) [22:39:26] clustered by(id) into 6 buckets; [22:39:27]OK [22:39:27]Time taken: 0.546 seconds clustered by以哪个字段分桶。对id进行哈希取值随机 地放到4个桶里。
----------------------------- 准备数据
[roothello110 data]# vi bucket_test 1 2 3 4 5 6
.............
......... 导入数据
正确的导入方式从日常保存数据的表insert [21:27:45]hive (zmgdb) create table t2(id string); [21:27:45]OK [21:27:45]Time taken: 0.073 seconds [21:28:24]hive (zmgdb) load data local inpath /data/bucket_test into table t2; [21:28:24]Loading data to table zmgdb.t2 [21:28:25]OK
从日常表导入
[22:39:47]hive (zmgdb) insert overwrite table bucket_t1 select id from t2;
hive会启动mapreduce [22:39:48]WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. [22:39:48]Query ID hadoop_20160922063946_34bf30c4-3f23-43e9-ad8f-edd5ee214948 [22:39:48]Total jobs 1 [22:39:48]Launching Job 1 out of 1 [22:39:48]Number of reduce tasks determined at compile time: 6 [22:39:48]In order to change the average load for a reducer (in bytes): [22:39:48] set hive.exec.reducers.bytes.per.reducernumber [22:39:48]In order to limit the maximum number of reducers: [22:39:48] set hive.exec.reducers.maxnumber [22:39:48]In order to set a constant number of reducers: [22:39:48] set mapreduce.job.reducesnumber [22:39:51]Starting Job job_1474497386931_0001, Tracking URL http://hello110:8088/proxy/application_1474497386931_0001/ [22:39:51]Kill Command /home/hadoop/app/hadoop-2.7.2/bin/hadoop job -kill job_1474497386931_0001 [22:39:59]Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 6 [22:39:59]2016-09-22 06:39:59,419 Stage-1 map 0%, reduce 0% [22:40:06]2016-09-22 06:40:05,828 Stage-1 map 100%, reduce 0%, Cumulative CPU 1.63 sec [22:40:12]2016-09-22 06:40:12,347 Stage-1 map 100%, reduce 17%, Cumulative CPU 3.48 sec [22:40:16]2016-09-22 06:40:15,739 Stage-1 map 100%, reduce 33%, Cumulative CPU 5.4 sec [22:40:17]2016-09-22 06:40:16,807 Stage-1 map 100%, reduce 50%, Cumulative CPU 7.52 sec [22:40:19]2016-09-22 06:40:18,929 Stage-1 map 100%, reduce 83%, Cumulative CPU 11.35 sec [22:40:20]2016-09-22 06:40:19,991 Stage-1 map 100%, reduce 100%, Cumulative CPU 13.19 sec [22:40:21]MapReduce Total cumulative CPU time: 13 seconds 190 msec [22:40:21]Ended Job job_1474497386931_0001 [22:40:21]Loading data to table zmgdb.bucket_t1 [22:40:22]MapReduce Jobs Launched: [22:40:22]Stage-Stage-1: Map: 1 Reduce: 6 Cumulative CPU: 13.19 sec HDFS Read: 25355 HDFS Write: 1434 SUCCESS [22:40:22]Total MapReduce CPU Time Spent: 13 seconds 190 msec [22:40:22]OK [22:40:22]id [22:40:22]Time taken: 34.91 seconds 错误的导入方式从文件load data
hive (zmgdb) create table bucket_t2 like bucket_t1; OK Time taken: 0.707 seconds
hive (zmgdb) load data local inpath /data/bucket_test into table bucket_t2; Loading data to table zmgdb.bucket_t2 OK Time taken: 1.485 seconds 没有启动mapreduce对数据进行哈希取值只是简单的原样导入没有起到抽样查询的目的。通过select * from 比较会发现bucket_t1的数据和bucket_t2的数据顺序是不同的bucket_t2的表顺序与原数据文件顺序一致没有做过哈希取值。 查询
select * from bucket_table tablesample(bucket x out of y on column); tablesample是抽样语句 语法解析TABLESAMPLE(BUCKET x OUT OF y on 字段) y必须是table总bucket数的倍数或者因子。 hive根据y的大小决定抽样的比例。 例如table总共分了64份当y32时抽取(64/32)2个bucket的数据当y128时抽取(64/128)1/2个bucket的数据。x表示从哪个bucket开始抽取。 例如table总bucket数为32tablesample(bucket 3 out of 16)表示总共抽取32/162个bucket的数据分别为第3个bucket和第31619个bucket的数据。如果是y64则抽取半个第3个桶的值。 [22:44:31]hive (zmgdb) select * from bucket_t1 tablesample (bucket 1 out of 6 on id); [22:44:31]OK [22:44:31]bucket_t1.id [22:44:31]6 [22:44:31]iu [22:44:31]0 [22:44:31]6 [22:44:31]hj [22:44:31]6 [22:44:31]6 [22:44:31]51 [22:44:31] [22:44:31] [22:44:31]r [22:44:31]99 [22:44:31]0 [22:44:31]57 [22:44:31]loo [22:44:31]r [22:44:31]r [22:44:31]r [22:44:31]60 [22:44:31]66 [22:44:31]75 [22:44:31]6 [22:44:31]84 [22:44:31]x [22:44:31]24 [22:44:31]93 [22:44:31]99 [22:44:31]105 [22:44:31]f [22:44:31]r [22:44:31]114 [22:44:31]0 [22:44:31]123 [22:44:31]129 [22:44:31]132 [22:44:31]x [22:44:31]138 [22:44:31]141 [22:44:31]147 [22:44:31]33 [22:44:31]150 [22:44:31]156 [22:44:31]r [22:44:31]f [22:44:31]39 [22:44:31]15 [22:44:31]r [22:44:31]ddd [22:44:31] [22:44:31]06 [22:44:31]hj [22:44:31]f [22:44:31]l [22:44:31]f [22:44:31]f [22:44:31]f [22:44:31]f [22:44:31]42 [22:44:31]f [22:44:31]r [22:44:31]r [22:44:31]f [22:44:31]f [22:44:31]r [22:44:31]48 [22:44:31]6 [22:44:31]Time taken: 0.142 seconds, Fetched:66 row(s) [22:44:43]hive (zmgdb) select * from bucket_t1 tablesample (bucket 1 out of 60 on id); [22:44:43]OK [22:44:43]bucket_t1.id [22:44:43] [22:44:43] [22:44:43]loo [22:44:43]x [22:44:43]114 [22:44:43]132 [22:44:43]x [22:44:43]150 [22:44:43]ddd [22:44:43] [22:44:43]Time taken: 0.064 seconds, Fetched: 10 row(s)