Hive连接JOIN用例详解
SQL里面通常都会用Join来连接两个表,做复杂的关联查询。比如用户表和订单表,能通过join得到某个用户购买的产品;或者某个产品被购买的人群....
Hive也支持这样的操作,而且由于Hive底层运行在hadoop上,因此有很多地方可以进行优化。比如小表到大表的连接操作、小表进行缓存、大表进行避免缓存等等...
下面就来看看hive里面的连接操作吧!其实跟SQL还是差不多的...
数据准备:创建数据-->创建表-->导入数据
首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:
[xingoo@localhosttmp]$cataa.txt
1a3
2b4
3c1
[xingoo@localhosttmp]$catbb.txt
1xxx2
2yyy3
3zzz5
接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。
hive>createtableaa
>(astring,bstring,cstring)
>rowformatdelimited
>fieldsterminatedby'''';
OK
Timetaken:0.19seconds
hive>createtablebblikeaa;
OK
Timetaken:0.188seconds
查看两个表的结构:
hive>describeaa;
OK
astring
bstring
cstring
Timetaken:0.068seconds,Fetched:3row(s)
hive>describebb;
OK
astring
bstring
cstring
Timetaken:0.045seconds,Fetched:3row(s)
下面可以基于本地的文件,导入数据
hive>loaddatalocalinpath''/usr/tmp/aa.txt''overwriteintotableaa;
Loadingdatatotabletest.aa
OK
Timetaken:0.519seconds
hive>loaddatalocalinpath''/usr/tmp/bb.txt''overwriteintotablebb;
Loadingdatatotabletest.bb
OK
Timetaken:0.321seconds
内连接
内连接即基于on语句,仅列出表1和表2符合连接条件的数据。
hive>selectfromaaajoinbbbona.c=b.a;
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313
Totaljobs=1
SLF4J:ClasspathcontainsmultipleSLF4Jbindings.
SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.
SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-2416:12:44Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584
2016-08-2416:12:47Dumptheside-tablefortag:0withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-08-2416:12:47Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable(332bytes)
2016-08-2416:12:47Endoflocaltask;TimeTaken:3.425sec.
Executioncompletedsuccessfully
MapredLocaltasksucceeded
LaunchingJob1outof1
Numberofreducetasksissetto0sincethere''snoreduceoperator
Jobrunningin-process(localHadoop)
2016-08-2416:12:50,222Stage-3map=100%,reduce=0%
EndedJob=job_local944389202_0007
MapReduceJobsLaunched:
Stage-Stage-3:HDFSRead:1264HDFSWrite:90SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
3c11xxx2
1a33zzz5
Timetaken:17.083seconds,Fetched:2row(s)
左连接
左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null
ive>selectfromaaaleftouterjoinbbbona.c=b.a;
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533
Totaljobs=1
SLF4J:ClasspathcontainsmultipleSLF4Jbindings.
SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.
SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-2416:16:48Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584
2016-08-2416:16:51Dumptheside-tablefortag:1withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2016-08-2416:16:51Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable(338bytes)
2016-08-2416:16:51Endoflocaltask;TimeTaken:2.634sec.
Executioncompletedsuccessfully
MapredLocaltasksucceeded
LaunchingJob1outof1
Numberofreducetasksissetto0sincethere''snoreduceoperator
Jobrunningin-process(localHadoop)
2016-08-2416:16:53,843Stage-3map=100%,reduce=0%
EndedJob=job_local1670258961_0008
MapReduceJobsLaunched:
Stage-Stage-3:HDFSRead:1282HDFSWrite:90SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
1a33zzz5
2b4NULLNULLNULL
3c11xxx2
Timetaken:16.048seconds,Fetched:3row(s)
右连接
类似左连接,同理。
hive>selectfromaaarightouterjoinbbbona.c=b.a;
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b
Totaljobs=1
SLF4J:ClasspathcontainsmultipleSLF4Jbindings.
SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.
SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-2416:22:37Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584
2016-08-2416:22:40Dumptheside-tablefortag:0withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2016-08-2416:22:40Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable(332bytes)
2016-08-2416:22:40Endoflocaltask;TimeTaken:2.368sec.
Executioncompletedsuccessfully
MapredLocaltasksucceeded
LaunchingJob1outof1
Numberofreducetasksissetto0sincethere''snoreduceoperator
Jobrunningin-process(localHadoop)
2016-08-2416:22:43,060Stage-3map=100%,reduce=0%
EndedJob=job_local2001415675_0009
MapReduceJobsLaunched:
Stage-Stage-3:HDFSRead:1306HDFSWrite:90SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
3c11xxx2
NULLNULLNULL2yyy3
1a33zzz5
Timetaken:15.483seconds,Fetched:3row(s)
全连接
相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.
hive>selectfromaaafullouterjoinbbbona.c=b.a;
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb
Totaljobs=1
LaunchingJob1outof1
Numberofreducetasksnotspecified.Estimatedfrominputdatasize:1
Inordertochangetheaverageloadforareducer(inbytes):
sethive.exec.reducers.bytes.per.reducer=
Inordertolimitthemaximumnumberofreducers:
sethive.exec.reducers.max=
Inordertosetaconstantnumberofreducers:
setmapreduce.job.reduces=
Jobrunningin-process(localHadoop)
2016-08-2416:22:54,111Stage-1map=100%,reduce=100%
EndedJob=job_local1766586034_0010
MapReduceJobsLaunched:
Stage-Stage-1:HDFSRead:4026HDFSWrite:270SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
3c11xxx2
NULLNULLNULL2yyy3
1a33zzz5
2b4NULLNULLNULL
Timetaken:1.689seconds,Fetched:4row(s)
左半开连接
这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。
hive>selectfromaaaleftsemijoinbbbona.c=b.a;
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816
Totaljobs=1
SLF4J:ClasspathcontainsmultipleSLF4Jbindings.
SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.
SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-2416:23:37Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584
2016-08-2416:23:41Dumptheside-tablefortag:1withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2016-08-2416:23:41Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable(317bytes)
2016-08-2416:23:41Endoflocaltask;TimeTaken:3.586sec.
Executioncompletedsuccessfully
MapredLocaltasksucceeded
LaunchingJob1outof1
Numberofreducetasksissetto0sincethere''snoreduceoperator
Jobrunningin-process(localHadoop)
2016-08-2416:23:43,798Stage-3map=100%,reduce=0%
EndedJob=job_local521961878_0011
MapReduceJobsLaunched:
Stage-Stage-3:HDFSRead:1366HDFSWrite:90SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
1a3
3c1
Timetaken:16.811seconds,Fetched:2row(s)
笛卡尔积
笛卡尔积会针对表1和表2的每条数据做连接...
hive>selectfromaajoinbb;
Warning:MapJoinMAPJOIN[9][bigTable=?]intask''Stage-3:MAPRED''isacrossproduct
WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.
QueryID=root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f
Totaljobs=1
SLF4J:ClasspathcontainsmultipleSLF4Jbindings.
SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J:Seehttp://www.slf4j.org/codes.htmwww.wang027.coml#multiple_bindingsforanexplanation.
SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-2416:25:00Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584
2016-08-2416:25:02Dumptheside-tablefortag:0withgroupcount:1intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
2016-08-2416:25:02Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable(305bytes)
2016-08-2416:25:02Endoflocaltask;TimeTaken:2.892sec.
Executioncompletedsuccessfully
MapredLocaltasksucceeded
LaunchingJob1outof1
Numberofreducetasksissetto0sincethere''snoreduceoperator
Jobrunningin-process(localHadoop)
2016-08-2416:25:05,677Stage-3map=100%,reduce=0%
EndedJob=job_local2068422373_0012
MapReducewww.baiyuewang.netJobsLaunched:
Stage-Stage-3:HDFSRead:1390HDFSWrite:90SUCCESS
TotalMapReduceCPUTimeSpent:0msec
OK
1a31xxx2
2b41xxx2
3c11xxx2
1a32yyy3
2b42yyy3
3c12yyy3
1a33zzz5
2b43zzz5
3c13zzz5
上面就是hive中的连接查询,其实与SQL一样的。
|
|