配色: 字号:
Hive连接JOIN用例详解
2016-12-13 | 阅:  转:  |  分享 
  
Hive连接JOIN用例详解

SQL里面通常都会用Join来连接两个表,做复杂的关联查询。比如用户表和订单表,能通过join得到某个用户购买的产品;或者某个产品被购买的人群....



Hive也支持这样的操作,而且由于Hive底层运行在hadoop上,因此有很多地方可以进行优化。比如小表到大表的连接操作、小表进行缓存、大表进行避免缓存等等...



下面就来看看hive里面的连接操作吧!其实跟SQL还是差不多的...



数据准备:创建数据-->创建表-->导入数据



首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:



[xingoo@localhosttmp]$cataa.txt

1a3

2b4

3c1

[xingoo@localhosttmp]$catbb.txt

1xxx2

2yyy3

3zzz5

接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。



hive>createtableaa

>(astring,bstring,cstring)

>rowformatdelimited

>fieldsterminatedby'''';

OK

Timetaken:0.19seconds

hive>createtablebblikeaa;

OK

Timetaken:0.188seconds

查看两个表的结构:



hive>describeaa;

OK

astring

bstring

cstring

Timetaken:0.068seconds,Fetched:3row(s)

hive>describebb;

OK

astring

bstring

cstring

Timetaken:0.045seconds,Fetched:3row(s)

下面可以基于本地的文件,导入数据



hive>loaddatalocalinpath''/usr/tmp/aa.txt''overwriteintotableaa;

Loadingdatatotabletest.aa

OK

Timetaken:0.519seconds

hive>loaddatalocalinpath''/usr/tmp/bb.txt''overwriteintotablebb;

Loadingdatatotabletest.bb

OK

Timetaken:0.321seconds

内连接



内连接即基于on语句,仅列出表1和表2符合连接条件的数据。



hive>selectfromaaajoinbbbona.c=b.a;

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313

Totaljobs=1

SLF4J:ClasspathcontainsmultipleSLF4Jbindings.

SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.

SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]

2016-08-2416:12:44Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584

2016-08-2416:12:47Dumptheside-tablefortag:0withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable

2016-08-2416:12:47Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable(332bytes)

2016-08-2416:12:47Endoflocaltask;TimeTaken:3.425sec.

Executioncompletedsuccessfully

MapredLocaltasksucceeded

LaunchingJob1outof1

Numberofreducetasksissetto0sincethere''snoreduceoperator

Jobrunningin-process(localHadoop)

2016-08-2416:12:50,222Stage-3map=100%,reduce=0%

EndedJob=job_local944389202_0007

MapReduceJobsLaunched:

Stage-Stage-3:HDFSRead:1264HDFSWrite:90SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

3c11xxx2

1a33zzz5

Timetaken:17.083seconds,Fetched:2row(s)

左连接



左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null



ive>selectfromaaaleftouterjoinbbbona.c=b.a;

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533

Totaljobs=1

SLF4J:ClasspathcontainsmultipleSLF4Jbindings.

SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.

SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]

2016-08-2416:16:48Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584

2016-08-2416:16:51Dumptheside-tablefortag:1withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable

2016-08-2416:16:51Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable(338bytes)

2016-08-2416:16:51Endoflocaltask;TimeTaken:2.634sec.

Executioncompletedsuccessfully

MapredLocaltasksucceeded

LaunchingJob1outof1

Numberofreducetasksissetto0sincethere''snoreduceoperator

Jobrunningin-process(localHadoop)

2016-08-2416:16:53,843Stage-3map=100%,reduce=0%

EndedJob=job_local1670258961_0008

MapReduceJobsLaunched:

Stage-Stage-3:HDFSRead:1282HDFSWrite:90SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

1a33zzz5

2b4NULLNULLNULL

3c11xxx2

Timetaken:16.048seconds,Fetched:3row(s)

右连接



类似左连接,同理。



hive>selectfromaaarightouterjoinbbbona.c=b.a;

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b

Totaljobs=1

SLF4J:ClasspathcontainsmultipleSLF4Jbindings.

SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.

SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]

2016-08-2416:22:37Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584

2016-08-2416:22:40Dumptheside-tablefortag:0withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable

2016-08-2416:22:40Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable(332bytes)

2016-08-2416:22:40Endoflocaltask;TimeTaken:2.368sec.

Executioncompletedsuccessfully

MapredLocaltasksucceeded

LaunchingJob1outof1

Numberofreducetasksissetto0sincethere''snoreduceoperator

Jobrunningin-process(localHadoop)

2016-08-2416:22:43,060Stage-3map=100%,reduce=0%

EndedJob=job_local2001415675_0009

MapReduceJobsLaunched:

Stage-Stage-3:HDFSRead:1306HDFSWrite:90SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

3c11xxx2

NULLNULLNULL2yyy3

1a33zzz5

Timetaken:15.483seconds,Fetched:3row(s)

全连接



相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.



hive>selectfromaaafullouterjoinbbbona.c=b.a;

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb

Totaljobs=1

LaunchingJob1outof1

Numberofreducetasksnotspecified.Estimatedfrominputdatasize:1

Inordertochangetheaverageloadforareducer(inbytes):

sethive.exec.reducers.bytes.per.reducer=

Inordertolimitthemaximumnumberofreducers:

sethive.exec.reducers.max=

Inordertosetaconstantnumberofreducers:

setmapreduce.job.reduces=

Jobrunningin-process(localHadoop)

2016-08-2416:22:54,111Stage-1map=100%,reduce=100%

EndedJob=job_local1766586034_0010

MapReduceJobsLaunched:

Stage-Stage-1:HDFSRead:4026HDFSWrite:270SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

3c11xxx2

NULLNULLNULL2yyy3

1a33zzz5

2b4NULLNULLNULL

Timetaken:1.689seconds,Fetched:4row(s)

左半开连接



这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。



hive>selectfromaaaleftsemijoinbbbona.c=b.a;

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816

Totaljobs=1

SLF4J:ClasspathcontainsmultipleSLF4Jbindings.

SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Seehttp://www.slf4j.org/codes.html#multiple_bindingsforanexplanation.

SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]

2016-08-2416:23:37Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584

2016-08-2416:23:41Dumptheside-tablefortag:1withgroupcount:3intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable

2016-08-2416:23:41Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable(317bytes)

2016-08-2416:23:41Endoflocaltask;TimeTaken:3.586sec.

Executioncompletedsuccessfully

MapredLocaltasksucceeded

LaunchingJob1outof1

Numberofreducetasksissetto0sincethere''snoreduceoperator

Jobrunningin-process(localHadoop)

2016-08-2416:23:43,798Stage-3map=100%,reduce=0%

EndedJob=job_local521961878_0011

MapReduceJobsLaunched:

Stage-Stage-3:HDFSRead:1366HDFSWrite:90SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

1a3

3c1

Timetaken:16.811seconds,Fetched:2row(s)

笛卡尔积



笛卡尔积会针对表1和表2的每条数据做连接...



hive>selectfromaajoinbb;

Warning:MapJoinMAPJOIN[9][bigTable=?]intask''Stage-3:MAPRED''isacrossproduct

WARNING:Hive-on-MRisdeprecatedinHive2andmaynotbeavailableinthefutureversions.Considerusingadifferentexecutionengine(i.e.spark,tez)orusingHive1.Xreleases.

QueryID=root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f

Totaljobs=1

SLF4J:ClasspathcontainsmultipleSLF4Jbindings.

SLF4J:Foundbindingin[jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Foundbindingin[jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J:Seehttp://www.slf4j.org/codes.htmwww.wang027.coml#multiple_bindingsforanexplanation.

SLF4J:Actualbindingisoftype[org.apache.logging.slf4j.Log4jLoggerFactory]

2016-08-2416:25:00Startingtolaunchlocaltasktoprocessmapjoin;maximummemory=518979584

2016-08-2416:25:02Dumptheside-tablefortag:0withgroupcount:1intofile:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable

2016-08-2416:25:02Uploaded1Fileto:file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable(305bytes)

2016-08-2416:25:02Endoflocaltask;TimeTaken:2.892sec.

Executioncompletedsuccessfully

MapredLocaltasksucceeded

LaunchingJob1outof1

Numberofreducetasksissetto0sincethere''snoreduceoperator

Jobrunningin-process(localHadoop)

2016-08-2416:25:05,677Stage-3map=100%,reduce=0%

EndedJob=job_local2068422373_0012

MapReducewww.baiyuewang.netJobsLaunched:

Stage-Stage-3:HDFSRead:1390HDFSWrite:90SUCCESS

TotalMapReduceCPUTimeSpent:0msec

OK

1a31xxx2

2b41xxx2

3c11xxx2

1a32yyy3

2b42yyy3

3c12yyy3

1a33zzz5

2b43zzz5

3c13zzz5

上面就是hive中的连接查询,其实与SQL一样的。

献花(0)
+1
(本文系thedust79首藏)