配色: 字号:
Sqoop使用
2016-11-15 | 阅:  转:  |  分享 
  
Sqoop使用

Sqoop的本质还是一个命令行工具,和HDFS,MapReduce相比,并没有什么高深的理论。



我们可以通过sqoophelp命令来查看sqoop的命令选项,如下:



16/11/1320:10:17INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

usage:sqoopCOMMAND[ARGS]

Availablecommands:

codegenGeneratecodetointeractwithdatabaserecords

create-hive-tableImportatabledefinitionintoHive

evalEvaluateaSQLstatementanddisplaytheresults

exportExportanHDFSdirectorytoadatabasetable

helpListavailablecommands

importImportatablefromadatabasetoHDFS

import-all-tablesImporttablesfromadatabasetoHDFS

import-mainframeImportdatasetsfromamainframeservertoHDFS

jobWorkwithsavedjobs

list-databasesListavailabledatabasesonaserver

list-tablesListavailabletablesinadatabase

mergeMergeresultsofincrementalimports

metastoreRunastandaloneSqoopmetastore

versionDisplayversioninformation

See''sqoophelpCOMMAND''forinformationonaspecificcommand.

其中使用频率最高的选项还是import和export选项。



1.codegen



将关系型数据库表的记录映射为一个Java文件,Javaclass类以及相关的jar包,该命令将数据库表的记录映射为一个Java文件,在该Java文件中对应有表的各个字段。生成的jar和class文件在Metastore功能使用时会用到。该命令选项的参数如下图所示:







举例:



sqoopcodegen--connectjdbc:mysql://localhost:3306/test--tableorder_info-outdir/home/xiaosi/test/--usernameroot-passwordroot

上面实例以test数据库的order_info表来生成Java代码,其中-outdir指定了Java代码生成的路径



运行结果信息如下:



16/11/1321:50:34INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1321:50:38INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

16/11/1321:50:38INFOtool.CodeGenTool:Beginningcodegeneration

16/11/1321:50:38INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.FROM`order_info`AStLIMIT1

16/11/1321:50:38INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.FROM`order_info`AStLIMIT1

16/11/1321:50:38INFOorm.CompilationManager:HADOOP_MAPRED_HOMEis/opt/hadoop-2.7.2

注:/tmp/sqoop-xiaosi/compile/ea41fe40e1f12f6b052ad9fe4a5d9710/order_info.java使用或覆盖了已过时的API。

注:有关详细信息,请使用-Xlint:deprecation重新编译。

16/11/1321:50:39INFOorm.CompilationManager:Writingjarfile:/tmp/sqoop-xiaosi/compile/ea41fe40e1f12f6b052ad9fe4a5d9710/order_info.jar

我们还可以使用-bindir指定编译成的class文件以及将生成文件打包为jar的jar包文件输出路径:



16/11/1321:53:55INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1321:53:58INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

16/11/1321:53:58INFOtool.CodeGenTool:Beginningcodegeneration

16/11/1321:53:58INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.FROM`order_info`AStLIMIT1

16/11/1321:53:58INFOmanager.SqlManager:ExecutingSQLstatement:SELECTt.FROM`order_info`AStLIMIT1

16/11/1321:53:58INFOorm.CompilationManager:HADOOP_MAPRED_HOMEis/opt/hadoop-2.7.2

注:/home/xiaosi/data/order_info.java使用或覆盖了已过时的API。

注:有关详细信息,请使用-Xlint:deprecation重新编译。

16/11/1321:53:59INFOorm.CompilationManager:Writingjarfile:/home/xiaosi/data/order_info.jar

上面实例指定编译成的class文件(order_info.class)以及将生成文件打包为jar的jar包文件(order_info.jar)输出路径为/home/xiaosi/data路径,java文件(order_info.java)路径为/home/xiaosi/test



2.create-hive-table



这个命令上一篇文章[Sqoop导入与导出]中已经使用过了,作用就是生成与关系数据库表的表结构对应的Hive表。该命令选项的参数如下图所示:







举例:



sqoopcreate-hive-table--connectjdbc:mysql://localhost:3306/test--tableemployee--usernameroot-passwordroot--fields-terminated-by'',''



3.eval



eval命令选项可以让Sqoop使用SQL语句对关系性数据库进行操作,在使用import这种工具进行数据导入的时候,可以预先了解相关的SQL语句是否正确,并能将结果显示在控制台。



3.1选择查询评估计算



使用eval工具,我们可以评估计算任何类型的SQL查询。我们以test数据库的order_info表为例子:



sqoopeval--connectjdbc:mysql://localhost:3306/test--usernameroot--query"selectfromorder_infolimit3"-P

运行结果信息:



16/11/1322:25:19INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1322:25:22INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

------------------------------------------------------------

|id|order_time|business|

------------------------------------------------------------

|358574046793404|2016-04-05|flight|

|358574046794733|2016-08-03|hotel|

|358574050631177|2016-05-08|vacation|

------------------------------------------------------------



3.2插入评估计算



Sqoop的eval工具可以适用于两个模拟和定义的SQL语句。这意味着,我们可以使用eval的INSERT语句了。下面的命令用于在test数据库的order_info表中插入新行:



sqoopeval--connectjdbc:mysql://localhost:3306/test--usernameroot--query"insertintoorder_info(id,order_time,business)values(''358574050631166'',''2016-11-13'',''hotel'')"-P

运行结果信息输出:



16/11/1322:29:42INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1322:29:44INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

16/11/1322:29:44INFOtool.EvalSqlTool:1row(s)updated.

如果命令成功执行,会在控制台上显示更新的行的状态。或者我们可以在mysql中查询我们刚插入的那条信息:

mysql>selectfromorder_infowhereid="358574050631166";

+-----------------+------------+----------+

|id|order_time|business|

+-----------------+------------+----------+

|358574050631166|2016-11-13|hotel|

+-----------------+------------+----------+

1rowinset(0.00sec)



4.export



从HDFS中将数据导出到关系性数据库中。该命令选项的参数如下图所示:











举例:



在HDFS文件中的员工数据的一个例子,数据如下:



hadoopfs-text/user/xiaosi/employee/|less

yoona,qunar,创新事业部

xiaosi,qunar,创新事业部

jim,ali,淘宝

kom,ali,淘宝

lucy,baidu,搜索

jim,ali,淘宝

在将HDFS中数据导出到关系性数据库时,必须在关系性数据库中新建一张来接受数据的表,如下:



CREATETABLE`employee`(

`name`varchar(255)DEFAULTNULL,

`company`varchar(255)DEFAULTNULL,

`depart`varchar(255)DEFAULTNULL

);

下面执行导出操作,命令如下:



sqoopexport--connectjdbc:mysql://localhost:3306/test--tableemployee--export-dir/user/xiaosi/employee--usernameroot-m1--fields-terminated-by'',''-P

运行结果信息输出:



16/11/1323:40:49INFOmapreduce.Job:Theurltotrackthejob:http://localhost:8080/

16/11/1323:40:49INFOmapreduce.Job:Runningjob:job_local611430785_0001

16/11/1323:40:49INFOmapred.LocalJobRunner:OutputCommittersetinconfignull

16/11/1323:40:49INFOmapred.LocalJobRunner:OutputCommitterisorg.apache.sqoop.mapreduce.NullOutputCommitter

16/11/1323:40:49INFOmapred.LocalJobRunner:Waitingformaptasks

16/11/1323:40:49INFOmapred.LocalJobRunner:Startingtask:attempt_local611430785_0001_m_000000_0

16/11/1323:40:49INFOmapred.Task:UsingResourceCalculatorProcessTree:[]

16/11/1323:40:49INFOmapred.MapTask:Processingsplit:Paths:/user/xiaosi/employee/part-m-00000:0+120

16/11/1323:40:49INFOConfiguration.deprecation:map.input.fileisdeprecated.Instead,usemapreduce.map.input.file

16/11/1323:40:49INFOConfiguration.deprecation:map.input.startisdeprecated.Instead,usemapreduce.map.input.start

16/11/1323:40:49INFOConfiguration.deprecation:map.input.lengthisdeprecated.Instead,usemapreduce.map.input.length

16/11/1323:40:49INFOmapreduce.AutoProgressMapper:Auto-progressthreadisfinished.keepGoing=false

16/11/1323:40:49INFOmapred.LocalJobRunner:

16/11/1323:40:49INFOmapred.Task:Task:attempt_local611430785_0001_m_000000_0isdone.Andisintheprocessofcommitting

16/11/1323:40:49INFOmapred.LocalJobRunner:map

16/11/1323:40:49INFOmapred.Task:Task''attempt_local611430785_0001_m_000000_0''done.

16/11/1323:40:49INFOmapred.LocalJobRunner:Finishingtask:attempt_local611430785_0001_m_000000_0

16/11/1323:40:49INFOmapred.LocalJobRunner:maptaskexecutorcomplete.

16/11/1323:40:50INFOmapreduce.Job:Jobjob_local611430785_0001runninginubermode:false

16/11/1323:40:50INFOmapreduce.Job:map100%reduce0%

16/11/1323:40:50INFOmapreduce.Job:Jobjob_local611430785_0001completedsuccessfully

16/11/1323:40:50INFOmapreduce.Job:Counters:20

FileSystemCounters

FILE:Numberofbytesread=22247825

FILE:Numberofbyteswritten=22732498

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=126

HDFS:Numberofbyteswritten=0

HDFS:Numberofreadoperations=12

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=0

Map-ReduceFramework

Mapinputrecords=6

Mapoutputrecords=6

Inputsplitbytes=136

SpilledRecords=0

FailedShuffles=0

MergedMapoutputs=0

GCtimeelapsed(ms)=0

Totalcommittedheapusage(bytes)=245366784

FileInputFormatCounters

BytesRead=0

FileOutputFormatCounters

BytesWritten=0

16/11/1323:40:50INFOmapreduce.ExportJobBase:Transferred126bytesin2.3492seconds(53.6344bytes/sec)

16/11/1323:40:50INFOmapreduce.ExportJobBase:Exported6records.

导出完毕之后,我们可以在mysql中通过employee表进行查询:

mysql>selectname,companyfromemployee;

+--------+---------+

|name|company|

+--------+---------+

|yoona|qunar|

|xiaosi|qunar|

|jim|ali|

|kom|ali|

|lucy|baidu|

|jim|ali|

+--------+---------+

6rowsinset(0.00sec)



5.import



将数据表中的数据导入HDFS或者Hive中,该命令选项的参数如下图所示:















举例:



sqoopimport--connectjdbc:mysql://localhost:3306/test--target-dir/user/xiaosi/data/order_info--query''selectfromorder_infowhere$CONDITIONS''-m1--usernameroot-P

如上代码从查询结果中导入数据到HDFS中,存储路径由--target-dir参数指定。这里,使用了--query选项,不能同时与--table选项使用。同时,变量$CONDITIONS必须在WHERE语句之后,供Sqoop进程运行命令过程中使用。



运行结果信息如下:



16/11/1412:08:50INFOmapreduce.Job:Theurltotrackthejob:http://localhost:8080/

16/11/1412:08:50INFOmapreduce.Job:Runningjob:job_local127577466_0001

16/11/1412:08:50INFOmapred.LocalJobRunner:OutputCommittersetinconfignull

16/11/1412:08:50INFOoutput.FileOutputCommitter:FileOutputCommitterAlgorithmversionis1

16/11/1412:08:50INFOmapred.LocalJobRunner:OutputCommitterisorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

16/11/1412:08:50INFOmapred.LocalJobRunner:Waitingformaptasks

16/11/1412:08:50INFOmapred.LocalJobRunner:Startingtask:attempt_local127577466_0001_m_000000_0

16/11/1412:08:50INFOoutput.FileOutputCommitter:FileOutputCommitterAlgorithmversionis1

16/11/1412:08:50INFOmapred.Task:UsingResourceCalculatorProcessTree:[]

16/11/1412:08:50INFOdb.DBInputFormat:Usingreadcommitedtransactionisolation

16/11/1412:08:50INFOmapred.MapTask:Processingsplit:1=1AND1=1

16/11/1412:08:50INFOdb.DBRecordReader:Workingonsplit:1=1AND1=1

16/11/1412:08:50INFOdb.DBRecordReader:Executingquery:selectfromorder_infowhere(1=1)AND(1=1)

16/11/1412:08:50INFOmapreduce.AutoProgressMapper:Auto-progressthreadisfinished.keepGoing=false

16/11/1412:08:50INFOmapred.LocalJobRunner:

16/11/1412:08:51INFOmapred.Task:Task:attempt_local127577466_0001_m_000000_0isdone.Andisintheprocessofcommitting

16/11/1412:08:51INFOmapred.LocalJobRunner:

16/11/1412:08:51INFOmapred.Task:Taskattempt_local127577466_0001_m_000000_0isallowedtocommitnow

16/11/1412:08:51INFOoutput.FileOutputCommitter:Savedoutputoftask''attempt_local127577466_0001_m_000000_0''tohdfs://localhost:9000/user/xiaosi/data/order_info/_temporary/0/task_local127577466_0001_m_000000

16/11/1412:08:51INFOmapred.Locawww.wang027.comlJobRunner:map

16/11/1412:08:51INFOmapred.Task:Task''attempt_local127577466_0001_m_000000_0''done.

16/11/1412:08:51INFOmapred.LocalJobRunner:Finishingtask:attempt_local127577466_0001_m_000000_0

16/11/1412:08:51INFOmapred.LocalJobRunner:maptaskexecutorcomplete.

16/11/1412:08:51INFOmapreduce.Job:Jobjob_local127577466_0001runninginubermode:false

16/11/1412:08:51INFOmapreduce.Job:map100%reduce0%

16/11/1412:08:51INFOmapreduce.Job:Jobjob_local127577466_0001completedsuccessfully

16/11/1412:08:51INFOmapreduce.Job:Counters:20

FileSystemCounters

FILE:Numberofbytesread=22247784

FILE:Numberofbyteswritten=22732836

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=0

HDFS:Numberofbyteswritten=3710

HDFS:Numberofreadoperations=4

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=3

Map-ReduceFramework

Mapinputrecords=111

Mapoutputrecords=111

Inputsplitbytes=87

SpilledRecords=0

FailedShuffles=0

MergedMapoutputs=0

GCtimeelapsed(ms)=0

Totalcommittedheapusage(bytes)=245366784

FileInputFormatCounters

BytesRead=0

FileOutputFormatCounters

BytesWritten=3710

16/11/1412:08:51INFOmapreduce.ImportJobBase:Transferred3.623KBin2.5726seconds(1.4083KB/sec)

16/11/1412:08:51INFOmapreduce.ImportJobBase:Retrieved111records.

我们可以查看HDFS由参数--target-dir指定的路径查看导入的数据:



hadoopfs-text/user/xiaosi/data/order_info/|less

358574046793404,2016-04-05,flight

358574046794733,2016-08-03,hotel

358574050631177,2016-05-08,vacation

358574050634213,2015-04-28,train

358574050634692,2016-04-05,tuan

358574050650524,2015-07-26,hotel

358574050654773,2015-01-23,flight

358574050668658,2015-01-23,hotel

358574050730771,2016-11-06,train

358574050731241,2016-05-08,car

358574050743865,2015-01-23,vacation

358574050767666,2015-04-28,train

358574050767971,2015-07-26,flight

358574050808288,2016-05-08,hotel

358574050816828,2015-01-23,hotel

358574050818220,2015-04-28,car

358574050821877,2013-08-03,flight

再看一个例子:



sqoopimport--connectjdbc:mysql://localhost:3306/test--tableorder_info--columns"business,id,order_time"-m1--usernameroot-P

HDFS上会在/user/xiaosi/目录下新增一个目录order_info,与关系性数据库的表名一致,内容如下:



flight,358574046793404,2016-04-05

hotel,358574046794733,2016-08-03

vacation,358574050631177,2016-05-08

train,358574050634213,2015-04-28

tuan,358574050634692,2016-04-05



6.import-all-tables



将数据库里的所有表导入HDFS中,每个表在HDFS中对应一个独立的目录。该命令选项的参数如下图所示:







7.list-databases



该命令选项可以列出关系性数据库的所有数据库名,命令如下:



sqooplist-databases--connectjdbc:mysql://localhost:3306--usernameroot-P

运行结果信息如下:



16/11/1414:30:11INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1414:30:14INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

information_schema

hive_db

mysql

performance_schema

phpmyadmin

test



8.list-tables



该命令选项可以列出关系性数据库的某一个数据库的所有表名,命令如下:



sqooplist-tables--connectjdbc:mysql://localhost:3306/test--usernameroot-P

运行结果信息如下:



16/11/1414:32:08INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1414:32:10INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

PageView

book

bookID

cc

city_click

country

country2

cup

employee

flightOrder

hotel_book_info

hotel_info

order_info

stu

stu2

stu3

stuInfo

student



9.merge



该命令选项的作用是将HDFS上的两份数据进行合并,在合并的同时进行数据去重。该命令选项的参数如下图所示:











例如,在HDFS的路径/user/xiaosi/old下由一份导入数据,如下:



idname

在HDFS的路径/user/xiaosi/new下也有一份数据,但是在导入时间在第一份之后,如下:



idname

那么合并的结果为:



idname

运行如下命令:



sqoopmerge-new-data/user/xiaosi/new/part-m-00000-onto/user/xiaosi/old/part-m-00000-target-dir/user/xiaosi/final-jar-file/home/xiaosi/test/testmerge.jar-class-nametestmerge-merge-keyid

备注:



在一份数据集中,多行不应具有相同的主键,否则会发生数据丢失。







10.metastore



记录Sqoop作业的元数据信息,如果不启动Metastore实例,则默认的元数据存储目录为~/.sqoop。如果要更改存储目录,可以在配置文件sqoop-site.xml中进行更改。







启动Metastore实例:



sqoopmetastore

运行结果信息如下:



16/11/1414:44:40INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

16/11/1414:44:40WARNhsqldb.HsqldbMetaStore:Thelocationformetastoredatahasnotbeenexplicitlyset.Placingsharedmetastorefilesin/home/xiaosi/.sqoop/shared-metastore.db

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)entered

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)exited

[Server@52308be6]:[Thread[main,5,main]]:setDatabasePath(0,file:/home/xiaosi/.sqoop/shared-metastore.db)

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)entered

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)exited

[Server@52308be6]:[Thread[main,5,main]]:setDatabaseName(0,sqoop)

[Server@52308be6]:[Thread[main,5,main]]:putPropertiesFromString():[hsqldb.write_delay=false]

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)entered

[Server@52308be6]:[Thread[main,5,main]]:checkRunning(false)exited

[Server@52308be6]:Initiatingwww.baiyuewang.netstartupsequence...

[Server@52308be6]:Serversocketopenedsuccessfullyin3ms.

[Server@52308be6]:Database[index=0,id=0,db=file:/home/xiaosi/.sqoop/shared-metastore.db,alias=sqoop]openedsucessfullyin153ms.

[Server@52308be6]:Startupsequencecompletedin157ms.

[Server@52308be6]:2016-11-1414:44:40.414HSQLDBserver1.8.0isonline

[Server@52308be6]:Toclosenormally,connectandexecuteSHUTDOWNSQL

[Server@52308be6]:Fromcommandline,use[Ctrl]+[C]toabortabruptly

16/11/1414:44:40INFOhsqldb.HsqldbMetaStore:Serverstartedonport16000withprotocolHSQL





11.job



该命令选项可以生产一个Sqoop的作业,但是不会立即执行,需要手动执行,该命令选项目的在于尽可能的服用Sqoop命令。该命令选项的参数如下图所示:









举例:



sqoopjob-createlistTablesJob--list-tables--connectjdbc:mysql://localhost:3306/test--usernameroot-P

上面代码实现一个job,显示关系性数据库test数据库中所有的表。



sqoopjob-execlistTablesJob

上面代码执行我们已经定义好的Job,输出结果信息如下:



16/11/1419:51:44INFOsqoop.Sqoop:RunningSqoopversion:1.4.6

Enterpassword:

16/11/1419:51:47INFOmanager.MySQLManager:PreparingtouseaMySQLstreamingresultset.

PageView

book

bookID

cc

city_click

country

country2

cup

employee

flightOrder

hotel_book_info

hotel_info

order_info

stu

stu2

stu3

stuInfo

student

备注:



--和list-tables(Job所要执行的Sqoop命令)不能挨着。

献花(0)
+1
(本文系thedust79首藏)