分享

Druid 数据摄入与查询示例

 码农书馆 2020-06-28

1.前置条件:

1.1 环境准备:

    Java 8 Linux,

    Mac OS X或其他类Unix操作系统(不支持Windows)

    8G的RAM

    2个vCPU

 

1.2 下载并解压 druid

    下载地址:https://www./dyn/closer.cgi?path=/incubator/druid/0.13.0-incubating/apache-druid-0.13.0-incubating-bin.tar.gz

    解压命令:tar -xzf apache-druid-0.13.0-incubating-bin.tar.gz

    进入目录:cd apache-druid-0.13.0-incubating

 

1.3 下载并解压 zookeeper

    Druid依赖Apache ZooKeeper进行分布式协调,在druid的根目录中,下载并运行Zookeeper。运行以下命令:

    curl https://archive./dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz

    tar -xzf zookeeper-3.4.11.tar.gz

    mv zookeeper-3.4.11 zk

 

1.4 启动并运行druid

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf

    这将带来Zookeeper和Druid服务的实例,所有这些都在本地机器上运行,例如:

  1. MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf

  2. [Tue Dec 25 16:11:35 2018] Running command[zk], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/zk.log]: bin/run-zk quickstart/tutorial/conf

  3. [Tue Dec 25 16:11:35 2018] Running command[coordinator], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/coordinator.log]: bin/run-druid coordinator quickstart/tutorial/conf

  4. [Tue Dec 25 16:11:35 2018] Running command[broker], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/broker.log]: bin/run-druid broker quickstart/tutorial/conf

  5. [Tue Dec 25 16:11:35 2018] Running command[historical], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/historical.log]: bin/run-druid historical quickstart/tutorial/conf

  6. [Tue Dec 25 16:11:35 2018] Running command[overlord], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/overlord.log]: bin/run-druid overlord quickstart/tutorial/conf

  7. [Tue Dec 25 16:11:35 2018] Running command[middleManager], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/middleManager.log]: bin/run-druid middleManager quickstart/tutorial/conf

    所有持久状态(如集群元数据存储和服务段)都将保存在apache-druid-0.13.0-incubating下的var目录中。服务的日志位于var / sv。

 

2.加载数据

    此示例数据位于Druid软件包根目录下的quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz中。页面编辑事件作为JSON对象存储在文本文件中。

    示例数据包含以下列,示例事件如下所示:

  1. {

  2.     "time": "2015-09-12T00:47:47.870Z",

  3.     "channel": "#vi.wikipedia",

  4.     "cityName": null,

  5.     "comment": "clean up using [[Project:AWB|AWB]]",

  6.     "countryIsoCode": null,

  7.     "countryName": null,

  8.     "isAnonymous": false,

  9.     "isMinor": false,

  10.     "isNew": false,

  11.     "isRobot": true,

  12.     "isUnpatrolled": false,

  13.     "metroCode": null,

  14.     "namespace": "Main",

  15.     "page": "Atractus duboisi",

  16.     "regionIsoCode": null,

  17.     "regionName": null,

  18.     "user": "ThitxongkhoiAWB",

  19.     "delta": 18,

  20.     "added": 18,

  21.     "deleted": 0

  22. }

2.1 通过kafka流式的方式加载数据

2.1.1 Kafka相关操作

(1)下载

  1. curl -O https://archive./dist/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz

  2. tar -xzf kafka_2.11-0.10.2.0.tgz

  3. cd kafka_2.11-0.10.2.0

(2)启动kafka

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-server-start.sh config/server.properties

(3)查看主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --list

(4)创建主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia

(5)删除主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic wikipedia

2.1.2 druid 相关操作:

      数据进入kafka之后,需要定义一个datasource specfile告诉druid怎样接入数据,里面声明时间戳格式,纬度列、指标列、预聚合的粒度等等。

(1)在druid中启动从kafka中摄入数据的处理逻辑

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor

    说明:如果主管成功创建,您将收到包含主管ID的回复;在我们的例子中,我们应该看到{“id”:“wikipedia-kafka”}

    数据(定义datasource):wikipedia-kafka-supervisor.json

  1. {

  2.   "type": "kafka",

  3.   "dataSchema": {

  4.     "dataSource": "wikipedia",

  5.     "parser": {

  6.       "type": "string",

  7.       "parseSpec": {

  8.         "format": "json",

  9.         "timestampSpec": {

  10.           "column": "time",

  11.           "format": "auto"

  12.         },

  13.         "dimensionsSpec": {

  14.           "dimensions": [

  15.             "channel",

  16.             "cityName",

  17.             "comment",

  18.             "countryIsoCode",

  19.             "countryName",

  20.             "isAnonymous",

  21.             "isMinor",

  22.             "isNew",

  23.             "isRobot",

  24.             "isUnpatrolled",

  25.             "metroCode",

  26.             "namespace",

  27.             "page",

  28.             "regionIsoCode",

  29.             "regionName",

  30.             "user",

  31.             { "name": "added", "type": "long" },

  32.             { "name": "deleted", "type": "long" },

  33.             { "name": "delta", "type": "long" }

  34.           ]

  35.         }

  36.       }

  37.     },

  38.     "metricsSpec" : [],

  39.     "granularitySpec": {

  40.       "type": "uniform",

  41.       "segmentGranularity": "DAY",

  42.       "queryGranularity": "NONE",

  43.       "rollup": false

  44.     }

  45.   },

  46.   "tuningConfig": {

  47.     "type": "kafka",

  48.     "reportParseExceptions": false

  49.   },

  50.   "ioConfig": {

  51.     "topic": "wikipedia",

  52.     "replicas": 2,

  53.     "taskDuration": "PT10M",

  54.     "completionTimeout": "PT20M",

  55.     "consumerProperties": {

  56.       "bootstrap.servers": "localhost:9092"

  57.     }

  58.   }

  59. }

(2)解压wikiticker-2015-09-12-sampled.json.gz

  1. MacBook-Air-3:tutorial g2$ pwd

  2. /Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial

  3. MacBook-Air-3:tutorial g2$ gunzip -k wikiticker-2015-09-12-sampled.json.gz

    数据示例:

  1. {

  2.     "time": "2015-09-12T00:48:02.596Z",

  3.     "channel": "#es.wikipedia",

  4.     "cityName": "Mexico City",

  5.     "comment": "Cambio en la redacción del texto y correción en sintaxis",

  6.     "countryIsoCode": "MX",

  7.     "countryName": "Mexico",

  8.     "isAnonymous": true,

  9.     "isMinor": false,

  10.     "isNew": false,

  11.     "isRobot": false,

  12.     "isUnpatrolled": false,

  13.     "metroCode": null,

  14.     "namespace": "Main",

  15.     "page": "Mathis Bolly",

  16.     "regionIsoCode": "DIF",

  17.     "regionName": "Mexico City",

  18.     "user": "189.217.75.123",

  19.     "delta": -67,

  20.     "added": 0,

  21.     "deleted": 67

  22. }

(3)在kafka目录中,解压如下命令(向kafka中写入数据)

  1. MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ export KAFKA_OPTS="-Dfile.encoding=UTF-8"

  2. MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < /Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial/wikiticker-2015-09-12-sampled.json

     现在,数据成功写入到kafka,接下来我们就可以到druid中进行查询了。

3.2 加载文件

3.2.1 准备数据和摄取任务规范:wikipedia-index.json

  1. {

  2.   "type" : "index",

  3.   "spec" : {

  4.     "dataSchema" : {

  5.       "dataSource" : "wikipedia",

  6.       "parser" : {

  7.         "type" : "string",

  8.         "parseSpec" : {

  9.           "format" : "json",

  10.           "dimensionsSpec" : {

  11.             "dimensions" : [

  12.               "channel",

  13.               "cityName",

  14.               "comment",

  15.               "countryIsoCode",

  16.               "countryName",

  17.               "isAnonymous",

  18.               "isMinor",

  19.               "isNew",

  20.               "isRobot",

  21.               "isUnpatrolled",

  22.               "metroCode",

  23.               "namespace",

  24.               "page",

  25.               "regionIsoCode",

  26.               "regionName",

  27.               "user",

  28.               { "name": "added", "type": "long" },

  29.               { "name": "deleted", "type": "long" },

  30.               { "name": "delta", "type": "long" }

  31.             ]

  32.           },

  33.           "timestampSpec": {

  34.             "column": "time",

  35.             "format": "iso"

  36.           }

  37.         }

  38.       },

  39.       "metricsSpec" : [],

  40.       "granularitySpec" : {

  41.         "type" : "uniform",

  42.         "segmentGranularity" : "day",

  43.         "queryGranularity" : "none",

  44.         "intervals" : ["2015-09-12/2015-09-13"],

  45.         "rollup" : false

  46.       }

  47.     },

  48.     "ioConfig" : {

  49.       "type" : "index",

  50.       "firehose" : {

  51.         "type" : "local",

  52.         "baseDir" : "quickstart/tutorial/",

  53.         "filter" : "wikiticker-2015-09-12-sampled.json.gz"

  54.       },

  55.       "appendToExisting" : false

  56.     },

  57.     "tuningConfig" : {

  58.       "type" : "index",

  59.       "targetPartitionSize" : 5000000,

  60.       "maxRowsInMemory" : 25000,

  61.       "forceExtendableShardSpecs" : true

  62.     }

  63.   }

  64. }

3.2.2 Load batch data

  1. MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/post-index-task --file quickstart/tutorial/wikipedia-index.json

  2. Beginning indexing data for wikipedia

  3. Task started: index_wikipedia_2018-12-25T10:00:45.744Z

  4. Task log:     http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/log

  5. Task status:  http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/status

  6. Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

  7. Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

  8. Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

  9. Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

  10. Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

  11. Task finished with status: SUCCESS

  12. Completed indexing data for wikipedia. Now loading indexed data onto the cluster...

  13. wikipedia loading complete! You may now query your data

  14. MacBook-Air-3:apache-druid-0.13.0-incubating g2$

    现在,数据成功加载到druid,接下来我们就可以到druid中进行查询了。

 

3.查询数据

3.1 Native JSON queries

3.1.1 查询的数据请求示例

  1. {

  2.   "queryType" : "topN",

  3.   "dataSource" : "wikipedia",

  4.   "intervals" : ["2015-09-12/2015-09-13"],

  5.   "granularity" : "all",

  6.   "dimension" : "page",

  7.   "metric" : "count",

  8.   "threshold" : 15,

  9.   "aggregations" : [

  10.     {

  11.       "type" : "count",

  12.       "name" : "count"

  13.     }

  14.   ]

  15. }

3.1.2 查询:

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8082/druid/v2?pretty

3.1.3 执行结果

  1. [ {

  2.   "timestamp" : "2015-09-12T00:46:58.771Z",

  3.   "result" : [ {

  4.     "count" : 33,

  5.     "page" : "Wikipedia:Vandalismusmeldung"

  6.   }, {

  7.     "count" : 28,

  8.     "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"

  9.   }, {

  10.     "count" : 27,

  11.     "page" : "Jeremy Corbyn"

  12.   }, {

  13.     "count" : 21,

  14.     "page" : "Wikipedia:Administrators' noticeboard/Incidents"

  15.   }, {

  16.     "count" : 20,

  17.     "page" : "Flavia Pennetta"

  18.   }, {

  19.     "count" : 18,

  20.     "page" : "Total Drama Presents: The Ridonculous Race"

  21.   }, {

  22.     "count" : 18,

  23.     "page" : "User talk:Dudeperson176123"

  24.   }, {

  25.     "count" : 18,

  26.     "page" : "Wikipédia:Le Bistro/12 septembre 2015"

  27.   }, {

  28.     "count" : 17,

  29.     "page" : "Wikipedia:In the news/Candidates"

  30.   }, {

  31.     "count" : 17,

  32.     "page" : "Wikipedia:Requests for page protection"

  33.   }, {

  34.     "count" : 16,

  35.     "page" : "Utente:Giulio Mainardi/Sandbox"

  36.   }, {

  37.     "count" : 16,

  38.     "page" : "Wikipedia:Administrator intervention against vandalism"

  39.   }, {

  40.     "count" : 15,

  41.     "page" : "Anthony Martial"

  42.   }, {

  43.     "count" : 13,

  44.     "page" : "Template talk:Connected contributor"

  45.   }, {

  46.     "count" : 12,

  47.     "page" : "Chronologie de la Lorraine"

  48.   } ]

 

3.2 Druid SQL queries

    德鲁伊还支持用于查询的SQL方言。让我们运行一个SQL查询,它等同于上面显示的本机JSON查询。

3.2.1 sql请求的json格式

  1. {

  2.   "query":"SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10"

  3. }

3.2.2 查询

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8082/druid/v2/sql

 3.2.3 执行结果

  1. [

  2.     {

  3.         "page": "Wikipedia:Vandalismusmeldung",

  4.         "Edits": 33

  5.     },

  6.     {

  7.         "page": "User:Cyde/List of candidates for speedy deletion/Subpage",

  8.         "Edits": 28

  9.     },

  10.     {

  11.         "page": "Jeremy Corbyn",

  12.         "Edits": 27

  13.     },

  14.     {

  15.         "page": "Wikipedia:Administrators' noticeboard/Incidents",

  16.         "Edits": 21

  17.     },

  18.     {

  19.         "page": "Flavia Pennetta",

  20.         "Edits": 20

  21.     },

  22.     {

  23.         "page": "Total Drama Presents: The Ridonculous Race",

  24.         "Edits": 18

  25.     },

  26.     {

  27.         "page": "User talk:Dudeperson176123",

  28.         "Edits": 18

  29.     },

  30.     {

  31.         "page": "Wikipédia:Le Bistro/12 septembre 2015",

  32.         "Edits": 18

  33.     },

  34.     {

  35.         "page": "Wikipedia:In the news/Candidates",

  36.         "Edits": 17

  37.     },

  38.     {

  39.         "page": "Wikipedia:Requests for page protection",

  40.         "Edits": 17

  41.     }

  42. ]

 

3.3 sql client

    为方便起见,Druid包中包含一个SQL命令行客户端,位于Druid包根目录的bin/dsql中。 我们现在运行bin / dsql;你应该看到以下提示:

  1. MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/dsql

  2. Welcome to dsql, the command-line client for Druid SQL.

  3. Type "\h" for help.

  4. dsql>

3.3.1 示例1:

  1. dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

  2. ┌──────────────────────────────────────────────────────────┬───────┐

  3. │ page                                                     │ Edits │

  4. ├──────────────────────────────────────────────────────────┼───────┤

  5. │ Wikipedia:Vandalismusmeldung                             │    33

  6. │ User:Cyde/List of candidates for speedy deletion/Subpage │    28

  7. │ Jeremy Corbyn                                            │    27

  8. │ Wikipedia:Administrators' noticeboard/Incidents          │    21 │

  9. │ Flavia Pennetta                                          │    20

  10. │ Total Drama Presents: The Ridonculous Race               │    18

  11. │ User talk:Dudeperson176123                               │    18

  12. │ Wikipédia:Le Bistro/12 septembre 2015                    │    18

  13. │ Wikipedia:In the news/Candidates                         │    17

  14. │ Wikipedia:Requests for page protection                   │    17

  15. └──────────────────────────────────────────────────────────┴───────┘

  16. Retrieved 10 rows in 0.35s.

3.3.2 示例2:Timeseries

  1. dsql> SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR);

  2. ┌──────────────────────────┬──────────────┐

  3. │ HourTime                 │ LinesDeleted │

  4. ├──────────────────────────┼──────────────┤

  5. 2015-09-12T00:00:00.000Z │         1761

  6. 2015-09-12T01:00:00.000Z │        16208

  7. 2015-09-12T02:00:00.000Z │        14543

  8. 2015-09-12T03:00:00.000Z │        13101

  9. 2015-09-12T04:00:00.000Z │        12040

  10. 2015-09-12T05:00:00.000Z │         6399

  11. 2015-09-12T06:00:00.000Z │         9036

  12. 2015-09-12T07:00:00.000Z │        11409

  13. 2015-09-12T08:00:00.000Z │        11616

  14. 2015-09-12T09:00:00.000Z │        17509

  15. 2015-09-12T10:00:00.000Z │        19406

  16. 2015-09-12T11:00:00.000Z │        16284

  17. 2015-09-12T12:00:00.000Z │        18672

  18. 2015-09-12T13:00:00.000Z │        30520

  19. 2015-09-12T14:00:00.000Z │        18025

  20. 2015-09-12T15:00:00.000Z │        26399

  21. 2015-09-12T16:00:00.000Z │        24759

  22. 2015-09-12T17:00:00.000Z │        19634

  23. 2015-09-12T18:00:00.000Z │        17345

  24. 2015-09-12T19:00:00.000Z │        19305

  25. 2015-09-12T20:00:00.000Z │        22265

  26. 2015-09-12T21:00:00.000Z │        16394

  27. 2015-09-12T22:00:00.000Z │        16379

  28. 2015-09-12T23:00:00.000Z │        15289

  29. └──────────────────────────┴──────────────┘

  30. Retrieved 24 rows in 0.25s.



  31. dsql>

3.3.3 GroupBy

  1. dsql> SELECT channel, SUM(added) FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5;

  2. ┌───────────────┬─────────┐

  3. │ channel       │ EXPR$1  │

  4. ├───────────────┼─────────┤

  5. #en.wikipedia │ 3045299 │

  6. #it.wikipedia │  711011 │

  7. #fr.wikipedia │  642555 │

  8. #ru.wikipedia │  640698 │

  9. #es.wikipedia │  634670 │

  10. └───────────────┴─────────┘

  11. Retrieved 5 rows in 0.13s.

3.3.4 EXPLAIN PLAN FOR

  1. dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

  2. ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐

  3. │ PLAN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    │

  4. ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤

  5. │ DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}]) │

  6. └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

  7. Retrieved 1 row in 0.09s.

 

    参考文献:http:///docs/latest/tutorials/index.html

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多