Druid 数据摄入与查询示例

码农书馆 2020-06-28

展开全文

1.前置条件：

1.1 环境准备：

Java 8 Linux，

Mac OS X或其他类Unix操作系统（不支持Windows）

8G的RAM

2个vCPU

1.2 下载并解压 druid

下载地址：https://www./dyn/closer.cgi?path=/incubator/druid/0.13.0-incubating/apache-druid-0.13.0-incubating-bin.tar.gz

解压命令：tar -xzf apache-druid-0.13.0-incubating-bin.tar.gz

进入目录：cd apache-druid-0.13.0-incubating

1.3 下载并解压 zookeeper

Druid依赖Apache ZooKeeper进行分布式协调，在druid的根目录中，下载并运行Zookeeper。运行以下命令：

curl https://archive./dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz

tar -xzf zookeeper-3.4.11.tar.gz

mv zookeeper-3.4.11 zk

1.4 启动并运行druid

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf

这将带来Zookeeper和Druid服务的实例，所有这些都在本地机器上运行，例如：

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf

[Tue Dec 25 16:11:35 2018] Running command[zk], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/zk.log]: bin/run-zk quickstart/tutorial/conf

[Tue Dec 25 16:11:35 2018] Running command[coordinator], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/coordinator.log]: bin/run-druid coordinator quickstart/tutorial/conf

[Tue Dec 25 16:11:35 2018] Running command[broker], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/broker.log]: bin/run-druid broker quickstart/tutorial/conf

[Tue Dec 25 16:11:35 2018] Running command[historical], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/historical.log]: bin/run-druid historical quickstart/tutorial/conf

[Tue Dec 25 16:11:35 2018] Running command[overlord], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/overlord.log]: bin/run-druid overlord quickstart/tutorial/conf

[Tue Dec 25 16:11:35 2018] Running command[middleManager], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/middleManager.log]: bin/run-druid middleManager quickstart/tutorial/conf

所有持久状态（如集群元数据存储和服务段）都将保存在apache-druid-0.13.0-incubating下的var目录中。服务的日志位于var / sv。

2.加载数据

此示例数据位于Druid软件包根目录下的quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz中。页面编辑事件作为JSON对象存储在文本文件中。

示例数据包含以下列，示例事件如下所示：

{

    "time": "2015-09-12T00:47:47.870Z",

    "channel": "#vi.wikipedia",

    "cityName": null,

    "comment": "clean up using [[Project:AWB|AWB]]",

    "countryIsoCode": null,

    "countryName": null,

    "isAnonymous": false,

    "isMinor": false,

    "isNew": false,

    "isRobot": true,

    "isUnpatrolled": false,

    "metroCode": null,

    "namespace": "Main",

    "page": "Atractus duboisi",

    "regionIsoCode": null,

    "regionName": null,

    "user": "ThitxongkhoiAWB",

    "delta": 18,

    "added": 18,

    "deleted": 0

}

2.1 通过kafka流式的方式加载数据

2.1.1 Kafka相关操作

（1）下载

curl -O https://archive./dist/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz

tar -xzf kafka_2.11-0.10.2.0.tgz

cd kafka_2.11-0.10.2.0

（2）启动kafka

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-server-start.sh config/server.properties

（3）查看主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --list

（4）创建主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia

（5）删除主题

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic wikipedia

2.1.2 druid 相关操作：

数据进入kafka之后，需要定义一个datasource specfile告诉druid怎样接入数据，里面声明时间戳格式，纬度列、指标列、预聚合的粒度等等。

（1）在druid中启动从kafka中摄入数据的处理逻辑

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor

说明：如果主管成功创建，您将收到包含主管ID的回复;在我们的例子中，我们应该看到{“id”：“wikipedia-kafka”}

数据（定义datasource）：wikipedia-kafka-supervisor.json

{

  "type": "kafka",

  "dataSchema": {

    "dataSource": "wikipedia",

    "parser": {

      "type": "string",

      "parseSpec": {

        "format": "json",

        "timestampSpec": {

          "column": "time",

          "format": "auto"

        },

        "dimensionsSpec": {

          "dimensions": [

            "channel",

            "cityName",

            "comment",

            "countryIsoCode",

            "countryName",

            "isAnonymous",

            "isMinor",

            "isNew",

            "isRobot",

            "isUnpatrolled",

            "metroCode",

            "namespace",

            "page",

            "regionIsoCode",

            "regionName",

            "user",

            { "name": "added", "type": "long" },

            { "name": "deleted", "type": "long" },

            { "name": "delta", "type": "long" }

          ]

        }

      }

    },

    "metricsSpec" : [],

    "granularitySpec": {

      "type": "uniform",

      "segmentGranularity": "DAY",

      "queryGranularity": "NONE",

      "rollup": false

    }

  },

  "tuningConfig": {

    "type": "kafka",

    "reportParseExceptions": false

  },

  "ioConfig": {

    "topic": "wikipedia",

    "replicas": 2,

    "taskDuration": "PT10M",

    "completionTimeout": "PT20M",

    "consumerProperties": {

      "bootstrap.servers": "localhost:9092"

    }

  }

}

（2）解压wikiticker-2015-09-12-sampled.json.gz

MacBook-Air-3:tutorial g2$ pwd

/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial

MacBook-Air-3:tutorial g2$ gunzip -k wikiticker-2015-09-12-sampled.json.gz

数据示例：

{

    "time": "2015-09-12T00:48:02.596Z",

    "channel": "#es.wikipedia",

    "cityName": "Mexico City",

    "comment": "Cambio en la redacción del texto y correción en sintaxis",

    "countryIsoCode": "MX",

    "countryName": "Mexico",

    "isAnonymous": true,

    "isMinor": false,

    "isNew": false,

    "isRobot": false,

    "isUnpatrolled": false,

    "metroCode": null,

    "namespace": "Main",

    "page": "Mathis Bolly",

    "regionIsoCode": "DIF",

    "regionName": "Mexico City",

    "user": "189.217.75.123",

    "delta": -67,

    "added": 0,

    "deleted": 67

}

（3）在kafka目录中，解压如下命令（向kafka中写入数据）

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ export KAFKA_OPTS="-Dfile.encoding=UTF-8"

MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < /Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial/wikiticker-2015-09-12-sampled.json

现在，数据成功写入到kafka，接下来我们就可以到druid中进行查询了。

3.2 加载文件

3.2.1 准备数据和摄取任务规范：wikipedia-index.json

{

  "type" : "index",

  "spec" : {

    "dataSchema" : {

      "dataSource" : "wikipedia",

      "parser" : {

        "type" : "string",

        "parseSpec" : {

          "format" : "json",

          "dimensionsSpec" : {

            "dimensions" : [

              "channel",

              "cityName",

              "comment",

              "countryIsoCode",

              "countryName",

              "isAnonymous",

              "isMinor",

              "isNew",

              "isRobot",

              "isUnpatrolled",

              "metroCode",

              "namespace",

              "page",

              "regionIsoCode",

              "regionName",

              "user",

              { "name": "added", "type": "long" },

              { "name": "deleted", "type": "long" },

              { "name": "delta", "type": "long" }

            ]

          },

          "timestampSpec": {

            "column": "time",

            "format": "iso"

          }

        }

      },

      "metricsSpec" : [],

      "granularitySpec" : {

        "type" : "uniform",

        "segmentGranularity" : "day",

        "queryGranularity" : "none",

        "intervals" : ["2015-09-12/2015-09-13"],

        "rollup" : false

      }

    },

    "ioConfig" : {

      "type" : "index",

      "firehose" : {

        "type" : "local",

        "baseDir" : "quickstart/tutorial/",

        "filter" : "wikiticker-2015-09-12-sampled.json.gz"

      },

      "appendToExisting" : false

    },

    "tuningConfig" : {

      "type" : "index",

      "targetPartitionSize" : 5000000,

      "maxRowsInMemory" : 25000,

      "forceExtendableShardSpecs" : true

    }

  }

}

3.2.2 Load batch data

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/post-index-task --file quickstart/tutorial/wikipedia-index.json

Beginning indexing data for wikipedia

Task started: index_wikipedia_2018-12-25T10:00:45.744Z

Task log:     http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/log

Task status:  http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/status

Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

Task index_wikipedia_2018-12-25T10:00:45.744Z still running...

Task finished with status: SUCCESS

Completed indexing data for wikipedia. Now loading indexed data onto the cluster...

wikipedia loading complete! You may now query your data

MacBook-Air-3:apache-druid-0.13.0-incubating g2$

现在，数据成功加载到druid，接下来我们就可以到druid中进行查询了。

3.查询数据

3.1 Native JSON queries

3.1.1 查询的数据请求示例

{

  "queryType" : "topN",

  "dataSource" : "wikipedia",

  "intervals" : ["2015-09-12/2015-09-13"],

  "granularity" : "all",

  "dimension" : "page",

  "metric" : "count",

  "threshold" : 15,

  "aggregations" : [

    {

      "type" : "count",

      "name" : "count"

    }

  ]

}

3.1.2 查询：

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8082/druid/v2?pretty

3.1.3 执行结果

[ {

  "timestamp" : "2015-09-12T00:46:58.771Z",

  "result" : [ {

    "count" : 33,

    "page" : "Wikipedia:Vandalismusmeldung"

  }, {

    "count" : 28,

    "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"

  }, {

    "count" : 27,

    "page" : "Jeremy Corbyn"

  }, {

    "count" : 21,

    "page" : "Wikipedia:Administrators' noticeboard/Incidents"

  }, {

    "count" : 20,

    "page" : "Flavia Pennetta"

  }, {

    "count" : 18,

    "page" : "Total Drama Presents: The Ridonculous Race"

  }, {

    "count" : 18,

    "page" : "User talk:Dudeperson176123"

  }, {

    "count" : 18,

    "page" : "Wikipédia:Le Bistro/12 septembre 2015"

  }, {

    "count" : 17,

    "page" : "Wikipedia:In the news/Candidates"

  }, {

    "count" : 17,

    "page" : "Wikipedia:Requests for page protection"

  }, {

    "count" : 16,

    "page" : "Utente:Giulio Mainardi/Sandbox"

  }, {

    "count" : 16,

    "page" : "Wikipedia:Administrator intervention against vandalism"

  }, {

    "count" : 15,

    "page" : "Anthony Martial"

  }, {

    "count" : 13,

    "page" : "Template talk:Connected contributor"

  }, {

    "count" : 12,

    "page" : "Chronologie de la Lorraine"

  } ]

3.2 Druid SQL queries

德鲁伊还支持用于查询的SQL方言。让我们运行一个SQL查询，它等同于上面显示的本机JSON查询。

3.2.1 sql请求的json格式

{

  "query":"SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10"

}

3.2.2 查询

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8082/druid/v2/sql

3.2.3 执行结果

[

    {

        "page": "Wikipedia:Vandalismusmeldung",

        "Edits": 33

    },

    {

        "page": "User:Cyde/List of candidates for speedy deletion/Subpage",

        "Edits": 28

    },

    {

        "page": "Jeremy Corbyn",

        "Edits": 27

    },

    {

        "page": "Wikipedia:Administrators' noticeboard/Incidents",

        "Edits": 21

    },

    {

        "page": "Flavia Pennetta",

        "Edits": 20

    },

    {

        "page": "Total Drama Presents: The Ridonculous Race",

        "Edits": 18

    },

    {

        "page": "User talk:Dudeperson176123",

        "Edits": 18

    },

    {

        "page": "Wikipédia:Le Bistro/12 septembre 2015",

        "Edits": 18

    },

    {

        "page": "Wikipedia:In the news/Candidates",

        "Edits": 17

    },

    {

        "page": "Wikipedia:Requests for page protection",

        "Edits": 17

    }

]

3.3 sql client

为方便起见，Druid包中包含一个SQL命令行客户端，位于Druid包根目录的bin/dsql中。我们现在运行bin / dsql;你应该看到以下提示：

MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/dsql

Welcome to dsql, the command-line client for Druid SQL.

Type "\h" for help.

dsql>

3.3.1 示例1:

dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

┌──────────────────────────────────────────────────────────┬───────┐

│ page                                                     │ Edits │

├──────────────────────────────────────────────────────────┼───────┤

│ Wikipedia:Vandalismusmeldung                             │    33 │

│ User:Cyde/List of candidates for speedy deletion/Subpage │    28 │

│ Jeremy Corbyn                                            │    27 │

│ Wikipedia:Administrators' noticeboard/Incidents          │    21 │

│ Flavia Pennetta                                          │    20 │

│ Total Drama Presents: The Ridonculous Race               │    18 │

│ User talk:Dudeperson176123                               │    18 │

│ Wikipédia:Le Bistro/12 septembre 2015                    │    18 │

│ Wikipedia:In the news/Candidates                         │    17 │

│ Wikipedia:Requests for page protection                   │    17 │

└──────────────────────────────────────────────────────────┴───────┘

Retrieved 10 rows in 0.35s.

3.3.2 示例2：Timeseries

dsql> SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR);

┌──────────────────────────┬──────────────┐

│ HourTime                 │ LinesDeleted │

├──────────────────────────┼──────────────┤

│ 2015-09-12T00:00:00.000Z │         1761 │

│ 2015-09-12T01:00:00.000Z │        16208 │

│ 2015-09-12T02:00:00.000Z │        14543 │

│ 2015-09-12T03:00:00.000Z │        13101 │

│ 2015-09-12T04:00:00.000Z │        12040 │

│ 2015-09-12T05:00:00.000Z │         6399 │

│ 2015-09-12T06:00:00.000Z │         9036 │

│ 2015-09-12T07:00:00.000Z │        11409 │

│ 2015-09-12T08:00:00.000Z │        11616 │

│ 2015-09-12T09:00:00.000Z │        17509 │

│ 2015-09-12T10:00:00.000Z │        19406 │

│ 2015-09-12T11:00:00.000Z │        16284 │

│ 2015-09-12T12:00:00.000Z │        18672 │

│ 2015-09-12T13:00:00.000Z │        30520 │

│ 2015-09-12T14:00:00.000Z │        18025 │

│ 2015-09-12T15:00:00.000Z │        26399 │

│ 2015-09-12T16:00:00.000Z │        24759 │

│ 2015-09-12T17:00:00.000Z │        19634 │

│ 2015-09-12T18:00:00.000Z │        17345 │

│ 2015-09-12T19:00:00.000Z │        19305 │

│ 2015-09-12T20:00:00.000Z │        22265 │

│ 2015-09-12T21:00:00.000Z │        16394 │

│ 2015-09-12T22:00:00.000Z │        16379 │

│ 2015-09-12T23:00:00.000Z │        15289 │

└──────────────────────────┴──────────────┘

Retrieved 24 rows in 0.25s.



dsql>

3.3.3 GroupBy

dsql> SELECT channel, SUM(added) FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5;

┌───────────────┬─────────┐

│ channel       │ EXPR$1  │

├───────────────┼─────────┤

│ #en.wikipedia │ 3045299 │

│ #it.wikipedia │  711011 │

│ #fr.wikipedia │  642555 │

│ #ru.wikipedia │  640698 │

│ #es.wikipedia │  634670 │

└───────────────┴─────────┘

Retrieved 5 rows in 0.13s.

3.3.4 EXPLAIN PLAN FOR

dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐

│ PLAN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    │

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤

│ DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}]) │

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Retrieved 1 row in 0.09s.