分享

elasticsearch 口水篇(8)分词 中文分词 ik插件

 风_宇星 2014-12-29

先来一个标准分词(standard),配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
curl -XPUT localhost:9200/local -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "stem" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "lowercase", "stop", "porter_stem"]
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "title" : {
                    "type" : "string",
                    "analyzer" : "stem"
                }
            }
        }
    }
}'

index:local

type:article

default analyzer:stem (filter:小写、停用词等)

field:title  

测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Sample Analysis
curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Fight for your life}'
curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Bruno fights Tyson tomorrow}'
  
# Index Data
curl -XPUT localhost:9200/local/article/1 -d'{"title": "Fight for your life"}'
curl -XPUT localhost:9200/local/article/2 -d'{"title": "Fighting for your life"}'
curl -XPUT localhost:9200/local/article/3 -d'{"title": "My dad fought a dog"}'
curl -XPUT localhost:9200/local/article/4 -d'{"title": "Bruno fights Tyson tomorrow"}'
  
# search on the title field, which is stemmed on index and search
curl -XGET localhost:9200/local/_search?q=title:fight
  
# searching on _all will not do anystemming, unless also configured on the mapping to be stemmed...
curl -XGET localhost:9200/local/_search?q=fight

例如:

1
Fight for your life

分词如下:

1
2
3
{"tokens":[
{"token":"fight","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1},<br>{"token":"your","start_offset":11,"end_offset":15,"type":"<ALPHANUM>","position":3},<br>{"token":"life","start_offset":16,"end_offset":20,"type":"<ALPHANUM>","position":4}
]}

  


 

部署ik分词器:

1)将ik分词器插件(es)拷贝到./plugins/analyzerIK/中

2)在elasticsearch.yml中配置

index.analysis.analyzer.ik.type : "ik"

3)在config中添加./config/ik

IKAnalyzer.cfg.xml

main.dic

quantifier.dic

ext.dic

stopword.dic

 

delete之前创建的index,重新配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
curl -XPUT localhost:9200/local -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik"
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "title" : {
                    "type" : "string",
                    "analyzer" : "ik"
                }
            }
        }
    }
}'

  

测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    "text":"中华人民共和国国歌" 
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 12,
    "end_offset" : 16,
    "type" : "ENGLISH",
    "position" : 1
  }, {
    "token" : "中华人民共和国",
    "start_offset" : 19,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "国歌",
    "start_offset" : 26,
    "end_offset" : 28,
    "type" : "CN_WORD",
    "position" : 3
  } ]
}

  

 ---------------------------------------

如果我们想返回最细粒度的分词结果,需要在elasticsearch.yml中配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
index:
  analysis:
    analyzer:
      ik:
          alias: [ik_analyzer]
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider
      ik_smart:
          type: ik
          use_smart: true
      ik_max_word:
          type: ik
          use_smart: false

  

 测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
    "text":"中华人民共和国国歌" 
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 12,
    "end_offset" : 16,
    "type" : "ENGLISH",
    "position" : 1
  }, {
    "token" : "中华人民共和国",
    "start_offset" : 19,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "中华人民",
    "start_offset" : 19,
    "end_offset" : 23,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "中华",
    "start_offset" : 19,
    "end_offset" : 21,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "华人",
    "start_offset" : 20,
    "end_offset" : 22,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "人民共和国",
    "start_offset" : 21,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "人民",
    "start_offset" : 21,
    "end_offset" : 23,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "共和国",
    "start_offset" : 23,
    "end_offset" : 26,
    "type" : "CN_WORD",
    "position" : 8
  }, {
    "token" : "共和",
    "start_offset" : 23,
    "end_offset" : 25,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "token" : "国",
    "start_offset" : 25,
    "end_offset" : 26,
    "type" : "CN_CHAR",
    "position" : 10
  }, {
    "token" : "国歌",
    "start_offset" : 26,
    "end_offset" : 28,
    "type" : "CN_WORD",
    "position" : 11
  } ]
}

  

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多