elasticsearch 口水篇（8）分词中文分词 ik插件

风_宇星 2014-12-29

展开全文

elasticsearch 口水篇（8）分词中文分词 ik插件

先来一个标准分词（standard），配置如下：

curl -XPUT localhost:9200/local -d '{

"settings" : {

"analysis" : {

"analyzer" : {

"stem" : {

"tokenizer" : "standard",

"filter" : ["standard", "lowercase", "stop", "porter_stem"]

}

},

"mappings" : {

"article" : {

"dynamic" : true,

"properties" : {

"title" : {

"type" : "string",

"analyzer" : "stem"

}

}'

index:local

type:article

default analyzer:stem (filter:小写、停用词等)

field:title　　

测试：

# Sample Analysis

curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Fight for your life}'

curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Bruno fights Tyson tomorrow}'

# Index Data

curl -XPUT localhost:9200/local/article/1 -d'{"title": "Fight for your life"}'

curl -XPUT localhost:9200/local/article/2 -d'{"title": "Fighting for your life"}'

curl -XPUT localhost:9200/local/article/3 -d'{"title": "My dad fought a dog"}'

curl -XPUT localhost:9200/local/article/4 -d'{"title": "Bruno fights Tyson tomorrow"}'

# search on the title field, which is stemmed on index and search

curl -XGET localhost:9200/local/_search?q=title:fight

# searching on _all will not do anystemming, unless also configured on the mapping to be stemmed...

curl -XGET localhost:9200/local/_search?q=fight

例如：

1	`Fight` `for` `your life`

分词如下：

{"tokens":[

{"token":"fight","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1},<br>{"token":"your","start_offset":11,"end_offset":15,"type":"<ALPHANUM>","position":3},<br>{"token":"life","start_offset":16,"end_offset":20,"type":"<ALPHANUM>","position":4}

]}

部署ik分词器：

1）将ik分词器插件（es）拷贝到./plugins/analyzerIK/中

2）在elasticsearch.yml中配置

index.analysis.analyzer.ik.type : "ik"

3）在config中添加./config/ik

IKAnalyzer.cfg.xml

main.dic

quantifier.dic

ext.dic

stopword.dic

delete之前创建的index，重新配置如下：

curl -XPUT localhost:9200/local -d '{

"settings" : {

"analysis" : {

"analyzer" : {

"ik" : {

"tokenizer" : "ik"

}

},

"mappings" : {

"article" : {

"dynamic" : true,

"properties" : {

"title" : {

"type" : "string",

"analyzer" : "ik"

}

}'

测试：

curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'

{

"text":"中华人民共和国国歌"

}

'

{

"tokens" : [ {

"token" : "text",

"start_offset" : 12,

"end_offset" : 16,

"type" : "ENGLISH",

"position" : 1

}, {

"token" : "中华人民共和国",

"start_offset" : 19,

"end_offset" : 26,

"type" : "CN_WORD",

"position" : 2

}, {

"token" : "国歌",

"start_offset" : 26,

"end_offset" : 28,

"type" : "CN_WORD",

"position" : 3

} ]

}

---------------------------------------

如果我们想返回最细粒度的分词结果，需要在elasticsearch.yml中配置如下：

index:

analysis:

analyzer:

ik:

alias: [ik_analyzer]

type: org.elasticsearch.index.analysis.IkAnalyzerProvider

ik_smart:

type: ik

use_smart: true

ik_max_word:

type: ik

use_smart: false

测试：

curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'

{

"text":"中华人民共和国国歌"

}

'

{

"tokens" : [ {

"token" : "text",

"start_offset" : 12,

"end_offset" : 16,

"type" : "ENGLISH",

"position" : 1

}, {

"token" : "中华人民共和国",

"start_offset" : 19,

"end_offset" : 26,

"type" : "CN_WORD",

"position" : 2

}, {

"token" : "中华人民",

"start_offset" : 19,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 3

}, {

"token" : "中华",

"start_offset" : 19,

"end_offset" : 21,

"type" : "CN_WORD",

"position" : 4

}, {

"token" : "华人",

"start_offset" : 20,

"end_offset" : 22,

"type" : "CN_WORD",

"position" : 5

}, {

"token" : "人民共和国",

"start_offset" : 21,

"end_offset" : 26,

"type" : "CN_WORD",

"position" : 6

}, {

"token" : "人民",

"start_offset" : 21,

"end_offset" : 23,

"type" : "CN_WORD",

"position" : 7

}, {

"token" : "共和国",

"start_offset" : 23,

"end_offset" : 26,

"type" : "CN_WORD",

"position" : 8

}, {

"token" : "共和",

"start_offset" : 23,

"end_offset" : 25,

"type" : "CN_WORD",

"position" : 9

}, {

"token" : "国",

"start_offset" : 25,

"end_offset" : 26,

"type" : "CN_CHAR",

"position" : 10

}, {

"token" : "国歌",

"start_offset" : 26,

"end_offset" : 28,

"type" : "CN_WORD",

"position" : 11

} ]

}

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：风_宇星 > 《ElasticSearch》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

风_宇星

关注对话

TA的最新馆藏

使用Apache 反向代理功能连接 Tomcat
使用apache反向代理功能连接Tomcat
IE下get传中文乱码的问题完美解决方案
JSP中文乱码问题终极解决方案
win7 64位系统 PB连接oracle数据库出现“oracle library oci.dll could not be loaded”问题的解决方法
理解oo：继承、多态、重写、重载、接口、抽象类

喜欢该文的人也喜欢更多

热门阅读换一换

elasticsearch 口水篇（8）分词 中文分词 ik插件

elasticsearch 口水篇（8）分词中文分词 ik插件