Elasticsearch 默认配置 IK及 Java AnalyzeRequestBuilder 使用

xiaoxiao2021-02-27 322

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载，保留摘要，谢谢！『春夏秋冬失去了你，我怎么过一年四季- 民谣歌词』本文提纲一、什么是 Elasticsearch-analysis-ik 二、默认配置 IK 三、使用 AnalyzeRequestBuilder 获取分词结果四、小结运行环境：JDK 7 或 8、Maven 3.0+、ElasticSearch 2.3.2、Elasticsearch-analysis-ik 1.9.2 技术栈：SpringBoot 1.5+、Spring-data-elasticsearch 2.1.0

前言

在 Elasticsearch 和插件 elasticsearch-head 安装详解 http://www.bysocket.com/?p=1744 文章中，我使用的是 Elasticsearch 5.3.x。这里我改成了 ElasticSearch 2.3.2。是因为版本对应关系 https://github.com/spring-projects/spring-data-elasticsearch/wiki/Spring-Data-Elasticsearch—Spring-Boot—version-matrix： Spring Boot Version (x) Spring Data Elasticsearch Version (y) Elasticsearch Version (z) x <= 1.3.5 y <= 1.3.4 z <= 1.7.2* x >= 1.4.x 2.0.0 <=y < 5.0.0** 2.0.0 <= z < 5.0.0** * – 只需要你修改下对应的 pom 文件版本号 ** – 下一个 ES 的版本会有重大的更新这里可以看出，5.3.x 不在第二行范围内。因此这里我讲下，如何在 ElasticSearch 2.3.2 中默认配置 IK。

一、什么是 Elasticsearch-analysis-ik

了解什么是 Elasticsearch-analysis-ik，首先了解什么是 IK Analyzer。 IK Analyzer 是基于 lucene 实现的分词开源框架。官方地址： https://code.google.com/p/ik-analyzer/ 。 Elasticsearch-analysis-ik 则是将 IK Analyzer 集成 Elasticsearch 的插件，并支持自定义词典。GitHub 地址： https://github.com/medcl/elasticsearch-analysis-ik。特性支持： 1 2 分析器 Analyzer: ik_smart 或 ik_max_word 分词器 Tokenizer: ik_smart 或 ik_max_word

二、默认配置 IK

在 Elasticsearch-analysis-ik 官网中可以看到，其中版本需要对应： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IK版 ES版本主 5.x -> master 5.3.2 5.3.2 5.2.2 5.2.2 5.1.2 5.1.2 1.10.1 2.4.1 1.9.5 2.3.5 1.8.1 2.2.1 1.7.0 2.1.1 1.5.0 2.0.0 1.2.6 1.0.0 1.2.5 0.90.x 1.1.3 0.20.x 1.0.0 0.16.2 -> 0.19.0 这里使用的是 Elasticsearch-analysis-ik 1.9.2，支持 ElasticSearch 2.3.2。下载地址： https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.2/elasticsearch-analysis-ik-1.9.2.zip，下载成功后进行安装。解压 zip 文件，复制里面的内容到 elasticsearch-2.3.2/plugins/ik。 1 2 3 cd elasticsearch-2.3.2 /plugins mkdir ik cp ... 在 elasticsearch-2.3.2/config/elasticsearch.yml 增加配置： 1 2 index.analysis.analyzer.default.tokenizer : "ik_max_word" index.analysis.analyzer.default. type : "ik" 配置默认分词器为 ik，并指定分词器为 ik_max_word。然后重启 ES 即可。验证 IK 是否成功安装，访问下 1 localhost:9200 /_analyze ?analyzer=ik&pretty= true &text=泥瓦匠的博客是bysocket.com

可以得到下面的结果集：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 { "tokens" : [ { "token" : "泥瓦匠" , "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD" , "position" : 0 }, { "token" : "泥" , "start_offset" : 0, "end_offset" : 1, "type" : "CN_WORD" , "position" : 1 }, { "token" : "瓦匠" , "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD" , "position" : 2 }, { "token" : "匠" , "start_offset" : 2, "end_offset" : 3, "type" : "CN_WORD" , "position" : 3 }, { "token" : "博客" , "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD" , "position" : 4 }, { "token" : "bysocket.com" , "start_offset" : 8, "end_offset" : 20, "type" : "LETTER" , "position" : 5 }, { "token" : "bysocket" , "start_offset" : 8, "end_offset" : 16, "type" : "ENGLISH" , "position" : 6 }, { "token" : "com" , "start_offset" : 17, "end_offset" : 20, "type" : "ENGLISH" , "position" : 7 } ] } 记得在Docker 容器安装时，需要对应的端口开发。

三、使用 AnalyzeRequestBuilder 获取分词结果

ES 中默认配置 IK 后，通过 Rest HTTP 的方式我们可以进行得到分词结果。那么在 Spring Boot 和提供的客户端依赖 spring-data-elasticsearch 中如何获取到分词结果。加入依赖 pom.xml

1 2 3 4 5  <dependency> <groupId>org.springframework.boot< /groupId > <artifactId>spring-boot-starter-data-elasticsearch< /artifactId > < /dependency > 在 application.properties 配置 ES 的地址： 1 2 3 # ES spring.data.elasticsearch.repositories.enabled = true spring.data.elasticsearch.cluster-nodes = 127.0.0.1:9300 然后创建一个方法，入参是搜索词，返回的是分词结果列表。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 @Autowired private ElasticsearchTemplate elasticsearchTemplate; /** * 调用 ES 获取 IK 分词后结果 * * @param searchContent * @ return */ private List<String> getIkAnalyzeSearchTerms(String searchContent) { // 调用 IK 分词分词 AnalyzeRequestBuilder ikRequest = new AnalyzeRequestBuilder(elasticsearchTemplate.getClient(), AnalyzeAction.INSTANCE, "indexName" ,searchContent); ikRequest.setTokenizer( "ik" ); List<AnalyzeResponse.AnalyzeToken> ikTokenList = ikRequest.execute().actionGet().getTokens(); // 循环赋值 List<String> searchTermList = new ArrayList<>(); ikTokenList.forEach(ikToken -> { searchTermList.add(ikToken.getTerm()); }); return searchTermList; } indexName 这里是指在 ES 设置的索引名称。从容器注入的 ElasticsearchTemplate Bean 中获取 Client ，再通过 AnalyzeRequestBuilder 分析请求类型中进行分词并获取分词结果 AnalyzeResponse.AnalyzeToken 列表。

四、小结

默认配置了 IK 分词器，则 DSL 去 ES 查询时会自动调用 IK 分词。如果想要自定义词库，比如比较偏的领域性。可以参考 Elasticsearch-analysis-ik GiHub 地址去具体查阅。

推荐开源项目：《springboot-learning-example》

spring boot 实践学习案例，是 spring boot 初学者及核心技术巩固的最佳实践

欢迎扫一扫我的公众号关注 — 及时得到博客订阅哦！ — http://www.bysocket.com/ — — https://github.com/JeffLi1993 —

转载请注明原文地址: https://www.6miu.com/read-3550.html

技术

最新回复(0)