custom-analyzer.asciidoc 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261
  1. [[analysis-custom-analyzer]]
  2. === Create a custom analyzer
  3. When the built-in analyzers do not fulfill your needs, you can create a
  4. `custom` analyzer which uses the appropriate combination of:
  5. * zero or more <<analysis-charfilters, character filters>>
  6. * a <<analysis-tokenizers,tokenizer>>
  7. * zero or more <<analysis-tokenfilters,token filters>>.
  8. [discrete]
  9. === Configuration
  10. The `custom` analyzer accepts the following parameters:
  11. [horizontal]
  12. `type`::
  13. Analyzer type. Accepts <<analysis-analyzers, built-in analyzer types>>. For
  14. custom analyzers, use `custom` or omit this parameter.
  15. `tokenizer`::
  16. A built-in or customised <<analysis-tokenizers,tokenizer>>.
  17. (Required)
  18. `char_filter`::
  19. An optional array of built-in or customised
  20. <<analysis-charfilters, character filters>>.
  21. `filter`::
  22. An optional array of built-in or customised
  23. <<analysis-tokenfilters, token filters>>.
  24. `position_increment_gap`::
  25. When indexing an array of text values, Elasticsearch inserts a fake "gap"
  26. between the last term of one value and the first term of the next value to
  27. ensure that a phrase query doesn't match two terms from different array
  28. elements. Defaults to `100`. See <<position-increment-gap>> for more.
  29. [discrete]
  30. === Example configuration
  31. Here is an example that combines the following:
  32. Character Filter::
  33. * <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
  34. Tokenizer::
  35. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  36. Token Filters::
  37. * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
  38. * <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
  39. [source,console]
  40. --------------------------------
  41. PUT my-index-000001
  42. {
  43. "settings": {
  44. "analysis": {
  45. "analyzer": {
  46. "my_custom_analyzer": {
  47. "type": "custom", <1>
  48. "tokenizer": "standard",
  49. "char_filter": [
  50. "html_strip"
  51. ],
  52. "filter": [
  53. "lowercase",
  54. "asciifolding"
  55. ]
  56. }
  57. }
  58. }
  59. }
  60. }
  61. POST my-index-000001/_analyze
  62. {
  63. "analyzer": "my_custom_analyzer",
  64. "text": "Is this <b>déjà vu</b>?"
  65. }
  66. --------------------------------
  67. <1> For `custom` analyzers, use a `type` of `custom` or omit the `type`
  68. parameter.
  69. /////////////////////
  70. [source,console-result]
  71. ----------------------------
  72. {
  73. "tokens": [
  74. {
  75. "token": "is",
  76. "start_offset": 0,
  77. "end_offset": 2,
  78. "type": "<ALPHANUM>",
  79. "position": 0
  80. },
  81. {
  82. "token": "this",
  83. "start_offset": 3,
  84. "end_offset": 7,
  85. "type": "<ALPHANUM>",
  86. "position": 1
  87. },
  88. {
  89. "token": "deja",
  90. "start_offset": 11,
  91. "end_offset": 15,
  92. "type": "<ALPHANUM>",
  93. "position": 2
  94. },
  95. {
  96. "token": "vu",
  97. "start_offset": 16,
  98. "end_offset": 22,
  99. "type": "<ALPHANUM>",
  100. "position": 3
  101. }
  102. ]
  103. }
  104. ----------------------------
  105. /////////////////////
  106. The above example produces the following terms:
  107. [source,text]
  108. ---------------------------
  109. [ is, this, deja, vu ]
  110. ---------------------------
  111. The previous example used tokenizer, token filters, and character filters with
  112. their default configurations, but it is possible to create configured versions
  113. of each and to use them in a custom analyzer.
  114. Here is a more complicated example that combines the following:
  115. Character Filter::
  116. * <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
  117. Tokenizer::
  118. * <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
  119. Token Filters::
  120. * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
  121. * <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
  122. Here is an example:
  123. [source,console]
  124. --------------------------------------------------
  125. PUT my-index-000001
  126. {
  127. "settings": {
  128. "analysis": {
  129. "analyzer": {
  130. "my_custom_analyzer": { <1>
  131. "char_filter": [
  132. "emoticons"
  133. ],
  134. "tokenizer": "punctuation",
  135. "filter": [
  136. "lowercase",
  137. "english_stop"
  138. ]
  139. }
  140. },
  141. "tokenizer": {
  142. "punctuation": { <2>
  143. "type": "pattern",
  144. "pattern": "[ .,!?]"
  145. }
  146. },
  147. "char_filter": {
  148. "emoticons": { <3>
  149. "type": "mapping",
  150. "mappings": [
  151. ":) => _happy_",
  152. ":( => _sad_"
  153. ]
  154. }
  155. },
  156. "filter": {
  157. "english_stop": { <4>
  158. "type": "stop",
  159. "stopwords": "_english_"
  160. }
  161. }
  162. }
  163. }
  164. }
  165. POST my-index-000001/_analyze
  166. {
  167. "analyzer": "my_custom_analyzer",
  168. "text": "I'm a :) person, and you?"
  169. }
  170. --------------------------------------------------
  171. <1> Assigns the index a default custom analyzer, `my_custom_analyzer`. This
  172. analyzer uses a custom tokenizer, character filter, and token filter that
  173. are defined later in the request. This analyzer also omits the `type` parameter.
  174. <2> Defines the custom `punctuation` tokenizer.
  175. <3> Defines the custom `emoticons` character filter.
  176. <4> Defines the custom `english_stop` token filter.
  177. /////////////////////
  178. [source,console-result]
  179. ----------------------------
  180. {
  181. "tokens": [
  182. {
  183. "token": "i'm",
  184. "start_offset": 0,
  185. "end_offset": 3,
  186. "type": "word",
  187. "position": 0
  188. },
  189. {
  190. "token": "_happy_",
  191. "start_offset": 6,
  192. "end_offset": 8,
  193. "type": "word",
  194. "position": 2
  195. },
  196. {
  197. "token": "person",
  198. "start_offset": 9,
  199. "end_offset": 15,
  200. "type": "word",
  201. "position": 3
  202. },
  203. {
  204. "token": "you",
  205. "start_offset": 21,
  206. "end_offset": 24,
  207. "type": "word",
  208. "position": 5
  209. }
  210. ]
  211. }
  212. ----------------------------
  213. /////////////////////
  214. The above example produces the following terms:
  215. [source,text]
  216. ---------------------------
  217. [ i'm, _happy_, person, you ]
  218. ---------------------------