custom-analyzer.asciidoc 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256
  1. [[analysis-custom-analyzer]]
  2. === Custom Analyzer
  3. When the built-in analyzers do not fulfill your needs, you can create a
  4. `custom` analyzer which uses the appropriate combination of:
  5. * zero or more <<analysis-charfilters, character filters>>
  6. * a <<analysis-tokenizers,tokenizer>>
  7. * zero or more <<analysis-tokenfilters,token filters>>.
  8. [float]
  9. === Configuration
  10. The `custom` analyzer accepts the following parameters:
  11. [horizontal]
  12. `tokenizer`::
  13. A built-in or customised <<analysis-tokenizers,tokenizer>>.
  14. (Required)
  15. `char_filter`::
  16. An optional array of built-in or customised
  17. <<analysis-charfilters, character filters>>.
  18. `filter`::
  19. An optional array of built-in or customised
  20. <<analysis-tokenfilters, token filters>>.
  21. `position_increment_gap`::
  22. When indexing an array of text values, Elasticsearch inserts a fake "gap"
  23. between the last term of one value and the first term of the next value to
  24. ensure that a phrase query doesn't match two terms from different array
  25. elements. Defaults to `100`. See <<position-increment-gap>> for more.
  26. [float]
  27. === Example configuration
  28. Here is an example that combines the following:
  29. Character Filter::
  30. * <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
  31. Tokenizer::
  32. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  33. Token Filters::
  34. * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
  35. * <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
  36. [source,js]
  37. --------------------------------
  38. PUT my_index
  39. {
  40. "settings": {
  41. "analysis": {
  42. "analyzer": {
  43. "my_custom_analyzer": {
  44. "type": "custom",
  45. "tokenizer": "standard",
  46. "char_filter": [
  47. "html_strip"
  48. ],
  49. "filter": [
  50. "lowercase",
  51. "asciifolding"
  52. ]
  53. }
  54. }
  55. }
  56. }
  57. }
  58. POST my_index/_analyze
  59. {
  60. "analyzer": "my_custom_analyzer",
  61. "text": "Is this <b>déjà vu</b>?"
  62. }
  63. --------------------------------
  64. // CONSOLE
  65. /////////////////////
  66. [source,js]
  67. ----------------------------
  68. {
  69. "tokens": [
  70. {
  71. "token": "is",
  72. "start_offset": 0,
  73. "end_offset": 2,
  74. "type": "<ALPHANUM>",
  75. "position": 0
  76. },
  77. {
  78. "token": "this",
  79. "start_offset": 3,
  80. "end_offset": 7,
  81. "type": "<ALPHANUM>",
  82. "position": 1
  83. },
  84. {
  85. "token": "deja",
  86. "start_offset": 11,
  87. "end_offset": 15,
  88. "type": "<ALPHANUM>",
  89. "position": 2
  90. },
  91. {
  92. "token": "vu",
  93. "start_offset": 16,
  94. "end_offset": 22,
  95. "type": "<ALPHANUM>",
  96. "position": 3
  97. }
  98. ]
  99. }
  100. ----------------------------
  101. // TESTRESPONSE
  102. /////////////////////
  103. The above example produces the following terms:
  104. [source,text]
  105. ---------------------------
  106. [ is, this, deja, vu ]
  107. ---------------------------
  108. The previous example used tokenizer, token filters, and character filters with
  109. their default configurations, but it is possible to create configured versions
  110. of each and to use them in a custom analyzer.
  111. Here is a more complicated example that combines the following:
  112. Character Filter::
  113. * <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
  114. Tokenizer::
  115. * <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
  116. Token Filters::
  117. * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
  118. * <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
  119. Here is an example:
  120. [source,js]
  121. --------------------------------------------------
  122. PUT my_index
  123. {
  124. "settings": {
  125. "analysis": {
  126. "analyzer": {
  127. "my_custom_analyzer": {
  128. "type": "custom",
  129. "char_filter": [
  130. "emoticons" <1>
  131. ],
  132. "tokenizer": "punctuation", <1>
  133. "filter": [
  134. "lowercase",
  135. "english_stop" <1>
  136. ]
  137. }
  138. },
  139. "tokenizer": {
  140. "punctuation": { <1>
  141. "type": "pattern",
  142. "pattern": "[ .,!?]"
  143. }
  144. },
  145. "char_filter": {
  146. "emoticons": { <1>
  147. "type": "mapping",
  148. "mappings": [
  149. ":) => _happy_",
  150. ":( => _sad_"
  151. ]
  152. }
  153. },
  154. "filter": {
  155. "english_stop": { <1>
  156. "type": "stop",
  157. "stopwords": "_english_"
  158. }
  159. }
  160. }
  161. }
  162. }
  163. POST my_index/_analyze
  164. {
  165. "analyzer": "my_custom_analyzer",
  166. "text": "I'm a :) person, and you?"
  167. }
  168. --------------------------------------------------
  169. // CONSOLE
  170. <1> The `emoticon` character filter, `punctuation` tokenizer and
  171. `english_stop` token filter are custom implementations which are defined
  172. in the same index settings.
  173. /////////////////////
  174. [source,js]
  175. ----------------------------
  176. {
  177. "tokens": [
  178. {
  179. "token": "i'm",
  180. "start_offset": 0,
  181. "end_offset": 3,
  182. "type": "word",
  183. "position": 0
  184. },
  185. {
  186. "token": "_happy_",
  187. "start_offset": 6,
  188. "end_offset": 8,
  189. "type": "word",
  190. "position": 2
  191. },
  192. {
  193. "token": "person",
  194. "start_offset": 9,
  195. "end_offset": 15,
  196. "type": "word",
  197. "position": 3
  198. },
  199. {
  200. "token": "you",
  201. "start_offset": 21,
  202. "end_offset": 24,
  203. "type": "word",
  204. "position": 5
  205. }
  206. ]
  207. }
  208. ----------------------------
  209. // TESTRESPONSE
  210. /////////////////////
  211. The above example produces the following terms:
  212. [source,text]
  213. ---------------------------
  214. [ i'm, _happy_, person, you ]
  215. ---------------------------