pathhierarchy-tokenizer.asciidoc 3.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
  1. [[analysis-pathhierarchy-tokenizer]]
  2. === Path Hierarchy Tokenizer
  3. The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
  4. path, splits on the path separator, and emits a term for each component in the
  5. tree.
  6. [float]
  7. === Example output
  8. [source,js]
  9. ---------------------------
  10. POST _analyze
  11. {
  12. "tokenizer": "path_hierarchy",
  13. "text": "/one/two/three"
  14. }
  15. ---------------------------
  16. // CONSOLE
  17. /////////////////////
  18. [source,js]
  19. ----------------------------
  20. {
  21. "tokens": [
  22. {
  23. "token": "/one",
  24. "start_offset": 0,
  25. "end_offset": 4,
  26. "type": "word",
  27. "position": 0
  28. },
  29. {
  30. "token": "/one/two",
  31. "start_offset": 0,
  32. "end_offset": 8,
  33. "type": "word",
  34. "position": 0
  35. },
  36. {
  37. "token": "/one/two/three",
  38. "start_offset": 0,
  39. "end_offset": 14,
  40. "type": "word",
  41. "position": 0
  42. }
  43. ]
  44. }
  45. ----------------------------
  46. // TESTRESPONSE
  47. /////////////////////
  48. The above text would produce the following terms:
  49. [source,text]
  50. ---------------------------
  51. [ /one, /one/two, /one/two/three ]
  52. ---------------------------
  53. [float]
  54. === Configuration
  55. The `path_hierarchy` tokenizer accepts the following parameters:
  56. [horizontal]
  57. `delimiter`::
  58. The character to use as the path separator. Defaults to `/`.
  59. `replacement`::
  60. An optional replacement character to use for the delimiter.
  61. Defaults to the `delimiter`.
  62. `buffer_size`::
  63. The number of characters read into the term buffer in a single pass.
  64. Defaults to `1024`. The term buffer will grow by this size until all the
  65. text has been consumed. It is advisable not to change this setting.
  66. `reverse`::
  67. If set to `true`, emits the tokens in reverse order. Defaults to `false`.
  68. `skip`::
  69. The number of initial tokens to skip. Defaults to `0`.
  70. [float]
  71. === Example configuration
  72. In this example, we configure the `path_hierarchy` tokenizer to split on `-`
  73. characters, and to replace them with `/`. The first two tokens are skipped:
  74. [source,js]
  75. ----------------------------
  76. PUT my_index
  77. {
  78. "settings": {
  79. "analysis": {
  80. "analyzer": {
  81. "my_analyzer": {
  82. "tokenizer": "my_tokenizer"
  83. }
  84. },
  85. "tokenizer": {
  86. "my_tokenizer": {
  87. "type": "path_hierarchy",
  88. "delimiter": "-",
  89. "replacement": "/",
  90. "skip": 2
  91. }
  92. }
  93. }
  94. }
  95. }
  96. POST my_index/_analyze
  97. {
  98. "analyzer": "my_analyzer",
  99. "text": "one-two-three-four-five"
  100. }
  101. ----------------------------
  102. // CONSOLE
  103. /////////////////////
  104. [source,js]
  105. ----------------------------
  106. {
  107. "tokens": [
  108. {
  109. "token": "/three",
  110. "start_offset": 7,
  111. "end_offset": 13,
  112. "type": "word",
  113. "position": 0
  114. },
  115. {
  116. "token": "/three/four",
  117. "start_offset": 7,
  118. "end_offset": 18,
  119. "type": "word",
  120. "position": 0
  121. },
  122. {
  123. "token": "/three/four/five",
  124. "start_offset": 7,
  125. "end_offset": 23,
  126. "type": "word",
  127. "position": 0
  128. }
  129. ]
  130. }
  131. ----------------------------
  132. // TESTRESPONSE
  133. /////////////////////
  134. The above example produces the following terms:
  135. [source,text]
  136. ---------------------------
  137. [ /three, /three/four, /three/four/five ]
  138. ---------------------------
  139. If we were to set `reverse` to `true`, it would produce the following:
  140. [source,text]
  141. ---------------------------
  142. [ one/two/three/, two/three/, three/ ]
  143. ---------------------------