pathhierarchy-tokenizer.asciidoc 3.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
  1. [[analysis-pathhierarchy-tokenizer]]
  2. === Path Hierarchy Tokenizer
  3. The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
  4. path, splits on the path separator, and emits a term for each component in the
  5. tree.
  6. [float]
  7. === Example output
  8. [source,js]
  9. ---------------------------
  10. POST _analyze
  11. {
  12. "tokenizer": "path_hierarchy",
  13. "text": "/one/two/three"
  14. }
  15. ---------------------------
  16. // CONSOLE
  17. /////////////////////
  18. [source,console-result]
  19. ----------------------------
  20. {
  21. "tokens": [
  22. {
  23. "token": "/one",
  24. "start_offset": 0,
  25. "end_offset": 4,
  26. "type": "word",
  27. "position": 0
  28. },
  29. {
  30. "token": "/one/two",
  31. "start_offset": 0,
  32. "end_offset": 8,
  33. "type": "word",
  34. "position": 0
  35. },
  36. {
  37. "token": "/one/two/three",
  38. "start_offset": 0,
  39. "end_offset": 14,
  40. "type": "word",
  41. "position": 0
  42. }
  43. ]
  44. }
  45. ----------------------------
  46. /////////////////////
  47. The above text would produce the following terms:
  48. [source,text]
  49. ---------------------------
  50. [ /one, /one/two, /one/two/three ]
  51. ---------------------------
  52. [float]
  53. === Configuration
  54. The `path_hierarchy` tokenizer accepts the following parameters:
  55. [horizontal]
  56. `delimiter`::
  57. The character to use as the path separator. Defaults to `/`.
  58. `replacement`::
  59. An optional replacement character to use for the delimiter.
  60. Defaults to the `delimiter`.
  61. `buffer_size`::
  62. The number of characters read into the term buffer in a single pass.
  63. Defaults to `1024`. The term buffer will grow by this size until all the
  64. text has been consumed. It is advisable not to change this setting.
  65. `reverse`::
  66. If set to `true`, emits the tokens in reverse order. Defaults to `false`.
  67. `skip`::
  68. The number of initial tokens to skip. Defaults to `0`.
  69. [float]
  70. === Example configuration
  71. In this example, we configure the `path_hierarchy` tokenizer to split on `-`
  72. characters, and to replace them with `/`. The first two tokens are skipped:
  73. [source,js]
  74. ----------------------------
  75. PUT my_index
  76. {
  77. "settings": {
  78. "analysis": {
  79. "analyzer": {
  80. "my_analyzer": {
  81. "tokenizer": "my_tokenizer"
  82. }
  83. },
  84. "tokenizer": {
  85. "my_tokenizer": {
  86. "type": "path_hierarchy",
  87. "delimiter": "-",
  88. "replacement": "/",
  89. "skip": 2
  90. }
  91. }
  92. }
  93. }
  94. }
  95. POST my_index/_analyze
  96. {
  97. "analyzer": "my_analyzer",
  98. "text": "one-two-three-four-five"
  99. }
  100. ----------------------------
  101. // CONSOLE
  102. /////////////////////
  103. [source,console-result]
  104. ----------------------------
  105. {
  106. "tokens": [
  107. {
  108. "token": "/three",
  109. "start_offset": 7,
  110. "end_offset": 13,
  111. "type": "word",
  112. "position": 0
  113. },
  114. {
  115. "token": "/three/four",
  116. "start_offset": 7,
  117. "end_offset": 18,
  118. "type": "word",
  119. "position": 0
  120. },
  121. {
  122. "token": "/three/four/five",
  123. "start_offset": 7,
  124. "end_offset": 23,
  125. "type": "word",
  126. "position": 0
  127. }
  128. ]
  129. }
  130. ----------------------------
  131. /////////////////////
  132. The above example produces the following terms:
  133. [source,text]
  134. ---------------------------
  135. [ /three, /three/four, /three/four/five ]
  136. ---------------------------
  137. If we were to set `reverse` to `true`, it would produce the following:
  138. [source,text]
  139. ---------------------------
  140. [ one/two/three/, two/three/, three/ ]
  141. ---------------------------
  142. [float]
  143. === Detailed Examples
  144. See <<analysis-pathhierarchy-tokenizer-examples, detailed examples here>>.