pathhierarchy-tokenizer.asciidoc 3.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
  1. [[analysis-pathhierarchy-tokenizer]]
  2. === Path Hierarchy Tokenizer
  3. The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
  4. path, splits on the path separator, and emits a term for each component in the
  5. tree.
  6. [float]
  7. === Example output
  8. [source,console]
  9. ---------------------------
  10. POST _analyze
  11. {
  12. "tokenizer": "path_hierarchy",
  13. "text": "/one/two/three"
  14. }
  15. ---------------------------
  16. /////////////////////
  17. [source,console-result]
  18. ----------------------------
  19. {
  20. "tokens": [
  21. {
  22. "token": "/one",
  23. "start_offset": 0,
  24. "end_offset": 4,
  25. "type": "word",
  26. "position": 0
  27. },
  28. {
  29. "token": "/one/two",
  30. "start_offset": 0,
  31. "end_offset": 8,
  32. "type": "word",
  33. "position": 0
  34. },
  35. {
  36. "token": "/one/two/three",
  37. "start_offset": 0,
  38. "end_offset": 14,
  39. "type": "word",
  40. "position": 0
  41. }
  42. ]
  43. }
  44. ----------------------------
  45. /////////////////////
  46. The above text would produce the following terms:
  47. [source,text]
  48. ---------------------------
  49. [ /one, /one/two, /one/two/three ]
  50. ---------------------------
  51. [float]
  52. === Configuration
  53. The `path_hierarchy` tokenizer accepts the following parameters:
  54. [horizontal]
  55. `delimiter`::
  56. The character to use as the path separator. Defaults to `/`.
  57. `replacement`::
  58. An optional replacement character to use for the delimiter.
  59. Defaults to the `delimiter`.
  60. `buffer_size`::
  61. The number of characters read into the term buffer in a single pass.
  62. Defaults to `1024`. The term buffer will grow by this size until all the
  63. text has been consumed. It is advisable not to change this setting.
  64. `reverse`::
  65. If set to `true`, emits the tokens in reverse order. Defaults to `false`.
  66. `skip`::
  67. The number of initial tokens to skip. Defaults to `0`.
  68. [float]
  69. === Example configuration
  70. In this example, we configure the `path_hierarchy` tokenizer to split on `-`
  71. characters, and to replace them with `/`. The first two tokens are skipped:
  72. [source,console]
  73. ----------------------------
  74. PUT my_index
  75. {
  76. "settings": {
  77. "analysis": {
  78. "analyzer": {
  79. "my_analyzer": {
  80. "tokenizer": "my_tokenizer"
  81. }
  82. },
  83. "tokenizer": {
  84. "my_tokenizer": {
  85. "type": "path_hierarchy",
  86. "delimiter": "-",
  87. "replacement": "/",
  88. "skip": 2
  89. }
  90. }
  91. }
  92. }
  93. }
  94. POST my_index/_analyze
  95. {
  96. "analyzer": "my_analyzer",
  97. "text": "one-two-three-four-five"
  98. }
  99. ----------------------------
  100. /////////////////////
  101. [source,console-result]
  102. ----------------------------
  103. {
  104. "tokens": [
  105. {
  106. "token": "/three",
  107. "start_offset": 7,
  108. "end_offset": 13,
  109. "type": "word",
  110. "position": 0
  111. },
  112. {
  113. "token": "/three/four",
  114. "start_offset": 7,
  115. "end_offset": 18,
  116. "type": "word",
  117. "position": 0
  118. },
  119. {
  120. "token": "/three/four/five",
  121. "start_offset": 7,
  122. "end_offset": 23,
  123. "type": "word",
  124. "position": 0
  125. }
  126. ]
  127. }
  128. ----------------------------
  129. /////////////////////
  130. The above example produces the following terms:
  131. [source,text]
  132. ---------------------------
  133. [ /three, /three/four, /three/four/five ]
  134. ---------------------------
  135. If we were to set `reverse` to `true`, it would produce the following:
  136. [source,text]
  137. ---------------------------
  138. [ one/two/three/, two/three/, three/ ]
  139. ---------------------------
  140. [float]
  141. === Detailed Examples
  142. See <<analysis-pathhierarchy-tokenizer-examples, detailed examples here>>.