thai-tokenizer.asciidoc 2.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
  1. [[analysis-thai-tokenizer]]
  2. === Thai tokenizer
  3. ++++
  4. <titleabbrev>Thai</titleabbrev>
  5. ++++
  6. The `thai` tokenizer segments Thai text into words, using the Thai
  7. segmentation algorithm included with Java. Text in other languages in general
  8. will be treated the same as the
  9. <<analysis-standard-tokenizer,`standard` tokenizer>>.
  10. WARNING: This tokenizer may not be supported by all JREs. It is known to work
  11. with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
  12. consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.
  13. [discrete]
  14. === Example output
  15. [source,console]
  16. ---------------------------
  17. POST _analyze
  18. {
  19. "tokenizer": "thai",
  20. "text": "การที่ได้ต้องแสดงว่างานดี"
  21. }
  22. ---------------------------
  23. /////////////////////
  24. [source,console-result]
  25. ----------------------------
  26. {
  27. "tokens": [
  28. {
  29. "token": "การ",
  30. "start_offset": 0,
  31. "end_offset": 3,
  32. "type": "word",
  33. "position": 0
  34. },
  35. {
  36. "token": "ที่",
  37. "start_offset": 3,
  38. "end_offset": 6,
  39. "type": "word",
  40. "position": 1
  41. },
  42. {
  43. "token": "ได้",
  44. "start_offset": 6,
  45. "end_offset": 9,
  46. "type": "word",
  47. "position": 2
  48. },
  49. {
  50. "token": "ต้อง",
  51. "start_offset": 9,
  52. "end_offset": 13,
  53. "type": "word",
  54. "position": 3
  55. },
  56. {
  57. "token": "แสดง",
  58. "start_offset": 13,
  59. "end_offset": 17,
  60. "type": "word",
  61. "position": 4
  62. },
  63. {
  64. "token": "ว่า",
  65. "start_offset": 17,
  66. "end_offset": 20,
  67. "type": "word",
  68. "position": 5
  69. },
  70. {
  71. "token": "งาน",
  72. "start_offset": 20,
  73. "end_offset": 23,
  74. "type": "word",
  75. "position": 6
  76. },
  77. {
  78. "token": "ดี",
  79. "start_offset": 23,
  80. "end_offset": 25,
  81. "type": "word",
  82. "position": 7
  83. }
  84. ]
  85. }
  86. ----------------------------
  87. /////////////////////
  88. The above sentence would produce the following terms:
  89. [source,text]
  90. ---------------------------
  91. [ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
  92. ---------------------------
  93. [discrete]
  94. === Configuration
  95. The `thai` tokenizer is not configurable.