thai-tokenizer.asciidoc 2.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
  1. [[analysis-thai-tokenizer]]
  2. === Thai Tokenizer
  3. The `thai` tokenizer segments Thai text into words, using the Thai
  4. segmentation algorithm included with Java. Text in other languages in general
  5. will be treated the same as the
  6. <<analysis-standard-tokenizer,`standard` tokenizer>>.
  7. WARNING: This tokenizer may not be supported by all JREs. It is known to work
  8. with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
  9. consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.
  10. [float]
  11. === Example output
  12. [source,js]
  13. ---------------------------
  14. POST _analyze
  15. {
  16. "tokenizer": "thai",
  17. "text": "การที่ได้ต้องแสดงว่างานดี"
  18. }
  19. ---------------------------
  20. // CONSOLE
  21. /////////////////////
  22. [source,js]
  23. ----------------------------
  24. {
  25. "tokens": [
  26. {
  27. "token": "การ",
  28. "start_offset": 0,
  29. "end_offset": 3,
  30. "type": "word",
  31. "position": 0
  32. },
  33. {
  34. "token": "ที่",
  35. "start_offset": 3,
  36. "end_offset": 6,
  37. "type": "word",
  38. "position": 1
  39. },
  40. {
  41. "token": "ได้",
  42. "start_offset": 6,
  43. "end_offset": 9,
  44. "type": "word",
  45. "position": 2
  46. },
  47. {
  48. "token": "ต้อง",
  49. "start_offset": 9,
  50. "end_offset": 13,
  51. "type": "word",
  52. "position": 3
  53. },
  54. {
  55. "token": "แสดง",
  56. "start_offset": 13,
  57. "end_offset": 17,
  58. "type": "word",
  59. "position": 4
  60. },
  61. {
  62. "token": "ว่า",
  63. "start_offset": 17,
  64. "end_offset": 20,
  65. "type": "word",
  66. "position": 5
  67. },
  68. {
  69. "token": "งาน",
  70. "start_offset": 20,
  71. "end_offset": 23,
  72. "type": "word",
  73. "position": 6
  74. },
  75. {
  76. "token": "ดี",
  77. "start_offset": 23,
  78. "end_offset": 25,
  79. "type": "word",
  80. "position": 7
  81. }
  82. ]
  83. }
  84. ----------------------------
  85. // TESTRESPONSE
  86. /////////////////////
  87. The above sentence would produce the following terms:
  88. [source,text]
  89. ---------------------------
  90. [ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
  91. ---------------------------
  92. [float]
  93. === Configuration
  94. The `thai` tokenizer is not configurable.