lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
							[[analysis-thai-tokenizer]]
=== Thai tokenizer
++++
<titleabbrev>Thai</titleabbrev>
++++

The `thai` tokenizer segments Thai text into words, using the Thai
segmentation algorithm included with Java. Text in other languages in general
will be treated the same as the
<<analysis-standard-tokenizer,`standard` tokenizer>>.

WARNING: This tokenizer may not be supported by all JREs. It is known to work
with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.

[discrete]
=== Example output

[source,console]
---------------------------
POST _analyze
{
  "tokenizer": "thai",
  "text": "การที่ได้ต้องแสดงว่างานดี"
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "การ",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "ที่",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "ได้",
      "start_offset": 6,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "ต้อง",
      "start_offset": 9,
      "end_offset": 13,
      "type": "word",
      "position": 3
    },
    {
      "token": "แสดง",
      "start_offset": 13,
      "end_offset": 17,
      "type": "word",
      "position": 4
    },
    {
      "token": "ว่า",
      "start_offset": 17,
      "end_offset": 20,
      "type": "word",
      "position": 5
    },
    {
      "token": "งาน",
      "start_offset": 20,
      "end_offset": 23,
      "type": "word",
      "position": 6
    },
    {
      "token": "ดี",
      "start_offset": 23,
      "end_offset": 25,
      "type": "word",
      "position": 7
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following terms:

[source,text]
---------------------------
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
---------------------------

[discrete]
=== Configuration

The `thai` tokenizer is not configurable.