| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304 | [[analysis-standard-analyzer]]=== Standard AnalyzerThe `standard` analyzer is the default analyzer which is used if none isspecified. It provides grammar based tokenization (based on the Unicode TextSegmentation algorithm, as specified inhttp://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works wellfor most languages.[float]=== Example output[source,js]---------------------------POST _analyze{  "analyzer": "standard",  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}---------------------------// CONSOLE/////////////////////[source,js]----------------------------{  "tokens": [    {      "token": "the",      "start_offset": 0,      "end_offset": 3,      "type": "<ALPHANUM>",      "position": 0    },    {      "token": "2",      "start_offset": 4,      "end_offset": 5,      "type": "<NUM>",      "position": 1    },    {      "token": "quick",      "start_offset": 6,      "end_offset": 11,      "type": "<ALPHANUM>",      "position": 2    },    {      "token": "brown",      "start_offset": 12,      "end_offset": 17,      "type": "<ALPHANUM>",      "position": 3    },    {      "token": "foxes",      "start_offset": 18,      "end_offset": 23,      "type": "<ALPHANUM>",      "position": 4    },    {      "token": "jumped",      "start_offset": 24,      "end_offset": 30,      "type": "<ALPHANUM>",      "position": 5    },    {      "token": "over",      "start_offset": 31,      "end_offset": 35,      "type": "<ALPHANUM>",      "position": 6    },    {      "token": "the",      "start_offset": 36,      "end_offset": 39,      "type": "<ALPHANUM>",      "position": 7    },    {      "token": "lazy",      "start_offset": 40,      "end_offset": 44,      "type": "<ALPHANUM>",      "position": 8    },    {      "token": "dog's",      "start_offset": 45,      "end_offset": 50,      "type": "<ALPHANUM>",      "position": 9    },    {      "token": "bone",      "start_offset": 51,      "end_offset": 55,      "type": "<ALPHANUM>",      "position": 10    }  ]}----------------------------// TESTRESPONSE/////////////////////The above sentence would produce the following terms:[source,text]---------------------------[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]---------------------------[float]=== ConfigurationThe `standard` analyzer accepts the following parameters:[horizontal]`max_token_length`::    The maximum token length. If a token is seen that exceeds this length then    it is split at `max_token_length` intervals. Defaults to `255`.`stopwords`::    A pre-defined stop words list like `_english_` or an array  containing a    list of stop words.  Defaults to `\_none_`.`stopwords_path`::    The path to a file containing stop words.See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more informationabout stop word configuration.[float]=== Example configurationIn this example, we configure the `standard` analyzer to have a`max_token_length` of 5 (for demonstration purposes), and to use thepre-defined list of English stop words:[source,js]----------------------------PUT my_index{  "settings": {    "analysis": {      "analyzer": {        "my_english_analyzer": {          "type": "standard",          "max_token_length": 5,          "stopwords": "_english_"        }      }    }  }}POST my_index/_analyze{  "analyzer": "my_english_analyzer",  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}----------------------------// CONSOLE/////////////////////[source,js]----------------------------{  "tokens": [    {      "token": "2",      "start_offset": 4,      "end_offset": 5,      "type": "<NUM>",      "position": 1    },    {      "token": "quick",      "start_offset": 6,      "end_offset": 11,      "type": "<ALPHANUM>",      "position": 2    },    {      "token": "brown",      "start_offset": 12,      "end_offset": 17,      "type": "<ALPHANUM>",      "position": 3    },    {      "token": "foxes",      "start_offset": 18,      "end_offset": 23,      "type": "<ALPHANUM>",      "position": 4    },    {      "token": "jumpe",      "start_offset": 24,      "end_offset": 29,      "type": "<ALPHANUM>",      "position": 5    },    {      "token": "d",      "start_offset": 29,      "end_offset": 30,      "type": "<ALPHANUM>",      "position": 6    },    {      "token": "over",      "start_offset": 31,      "end_offset": 35,      "type": "<ALPHANUM>",      "position": 7    },    {      "token": "lazy",      "start_offset": 40,      "end_offset": 44,      "type": "<ALPHANUM>",      "position": 9    },    {      "token": "dog's",      "start_offset": 45,      "end_offset": 50,      "type": "<ALPHANUM>",      "position": 10    },    {      "token": "bone",      "start_offset": 51,      "end_offset": 55,      "type": "<ALPHANUM>",      "position": 11    }  ]}----------------------------// TESTRESPONSE/////////////////////The above example produces the following terms:[source,text]---------------------------[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]---------------------------[float]=== DefinitionThe `standard` analyzer consists of:Tokenizer::* <<analysis-standard-tokenizer,Standard Tokenizer>>Token Filters::* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)If you need to customize the `standard` analyzer beyond the configurationparameters then you need to recreate it as a `custom` analyzer and modifyit, usually by adding token filters. This would recreate the built-in`standard` analyzer and you can use it as a starting point:[source,js]----------------------------------------------------PUT /standard_example{  "settings": {    "analysis": {      "analyzer": {        "rebuilt_standard": {          "tokenizer": "standard",          "filter": [            "lowercase"       <1>          ]        }      }    }  }}----------------------------------------------------// CONSOLE// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]<1> You'd add any token filters after `lowercase`.
 |