| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411 | [[analysis-pattern-analyzer]]=== Pattern analyzer++++<titleabbrev>Pattern</titleabbrev>++++The `pattern` analyzer uses a regular expression to split the text into terms.The regular expression should match the *token separators*  not the tokensthemselves. The regular expression defaults to `\W+` (or all non-word characters).[WARNING].Beware of Pathological Regular Expressions========================================The pattern analyzer useshttps://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].A badly written regular expression could run very slowly or even throw aStackOverflowError and cause the node it is running on to exit suddenly.Read more about https://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].========================================[discrete]=== Example output[source,console]---------------------------POST _analyze{  "analyzer": "pattern",  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}---------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "the",      "start_offset": 0,      "end_offset": 3,      "type": "word",      "position": 0    },    {      "token": "2",      "start_offset": 4,      "end_offset": 5,      "type": "word",      "position": 1    },    {      "token": "quick",      "start_offset": 6,      "end_offset": 11,      "type": "word",      "position": 2    },    {      "token": "brown",      "start_offset": 12,      "end_offset": 17,      "type": "word",      "position": 3    },    {      "token": "foxes",      "start_offset": 18,      "end_offset": 23,      "type": "word",      "position": 4    },    {      "token": "jumped",      "start_offset": 24,      "end_offset": 30,      "type": "word",      "position": 5    },    {      "token": "over",      "start_offset": 31,      "end_offset": 35,      "type": "word",      "position": 6    },    {      "token": "the",      "start_offset": 36,      "end_offset": 39,      "type": "word",      "position": 7    },    {      "token": "lazy",      "start_offset": 40,      "end_offset": 44,      "type": "word",      "position": 8    },    {      "token": "dog",      "start_offset": 45,      "end_offset": 48,      "type": "word",      "position": 9    },    {      "token": "s",      "start_offset": 49,      "end_offset": 50,      "type": "word",      "position": 10    },    {      "token": "bone",      "start_offset": 51,      "end_offset": 55,      "type": "word",      "position": 11    }  ]}----------------------------/////////////////////The above sentence would produce the following terms:[source,text]---------------------------[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]---------------------------[discrete]=== ConfigurationThe `pattern` analyzer accepts the following parameters:[horizontal]`pattern`::    A https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.`flags`::    Java regular expression https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].    Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.`lowercase`::    Should terms be lowercased or not. Defaults to `true`.`stopwords`::    A pre-defined stop words list like `_english_` or an array containing a    list of stop words. Defaults to `_none_`.`stopwords_path`::    The path to a file containing stop words.See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more informationabout stop word configuration.[discrete]=== Example configurationIn this example, we configure the `pattern` analyzer to split email addresseson non-word characters or on underscores (`\W|_`), and to lower-case the result:[source,console]----------------------------PUT my-index-000001{  "settings": {    "analysis": {      "analyzer": {        "my_email_analyzer": {          "type":      "pattern",          "pattern":   "\\W|_", <1>          "lowercase": true        }      }    }  }}POST my-index-000001/_analyze{  "analyzer": "my_email_analyzer",  "text": "John_Smith@foo-bar.com"}----------------------------<1> The backslashes in the pattern need to be escaped when specifying the    pattern as a JSON string./////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "john",      "start_offset": 0,      "end_offset": 4,      "type": "word",      "position": 0    },    {      "token": "smith",      "start_offset": 5,      "end_offset": 10,      "type": "word",      "position": 1    },    {      "token": "foo",      "start_offset": 11,      "end_offset": 14,      "type": "word",      "position": 2    },    {      "token": "bar",      "start_offset": 15,      "end_offset": 18,      "type": "word",      "position": 3    },    {      "token": "com",      "start_offset": 19,      "end_offset": 22,      "type": "word",      "position": 4    }  ]}----------------------------/////////////////////The above example produces the following terms:[source,text]---------------------------[ john, smith, foo, bar, com ]---------------------------[discrete]==== CamelCase tokenizerThe following more complicated example splits CamelCase text into tokens:[source,console]--------------------------------------------------PUT my-index-000001{  "settings": {    "analysis": {      "analyzer": {        "camel": {          "type": "pattern",          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"        }      }    }  }}GET my-index-000001/_analyze{  "analyzer": "camel",  "text": "MooseX::FTPClass2_beta"}--------------------------------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "moose",      "start_offset": 0,      "end_offset": 5,      "type": "word",      "position": 0    },    {      "token": "x",      "start_offset": 5,      "end_offset": 6,      "type": "word",      "position": 1    },    {      "token": "ftp",      "start_offset": 8,      "end_offset": 11,      "type": "word",      "position": 2    },    {      "token": "class",      "start_offset": 11,      "end_offset": 16,      "type": "word",      "position": 3    },    {      "token": "2",      "start_offset": 16,      "end_offset": 17,      "type": "word",      "position": 4    },    {      "token": "beta",      "start_offset": 18,      "end_offset": 22,      "type": "word",      "position": 5    }  ]}----------------------------/////////////////////The above example produces the following terms:[source,text]---------------------------[ moose, x, ftp, class, 2, beta ]---------------------------The regex above is easier to understand as:[source,regex]--------------------------------------------------  ([^\p{L}\d]+)                 # swallow non letters and numbers,| (?<=\D)(?=\d)                 # or non-number followed by number,| (?<=\d)(?=\D)                 # or number followed by non-number,| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case  (?=\p{Lu})                    #   followed by upper case,| (?<=\p{Lu})                   # or upper case  (?=\p{Lu}                     #   followed by upper case    [\p{L}&&[^\p{Lu}]]          #   then lower case  )--------------------------------------------------[discrete]=== DefinitionThe `pattern` anlayzer consists of:Tokenizer::* <<analysis-pattern-tokenizer,Pattern Tokenizer>>Token Filters::*  <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>*  <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)If you need to customize the `pattern` analyzer beyond the configurationparameters then you need to recreate it as a `custom` analyzer and modifyit, usually by adding token filters. This would recreate the built-in`pattern` analyzer and you can use it as a starting point for furthercustomization:[source,console]----------------------------------------------------PUT /pattern_example{  "settings": {    "analysis": {      "tokenizer": {        "split_on_non_word": {          "type":       "pattern",          "pattern":    "\\W+" <1>        }      },      "analyzer": {        "rebuilt_pattern": {          "tokenizer": "split_on_non_word",          "filter": [            "lowercase"       <2>          ]        }      }    }  }}----------------------------------------------------// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]<1> The default pattern is `\W+` which splits on non-word charactersand this is where you'd change it.<2> You'd add other token filters after `lowercase`.
 |