| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128 | [[analysis-pattern-analyzer]]=== Pattern AnalyzerAn analyzer of type `pattern` that can flexibly separate text into termsvia a regular expression. Accepts the following settings:The following are settings that can be set for a `pattern` analyzertype:[horizontal]`lowercase`::   Should terms be lowercased or not. Defaults to `true`.`pattern`::     The regular expression pattern, defaults to `\W+`.`flags`::       The regular expression flags.`stopwords`::   A list of stopwords to initialize the stop filter with.                Defaults to an 'empty' stopword list Check                <<analysis-stop-analyzer,Stop Analyzer>> for more details.*IMPORTANT*: The regular expression should match the *token separators*,not the tokens themselves.Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Checkhttp://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[JavaPattern API] for more details about `flags` options.[float]==== Pattern Analyzer ExamplesIn order to try out these examples, you should delete the `test` indexbefore running each example.[float]===== Whitespace tokenizer[source,js]--------------------------------------------------DELETE testPUT /test{  "settings": {    "analysis": {      "analyzer": {        "whitespace": {          "type": "pattern",          "pattern": "\\s+"        }      }    }  }}GET /test/_analyze?analyzer=whitespace&text=foo,bar baz# "foo,bar", "baz"--------------------------------------------------// AUTOSENSE[float]===== Non-word character tokenizer[source,js]--------------------------------------------------DELETE testPUT /test{  "settings": {    "analysis": {      "analyzer": {        "nonword": {          "type": "pattern",          "pattern": "[^\\w]+" <1>        }      }    }  }}GET /test/_analyze?analyzer=nonword&text=foo,bar baz# "foo,bar baz" becomes "foo", "bar", "baz"GET /test/_analyze?analyzer=nonword&text=type_1-type_4# "type_1","type_4"--------------------------------------------------// AUTOSENSE[float]===== CamelCase tokenizer[source,js]--------------------------------------------------DELETE testPUT /test?pretty=1{  "settings": {    "analysis": {      "analyzer": {        "camel": {          "type": "pattern",          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"        }      }    }  }}GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta# "moose","x","ftp","class","2","beta"--------------------------------------------------// AUTOSENSEThe regex above is easier to understand as:[source,js]--------------------------------------------------  ([^\p{L}\d]+)                 # swallow non letters and numbers,| (?<=\D)(?=\d)                 # or non-number followed by number,| (?<=\d)(?=\D)                 # or number followed by non-number,| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case  (?=\p{Lu})                    #   followed by upper case,| (?<=\p{Lu})                   # or upper case  (?=\p{Lu}                     #   followed by upper case    [\p{L}&&[^\p{Lu}]]          #   then lower case  )--------------------------------------------------
 |