| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130 | [[analysis-pattern-analyzer]]=== Pattern AnalyzerAn analyzer of type `pattern` that can flexibly separate text into termsvia a regular expression. Accepts the following settings:The following are settings that can be set for a `pattern` analyzertype:[cols="<,<",options="header",]|===================================================================|Setting |Description|`lowercase` |Should terms be lowercased or not. Defaults to `true`.|`pattern` |The regular expression pattern, defaults to `\W+`.|`flags` |The regular expression flags.|`stopwords` |A list of stopwords to initialize the stop filter with.Defaults to an 'empty' stopword list added[1.0.0.RC1, Previously defaulted to the English stopwords list]. Check<<analysis-stop-analyzer,Stop Analyzer>> for more details.|===================================================================*IMPORTANT*: The regular expression should match the *token separators*,not the tokens themselves.Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Checkhttp://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[JavaPattern API] for more details about `flags` options.[float]==== Pattern Analyzer ExamplesIn order to try out these examples, you should delete the `test` indexbefore running each example:[source,js]--------------------------------------------------    curl -XDELETE localhost:9200/test--------------------------------------------------[float]===== Whitespace tokenizer[source,js]--------------------------------------------------    curl -XPUT 'localhost:9200/test' -d '    {        "settings":{            "analysis": {                "analyzer": {                    "whitespace":{                        "type": "pattern",                        "pattern":"\\\\s+"                    }                }            }        }    }'    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'    # "foo,bar", "baz"--------------------------------------------------[float]===== Non-word character tokenizer[source,js]--------------------------------------------------    curl -XPUT 'localhost:9200/test' -d '    {        "settings":{            "analysis": {                "analyzer": {                    "nonword":{                        "type": "pattern",                        "pattern":"[^\\\\w]+"                    }                }            }        }    }'    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'    # "foo,bar baz" becomes "foo", "bar", "baz"    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'    # "type_1","type_4"--------------------------------------------------[float]===== CamelCase tokenizer[source,js]--------------------------------------------------    curl -XPUT 'localhost:9200/test?pretty=1' -d '    {        "settings":{            "analysis": {                "analyzer": {                    "camel":{                        "type": "pattern",                        "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"                    }                }            }        }    }'    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '        MooseX::FTPClass2_beta    '    # "moose","x","ftp","class","2","beta"--------------------------------------------------The regex above is easier to understand as:[source,js]--------------------------------------------------      ([^\\p{L}\\d]+)                 # swallow non letters and numbers,    | (?<=\\D)(?=\\d)                 # or non-number followed by number,    | (?<=\\d)(?=\\D)                 # or number followed by non-number,    | (?<=[ \\p{L} && [^\\p{Lu}]])    # or lower case      (?=\\p{Lu})                    #   followed by upper case,    | (?<=\\p{Lu})                   # or upper case      (?=\\p{Lu}                     #   followed by upper case        [\\p{L}&&[^\\p{Lu}]]          #   then lower case      )--------------------------------------------------
 |