123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130 |
- [[analysis-pattern-analyzer]]
- === Pattern Analyzer
- An analyzer of type `pattern` that can flexibly separate text into terms
- via a regular expression. Accepts the following settings:
- The following are settings that can be set for a `pattern` analyzer
- type:
- [cols="<,<",options="header",]
- |===================================================================
- |Setting |Description
- |`lowercase` |Should terms be lowercased or not. Defaults to `true`.
- |`pattern` |The regular expression pattern, defaults to `\W+`.
- |`flags` |The regular expression flags.
- |`stopwords` |A list of stopwords to initialize the stop filter with.
- Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously
- defaulted to the English stopwords list]. Check
- <<analysis-stop-analyzer,Stop Analyzer>> for more details.
- |===================================================================
- *IMPORTANT*: The regular expression should match the *token separators*,
- not the tokens themselves.
- Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
- http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
- Pattern API] for more details about `flags` options.
- [float]
- ==== Pattern Analyzer Examples
- In order to try out these examples, you should delete the `test` index
- before running each example:
- [source,js]
- --------------------------------------------------
- curl -XDELETE localhost:9200/test
- --------------------------------------------------
- [float]
- ===== Whitespace tokenizer
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/test' -d '
- {
- "settings":{
- "analysis": {
- "analyzer": {
- "whitespace":{
- "type": "pattern",
- "pattern":"\\\\s+"
- }
- }
- }
- }
- }'
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
- # "foo,bar", "baz"
- --------------------------------------------------
- [float]
- ===== Non-word character tokenizer
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/test' -d '
- {
- "settings":{
- "analysis": {
- "analyzer": {
- "nonword":{
- "type": "pattern",
- "pattern":"[^\\\\w]+"
- }
- }
- }
- }
- }'
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
- # "foo,bar baz" becomes "foo", "bar", "baz"
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
- # "type_1","type_4"
- --------------------------------------------------
- [float]
- ===== CamelCase tokenizer
- [source,js]
- --------------------------------------------------
- curl -XPUT 'localhost:9200/test?pretty=1' -d '
- {
- "settings":{
- "analysis": {
- "analyzer": {
- "camel":{
- "type": "pattern",
- "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
- }
- }
- }
- }
- }'
- curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
- MooseX::FTPClass2_beta
- '
- # "moose","x","ftp","class","2","beta"
- --------------------------------------------------
- The regex above is easier to understand as:
- [source,js]
- --------------------------------------------------
- ([^\\p{L}\\d]+) # swallow non letters and numbers,
- | (?<=\\D)(?=\\d) # or non-number followed by number,
- | (?<=\\d)(?=\\D) # or number followed by non-number,
- | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
- (?=\\p{Lu}) # followed by upper case,
- | (?<=\\p{Lu}) # or upper case
- (?=\\p{Lu} # followed by upper case
- [\\p{L}&&[^\\p{Lu}]] # then lower case
- )
- --------------------------------------------------
|