123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127 |
- [[analysis-pattern-analyzer]]
- === Pattern Analyzer
- An analyzer of type `pattern` that can flexibly separate text into terms
- via a regular expression. Accepts the following settings:
- The following are settings that can be set for a `pattern` analyzer
- type:
- [horizontal]
- `lowercase`:: Should terms be lowercased or not. Defaults to `true`.
- `pattern`:: The regular expression pattern, defaults to `\W+`.
- `flags`:: The regular expression flags.
- `stopwords`:: A list of stopwords to initialize the stop filter with.
- Defaults to an 'empty' stopword list Check
- <<analysis-stop-analyzer,Stop Analyzer>> for more details.
- *IMPORTANT*: The regular expression should match the *token separators*,
- not the tokens themselves.
- Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
- http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
- Pattern API] for more details about `flags` options.
- [float]
- ==== Pattern Analyzer Examples
- In order to try out these examples, you should delete the `test` index
- before running each example.
- [float]
- ===== Whitespace tokenizer
- [source,js]
- --------------------------------------------------
- PUT test
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "whitespace": {
- "type": "pattern",
- "pattern": "\\s+"
- }
- }
- }
- }
- }
- GET _cluster/health?wait_for_status=yellow
- GET test/_analyze?analyzer=whitespace&text=foo,bar baz
- # "foo,bar", "baz"
- --------------------------------------------------
- // AUTOSENSE
- [float]
- ===== Non-word character tokenizer
- [source,js]
- --------------------------------------------------
- PUT test
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "nonword": {
- "type": "pattern",
- "pattern": "[^\\w]+" <1>
- }
- }
- }
- }
- }
- GET _cluster/health?wait_for_status=yellow
- GET test/_analyze?analyzer=nonword&text=foo,bar baz
- # "foo,bar baz" becomes "foo", "bar", "baz"
- GET test/_analyze?analyzer=nonword&text=type_1-type_4
- # "type_1","type_4"
- --------------------------------------------------
- // AUTOSENSE
- [float]
- ===== CamelCase tokenizer
- [source,js]
- --------------------------------------------------
- PUT test?pretty=1
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "camel": {
- "type": "pattern",
- "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
- }
- }
- }
- }
- }
- GET _cluster/health?wait_for_status=yellow
- GET test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
- # "moose","x","ftp","class","2","beta"
- --------------------------------------------------
- // AUTOSENSE
- The regex above is easier to understand as:
- [source,js]
- --------------------------------------------------
- ([^\p{L}\d]+) # swallow non letters and numbers,
- | (?<=\D)(?=\d) # or non-number followed by number,
- | (?<=\d)(?=\D) # or number followed by non-number,
- | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
- (?=\p{Lu}) # followed by upper case,
- | (?<=\p{Lu}) # or upper case
- (?=\p{Lu} # followed by upper case
- [\p{L}&&[^\p{Lu}]] # then lower case
- )
- --------------------------------------------------
|