pattern-analyzer.asciidoc 3.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. An analyzer of type `pattern` that can flexibly separate text into terms
  4. via a regular expression. Accepts the following settings:
  5. The following are settings that can be set for a `pattern` analyzer
  6. type:
  7. [cols="<,<",options="header",]
  8. |===================================================================
  9. |Setting |Description
  10. |`lowercase` |Should terms be lowercased or not. Defaults to `true`.
  11. |`pattern` |The regular expression pattern, defaults to `\W+`.
  12. |`flags` |The regular expression flags.
  13. |===================================================================
  14. *IMPORTANT*: The regular expression should match the *token separators*,
  15. not the tokens themselves.
  16. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
  17. http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
  18. Pattern API] for more details about `flags` options.
  19. [float]
  20. ==== Pattern Analyzer Examples
  21. In order to try out these examples, you should delete the `test` index
  22. before running each example:
  23. [source,js]
  24. --------------------------------------------------
  25. curl -XDELETE localhost:9200/test
  26. --------------------------------------------------
  27. [float]
  28. ===== Whitespace tokenizer
  29. [source,js]
  30. --------------------------------------------------
  31. curl -XPUT 'localhost:9200/test' -d '
  32. {
  33. "settings":{
  34. "analysis": {
  35. "analyzer": {
  36. "whitespace":{
  37. "type": "pattern",
  38. "pattern":"\\\\s+"
  39. }
  40. }
  41. }
  42. }
  43. }'
  44. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
  45. # "foo,bar", "baz"
  46. --------------------------------------------------
  47. [float]
  48. ===== Non-word character tokenizer
  49. [source,js]
  50. --------------------------------------------------
  51. curl -XPUT 'localhost:9200/test' -d '
  52. {
  53. "settings":{
  54. "analysis": {
  55. "analyzer": {
  56. "nonword":{
  57. "type": "pattern",
  58. "pattern":"[^\\\\w]+"
  59. }
  60. }
  61. }
  62. }
  63. }'
  64. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
  65. # "foo,bar baz" becomes "foo", "bar", "baz"
  66. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
  67. # "type_1","type_4"
  68. --------------------------------------------------
  69. [float]
  70. ===== CamelCase tokenizer
  71. [source,js]
  72. --------------------------------------------------
  73. curl -XPUT 'localhost:9200/test?pretty=1' -d '
  74. {
  75. "settings":{
  76. "analysis": {
  77. "analyzer": {
  78. "camel":{
  79. "type": "pattern",
  80. "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
  81. }
  82. }
  83. }
  84. }
  85. }'
  86. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
  87. MooseX::FTPClass2_beta
  88. '
  89. # "moose","x","ftp","class","2","beta"
  90. --------------------------------------------------
  91. The regex above is easier to understand as:
  92. [source,js]
  93. --------------------------------------------------
  94. ([^\\p{L}\\d]+) # swallow non letters and numbers,
  95. | (?<=\\D)(?=\\d) # or non-number followed by number,
  96. | (?<=\\d)(?=\\D) # or number followed by non-number,
  97. | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
  98. (?=\\p{Lu}) # followed by upper case,
  99. | (?<=\\p{Lu}) # or upper case
  100. (?=\\p{Lu} # followed by upper case
  101. [\\p{L}&&[^\\p{Lu}]] # then lower case
  102. )
  103. --------------------------------------------------