pattern-analyzer.asciidoc 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. An analyzer of type `pattern` that can flexibly separate text into terms
  4. via a regular expression. Accepts the following settings:
  5. The following are settings that can be set for a `pattern` analyzer
  6. type:
  7. [cols="<,<",options="header",]
  8. |===================================================================
  9. |Setting |Description
  10. |`lowercase` |Should terms be lowercased or not. Defaults to `true`.
  11. |`pattern` |The regular expression pattern, defaults to `\W+`.
  12. |`flags` |The regular expression flags.
  13. |`stopwords` |A list of stopwords to initialize the stop filter with.
  14. Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously
  15. defaulted to the English stopwords list]
  16. |===================================================================
  17. *IMPORTANT*: The regular expression should match the *token separators*,
  18. not the tokens themselves.
  19. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
  20. http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
  21. Pattern API] for more details about `flags` options.
  22. [float]
  23. ==== Pattern Analyzer Examples
  24. In order to try out these examples, you should delete the `test` index
  25. before running each example:
  26. [source,js]
  27. --------------------------------------------------
  28. curl -XDELETE localhost:9200/test
  29. --------------------------------------------------
  30. [float]
  31. ===== Whitespace tokenizer
  32. [source,js]
  33. --------------------------------------------------
  34. curl -XPUT 'localhost:9200/test' -d '
  35. {
  36. "settings":{
  37. "analysis": {
  38. "analyzer": {
  39. "whitespace":{
  40. "type": "pattern",
  41. "pattern":"\\\\s+"
  42. }
  43. }
  44. }
  45. }
  46. }'
  47. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
  48. # "foo,bar", "baz"
  49. --------------------------------------------------
  50. [float]
  51. ===== Non-word character tokenizer
  52. [source,js]
  53. --------------------------------------------------
  54. curl -XPUT 'localhost:9200/test' -d '
  55. {
  56. "settings":{
  57. "analysis": {
  58. "analyzer": {
  59. "nonword":{
  60. "type": "pattern",
  61. "pattern":"[^\\\\w]+"
  62. }
  63. }
  64. }
  65. }
  66. }'
  67. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
  68. # "foo,bar baz" becomes "foo", "bar", "baz"
  69. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
  70. # "type_1","type_4"
  71. --------------------------------------------------
  72. [float]
  73. ===== CamelCase tokenizer
  74. [source,js]
  75. --------------------------------------------------
  76. curl -XPUT 'localhost:9200/test?pretty=1' -d '
  77. {
  78. "settings":{
  79. "analysis": {
  80. "analyzer": {
  81. "camel":{
  82. "type": "pattern",
  83. "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
  84. }
  85. }
  86. }
  87. }
  88. }'
  89. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
  90. MooseX::FTPClass2_beta
  91. '
  92. # "moose","x","ftp","class","2","beta"
  93. --------------------------------------------------
  94. The regex above is easier to understand as:
  95. [source,js]
  96. --------------------------------------------------
  97. ([^\\p{L}\\d]+) # swallow non letters and numbers,
  98. | (?<=\\D)(?=\\d) # or non-number followed by number,
  99. | (?<=\\d)(?=\\D) # or number followed by non-number,
  100. | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
  101. (?=\\p{Lu}) # followed by upper case,
  102. | (?<=\\p{Lu}) # or upper case
  103. (?=\\p{Lu} # followed by upper case
  104. [\\p{L}&&[^\\p{Lu}]] # then lower case
  105. )
  106. --------------------------------------------------