1
0

pattern-analyzer.asciidoc 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. An analyzer of type `pattern` that can flexibly separate text into terms
  4. via a regular expression. Accepts the following settings:
  5. The following are settings that can be set for a `pattern` analyzer
  6. type:
  7. [cols="<,<",options="header",]
  8. |===================================================================
  9. |Setting |Description
  10. |`lowercase` |Should terms be lowercased or not. Defaults to `true`.
  11. |`pattern` |The regular expression pattern, defaults to `\W+`.
  12. |`flags` |The regular expression flags.
  13. |`stopwords` |A list of stopwords to initialize the stop filter with.
  14. Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously
  15. defaulted to the English stopwords list]. Check
  16. <<analysis-stop-analyzer,Stop Analyzer>> for more details.
  17. |===================================================================
  18. *IMPORTANT*: The regular expression should match the *token separators*,
  19. not the tokens themselves.
  20. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
  21. http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
  22. Pattern API] for more details about `flags` options.
  23. [float]
  24. ==== Pattern Analyzer Examples
  25. In order to try out these examples, you should delete the `test` index
  26. before running each example:
  27. [source,js]
  28. --------------------------------------------------
  29. curl -XDELETE localhost:9200/test
  30. --------------------------------------------------
  31. [float]
  32. ===== Whitespace tokenizer
  33. [source,js]
  34. --------------------------------------------------
  35. curl -XPUT 'localhost:9200/test' -d '
  36. {
  37. "settings":{
  38. "analysis": {
  39. "analyzer": {
  40. "whitespace":{
  41. "type": "pattern",
  42. "pattern":"\\\\s+"
  43. }
  44. }
  45. }
  46. }
  47. }'
  48. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
  49. # "foo,bar", "baz"
  50. --------------------------------------------------
  51. [float]
  52. ===== Non-word character tokenizer
  53. [source,js]
  54. --------------------------------------------------
  55. curl -XPUT 'localhost:9200/test' -d '
  56. {
  57. "settings":{
  58. "analysis": {
  59. "analyzer": {
  60. "nonword":{
  61. "type": "pattern",
  62. "pattern":"[^\\\\w]+"
  63. }
  64. }
  65. }
  66. }
  67. }'
  68. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
  69. # "foo,bar baz" becomes "foo", "bar", "baz"
  70. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
  71. # "type_1","type_4"
  72. --------------------------------------------------
  73. [float]
  74. ===== CamelCase tokenizer
  75. [source,js]
  76. --------------------------------------------------
  77. curl -XPUT 'localhost:9200/test?pretty=1' -d '
  78. {
  79. "settings":{
  80. "analysis": {
  81. "analyzer": {
  82. "camel":{
  83. "type": "pattern",
  84. "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
  85. }
  86. }
  87. }
  88. }
  89. }'
  90. curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
  91. MooseX::FTPClass2_beta
  92. '
  93. # "moose","x","ftp","class","2","beta"
  94. --------------------------------------------------
  95. The regex above is easier to understand as:
  96. [source,js]
  97. --------------------------------------------------
  98. ([^\\p{L}\\d]+) # swallow non letters and numbers,
  99. | (?<=\\D)(?=\\d) # or non-number followed by number,
  100. | (?<=\\d)(?=\\D) # or number followed by non-number,
  101. | (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
  102. (?=\\p{Lu}) # followed by upper case,
  103. | (?<=\\p{Lu}) # or upper case
  104. (?=\\p{Lu} # followed by upper case
  105. [\\p{L}&&[^\\p{Lu}]] # then lower case
  106. )
  107. --------------------------------------------------