pattern-analyzer.asciidoc 3.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
  1. [[analysis-pattern-analyzer]]
  2. === Pattern Analyzer
  3. An analyzer of type `pattern` that can flexibly separate text into terms
  4. via a regular expression. Accepts the following settings:
  5. The following are settings that can be set for a `pattern` analyzer
  6. type:
  7. [horizontal]
  8. `lowercase`:: Should terms be lowercased or not. Defaults to `true`.
  9. `pattern`:: The regular expression pattern, defaults to `\W+`.
  10. `flags`:: The regular expression flags.
  11. `stopwords`:: A list of stopwords to initialize the stop filter with.
  12. Defaults to an 'empty' stopword list Check
  13. <<analysis-stop-analyzer,Stop Analyzer>> for more details.
  14. *IMPORTANT*: The regular expression should match the *token separators*,
  15. not the tokens themselves.
  16. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
  17. http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
  18. Pattern API] for more details about `flags` options.
  19. [float]
  20. ==== Pattern Analyzer Examples
  21. In order to try out these examples, you should delete the `test` index
  22. before running each example.
  23. [float]
  24. ===== Whitespace tokenizer
  25. [source,js]
  26. --------------------------------------------------
  27. PUT test
  28. {
  29. "settings": {
  30. "analysis": {
  31. "analyzer": {
  32. "whitespace": {
  33. "type": "pattern",
  34. "pattern": "\\s+"
  35. }
  36. }
  37. }
  38. }
  39. }
  40. GET _cluster/health?wait_for_status=yellow
  41. GET test/_analyze?analyzer=whitespace&text=foo,bar baz
  42. # "foo,bar", "baz"
  43. --------------------------------------------------
  44. // AUTOSENSE
  45. [float]
  46. ===== Non-word character tokenizer
  47. [source,js]
  48. --------------------------------------------------
  49. PUT test
  50. {
  51. "settings": {
  52. "analysis": {
  53. "analyzer": {
  54. "nonword": {
  55. "type": "pattern",
  56. "pattern": "[^\\w]+" <1>
  57. }
  58. }
  59. }
  60. }
  61. }
  62. GET _cluster/health?wait_for_status=yellow
  63. GET test/_analyze?analyzer=nonword&text=foo,bar baz
  64. # "foo,bar baz" becomes "foo", "bar", "baz"
  65. GET test/_analyze?analyzer=nonword&text=type_1-type_4
  66. # "type_1","type_4"
  67. --------------------------------------------------
  68. // AUTOSENSE
  69. [float]
  70. ===== CamelCase tokenizer
  71. [source,js]
  72. --------------------------------------------------
  73. PUT test?pretty=1
  74. {
  75. "settings": {
  76. "analysis": {
  77. "analyzer": {
  78. "camel": {
  79. "type": "pattern",
  80. "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
  81. }
  82. }
  83. }
  84. }
  85. }
  86. GET _cluster/health?wait_for_status=yellow
  87. GET test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
  88. # "moose","x","ftp","class","2","beta"
  89. --------------------------------------------------
  90. // AUTOSENSE
  91. The regex above is easier to understand as:
  92. [source,js]
  93. --------------------------------------------------
  94. ([^\p{L}\d]+) # swallow non letters and numbers,
  95. | (?<=\D)(?=\d) # or non-number followed by number,
  96. | (?<=\d)(?=\D) # or number followed by non-number,
  97. | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
  98. (?=\p{Lu}) # followed by upper case,
  99. | (?<=\p{Lu}) # or upper case
  100. (?=\p{Lu} # followed by upper case
  101. [\p{L}&&[^\p{Lu}]] # then lower case
  102. )
  103. --------------------------------------------------