fingerprint-analyzer.asciidoc 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint analyzer
  3. ++++
  4. <titleabbrev>Fingerprint</titleabbrev>
  5. ++++
  6. The `fingerprint` analyzer implements a
  7. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  8. which is used by the OpenRefine project to assist in clustering.
  9. Input text is lowercased, normalized to remove extended characters, sorted,
  10. deduplicated and concatenated into a single token. If a stopword list is
  11. configured, stop words will also be removed.
  12. [discrete]
  13. === Example output
  14. [source,console]
  15. ---------------------------
  16. POST _analyze
  17. {
  18. "analyzer": "fingerprint",
  19. "text": "Yes yes, Gödel said this sentence is consistent and."
  20. }
  21. ---------------------------
  22. /////////////////////
  23. [source,console-result]
  24. ----------------------------
  25. {
  26. "tokens": [
  27. {
  28. "token": "and consistent godel is said sentence this yes",
  29. "start_offset": 0,
  30. "end_offset": 52,
  31. "type": "fingerprint",
  32. "position": 0
  33. }
  34. ]
  35. }
  36. ----------------------------
  37. /////////////////////
  38. The above sentence would produce the following single term:
  39. [source,text]
  40. ---------------------------
  41. [ and consistent godel is said sentence this yes ]
  42. ---------------------------
  43. [discrete]
  44. === Configuration
  45. The `fingerprint` analyzer accepts the following parameters:
  46. [horizontal]
  47. `separator`::
  48. The character to use to concatenate the terms. Defaults to a space.
  49. `max_output_size`::
  50. The maximum token size to emit. Defaults to `255`. Tokens larger than
  51. this size will be discarded.
  52. `stopwords`::
  53. A pre-defined stop words list like `_english_` or an array containing a
  54. list of stop words. Defaults to `_none_`.
  55. `stopwords_path`::
  56. The path to a file containing stop words.
  57. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  58. about stop word configuration.
  59. [discrete]
  60. === Example configuration
  61. In this example, we configure the `fingerprint` analyzer to use the
  62. pre-defined list of English stop words:
  63. [source,console]
  64. ----------------------------
  65. PUT my-index-000001
  66. {
  67. "settings": {
  68. "analysis": {
  69. "analyzer": {
  70. "my_fingerprint_analyzer": {
  71. "type": "fingerprint",
  72. "stopwords": "_english_"
  73. }
  74. }
  75. }
  76. }
  77. }
  78. POST my-index-000001/_analyze
  79. {
  80. "analyzer": "my_fingerprint_analyzer",
  81. "text": "Yes yes, Gödel said this sentence is consistent and."
  82. }
  83. ----------------------------
  84. /////////////////////
  85. [source,console-result]
  86. ----------------------------
  87. {
  88. "tokens": [
  89. {
  90. "token": "consistent godel said sentence yes",
  91. "start_offset": 0,
  92. "end_offset": 52,
  93. "type": "fingerprint",
  94. "position": 0
  95. }
  96. ]
  97. }
  98. ----------------------------
  99. /////////////////////
  100. The above example produces the following term:
  101. [source,text]
  102. ---------------------------
  103. [ consistent godel said sentence yes ]
  104. ---------------------------
  105. [discrete]
  106. === Definition
  107. The `fingerprint` tokenizer consists of:
  108. Tokenizer::
  109. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  110. Token Filters (in order)::
  111. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  112. * <<analysis-asciifolding-tokenfilter>>
  113. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  114. * <<analysis-fingerprint-tokenfilter>>
  115. If you need to customize the `fingerprint` analyzer beyond the configuration
  116. parameters then you need to recreate it as a `custom` analyzer and modify
  117. it, usually by adding token filters. This would recreate the built-in
  118. `fingerprint` analyzer and you can use it as a starting point for further
  119. customization:
  120. [source,console]
  121. ----------------------------------------------------
  122. PUT /fingerprint_example
  123. {
  124. "settings": {
  125. "analysis": {
  126. "analyzer": {
  127. "rebuilt_fingerprint": {
  128. "tokenizer": "standard",
  129. "filter": [
  130. "lowercase",
  131. "asciifolding",
  132. "fingerprint"
  133. ]
  134. }
  135. }
  136. }
  137. }
  138. }
  139. ----------------------------------------------------
  140. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]