fingerprint-analyzer.asciidoc 4.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint Analyzer
  3. The `fingerprint` analyzer implements a
  4. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  5. which is used by the OpenRefine project to assist in clustering.
  6. Input text is lowercased, normalized to remove extended characters, sorted,
  7. deduplicated and concatenated into a single token. If a stopword list is
  8. configured, stop words will also be removed.
  9. [float]
  10. === Example output
  11. [source,js]
  12. ---------------------------
  13. POST _analyze
  14. {
  15. "analyzer": "fingerprint",
  16. "text": "Yes yes, Gödel said this sentence is consistent and."
  17. }
  18. ---------------------------
  19. // CONSOLE
  20. /////////////////////
  21. [source,js]
  22. ----------------------------
  23. {
  24. "tokens": [
  25. {
  26. "token": "and consistent godel is said sentence this yes",
  27. "start_offset": 0,
  28. "end_offset": 52,
  29. "type": "fingerprint",
  30. "position": 0
  31. }
  32. ]
  33. }
  34. ----------------------------
  35. // TESTRESPONSE
  36. /////////////////////
  37. The above sentence would produce the following single term:
  38. [source,text]
  39. ---------------------------
  40. [ and consistent godel is said sentence this yes ]
  41. ---------------------------
  42. [float]
  43. === Configuration
  44. The `fingerprint` analyzer accepts the following parameters:
  45. [horizontal]
  46. `separator`::
  47. The character to use to concatenate the terms. Defaults to a space.
  48. `max_output_size`::
  49. The maximum token size to emit. Defaults to `255`. Tokens larger than
  50. this size will be discarded.
  51. `stopwords`::
  52. A pre-defined stop words list like `_english_` or an array containing a
  53. list of stop words. Defaults to `_none_`.
  54. `stopwords_path`::
  55. The path to a file containing stop words.
  56. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  57. about stop word configuration.
  58. [float]
  59. === Example configuration
  60. In this example, we configure the `fingerprint` analyzer to use the
  61. pre-defined list of English stop words:
  62. [source,js]
  63. ----------------------------
  64. PUT my_index
  65. {
  66. "settings": {
  67. "analysis": {
  68. "analyzer": {
  69. "my_fingerprint_analyzer": {
  70. "type": "fingerprint",
  71. "stopwords": "_english_"
  72. }
  73. }
  74. }
  75. }
  76. }
  77. POST my_index/_analyze
  78. {
  79. "analyzer": "my_fingerprint_analyzer",
  80. "text": "Yes yes, Gödel said this sentence is consistent and."
  81. }
  82. ----------------------------
  83. // CONSOLE
  84. /////////////////////
  85. [source,js]
  86. ----------------------------
  87. {
  88. "tokens": [
  89. {
  90. "token": "consistent godel said sentence yes",
  91. "start_offset": 0,
  92. "end_offset": 52,
  93. "type": "fingerprint",
  94. "position": 0
  95. }
  96. ]
  97. }
  98. ----------------------------
  99. // TESTRESPONSE
  100. /////////////////////
  101. The above example produces the following term:
  102. [source,text]
  103. ---------------------------
  104. [ consistent godel said sentence yes ]
  105. ---------------------------
  106. [float]
  107. === Definition
  108. The `fingerprint` tokenizer consists of:
  109. Tokenizer::
  110. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  111. Token Filters (in order)::
  112. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  113. * <<analysis-asciifolding-tokenfilter>>
  114. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  115. * <<analysis-fingerprint-tokenfilter>>
  116. If you need to customize the `fingerprint` analyzer beyond the configuration
  117. parameters then you need to recreate it as a `custom` analyzer and modify
  118. it, usually by adding token filters. This would recreate the built-in
  119. `fingerprint` analyzer and you can use it as a starting point for further
  120. customization:
  121. [source,js]
  122. ----------------------------------------------------
  123. PUT /fingerprint_example
  124. {
  125. "settings": {
  126. "analysis": {
  127. "analyzer": {
  128. "rebuilt_fingerprint": {
  129. "tokenizer": "standard",
  130. "filter": [
  131. "lowercase",
  132. "asciifolding",
  133. "fingerprint"
  134. ]
  135. }
  136. }
  137. }
  138. }
  139. }
  140. ----------------------------------------------------
  141. // CONSOLE
  142. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]