fingerprint-analyzer.asciidoc 4.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint Analyzer
  3. The `fingerprint` analyzer implements a
  4. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  5. which is used by the OpenRefine project to assist in clustering.
  6. Input text is lowercased, normalized to remove extended characters, sorted,
  7. deduplicated and concatenated into a single token. If a stopword list is
  8. configured, stop words will also be removed.
  9. [float]
  10. === Example output
  11. [source,js]
  12. ---------------------------
  13. POST _analyze
  14. {
  15. "analyzer": "fingerprint",
  16. "text": "Yes yes, Gödel said this sentence is consistent and."
  17. }
  18. ---------------------------
  19. // CONSOLE
  20. /////////////////////
  21. [source,console-result]
  22. ----------------------------
  23. {
  24. "tokens": [
  25. {
  26. "token": "and consistent godel is said sentence this yes",
  27. "start_offset": 0,
  28. "end_offset": 52,
  29. "type": "fingerprint",
  30. "position": 0
  31. }
  32. ]
  33. }
  34. ----------------------------
  35. /////////////////////
  36. The above sentence would produce the following single term:
  37. [source,text]
  38. ---------------------------
  39. [ and consistent godel is said sentence this yes ]
  40. ---------------------------
  41. [float]
  42. === Configuration
  43. The `fingerprint` analyzer accepts the following parameters:
  44. [horizontal]
  45. `separator`::
  46. The character to use to concatenate the terms. Defaults to a space.
  47. `max_output_size`::
  48. The maximum token size to emit. Defaults to `255`. Tokens larger than
  49. this size will be discarded.
  50. `stopwords`::
  51. A pre-defined stop words list like `_english_` or an array containing a
  52. list of stop words. Defaults to `_none_`.
  53. `stopwords_path`::
  54. The path to a file containing stop words.
  55. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  56. about stop word configuration.
  57. [float]
  58. === Example configuration
  59. In this example, we configure the `fingerprint` analyzer to use the
  60. pre-defined list of English stop words:
  61. [source,js]
  62. ----------------------------
  63. PUT my_index
  64. {
  65. "settings": {
  66. "analysis": {
  67. "analyzer": {
  68. "my_fingerprint_analyzer": {
  69. "type": "fingerprint",
  70. "stopwords": "_english_"
  71. }
  72. }
  73. }
  74. }
  75. }
  76. POST my_index/_analyze
  77. {
  78. "analyzer": "my_fingerprint_analyzer",
  79. "text": "Yes yes, Gödel said this sentence is consistent and."
  80. }
  81. ----------------------------
  82. // CONSOLE
  83. /////////////////////
  84. [source,console-result]
  85. ----------------------------
  86. {
  87. "tokens": [
  88. {
  89. "token": "consistent godel said sentence yes",
  90. "start_offset": 0,
  91. "end_offset": 52,
  92. "type": "fingerprint",
  93. "position": 0
  94. }
  95. ]
  96. }
  97. ----------------------------
  98. /////////////////////
  99. The above example produces the following term:
  100. [source,text]
  101. ---------------------------
  102. [ consistent godel said sentence yes ]
  103. ---------------------------
  104. [float]
  105. === Definition
  106. The `fingerprint` tokenizer consists of:
  107. Tokenizer::
  108. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  109. Token Filters (in order)::
  110. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  111. * <<analysis-asciifolding-tokenfilter>>
  112. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  113. * <<analysis-fingerprint-tokenfilter>>
  114. If you need to customize the `fingerprint` analyzer beyond the configuration
  115. parameters then you need to recreate it as a `custom` analyzer and modify
  116. it, usually by adding token filters. This would recreate the built-in
  117. `fingerprint` analyzer and you can use it as a starting point for further
  118. customization:
  119. [source,js]
  120. ----------------------------------------------------
  121. PUT /fingerprint_example
  122. {
  123. "settings": {
  124. "analysis": {
  125. "analyzer": {
  126. "rebuilt_fingerprint": {
  127. "tokenizer": "standard",
  128. "filter": [
  129. "lowercase",
  130. "asciifolding",
  131. "fingerprint"
  132. ]
  133. }
  134. }
  135. }
  136. }
  137. }
  138. ----------------------------------------------------
  139. // CONSOLE
  140. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]