fingerprint-analyzer.asciidoc 4.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint Analyzer
  3. The `fingerprint` analyzer implements a
  4. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  5. which is used by the OpenRefine project to assist in clustering.
  6. Input text is lowercased, normalized to remove extended characters, sorted,
  7. deduplicated and concatenated into a single token. If a stopword list is
  8. configured, stop words will also be removed.
  9. [float]
  10. === Example output
  11. [source,console]
  12. ---------------------------
  13. POST _analyze
  14. {
  15. "analyzer": "fingerprint",
  16. "text": "Yes yes, Gödel said this sentence is consistent and."
  17. }
  18. ---------------------------
  19. /////////////////////
  20. [source,console-result]
  21. ----------------------------
  22. {
  23. "tokens": [
  24. {
  25. "token": "and consistent godel is said sentence this yes",
  26. "start_offset": 0,
  27. "end_offset": 52,
  28. "type": "fingerprint",
  29. "position": 0
  30. }
  31. ]
  32. }
  33. ----------------------------
  34. /////////////////////
  35. The above sentence would produce the following single term:
  36. [source,text]
  37. ---------------------------
  38. [ and consistent godel is said sentence this yes ]
  39. ---------------------------
  40. [float]
  41. === Configuration
  42. The `fingerprint` analyzer accepts the following parameters:
  43. [horizontal]
  44. `separator`::
  45. The character to use to concatenate the terms. Defaults to a space.
  46. `max_output_size`::
  47. The maximum token size to emit. Defaults to `255`. Tokens larger than
  48. this size will be discarded.
  49. `stopwords`::
  50. A pre-defined stop words list like `_english_` or an array containing a
  51. list of stop words. Defaults to `_none_`.
  52. `stopwords_path`::
  53. The path to a file containing stop words.
  54. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  55. about stop word configuration.
  56. [float]
  57. === Example configuration
  58. In this example, we configure the `fingerprint` analyzer to use the
  59. pre-defined list of English stop words:
  60. [source,console]
  61. ----------------------------
  62. PUT my_index
  63. {
  64. "settings": {
  65. "analysis": {
  66. "analyzer": {
  67. "my_fingerprint_analyzer": {
  68. "type": "fingerprint",
  69. "stopwords": "_english_"
  70. }
  71. }
  72. }
  73. }
  74. }
  75. POST my_index/_analyze
  76. {
  77. "analyzer": "my_fingerprint_analyzer",
  78. "text": "Yes yes, Gödel said this sentence is consistent and."
  79. }
  80. ----------------------------
  81. /////////////////////
  82. [source,console-result]
  83. ----------------------------
  84. {
  85. "tokens": [
  86. {
  87. "token": "consistent godel said sentence yes",
  88. "start_offset": 0,
  89. "end_offset": 52,
  90. "type": "fingerprint",
  91. "position": 0
  92. }
  93. ]
  94. }
  95. ----------------------------
  96. /////////////////////
  97. The above example produces the following term:
  98. [source,text]
  99. ---------------------------
  100. [ consistent godel said sentence yes ]
  101. ---------------------------
  102. [float]
  103. === Definition
  104. The `fingerprint` tokenizer consists of:
  105. Tokenizer::
  106. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  107. Token Filters (in order)::
  108. * <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  109. * <<analysis-asciifolding-tokenfilter>>
  110. * <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  111. * <<analysis-fingerprint-tokenfilter>>
  112. If you need to customize the `fingerprint` analyzer beyond the configuration
  113. parameters then you need to recreate it as a `custom` analyzer and modify
  114. it, usually by adding token filters. This would recreate the built-in
  115. `fingerprint` analyzer and you can use it as a starting point for further
  116. customization:
  117. [source,console]
  118. ----------------------------------------------------
  119. PUT /fingerprint_example
  120. {
  121. "settings": {
  122. "analysis": {
  123. "analyzer": {
  124. "rebuilt_fingerprint": {
  125. "tokenizer": "standard",
  126. "filter": [
  127. "lowercase",
  128. "asciifolding",
  129. "fingerprint"
  130. ]
  131. }
  132. }
  133. }
  134. }
  135. }
  136. ----------------------------------------------------
  137. // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]