fingerprint-analyzer.asciidoc 3.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint Analyzer
  3. The `fingerprint` analyzer implements a
  4. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  5. which is used by the OpenRefine project to assist in clustering.
  6. Input text is lowercased, normalized to remove extended characters, sorted,
  7. deduplicated and concatenated into a single token. If a stopword list is
  8. configured, stop words will also be removed.
  9. [float]
  10. === Definition
  11. It consists of:
  12. Tokenizer::
  13. * <<analysis-standard-tokenizer,Standard Tokenizer>>
  14. Token Filters (in order)::
  15. 1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
  16. 2. <<analysis-asciifolding-tokenfilter>>
  17. 3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
  18. 4. <<analysis-fingerprint-tokenfilter>>
  19. [float]
  20. === Example output
  21. [source,js]
  22. ---------------------------
  23. POST _analyze
  24. {
  25. "analyzer": "fingerprint",
  26. "text": "Yes yes, Gödel said this sentence is consistent and."
  27. }
  28. ---------------------------
  29. // CONSOLE
  30. /////////////////////
  31. [source,js]
  32. ----------------------------
  33. {
  34. "tokens": [
  35. {
  36. "token": "and consistent godel is said sentence this yes",
  37. "start_offset": 0,
  38. "end_offset": 52,
  39. "type": "fingerprint",
  40. "position": 0
  41. }
  42. ]
  43. }
  44. ----------------------------
  45. // TESTRESPONSE
  46. /////////////////////
  47. The above sentence would produce the following single term:
  48. [source,text]
  49. ---------------------------
  50. [ and consistent godel is said sentence this yes ]
  51. ---------------------------
  52. [float]
  53. === Configuration
  54. The `fingerprint` analyzer accepts the following parameters:
  55. [horizontal]
  56. `separator`::
  57. The character to use to concate the terms. Defaults to a space.
  58. `max_output_size`::
  59. The maximum token size to emit. Defaults to `255`. Tokens larger than
  60. this size will be discarded.
  61. `stopwords`::
  62. A pre-defined stop words list like `_english_` or an array containing a
  63. list of stop words. Defaults to `\_none_`.
  64. `stopwords_path`::
  65. The path to a file containing stop words.
  66. See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
  67. about stop word configuration.
  68. [float]
  69. === Example configuration
  70. In this example, we configure the `fingerprint` analyzer to use the
  71. pre-defined list of English stop words:
  72. [source,js]
  73. ----------------------------
  74. PUT my_index
  75. {
  76. "settings": {
  77. "analysis": {
  78. "analyzer": {
  79. "my_fingerprint_analyzer": {
  80. "type": "fingerprint",
  81. "stopwords": "_english_"
  82. }
  83. }
  84. }
  85. }
  86. }
  87. POST my_index/_analyze
  88. {
  89. "analyzer": "my_fingerprint_analyzer",
  90. "text": "Yes yes, Gödel said this sentence is consistent and."
  91. }
  92. ----------------------------
  93. // CONSOLE
  94. /////////////////////
  95. [source,js]
  96. ----------------------------
  97. {
  98. "tokens": [
  99. {
  100. "token": "consistent godel said sentence yes",
  101. "start_offset": 0,
  102. "end_offset": 52,
  103. "type": "fingerprint",
  104. "position": 0
  105. }
  106. ]
  107. }
  108. ----------------------------
  109. // TESTRESPONSE
  110. /////////////////////
  111. The above example produces the following term:
  112. [source,text]
  113. ---------------------------
  114. [ consistent godel said sentence yes ]
  115. ---------------------------