fingerprint-analyzer.asciidoc 2.0 KB

1234567891011121314151617181920212223242526272829303132333435363738394041
  1. [[analysis-fingerprint-analyzer]]
  2. === Fingerprint Analyzer
  3. The `fingerprint` analyzer implements a
  4. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
  5. which is used by the OpenRefine project to assist in clustering.
  6. The `fingerprint` analyzer is composed of a <<analysis-standard-tokenizer>>, and four
  7. token filters (in this order): <<analysis-lowercase-tokenfilter>>, <<analysis-stop-tokenfilter>>,
  8. <<analysis-fingerprint-tokenfilter>> and <<analysis-asciifolding-tokenfilter>>.
  9. Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and
  10. concatenated into a single token. If a stopword list is configured, stop words will
  11. also be removed. For example, the sentence:
  12. ____
  13. "Yes yes, Gödel said this sentence is consistent and."
  14. ____
  15. will be transformed into the token: `"and consistent godel is said sentence this yes"`
  16. Notice how the words are all lowercased, the umlaut in "gödel" has been normalized to "godel",
  17. punctuation has been removed, and "yes" has been de-duplicated.
  18. The `fingerprint` analyzer has these configurable settings
  19. [cols="<,<",options="header",]
  20. |=======================================================================
  21. |Setting |Description
  22. |`separator` | The character that separates the tokens after concatenation.
  23. Defaults to a space.
  24. |`max_output_size` | The maximum token size to emit. Defaults to `255`. See <<analysis-fingerprint-tokenfilter-max-size>>
  25. |`preserve_original`| If true, emits both the original and folded version of
  26. tokens that contain extended characters. Defaults to `false`
  27. |`stopwords` | A list of stop words to use. Defaults to an empty list (`_none_`).
  28. |`stopwords_path` | A path (either relative to `config` location, or absolute) to a stopwords
  29. file configuration. Each stop word should be in its own "line" (separated
  30. by a line break). The file must be UTF-8 encoded.
  31. |=======================================================================