| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151 | [[analysis-fingerprint-analyzer]]=== Fingerprint AnalyzerThe `fingerprint` analyzer implements ahttps://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]which is used by the OpenRefine project to assist in clustering.Input text is lowercased, normalized to remove extended characters, sorted,deduplicated and concatenated into a single token.  If a stopword list isconfigured, stop words will also be removed.[float]=== DefinitionIt consists of:Tokenizer::* <<analysis-standard-tokenizer,Standard Tokenizer>>Token Filters (in order)::1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>2. <<analysis-asciifolding-tokenfilter>>3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)4. <<analysis-fingerprint-tokenfilter>>[float]=== Example output[source,js]---------------------------POST _analyze{  "analyzer": "fingerprint",  "text": "Yes yes, Gödel said this sentence is consistent and."}---------------------------// CONSOLE/////////////////////[source,js]----------------------------{  "tokens": [    {      "token": "and consistent godel is said sentence this yes",      "start_offset": 0,      "end_offset": 52,      "type": "fingerprint",      "position": 0    }  ]}----------------------------// TESTRESPONSE/////////////////////The above sentence would produce the following single term:[source,text]---------------------------[ and consistent godel is said sentence this yes ]---------------------------[float]=== ConfigurationThe `fingerprint` analyzer accepts the following parameters:[horizontal]`separator`::    The character to use to concate the terms.  Defaults to a space.`max_output_size`::    The maximum token size to emit.  Defaults to `255`. Tokens larger than    this size will be discarded.`stopwords`::    A pre-defined stop words list like `_english_` or an array  containing a    list of stop words.  Defaults to `\_none_`.`stopwords_path`::    The path to a file containing stop words.See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more informationabout stop word configuration.[float]=== Example configurationIn this example, we configure the `fingerprint` analyzer to use thepre-defined list of English stop words:[source,js]----------------------------PUT my_index{  "settings": {    "analysis": {      "analyzer": {        "my_fingerprint_analyzer": {          "type": "fingerprint",          "stopwords": "_english_"        }      }    }  }}POST my_index/_analyze{  "analyzer": "my_fingerprint_analyzer",  "text": "Yes yes, Gödel said this sentence is consistent and."}----------------------------// CONSOLE/////////////////////[source,js]----------------------------{  "tokens": [    {      "token": "consistent godel said sentence yes",      "start_offset": 0,      "end_offset": 52,      "type": "fingerprint",      "position": 0    }  ]}----------------------------// TESTRESPONSE/////////////////////The above example produces the following term:[source,text]---------------------------[ consistent godel said sentence yes ]---------------------------
 |