| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178 | [[analysis-fingerprint-analyzer]]=== Fingerprint analyzer++++<titleabbrev>Fingerprint</titleabbrev>++++The `fingerprint` analyzer implements ahttps://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]which is used by the OpenRefine project to assist in clustering.Input text is lowercased, normalized to remove extended characters, sorted,deduplicated and concatenated into a single token.  If a stopword list isconfigured, stop words will also be removed.[discrete]=== Example output[source,console]---------------------------POST _analyze{  "analyzer": "fingerprint",  "text": "Yes yes, Gödel said this sentence is consistent and."}---------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "and consistent godel is said sentence this yes",      "start_offset": 0,      "end_offset": 52,      "type": "fingerprint",      "position": 0    }  ]}----------------------------/////////////////////The above sentence would produce the following single term:[source,text]---------------------------[ and consistent godel is said sentence this yes ]---------------------------[discrete]=== ConfigurationThe `fingerprint` analyzer accepts the following parameters:[horizontal]`separator`::    The character to use to concatenate the terms.  Defaults to a space.`max_output_size`::    The maximum token size to emit.  Defaults to `255`. Tokens larger than    this size will be discarded.`stopwords`::    A pre-defined stop words list like `_english_` or an array  containing a    list of stop words.  Defaults to `_none_`.`stopwords_path`::    The path to a file containing stop words.See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more informationabout stop word configuration.[discrete]=== Example configurationIn this example, we configure the `fingerprint` analyzer to use thepre-defined list of English stop words:[source,console]----------------------------PUT my-index-000001{  "settings": {    "analysis": {      "analyzer": {        "my_fingerprint_analyzer": {          "type": "fingerprint",          "stopwords": "_english_"        }      }    }  }}POST my-index-000001/_analyze{  "analyzer": "my_fingerprint_analyzer",  "text": "Yes yes, Gödel said this sentence is consistent and."}----------------------------/////////////////////[source,console-result]----------------------------{  "tokens": [    {      "token": "consistent godel said sentence yes",      "start_offset": 0,      "end_offset": 52,      "type": "fingerprint",      "position": 0    }  ]}----------------------------/////////////////////The above example produces the following term:[source,text]---------------------------[ consistent godel said sentence yes ]---------------------------[discrete]=== DefinitionThe `fingerprint` tokenizer consists of:Tokenizer::* <<analysis-standard-tokenizer,Standard Tokenizer>>Token Filters (in order)::* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>* <<analysis-asciifolding-tokenfilter>>* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)* <<analysis-fingerprint-tokenfilter>>If you need to customize the `fingerprint` analyzer beyond the configurationparameters then you need to recreate it as a `custom` analyzer and modifyit, usually by adding token filters. This would recreate the built-in`fingerprint` analyzer and you can use it as a starting point for furthercustomization:[source,console]----------------------------------------------------PUT /fingerprint_example{  "settings": {    "analysis": {      "analyzer": {        "rebuilt_fingerprint": {          "tokenizer": "standard",          "filter": [            "lowercase",            "asciifolding",            "fingerprint"          ]        }      }    }  }}----------------------------------------------------// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
 |