lqb
/
elasticsearch
mirror of https://gitee.com/mirrors/elasticsearch.git


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
							[[analysis-fingerprint-analyzer]]
=== Fingerprint analyzer
++++
<titleabbrev>Fingerprint</titleabbrev>
++++

The `fingerprint` analyzer implements a
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
which is used by the OpenRefine project to assist in clustering.

Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token.  If a stopword list is
configured, stop words will also be removed.

[discrete]
=== Example output

[source,console]
---------------------------
POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "and consistent godel is said sentence this yes",
      "start_offset": 0,
      "end_offset": 52,
      "type": "fingerprint",
      "position": 0
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following single term:

[source,text]
---------------------------
[ and consistent godel is said sentence this yes ]
---------------------------

[discrete]
=== Configuration

The `fingerprint` analyzer accepts the following parameters:

[horizontal]
`separator`::

    The character to use to concatenate the terms.  Defaults to a space.

`max_output_size`::

    The maximum token size to emit.  Defaults to `255`. Tokens larger than
    this size will be discarded.

`stopwords`::

    A pre-defined stop words list like `_english_` or an array  containing a
    list of stop words.  Defaults to `_none_`.

`stopwords_path`::

    The path to a file containing stop words.

See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.


[discrete]
=== Example configuration

In this example, we configure the `fingerprint` analyzer to use the
pre-defined list of English stop words:

[source,console]
----------------------------
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
----------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "consistent godel said sentence yes",
      "start_offset": 0,
      "end_offset": 52,
      "type": "fingerprint",
      "position": 0
    }
  ]
}
----------------------------

/////////////////////


The above example produces the following term:

[source,text]
---------------------------
[ consistent godel said sentence yes ]
---------------------------

[discrete]
=== Definition

The `fingerprint` tokenizer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-asciifolding-tokenfilter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>

If you need to customize the `fingerprint` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`fingerprint` analyzer and you can use it as a starting point for further
customization:

[source,console]
----------------------------------------------------
PUT /fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}
----------------------------------------------------
// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]