Browse Source

Synonyms Overview Documentation (#98202)

Carlos Delgado 2 years ago
parent
commit
c596f121b4

+ 76 - 113
docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc

@@ -4,7 +4,7 @@
 <titleabbrev>Synonym graph</titleabbrev>
 ++++
 
-The `synonym_graph` token filter allows to easily handle synonyms,
+The `synonym_graph` token filter allows to easily handle <<search-with-synonyms,synonyms>>,
 including multi-word synonyms correctly during the analysis process.
 
 In order to properly handle multi-word synonyms this token filter
@@ -19,37 +19,58 @@ only. If you want to apply synonyms during indexing please use the
 standard <<analysis-synonym-tokenfilter,synonym token filter>>.
 ===============================
 
-Synonyms are configured using a configuration file.
-Here is an example:
+[discrete]
+[[analysis-synonym-graph-define-synonyms]]
+==== Define synonyms sets
 
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "analyzer": {
-          "search_synonyms": {
-            "tokenizer": "whitespace",
-            "filter": [ "graph_synonyms" ]
-          }
-        },
-        "filter": {
-          "graph_synonyms": {
-            "type": "synonym_graph",
-            "synonyms_path": "analysis/synonym.txt"
-          }
-        }
-      }
+include::synonyms-format.asciidoc[]
+
+[discrete]
+[[analysis-synonym-graph-configure-sets]]
+==== Configure synonyms sets
+
+Synonyms can be configured using the <<synonyms-store-synonyms-api,synonyms API>>, a <<synonyms-store-synonyms-file,synonyms file>>, or directly <<synonyms-store-synonyms-inline,inlined>> in the token filter configuration.
+See <<synonyms-store-synonyms,store your synonyms set>> for more details on each option.
+
+Use `synonyms_set` configuration option to provide a synonym set created via Synonyms Management APIs:
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms_set": "my-synonym-set",
+      "updateable": true
     }
   }
-}
---------------------------------------------------
+----
+
+Use `synonyms_path` to provide a synonym file :
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms_path": "analysis/synonym-set.txt"
+    }
+  }
+----
 
-The above configures a `search_synonyms` filter, with a path of
-`analysis/synonym.txt` (relative to the `config` location). The
-`search_synonyms` analyzer is then configured with the filter.
+The above configures a `synonym` filter, with a path of
+`analysis/synonym-set.txt` (relative to the `config` location).
+
+Use `synonyms` to define inline synonyms:
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms": ["pc => personal computer", "computer, pc, laptop"]
+    }
+  }
+----
 
 Additional settings are:
 
@@ -99,103 +120,45 @@ stop word.
 
 [discrete]
 [[synonym-graph-tokenizer-ignore_case-deprecated]]
-==== `tokenizer` and `ignore_case` are deprecated
+===== `tokenizer` and `ignore_case` are deprecated
 
 The `tokenizer` parameter controls the tokenizers that will be used to
 tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.
 The `ignore_case` parameter works with `tokenizer` parameter only.
 
-Two synonym formats are supported: Solr, WordNet.
-
 [discrete]
-==== Solr synonyms
-
-The following is a sample format of the file:
-
-[source,synonyms]
---------------------------------------------------
-include::{es-test-dir}/cluster/config/analysis/synonym.txt[]
---------------------------------------------------
-
-You can also define synonyms for the filter directly in the
-configuration file (note use of `synonyms` instead of `synonyms_path`):
-
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "filter": {
-          "synonym": {
-            "type": "synonym_graph",
-            "synonyms": [
-              "lol, laughing out loud",
-              "universe, cosmos"
-            ]
-          }
+[[analysis-synonym-graph-analizers-configure]]
+==== Configure analyzers with synonym graph token filters
+
+To apply synonyms, you will need to include a synonym graph token filter into an analyzer:
+
+[source,JSON]
+----
+      "analyzer": {
+        "my_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": ["lowercase", "synonym_graph"]
         }
       }
-    }
-  }
-}
---------------------------------------------------
-
-However, it is recommended to define large synonyms set in a file using
-`synonyms_path`, because specifying them inline increases cluster size unnecessarily.
+----
 
 [discrete]
-==== WordNet synonyms
+[[analysis-synonym-graph-token-order]]
+===== Token filters ordering
 
-Synonyms based on https://wordnet.princeton.edu/[WordNet] format can be
-declared using `format`:
+Order is important for your token filters.
+Text will be processed first through filters preceding the synonym filter before being processed by the synonym filter.
 
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "filter": {
-          "synonym": {
-            "type": "synonym_graph",
-            "format": "wordnet",
-            "synonyms": [
-              "s(100000001,1,'abstain',v,1,0).",
-              "s(100000001,2,'refrain',v,1,0).",
-              "s(100000001,3,'desist',v,1,0)."
-            ]
-          }
-        }
-      }
-    }
-  }
-}
---------------------------------------------------
+In the above example, text will be lowercased by the `lowercase` filter before being processed by the `synonyms_filter`.
+This means that all the synonyms defined there needs to be in lowercase, or they won't be found by the synonyms filter.
 
-Using `synonyms_path` to define WordNet synonyms in a file is supported
-as well.
+The synonym rules should not contain words that are removed by a filter that appears later in the chain (like a `stop` filter).
+Removing a term from a synonym rule means there will be no matching for it at query time.
 
-[discrete]
-==== Parsing synonym files
-
-Elasticsearch will use the token filters preceding the synonym filter
-in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
-synonym filter is placed after a stemmer, then the stemmer will also be applied
-to the synonym entries. Because entries in the synonym map cannot have stacked
-positions, some token filters may cause issues here. Token filters that produce
-multiple versions of a token may choose which version of the token to emit when
-parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
-token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
-error.
-
-If you need to build analyzers that include both multi-token filters and synonym
-filters, consider using the <<analysis-multiplexer-tokenfilter,multiplexer>> filter,
-with the multi-token filters in one branch and the synonym filter in the other.
-
-WARNING: The synonym rules should not contain words that are removed by
-a filter that appears after in the chain (a `stop` filter for instance).
-Removing a term from a synonym rule breaks the matching at query time.
+Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here.
+Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms.
+For example, `asciifolding` will only produce the folded version of the token.
+Others, like `multiplexer`, `word_delimiter_graph` or `ngram` will throw an error.
 
+If you need to build analyzers that include both multi-token filters and synonym filters, consider using the <<analysis-multiplexer-tokenfilter,multiplexer>> filter, with the multi-token filters in one branch and the synonym filter in the other.

+ 77 - 110
docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc

@@ -4,41 +4,61 @@
 <titleabbrev>Synonym</titleabbrev>
 ++++
 
-The `synonym` token filter allows to easily handle synonyms during the
-analysis process. Synonyms are configured using a configuration file.
-Here is an example:
+The `synonym` token filter allows to easily handle <<search-with-synonyms,synonyms>> during the
+analysis process.
 
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "analyzer": {
-          "synonym": {
-            "tokenizer": "whitespace",
-            "filter": [ "synonym" ]
-          }
-        },
-        "filter": {
-          "synonym": {
-            "type": "synonym",
-            "synonyms_path": "analysis/synonym.txt"
-          }
-        }
-      }
+[discrete]
+[[analysis-synonym-define-synonyms]]
+==== Define synonyms sets
+
+include::synonyms-format.asciidoc[]
+
+[discrete]
+[[analysis-synonym-configure-sets]]
+==== Configure synonyms sets
+
+Synonyms can be configured using the <<synonyms-store-synonyms-api,synonyms API>>, a <<synonyms-store-synonyms-file,synonyms file>>, or directly <<synonyms-store-synonyms-inline,inlined>> in the token filter configuration.
+See <<synonyms-store-synonyms,store your synonyms set>> for more details on each option.
+
+Use `synonyms_set` configuration option to provide a synonym set created via Synonyms Management APIs:
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms_set": "my-synonym-set",
+      "updateable": true
     }
   }
-}
---------------------------------------------------
+----
+
+Use `synonyms_path` to provide a synonym file :
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms_path": "analysis/synonym-set.txt"
+    }
+  }
+----
 
 The above configures a `synonym` filter, with a path of
-`analysis/synonym.txt` (relative to the `config` location). The
-`synonym` analyzer is then configured with the filter.
+`analysis/synonym-set.txt` (relative to the `config` location).
 
-This filter tokenizes synonyms with whatever tokenizer and token filters
-appear before it in the chain.
+Use `synonyms` to define inline synonyms:
+
+[source,JSON]
+----
+  "filter": {
+    "synonyms_filter": {
+      "type": "synonym",
+      "synonyms": ["pc => personal computer", "computer, pc, laptop"]
+    }
+  }
+----
 
 Additional settings are:
 
@@ -90,98 +110,45 @@ stop word.
 
 [discrete]
 [[synonym-tokenizer-ignore_case-deprecated]]
-==== `tokenizer` and `ignore_case` are deprecated
+===== `tokenizer` and `ignore_case` are deprecated
 
 The `tokenizer` parameter controls the tokenizers that will be used to
 tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.
 The `ignore_case` parameter works with `tokenizer` parameter only.
 
-Two synonym formats are supported: Solr, WordNet.
-
 [discrete]
-==== Solr synonyms
-
-The following is a sample format of the file:
-
-[source,synonyms]
---------------------------------------------------
-include::{es-test-dir}/cluster/config/analysis/synonym.txt[]
---------------------------------------------------
-
-You can also define synonyms for the filter directly in the
-configuration file (note use of `synonyms` instead of `synonyms_path`):
-
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "filter": {
-          "synonym": {
-            "type": "synonym",
-            "synonyms": [
-              "i-pod, i pod => ipod",
-              "universe, cosmos"
-            ]
-          }
+[[analysis-synonym-analizers-configure]]
+==== Configure analyzers with synonym token filters
+
+To apply synonyms, you will need to include a synonym token filters into an analyzer:
+
+[source,JSON]
+----
+      "analyzer": {
+        "my_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": ["lowercase", "synonym"]
         }
       }
-    }
-  }
-}
---------------------------------------------------
-
-However, it is recommended to define large synonyms set in a file using
-`synonyms_path`, because specifying them inline increases cluster size unnecessarily.
+----
 
 [discrete]
-==== WordNet synonyms
+[[analysis-synonym-token-order]]
+===== Token filters ordering
 
-Synonyms based on https://wordnet.princeton.edu/[WordNet] format can be
-declared using `format`:
+Order is important for your token filters.
+Text will be processed first through filters preceding the synonym filter before being processed by the synonym filter.
 
-[source,console]
---------------------------------------------------
-PUT /test_index
-{
-  "settings": {
-    "index": {
-      "analysis": {
-        "filter": {
-          "synonym": {
-            "type": "synonym",
-            "format": "wordnet",
-            "synonyms": [
-              "s(100000001,1,'abstain',v,1,0).",
-              "s(100000001,2,'refrain',v,1,0).",
-              "s(100000001,3,'desist',v,1,0)."
-            ]
-          }
-        }
-      }
-    }
-  }
-}
---------------------------------------------------
+In the above example, text will be lowercased by the `lowercase` filter before being processed by the `synonyms_filter`.
+This means that all the synonyms defined there needs to be in lowercase, or they won't be found by the synonyms filter.
 
-Using `synonyms_path` to define WordNet synonyms in a file is supported
-as well.
+The synonym rules should not contain words that are removed by a filter that appears later in the chain (like a `stop` filter).
+Removing a term from a synonym rule means there will be no matching for it at query time.
 
-[discrete]
-=== Parsing synonym files
-
-Elasticsearch will use the token filters preceding the synonym filter
-in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
-synonym filter is placed after a stemmer, then the stemmer will also be applied
-to the synonym entries. Because entries in the synonym map cannot have stacked
-positions, some token filters may cause issues here. Token filters that produce
-multiple versions of a token may choose which version of the token to emit when
-parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
-token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
-error.
-
-If you need to build analyzers that include both multi-token filters and synonym
-filters, consider using the <<analysis-multiplexer-tokenfilter,multiplexer>> filter,
-with the multi-token filters in one branch and the synonym filter in the other.
+Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here.
+Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms.
+For example, `asciifolding` will only produce the folded version of the token.
+Others, like `multiplexer`, `word_delimiter_graph` or `ngram` will throw an error.
+
+If you need to build analyzers that include both multi-token filters and synonym filters, consider using the <<analysis-multiplexer-tokenfilter,multiplexer>> filter, with the multi-token filters in one branch and the synonym filter in the other.

+ 44 - 0
docs/reference/analysis/tokenfilters/synonyms-format.asciidoc

@@ -0,0 +1,44 @@
+Synonyms in a synonyms set are defined using *synonym rules*.
+Each synonym rule contains words that are synonyms.
+
+You can use two formats to define synonym rules: Solr and WordNet.
+
+[discrete]
+===== Solr format
+
+This format uses two different definitions:
+
+* Equivalent synonyms: Define groups of words that are equivalent. Words are separated by commas. Example:
++
+[source,synonyms]
+----
+ipod, i-pod, i pod
+computer, pc, laptop
+----
+* Explicit mappings: Matches a group of words to other words. Words on the left hand side of the rule definition are expanded into all the possibilities described on the right hand side. Example:
++
+[source,synonyms]
+----
+personal computer => pc
+sea biscuit, sea biscit => seabiscuit
+----
+
+[discrete]
+===== WordNet format
+
+https://wordnet.princeton.edu/[WordNet] defines synonyms sets spanning multiple lines. Each line contains the following information:
+
+* Synonyms set numeric identifier
+* Ordinal of the synonym in the synonyms set
+* Synonym word
+* Word type identifier: Noun (n), verb (v), adjective (a) or adverb (b).
+* Depth of the word in the synonym net
+
+The following example defines a synonym set for the words "come", "advance" and "approach":
+
+[source,synonyms]
+----
+s(100000002,1,'come',v,1,0).
+s(100000002,2,'advance',v,1,0).
+s(100000002,3,'approach',v,1,0).""";
+----

+ 139 - 0
docs/reference/search/search-your-data/search-with-synonyms.asciidoc

@@ -0,0 +1,139 @@
+[[search-with-synonyms]]
+== Search with synonyms
+
+Synonyms are words or phrases that have the same or similar meaning.
+They are an important aspect of search, as they can improve the search experience and increase the scope of search results.
+
+Synonyms allow you to:
+
+* *Improve search relevance* by finding relevant documents that use different terms to express the same concept.
+* Make *domain-specific vocabulary* more user-friendly, allowing users to use search terms they are more familiar with.
+* *Define common misspellings and typos* to transparently handle common mistakes.
+
+Synonyms are grouped together using *synonyms sets*.
+You can have as many synonyms sets as you need.
+
+In order to use synonyms sets in {es}, you need to:
+
+* <<synonyms-store-synonyms>>
+* <<synonyms-synonym-token-filters>>
+
+[discrete]
+[[synonyms-store-synonyms]]
+=== Store your synonyms set
+
+Your synonyms sets need to be stored in {es} so your analyzers can refer to them.
+There are three ways to store your synonyms sets:
+
+[discrete]
+[[synonyms-store-synonyms-api]]
+==== Synonyms API
+
+You can use the <<synonyms-apis,synonyms APIs>> to manage synonyms sets.
+This is the most flexible approach, as it allows to dynamically define and modify synonyms sets.
+
+Changes in your synonyms sets will automatically reload the associated analyzers.
+
+[discrete]
+[[synonyms-store-synonyms-file]]
+==== Synonyms File
+
+You can store your synonyms set in a file.
+
+A synonyms set file needs to be uploaded to all your cluster nodes, and be located in the configuration directory for your {es} distribution.
+If you're using {ess}, you can upload synonyms files using {cloud}/ec-custom-bundles.html[custom bundles].
+
+An example synonyms file:
+
+[source,synonyms]
+--------------------------------------------------
+include::{es-test-dir}/cluster/config/analysis/synonym.txt[]
+--------------------------------------------------
+
+To update an existing synonyms set, upload new files to your cluster.
+Synonyms set files must be kept in sync on every cluster node.
+
+When a synonyms set is updated, search analyzers that use it need to be refreshed using the <<indices-reload-analyzers,reload search analyzers API>>
+
+This manual syncing and reloading makes this approach less flexible than using the <<synonyms-store-synonyms-api,synonyms API>>.
+
+[discrete]
+[[synonyms-store-synonyms-inline]]
+==== Inline
+
+You can test your synonyms by adding them directly inline in your token filter definition.
+
+[WARNING]
+======
+Inline synonyms are not recommended for production usage.
+A large number of inline synonyms increases cluster size unnecessarily and can lead to performance issues.
+======
+
+[discrete]
+[[synonyms-synonym-token-filters]]
+=== Configure synonyms token filters and analyzers
+
+Once your synonyms sets are created, you can start configuring your token filters and analyzers to use them.
+
+{es} uses synonyms as part of the <<analysis-overview,analysis process>>.
+You can use two types of <<analysis-tokenfilters,token filter>> to include synonyms:
+
+* <<analysis-synonym-graph-tokenfilter>>: It is recommended to use it, as it can correctly handle multi-word synonyms ("hurriedly", "in a hurry").
+* <<analysis-synonym-tokenfilter>>: Not recommended if you need to use multi-word synonyms.
+
+Check each synonym token filter documentation for configuration details and instructions on adding it to an analyzer.
+
+[discrete]
+[[synonyms-test-analyzer]]
+==== Test your analyzer
+
+You can test an analyzer configuration without modifying your index settings.
+Use the <<indices-analyze,analyze API>> to test your analyzer chain:
+
+[source,console]
+----
+GET /_analyze
+{
+  "tokenizer": "standard",
+  "filter" : [
+    "lowercase",
+    {
+      "type": "synonym_graph",
+      "synonyms": ["pc => personal computer", "computer, pc, laptop"]
+    }
+  ],
+  "text" : "Check how PC synonyms work"
+}
+----
+
+[discrete]
+[[synonyms-apply-synonyms]]
+==== Apply synonyms at index or search time
+
+Analyzers can be applied at <<analysis-index-search-time,index time or search time>>.
+
+You need to decide when to apply your synonyms:
+
+* Index time: Synonyms are applied when the documents are indexed into {es}. This is a less flexible alternative, as changes to your synonyms require <<docs-reindex,reindexing>>.
+* Search time: Synonyms are applied when a search is executed. This is a more flexible approach, which doesn't require reindexing. If token filters are configured with `"updateable": true`, search analyzers can be <<indices-reload-analyzers,reloaded>> when you make changes to your synonyms.
+
+Synonyms sets created using the <<synonyms-store-synonyms-api,synonyms API>> can only be used at search time.
+
+You can specify the analyzer that contains your synonyms set as a <<specify-search-analyzer,search time analyzer>> or as an <<specify-index-time-analyzer,index time analyzer>>.
+
+The following example adds `my_analyzer` as a search analyzer to the `title` field in an index mapping:
+
+[source,JSON]
+----
+  "mappings": {
+    "properties": {
+      "title": {
+        "type": "text",
+        "search_analyzer": "my_analyzer",
+        "updateable": true
+      }
+    }
+  }
+----
+
+

+ 1 - 0
docs/reference/search/search-your-data/search-your-data.asciidoc

@@ -529,6 +529,7 @@ include::search-across-clusters.asciidoc[]
 include::search-multiple-indices.asciidoc[]
 include::search-shard-routing.asciidoc[]
 include::search-template.asciidoc[]
+include::search-with-synonyms.asciidoc[]
 include::sort-search-results.asciidoc[]
 include::knn-search.asciidoc[]
 include::semantic-search.asciidoc[]

+ 1 - 1
docs/reference/synonyms/apis/put-synonym-rule.asciidoc

@@ -36,7 +36,7 @@ Synonym rule identifier to create or update.
 `synonyms`::
 (Required, string)
 The synonym rule definition.
-This needs to be in <<_solr_synonyms>> format. Some examples are:
+This needs to be in <<analysis-synonym-graph-define-synonyms,Solr format>>. Some examples are:
 
 * "i-pod, i pod => ipod",
 * "universe, cosmos"

+ 1 - 2
docs/reference/synonyms/apis/put-synonyms-set.asciidoc

@@ -45,8 +45,7 @@ In case a synonym rule id is not specified, an identifier will be created automa
 
 `synonyms`::
 (Required, string)
-The synonym rule. This needs to be in <<_solr_synonyms>> format.
-Some examples are:
+The synonym rule. This needs to be in <<analysis-synonym-graph-define-synonyms,Solr format>>. Some examples are:
 * "i-pod, i pod => ipod",
 * "universe, cosmos"
 

+ 7 - 7
docs/src/test/cluster/config/analysis/synonym.txt

@@ -1,24 +1,24 @@
 # Blank lines and lines starting with pound are comments.
 
-# Explicit mappings match any token sequence on the LHS of "=>"
-# and replace with all alternatives on the RHS.  These types of mappings
-# ignore the expand parameter in the schema.
+# Explicit mappings match any token sequence on the left hand side of "=>"
+# and replace with all alternatives on the right hand side.
+# These types of mappings ignore the expand parameter in the schema.
 # Examples:
 i-pod, i pod => ipod
 sea biscuit, sea biscit => seabiscuit
 
 # Equivalent synonyms may be separated with commas and give
 # no explicit mapping.  In this case the mapping behavior will
-# be taken from the expand parameter in the schema.  This allows
-# the same synonym file to be used in different synonym handling strategies.
+# be taken from the expand parameter in the token filter configuration.
+# This allows the same synonym file to be used in different synonym handling strategies.
 # Examples:
 ipod, i-pod, i pod
 foozball , foosball
 universe , cosmos
 lol, laughing out loud
 
-# If expand==true, "ipod, i-pod, i pod" is equivalent
-# to the explicit mapping:
+# If expand==true in the synonym token filter configuration,
+# "ipod, i-pod, i pod" is equivalent to the explicit mapping:
 ipod, i-pod, i pod => ipod, i-pod, i pod
 # If expand==false, "ipod, i-pod, i pod" is equivalent
 # to the explicit mapping: