Quellcode durchsuchen

Update the how-to section of the docs for 7.0: (#37717)

 - new `rank_feature`/`script_score` queries
 - new `index_phrases`/`index_prefixes` options
 - disabling `_field_names` doesn't help anymore
 - adaptive replica selection is on by default
Adrien Grand vor 6 Jahren
Ursprung
Commit
8905d77ca4

+ 0 - 7
docs/reference/how-to/indexing-speed.asciidoc

@@ -114,13 +114,6 @@ The default is `10%` which is often plenty: for example, if you give the JVM
 10GB of memory, it will give 1GB to the index buffer, which is enough to host
 two shards that are heavily indexing.
 
-[float]
-=== Disable `_field_names`
-
-The <<mapping-field-names-field,`_field_names` field>> introduces some
-index-time overhead, so you might want to disable it if you never need to
-run `exists` queries.
-
 [float]
 === Additional optimizations
 

+ 3 - 3
docs/reference/how-to/recipes.asciidoc

@@ -3,9 +3,9 @@
 
 This section includes a few recipes to help with common problems:
 
-* <<mixing-exact-search-with-stemming>>
-* <<consistent-scoring>>
+* <<mixing-exact-search-with-stemming,Mixing exact search with stemming>>
+* <<consistent-scoring,Getting consistent scores>>
+* <<static-scoring-signals,Incorporating static relevance signals into the score>>
 
 include::recipes/stemming.asciidoc[]
 include::recipes/scoring.asciidoc[]
-

+ 124 - 2
docs/reference/how-to/recipes/scoring.asciidoc

@@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.
 
 If you have a small dataset, the easiest way to work around this issue is to
 index everything into an index that has a single shard
-(`index.number_of_shards: 1`). Then index statistics will be the same for all
-documents and scores will be consistent.
+(`index.number_of_shards: 1`), which is the default. Then index statistics
+will be the same for all documents and scores will be consistent.
 
 Otherwise the recommended way to work around this issue is to use the
 <<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
@@ -78,3 +78,125 @@ queries, beware that gathering statistics alone might not be cheap since all
 terms have to be looked up in the terms dictionaries in order to look up
 statistics.
 
+[[static-scoring-signals]]
+=== Incorporating static relevance signals into the score
+
+Many domains have static signals that are known to be correlated with relevance.
+For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
+two commonly used features for web search in order to tune the score of web
+pages independently of the query.
+
+There are two main queries that allow combining static score contributions with
+textual relevance, eg. as computed with BM25:
+ - <<query-dsl-script-score-query,`script_score` query>>
+ - <<query-dsl-rank-feature-query,`rank_feature` query>>
+
+For instance imagine that you have a `pagerank` field that you wish to
+combine with the BM25 score so that the final score is equal to
+`score = bm25_score + pagerank / (10 + pagerank)`.
+
+With the <<query-dsl-script-score-query,`script_score` query>> the query would
+look like this:
+
+//////////////////////////
+
+[source,js]
+--------------------------------------------------
+PUT index
+{
+    "mappings": {
+        "properties": {
+            "body": {
+                "type": "text"
+            },
+            "pagerank": {
+                "type": "long"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST
+
+//////////////////////////
+
+[source,js]
+--------------------------------------------------
+GET index/_search
+{
+    "query" : {
+        "script_score" : {
+            "query" : {
+                "match": { "body": "elasticsearch" }
+            },
+            "script" : {
+                "source" : "_score * saturation(doc['pagerank'].value, 10)" <1>
+            }
+        }
+    }
+}
+--------------------------------------------------
+// CONSOLE
+//TEST[continued]
+<1> `pagerank` must be mapped as a <<number>>
+
+while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
+look like below:
+
+//////////////////////////
+
+[source,js]
+--------------------------------------------------
+PUT index
+{
+    "mappings": {
+        "properties": {
+            "body": {
+                "type": "text"
+            },
+            "pagerank": {
+                "type": "rank_feature"
+            }
+        }
+    }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST
+
+//////////////////////////
+
+[source,js]
+--------------------------------------------------
+GET _search
+{
+    "query" : {
+        "bool" : {
+            "must": {
+                "match": { "body": "elasticsearch" }
+            },
+            "should": {
+                "rank_feature": {
+                    "field": "pagerank", <1>
+                    "saturation": {
+                        "pivot": 10
+                    }
+                }
+            }
+        }
+    }
+}
+--------------------------------------------------
+// CONSOLE
+<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
+
+While both options would return similar scores, there are trade-offs:
+<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
+enabling you to combine the text relevance score with static signals as you
+prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
+exposes a couple ways to incorporate static signails into the score. However,
+it relies on the <<rank-feature,`rank_feature`>> and
+<<rank-features,`rank_features`>> fields, which index values in a special way
+that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
+over non-competitive documents and get the top matches of a query faster.

+ 14 - 9
docs/reference/how-to/search-speed.asciidoc

@@ -395,15 +395,6 @@ be able to cope with `max_failures` node failures at once at most, then the
 right number of replicas for you is
 `max(max_failures, ceil(num_nodes / num_primaries) - 1)`.
 
-[float]
-=== Turn on adaptive replica selection
-
-When multiple copies of data are present, elasticsearch can use a set of
-criteria called <<search-adaptive-replica,adaptive replica selection>> to select
-the best copy of the data based on response time, service time, and queue size
-of the node containing each copy of the shard. This can improve query throughput
-and reduce latency for search-heavy applications.
-
 === Tune your queries with the Profile API
 
 You can also analyse how expensive each component of your queries and 
@@ -419,3 +410,17 @@ Some caveats to the Profile API are that:
  - the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
  - given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
  - the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause 
+
+=== Faster phrase queries with `index_phrases`
+
+The <<text,`text`>> field has an <<index-phrases,`index_phrases`>> option that
+indexes 2-shingles and is automatically leveraged by query parsers to run phrase
+queries that don't have a slop. If your use-case involves running lots of phrase
+queries, this can speed up queries significantly.
+
+=== Faster prefix queries with `index_prefixes`
+
+The <<text,`text`>> field has an <<index-phrases,`index_prefixes`>> option that
+indexes prefixes of all terms and is automatically leveraged by query parsers to
+run prefix queries. If your use-case involves running lots of prefix queries,
+this can speed up queries significantly.