Browse Source

Term Stats documentation (#115933) (#116167)

* Term Stats documentation

* Update docs/reference/reranking/learning-to-rank-model-training.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Fix query example.

---------

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
(cherry picked from commit 0416812456af0a763a5f43f9ab6813221ea6e4d8)

Co-authored-by: Aurélien FOUCRET <aurelien.foucret@gmail.com>
Liam Thompson 11 months ago
parent
commit
7b39d3db52

+ 10 - 3
docs/reference/query-dsl/script-score-query.asciidoc

@@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`.
 ===== Use relevance scores in a script
 ===== Use relevance scores in a script
 
 
 Within a script, you can
 Within a script, you can
-{ref}/modules-scripting-fields.html#scripting-score[access] 
+{ref}/modules-scripting-fields.html#scripting-score[access]
 the `_score` variable which represents the current relevance score of a
 the `_score` variable which represents the current relevance score of a
 document.
 document.
 
 
+[[script-score-access-term-statistics]]
+===== Use term statistics in a script
+
+Within a script, you can
+{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
+the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
+
 [[script-score-predefined-functions]]
 [[script-score-predefined-functions]]
 ===== Predefined functions
 ===== Predefined functions
 You can use any of the available {painless}/painless-contexts.html[painless
 You can use any of the available {painless}/painless-contexts.html[painless
@@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field.
 
 
 [[decay-functions-numeric-fields]]
 [[decay-functions-numeric-fields]]
 ====== Decay functions for numeric fields
 ====== Decay functions for numeric fields
-You can read more about decay functions 
+You can read more about decay functions
 {ref}/query-dsl-function-score-query.html#function-decay[here].
 {ref}/query-dsl-function-score-query.html#function-decay[here].
 
 
 * `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
 * `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
@@ -233,7 +240,7 @@ The `script_score` query calculates the score for
 every matching document, or hit. There are faster alternative query types that
 every matching document, or hit. There are faster alternative query types that
 can efficiently skip non-competitive hits:
 can efficiently skip non-competitive hits:
 
 
-* If you want to boost documents on some static fields, use the 
+* If you want to boost documents on some static fields, use the
  <<query-dsl-rank-feature-query, `rank_feature`>> query.
  <<query-dsl-rank-feature-query, `rank_feature`>> query.
  * If you want to boost documents closer to a date or geographic point, use the
  * If you want to boost documents closer to a date or geographic point, use the
  <<query-dsl-distance-feature-query, `distance_feature`>> query.
  <<query-dsl-distance-feature-query, `distance_feature`>> query.

+ 25 - 12
docs/reference/reranking/learning-to-rank-model-training.asciidoc

@@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
 from eland.ml.ltr import QueryFeatureExtractor
 from eland.ml.ltr import QueryFeatureExtractor
 
 
 feature_extractors=[
 feature_extractors=[
-    # We want to use the score of the match query for the title field as a feature:
+    # We want to use the BM25 score of the match query for the title field as a feature:
     QueryFeatureExtractor(
     QueryFeatureExtractor(
         feature_name="title_bm25",
         feature_name="title_bm25",
         query={"match": {"title": "{{query}}"}}
         query={"match": {"title": "{{query}}"}}
     ),
     ),
+    # We want to use the the number of matched terms in the title field as a feature:
+    QueryFeatureExtractor(
+        feature_name="title_matched_term_count",
+        query={
+            "script_score": {
+                "query": {"match": {"title": "{{query}}"}},
+                "script": {"source": "return _termStats.matchedTermsCount();"},
+            }
+        },
+    ),
     # We can use a script_score query to get the value
     # We can use a script_score query to get the value
     # of the field rating directly as a feature:
     # of the field rating directly as a feature:
     QueryFeatureExtractor(
     QueryFeatureExtractor(
@@ -54,19 +64,13 @@ feature_extractors=[
             }
             }
         },
         },
     ),
     ),
-    # We can execute a script on the value of the query
-    # and use the return value as a feature:
-    QueryFeatureExtractor(
-        feature_name="query_length",
+    # We extract the number of terms in the query as feature.
+   QueryFeatureExtractor(
+        feature_name="query_term_count",
         query={
         query={
             "script_score": {
             "script_score": {
-                "query": {"match_all": {}},
-                "script": {
-                    "source": "return params['query'].splitOnToken(' ').length;",
-                    "params": {
-                        "query": "{{query}}",
-                    }
-                },
+                "query": {"match": {"title": "{{query}}"}},
+                "script": {"source": "return _termStats.uniqueTermsCount();"},
             }
             }
         },
         },
     ),
     ),
@@ -74,6 +78,15 @@ feature_extractors=[
 ----
 ----
 // NOTCONSOLE
 // NOTCONSOLE
 
 
+[NOTE]
+.Tern statistics as features
+===================================================
+
+It is very common for an LTR model to leverage raw term statistics as features.
+To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the  <<query-dsl-script-score-query,`script_score`>> query.
+
+===================================================
+
 Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
 Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
 
 
 [source,python]
 [source,python]

+ 0 - 7
docs/reference/reranking/learning-to-rank-search-usage.asciidoc

@@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
 ====== Negative scores
 ====== Negative scores
 
 
 Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
 Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
-
-[discrete]
-[[learning-to-rank-rescorer-limitations-term-statistics]]
-====== Term statistics as features
-
-We do not currently support term statistics as features, however future releases will introduce this capability.
-

+ 73 - 0
docs/reference/scripting/fields.asciidoc

@@ -80,6 +80,79 @@ GET my-index-000001/_search
 }
 }
 -------------------------------------
 -------------------------------------
 
 
+[discrete]
+[[scripting-term-statistics]]
+=== Accessing term statistics of a document within a script
+
+Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
+
+In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
+
+[source,console]
+-------------------------------------
+PUT my-index-000001/_doc/1?refresh
+{
+  "text": "quick brown fox"
+}
+
+PUT my-index-000001/_doc/2?refresh
+{
+  "text": "quick fox"
+}
+
+GET my-index-000001/_search
+{
+  "query": {
+    "script_score": {
+      "query": { <1>
+        "match": {
+          "text": "quick brown fox"
+        }
+      },
+      "script": {
+        "source": "_termStats.termFreq().getAverage()" <2>
+      }
+    }
+  }
+}
+-------------------------------------
+
+<1> Child query used to infer the field and the terms considered in term statistics.
+
+<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
+
+`_termStats` provides access to the following functions for working with term statistics:
+
+- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
+- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
+- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
+- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
+- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
+
+[NOTE]
+.Functions returning aggregated statistics
+===================================================
+
+The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
+
+Statistics provides support for the following methods:
+
+`getAverage()`: Returns the average value of the metric.
+`getMin()`: Returns the minimum value of the metric.
+`getMax()`: Returns the maximum value of the metric.
+`getSum()`: Returns the sum of the metric values.
+`getCount()`: Returns the count of terms included in the metric calculation.
+
+===================================================
+
+
+[NOTE]
+.Painless language required
+===================================================
+
+The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
+
+===================================================
 
 
 [discrete]
 [discrete]
 [[modules-scripting-doc-vals]]
 [[modules-scripting-doc-vals]]