advanced-scripting.asciidoc 6.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
  1. [[modules-advanced-scripting]]
  2. === Advanced text scoring in scripts
  3. experimental[The functionality described on this page is considered experimental and may be changed or removed in a future release]
  4. Text features, such as term or document frequency for a specific term can be
  5. accessed in scripts with the `_index` variable. This can be useful if, for
  6. example, you want to implement your own scoring model using for example a
  7. script inside a <<query-dsl-function-score-query,function score query>>.
  8. Statistics over the document collection are computed *per shard*, not per
  9. index.
  10. It should be noted that the `_index` variable is not supported in the painless language, but `_index` is defined when using the groovy language.
  11. [float]
  12. === Nomenclature:
  13. [horizontal]
  14. `df`::
  15. document frequency. The number of documents a term appears in. Computed
  16. per field.
  17. `tf`::
  18. term frequency. The number times a term appears in a field in one specific
  19. document.
  20. `ttf`::
  21. total term frequency. The number of times this term appears in all
  22. documents, that is, the sum of `tf` over all documents. Computed per
  23. field.
  24. `df` and `ttf` are computed per shard and therefore these numbers can vary
  25. depending on the shard the current document resides in.
  26. [float]
  27. === Shard statistics:
  28. `_index.numDocs()`::
  29. Number of documents in shard.
  30. `_index.maxDoc()`::
  31. Maximal document number in shard.
  32. `_index.numDeletedDocs()`::
  33. Number of deleted documents in shard.
  34. [float]
  35. === Field statistics:
  36. Field statistics can be accessed with a subscript operator like this:
  37. `_index['FIELD']`.
  38. `_index['FIELD'].docCount()`::
  39. Number of documents containing the field `FIELD`. Does not take deleted documents into account.
  40. `_index['FIELD'].sumttf()`::
  41. Sum of `ttf` over all terms that appear in field `FIELD` in all documents.
  42. `_index['FIELD'].sumdf()`::
  43. The sum of `df` s over all terms that appear in field `FIELD` in all
  44. documents.
  45. Field statistics are computed per shard and therefore these numbers can vary
  46. depending on the shard the current document resides in.
  47. The number of terms in a field cannot be accessed using the `_index` variable. See <<token-count>> for how to do that.
  48. [float]
  49. === Term statistics:
  50. Term statistics for a field can be accessed with a subscript operator like
  51. this: `_index['FIELD']['TERM']`. This will never return null, even if term or field does not exist.
  52. If you do not need the term frequency, call `_index['FIELD'].get('TERM', 0)`
  53. to avoid unnecessary initialization of the frequencies. The flag will have only
  54. affect is your set the <<index-options,`index_options`>> to `docs`.
  55. `_index['FIELD']['TERM'].df()`::
  56. `df` of term `TERM` in field `FIELD`. Will be returned, even if the term
  57. is not present in the current document.
  58. `_index['FIELD']['TERM'].ttf()`::
  59. The sum of term frequencies of term `TERM` in field `FIELD` over all
  60. documents. Will be returned, even if the term is not present in the
  61. current document.
  62. `_index['FIELD']['TERM'].tf()`::
  63. `tf` of term `TERM` in field `FIELD`. Will be 0 if the term is not present
  64. in the current document.
  65. [float]
  66. === Term positions, offsets and payloads:
  67. If you need information on the positions of terms in a field, call
  68. `_index['FIELD'].get('TERM', flag)` where flag can be
  69. [horizontal]
  70. `_POSITIONS`:: if you need the positions of the term
  71. `_OFFSETS`:: if you need the offsets of the term
  72. `_PAYLOADS`:: if you need the payloads of the term
  73. `_CACHE`:: if you need to iterate over all positions several times
  74. The iterator uses the underlying lucene classes to iterate over positions. For efficiency reasons, you can only iterate over positions once. If you need to iterate over the positions several times, set the `_CACHE` flag.
  75. You can combine the operators with a `|` if you need more than one info. For
  76. example, the following will return an object holding the positions and payloads,
  77. as well as all statistics:
  78. `_index['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)`
  79. Positions can be accessed with an iterator that returns an object
  80. (`POS_OBJECT`) holding position, offsets and payload for each term position.
  81. `POS_OBJECT.position`::
  82. The position of the term.
  83. `POS_OBJECT.startOffset`::
  84. The start offset of the term.
  85. `POS_OBJECT.endOffset`::
  86. The end offset of the term.
  87. `POS_OBJECT.payload`::
  88. The payload of the term.
  89. `POS_OBJECT.payloadAsInt(missingValue)`::
  90. The payload of the term converted to integer. If the current position has
  91. no payload, the `missingValue` will be returned. Call this only if you
  92. know that your payloads are integers.
  93. `POS_OBJECT.payloadAsFloat(missingValue)`::
  94. The payload of the term converted to float. If the current position has no
  95. payload, the `missingValue` will be returned. Call this only if you know
  96. that your payloads are floats.
  97. `POS_OBJECT.payloadAsString()`::
  98. The payload of the term converted to string. If the current position has
  99. no payload, `null` will be returned. Call this only if you know that your
  100. payloads are strings.
  101. Example: sums up all payloads for the term `foo`.
  102. [source,groovy]
  103. ---------------------------------------------------------
  104. termInfo = _index['my_field'].get('foo',_PAYLOADS);
  105. score = 0;
  106. for (pos in termInfo) {
  107. score = score + pos.payloadAsInt(0);
  108. }
  109. return score;
  110. ---------------------------------------------------------
  111. [float]
  112. === Term vectors:
  113. The `_index` variable can only be used to gather statistics for single terms. If you want to use information on all terms in a field, you must store the term vectors (see <<term-vector>>). To access them, call
  114. `_index.termVectors()` to get a
  115. https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[Fields]
  116. instance. This object can then be used as described in https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/Fields.html[lucene doc] to iterate over fields and then for each field iterate over each term in the field.
  117. The method will return null if the term vectors were not stored.