|
@@ -97,22 +97,38 @@ similarity has the following option:
|
|
|
Type name: `classic`
|
|
|
|
|
|
[float]
|
|
|
-[[drf]]
|
|
|
+[[dfr]]
|
|
|
==== DFR similarity
|
|
|
|
|
|
Similarity that implements the
|
|
|
-http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
|
+{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
|
from randomness] framework. This similarity has the following options:
|
|
|
|
|
|
[horizontal]
|
|
|
`basic_model`::
|
|
|
- Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
|
|
|
+ Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`].
|
|
|
+
|
|
|
+`be`, `d` and `p` should be avoided in practice as they might return scores that
|
|
|
+are equal to 0 or infinite with terms that do not meet the expected random
|
|
|
+distribution.
|
|
|
|
|
|
`after_effect`::
|
|
|
- Possible values: `no`, `b` and `l`.
|
|
|
+ Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
|
|
|
|
|
|
`normalization`::
|
|
|
- Possible values: `no`, `h1`, `h2`, `h3` and `z`.
|
|
|
+ Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
|
|
|
|
|
|
All options but the first option need a normalization value.
|
|
|
|
|
@@ -127,7 +143,14 @@ model.
|
|
|
This similarity has the following options:
|
|
|
|
|
|
[horizontal]
|
|
|
-`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
|
|
|
+`independence_measure`:: Possible values
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
|
|
|
+
|
|
|
+When using this similarity, it is highly recommended to remove stop words to get
|
|
|
+good relevance. Also beware that terms whose frequency is less than the expected
|
|
|
+frequency will get a score equal to 0.
|
|
|
|
|
|
Type name: `DFI`
|
|
|
|
|
@@ -135,15 +158,19 @@ Type name: `DFI`
|
|
|
[[ib]]
|
|
|
==== IB similarity.
|
|
|
|
|
|
-http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
|
+{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
|
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
|
|
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
|
|
This similarity has the following options:
|
|
|
|
|
|
[horizontal]
|
|
|
-`distribution`:: Possible values: `ll` and `spl`.
|
|
|
-`lambda`:: Possible values: `df` and `ttf`.
|
|
|
+`distribution`:: Possible values:
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
|
|
|
+`lambda`:: Possible values:
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
|
|
|
+ {lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
|
|
|
`normalization`:: Same as in `DFR` similarity.
|
|
|
|
|
|
Type name: `IB`
|
|
@@ -152,19 +179,23 @@ Type name: `IB`
|
|
|
[[lm_dirichlet]]
|
|
|
==== LM Dirichlet similarity.
|
|
|
|
|
|
-http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
|
+{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
|
Dirichlet similarity] . This similarity has the following options:
|
|
|
|
|
|
[horizontal]
|
|
|
`mu`:: Default to `2000`.
|
|
|
|
|
|
+The scoring formula in the paper assigns negative scores to terms that have
|
|
|
+fewer occurrences than predicted by the language model, which is illegal to
|
|
|
+Lucene, so such terms get a score of 0.
|
|
|
+
|
|
|
Type name: `LMDirichlet`
|
|
|
|
|
|
[float]
|
|
|
[[lm_jelinek_mercer]]
|
|
|
==== LM Jelinek Mercer similarity.
|
|
|
|
|
|
-http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
|
+{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
|
|
|
|
|
[horizontal]
|