소스 검색

Docs: Language analyzers
Clarified the use of stem_exclusion and the keyword_marker
token filter

Closes #6613

Clinton Gormley 11 년 전
부모
커밋
e4baa56f4b
1개의 변경된 파일109개의 추가작업 그리고 116개의 파일을 삭제
  1. 109 116
      docs/reference/analysis/analyzers/lang-analyzer.asciidoc

+ 109 - 116
docs/reference/analysis/analyzers/lang-analyzer.asciidoc

@@ -36,19 +36,40 @@ following types are supported:
 <<turkish-analyzer,`turkish`>>,
 <<thai-analyzer,`thai`>>.
 
+==== Configuring language analyzers
+
+===== Stopwords
+
 All analyzers support setting custom `stopwords` either internally in
 the config, or by using an external stopwords file by setting
 `stopwords_path`. Check <<analysis-stop-analyzer,Stop Analyzer>> for
 more details.
 
+===== Excluding words from stemming
+
+The `stem_exclusion` parameter allows you to specify an array
+of lowercase words that should not be stemmed.  Internally, this
+functionality is implemented by adding the
+<<analysis-keyword-marker-tokenfilter,`keyword_marker` token filter>>
+with the `keywords` set to the value of the `stem_exclusion` parameter.
+
 The following analyzers support setting custom `stem_exclusion` list:
 `arabic`, `armenian`, `basque`, `catalan`, `bulgarian`, `catalan`,
 `czech`, `finnish`, `dutch`, `english`, `finnish`, `french`, `galician`,
 `german`, `irish`, `hindi`, `hungarian`, `indonesian`, `italian`, `norwegian`,
 `portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish`.
 
+==== Reimplementing language analyzers
+
+The built-in language analyzers can be reimplemented as `custom` analyzers
+(as described below) in order to customize their behaviour.
+
+NOTE: If you do not intend to exclude words from being stemmed (the
+equivalent of the `stem_exclusion` parameter above), then you should remove
+the `keyword_marker` token filter from the custom analyzer configuration.
+
 [[arabic-analyzer]]
-==== `arabic` analyzer
+===== `arabic` analyzer
 
 The `arabic` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -89,12 +110,11 @@ The `arabic` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[armenian-analyzer]]
-==== `armenian` analyzer
+===== `armenian` analyzer
 
 The `armenian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -134,12 +154,11 @@ The `armenian` analyzer could be reimplemented as a `custom` analyzer as follows
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[basque-analyzer]]
-==== `basque` analyzer
+===== `basque` analyzer
 
 The `basque` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -179,12 +198,11 @@ The `basque` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[brazilian-analyzer]]
-==== `brazilian` analyzer
+===== `brazilian` analyzer
 
 The `brazilian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -224,12 +242,11 @@ The `brazilian` analyzer could be reimplemented as a `custom` analyzer as follow
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[bulgarian-analyzer]]
-==== `bulgarian` analyzer
+===== `bulgarian` analyzer
 
 The `bulgarian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -269,12 +286,11 @@ The `bulgarian` analyzer could be reimplemented as a `custom` analyzer as follow
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[catalan-analyzer]]
-==== `catalan` analyzer
+===== `catalan` analyzer
 
 The `catalan` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -319,12 +335,11 @@ The `catalan` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[chinese-analyzer]]
-==== `chinese` analyzer
+===== `chinese` analyzer
 
 The `chinese` analyzer cannot be reimplemented as a `custom` analyzer
 because it depends on the ChineseTokenizer and ChineseFilter classes,
@@ -333,7 +348,7 @@ deprecated in Lucene 4 and the `chinese` analyzer will be replaced
 with the <<analysis-standard-analyzer>> in Lucene 5.
 
 [[cjk-analyzer]]
-==== `cjk` analyzer
+===== `cjk` analyzer
 
 The `cjk` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -367,7 +382,7 @@ The `cjk` analyzer could be reimplemented as a `custom` analyzer as follows:
     or `stopwords_path` parameters.
 
 [[czech-analyzer]]
-==== `czech` analyzer
+===== `czech` analyzer
 
 The `czech` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -407,12 +422,11 @@ The `czech` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[danish-analyzer]]
-==== `danish` analyzer
+===== `danish` analyzer
 
 The `danish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -452,12 +466,11 @@ The `danish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[dutch-analyzer]]
-==== `dutch` analyzer
+===== `dutch` analyzer
 
 The `dutch` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -507,12 +520,11 @@ The `dutch` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[english-analyzer]]
-==== `english` analyzer
+===== `english` analyzer
 
 The `english` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -557,12 +569,11 @@ The `english` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[finnish-analyzer]]
-==== `finnish` analyzer
+===== `finnish` analyzer
 
 The `finnish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -602,12 +613,11 @@ The `finnish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[french-analyzer]]
-==== `french` analyzer
+===== `french` analyzer
 
 The `french` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -655,12 +665,11 @@ The `french` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[galician-analyzer]]
-==== `galician` analyzer
+===== `galician` analyzer
 
 The `galician` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -700,12 +709,11 @@ The `galician` analyzer could be reimplemented as a `custom` analyzer as follows
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[german-analyzer]]
-==== `german` analyzer
+===== `german` analyzer
 
 The `german` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -746,12 +754,11 @@ The `german` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[greek-analyzer]]
-==== `greek` analyzer
+===== `greek` analyzer
 
 The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -795,12 +802,11 @@ The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[hindi-analyzer]]
-==== `hindi` analyzer
+===== `hindi` analyzer
 
 The `hindi` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -842,12 +848,11 @@ The `hindi` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[hungarian-analyzer]]
-==== `hungarian` analyzer
+===== `hungarian` analyzer
 
 The `hungarian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -887,13 +892,12 @@ The `hungarian` analyzer could be reimplemented as a `custom` analyzer as follow
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 
 [[indonesian-analyzer]]
-==== `indonesian` analyzer
+===== `indonesian` analyzer
 
 The `indonesian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -933,12 +937,11 @@ The `indonesian` analyzer could be reimplemented as a `custom` analyzer as follo
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[irish-analyzer]]
-==== `irish` analyzer
+===== `irish` analyzer
 
 The `irish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -987,12 +990,11 @@ The `irish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[italian-analyzer]]
-==== `italian` analyzer
+===== `italian` analyzer
 
 The `italian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1042,12 +1044,11 @@ The `italian` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[norwegian-analyzer]]
-==== `norwegian` analyzer
+===== `norwegian` analyzer
 
 The `norwegian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1087,12 +1088,11 @@ The `norwegian` analyzer could be reimplemented as a `custom` analyzer as follow
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[persian-analyzer]]
-==== `persian` analyzer
+===== `persian` analyzer
 
 The `persian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1134,7 +1134,7 @@ The `persian` analyzer could be reimplemented as a `custom` analyzer as follows:
     or `stopwords_path` parameters.
 
 [[portuguese-analyzer]]
-==== `portuguese` analyzer
+===== `portuguese` analyzer
 
 The `portuguese` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1174,12 +1174,11 @@ The `portuguese` analyzer could be reimplemented as a `custom` analyzer as follo
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[romanian-analyzer]]
-==== `romanian` analyzer
+===== `romanian` analyzer
 
 The `romanian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1219,13 +1218,12 @@ The `romanian` analyzer could be reimplemented as a `custom` analyzer as follows
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 
 [[russian-analyzer]]
-==== `russian` analyzer
+===== `russian` analyzer
 
 The `russian` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1265,12 +1263,11 @@ The `russian` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[sorani-analyzer]]
-==== `sorani` analyzer
+===== `sorani` analyzer
 
 The `sorani` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1311,12 +1308,11 @@ The `sorani` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[spanish-analyzer]]
-==== `spanish` analyzer
+===== `spanish` analyzer
 
 The `spanish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1356,12 +1352,11 @@ The `spanish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[swedish-analyzer]]
-==== `swedish` analyzer
+===== `swedish` analyzer
 
 The `swedish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1401,12 +1396,11 @@ The `swedish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[turkish-analyzer]]
-==== `turkish` analyzer
+===== `turkish` analyzer
 
 The `turkish` analyzer could be reimplemented as a `custom` analyzer as follows:
 
@@ -1451,12 +1445,11 @@ The `turkish` analyzer could be reimplemented as a `custom` analyzer as follows:
 ----------------------------------------------------
 <1> The default stopwords can be overridden with the `stopwords`
     or `stopwords_path` parameters.
-<2> Words can be excluded from stemming with the `stem_exclusion`
-    parameter. This filter should be removed if there are no words 
-    to exclude.
+<2> This filter should be removed unless there are words which should
+    be excluded from stemming.
 
 [[thai-analyzer]]
-==== `thai` analyzer
+===== `thai` analyzer
 
 The `thai` analyzer could be reimplemented as a `custom` analyzer as follows: