7 年之前 · 9881bfaea5
--- a/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
@@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
 
				 deduplicated and concatenated into a single token.  If a stopword list is
			
 
				 configured, stop words will also be removed.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-standard-tokenizer,Standard Tokenizer>>
			
 
				-
			
 
				-Token Filters (in order)::
			
 
				-1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				-2. <<analysis-asciifolding-tokenfilter>>
			
 
				-3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				-4. <<analysis-fingerprint-tokenfilter>>
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -149,3 +135,46 @@ The above example produces the following term:
 
				 ---------------------------
			
 
				 [ consistent godel said sentence yes ]
			
 
				 ---------------------------
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+The `fingerprint` tokenizer consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-standard-tokenizer,Standard Tokenizer>>
			
 
				+
			
 
				+Token Filters (in order)::
			
 
				+* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				+* <<analysis-asciifolding-tokenfilter>>
			
 
				+* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				+* <<analysis-fingerprint-tokenfilter>>
			
 
				+
			
 
				+If you need to customize the `fingerprint` analyzer beyond the configuration
			
 
				+parameters then you need to recreate it as a `custom` analyzer and modify
			
 
				+it, usually by adding token filters. This would recreate the built-in
			
 
				+`fingerprint` analyzer and you can use it as a starting point for further
			
 
				+customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /fingerprint_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_fingerprint": {
			
 
				+          "tokenizer": "standard",
			
 
				+          "filter": [
			
 
				+            "lowercase",
			
 
				+            "asciifolding",
			
 
				+            "fingerprint"
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
			
--- a/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
@@ -4,14 +4,6 @@
 
				 The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
			
 
				 string as a single token.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -57,3 +49,40 @@ The above sentence would produce the following single term:
 
				 === Configuration
			
 
				 
			
 
				 The `keyword` analyzer is not configurable.
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+The `keyword` analyzer consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
			
 
				+
			
 
				+If you need to customize the `keyword` analyzer then you need to
			
 
				+recreate it as a `custom` analyzer and modify it, usually by adding
			
 
				+token filters. Usually, you should prefer the
			
 
				+<<keyword, Keyword type>> when you want strings that are not split
			
 
				+into tokens, but just in case you need it, this would recreate the
			
 
				+built-in `keyword` analyzer and you can use it as a starting point
			
 
				+for further customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /keyword_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_keyword": {
			
 
				+          "tokenizer": "keyword",
			
 
				+          "filter": [         <1>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
			
 
				+<1> You'd add any token filters here.
			
--- a/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
@@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic
 
				 
			
 
				 ========================================
			
 
				 
			
 
				-
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
			
 
				-
			
 
				-Token Filters::
			
 
				-*  <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				-*  <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -378,3 +365,51 @@ The regex above is easier to understand as:
 
				     [\p{L}&&[^\p{Lu}]]          #   then lower case
			
 
				   )
			
 
				 --------------------------------------------------
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+The `pattern` anlayzer consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
			
 
				+
			
 
				+Token Filters::
			
 
				+*  <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				+*  <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				+
			
 
				+If you need to customize the `pattern` analyzer beyond the configuration
			
 
				+parameters then you need to recreate it as a `custom` analyzer and modify
			
 
				+it, usually by adding token filters. This would recreate the built-in
			
 
				+`pattern` analyzer and you can use it as a starting point for further
			
 
				+customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /pattern_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "tokenizer": {
			
 
				+        "split_on_non_word": {
			
 
				+          "type":       "pattern",
			
 
				+          "pattern":    "\\W+" <1>
			
 
				+        }
			
 
				+      },
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_pattern": {
			
 
				+          "tokenizer": "split_on_non_word",
			
 
				+          "filter": [
			
 
				+            "lowercase"       <2>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
			
 
				+<1> The default pattern is `\W+` which splits on non-word characters
			
 
				+and this is where you'd change it.
			
 
				+<2> You'd add other token filters after `lowercase`.
			
--- a/docs/reference/analysis/analyzers/simple-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/simple-analyzer.asciidoc
@@ -4,14 +4,6 @@
 
				 The `simple` analyzer breaks text into terms whenever it encounters a
			
 
				 character which is not a letter. All terms are lower cased.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -127,3 +119,37 @@ The above sentence would produce the following terms:
 
				 === Configuration
			
 
				 
			
 
				 The `simple` analyzer is not configurable.
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+The `simple` analzyer consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
			
 
				+
			
 
				+If you need to customize the `simple` analyzer then you need to recreate
			
 
				+it as a `custom` analyzer and modify it, usually by adding token filters.
			
 
				+This would recreate the built-in `simple` analyzer and you can use it as
			
 
				+a starting point for further customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /simple_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_simple": {
			
 
				+          "tokenizer": "lowercase",
			
 
				+          "filter": [         <1>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
			
 
				+<1> You'd add any token filters here.
			
--- a/docs/reference/analysis/analyzers/standard-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/standard-analyzer.asciidoc
@@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
 
				 http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
			
 
				 for most languages.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-standard-tokenizer,Standard Tokenizer>>
			
 
				-
			
 
				-Token Filters::
			
 
				-* <<analysis-standard-tokenfilter,Standard Token Filter>>
			
 
				-* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				-* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -276,3 +263,44 @@ The above example produces the following terms:
 
				 ---------------------------
			
 
				 [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
			
 
				 ---------------------------
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+The `standard` analyzer consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-standard-tokenizer,Standard Tokenizer>>
			
 
				+
			
 
				+Token Filters::
			
 
				+* <<analysis-standard-tokenfilter,Standard Token Filter>>
			
 
				+* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
			
 
				+* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
			
 
				+
			
 
				+If you need to customize the `standard` analyzer beyond the configuration
			
 
				+parameters then you need to recreate it as a `custom` analyzer and modify
			
 
				+it, usually by adding token filters. This would recreate the built-in
			
 
				+`standard` analyzer and you can use it as a starting point:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /standard_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_standard": {
			
 
				+          "tokenizer": "standard",
			
 
				+          "filter": [
			
 
				+            "standard",
			
 
				+            "lowercase"       <1>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
			
 
				+<1> You'd add any token filters after `lowercase`.
			
--- a/docs/reference/analysis/analyzers/stop-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/stop-analyzer.asciidoc
@@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
 
				 but adds support for removing stop words.  It defaults to using the
			
 
				 `_english_` stop words.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
			
 
				-
			
 
				-Token filters::
			
 
				-* <<analysis-stop-tokenfilter,Stop Token Filter>>
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -239,3 +228,50 @@ The above example produces the following terms:
 
				 ---------------------------
			
 
				 [ quick, brown, foxes, jumped, lazy, dog, s, bone ]
			
 
				 ---------------------------
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+It consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
			
 
				+
			
 
				+Token filters::
			
 
				+* <<analysis-stop-tokenfilter,Stop Token Filter>>
			
 
				+
			
 
				+If you need to customize the `stop` analyzer beyond the configuration
			
 
				+parameters then you need to recreate it as a `custom` analyzer and modify
			
 
				+it, usually by adding token filters. This would recreate the built-in
			
 
				+`stop` analyzer and you can use it as a starting point for further
			
 
				+customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /stop_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "filter": {
			
 
				+        "english_stop": {
			
 
				+          "type":       "stop",
			
 
				+          "stopwords":  "_english_" <1>
			
 
				+        }
			
 
				+      },
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_stop": {
			
 
				+          "tokenizer": "lowercase",
			
 
				+          "filter": [
			
 
				+            "english_stop"          <2>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
			
 
				+<1> The default stopwords can be overridden with the `stopwords`
			
 
				+    or `stopwords_path` parameters.
			
 
				+<2> You'd add any token filters after `english_stop`.
			
--- a/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc
@@ -4,14 +4,6 @@
 
				 The `whitespace` analyzer breaks text into terms whenever it encounters a
			
 
				 whitespace character.
			
 
				 
			
 
				-[float]
			
 
				-=== Definition
			
 
				-
			
 
				-It consists of:
			
 
				-
			
 
				-Tokenizer::
			
 
				-* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
			
 
				-
			
 
				 [float]
			
 
				 === Example output
			
 
				 
			
@@ -120,3 +112,37 @@ The above sentence would produce the following terms:
 
				 === Configuration
			
 
				 
			
 
				 The `whitespace` analyzer is not configurable.
			
 
				+
			
 
				+[float]
			
 
				+=== Definition
			
 
				+
			
 
				+It consists of:
			
 
				+
			
 
				+Tokenizer::
			
 
				+* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
			
 
				+
			
 
				+If you need to customize the `whitespace` analyzer then you need to
			
 
				+recreate it as a `custom` analyzer and modify it, usually by adding
			
 
				+token filters. This would recreate the built-in `whitespace` analyzer
			
 
				+and you can use it as a starting point for further customization:
			
 
				+
			
 
				+[source,js]
			
 
				+----------------------------------------------------
			
 
				+PUT /whitespace_example
			
 
				+{
			
 
				+  "settings": {
			
 
				+    "analysis": {
			
 
				+      "analyzer": {
			
 
				+        "rebuilt_whitespace": {
			
 
				+          "tokenizer": "whitespace",
			
 
				+          "filter": [         <1>
			
 
				+          ]
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+----------------------------------------------------
			
 
				+// CONSOLE
			
 
				+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/]
			
 
				+<1> You'd add any token filters here.