|
@@ -1,137 +1,202 @@
|
|
|
[[analysis-keep-types-tokenfilter]]
|
|
|
-=== Keep Types Token Filter
|
|
|
+=== Keep types token filter
|
|
|
+++++
|
|
|
+<titleabbrev>Keep types</titleabbrev>
|
|
|
+++++
|
|
|
|
|
|
-A token filter of type `keep_types` that only keeps tokens with a token type
|
|
|
-contained in a predefined set.
|
|
|
+Keeps or removes tokens of a specific type. For example, you can use this filter
|
|
|
+to change `3 quick foxes` to `quick foxes` by keeping only `<ALPHANUM>`
|
|
|
+(alphanumeric) tokens.
|
|
|
|
|
|
+[NOTE]
|
|
|
+.Token types
|
|
|
+====
|
|
|
+Token types are set by the <<analysis-tokenizers,tokenizer>> when converting
|
|
|
+characters to tokens. Token types can vary between tokenizers.
|
|
|
|
|
|
-[float]
|
|
|
-=== Options
|
|
|
-[horizontal]
|
|
|
-types:: a list of types to include (default mode) or exclude
|
|
|
-mode:: if set to `include` (default) the specified token types will be kept,
|
|
|
-if set to `exclude` the specified token types will be removed from the stream
|
|
|
+For example, the <<analysis-standard-tokenizer,`standard`>> tokenizer can
|
|
|
+produce a variety of token types, including `<ALPHANUM>`, `<HANGUL>`, and
|
|
|
+`<NUM>`. Simpler analyzers, like the
|
|
|
+<<analysis-lowercase-tokenizer,`lowercase`>> tokenizer, only produce the `word`
|
|
|
+token type.
|
|
|
|
|
|
-[float]
|
|
|
-=== Settings example
|
|
|
+Certain token filters can also add token types. For example, the
|
|
|
+<<analysis-synonym-tokenfilter,`synonym`>> filter can add the `<SYNONYM>` token
|
|
|
+type.
|
|
|
+====
|
|
|
|
|
|
-You can set it up like:
|
|
|
+This filter uses Lucene's
|
|
|
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html[TypeTokenFilter].
|
|
|
+
|
|
|
+[[analysis-keep-types-tokenfilter-analyze-include-ex]]
|
|
|
+==== Include example
|
|
|
+
|
|
|
+The following <<indices-analyze,analyze API>> request uses the `keep_types`
|
|
|
+filter to keep only `<NUM>` (numeric) tokens from `1 quick fox 2 lazy dogs`.
|
|
|
|
|
|
[source,console]
|
|
|
--------------------------------------------------
|
|
|
-PUT /keep_types_example
|
|
|
+GET _analyze
|
|
|
{
|
|
|
- "settings" : {
|
|
|
- "analysis" : {
|
|
|
- "analyzer" : {
|
|
|
- "my_analyzer" : {
|
|
|
- "tokenizer" : "standard",
|
|
|
- "filter" : ["lowercase", "extract_numbers"]
|
|
|
- }
|
|
|
- },
|
|
|
- "filter" : {
|
|
|
- "extract_numbers" : {
|
|
|
- "type" : "keep_types",
|
|
|
- "types" : [ "<NUM>" ]
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
+ "tokenizer": "standard",
|
|
|
+ "filter": [
|
|
|
+ {
|
|
|
+ "type": "keep_types",
|
|
|
+ "types": [ "<NUM>" ]
|
|
|
}
|
|
|
+ ],
|
|
|
+ "text": "1 quick fox 2 lazy dogs"
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-And test it like:
|
|
|
+The filter produces the following tokens:
|
|
|
|
|
|
-[source,console]
|
|
|
+[source,text]
|
|
|
--------------------------------------------------
|
|
|
-POST /keep_types_example/_analyze
|
|
|
-{
|
|
|
- "analyzer" : "my_analyzer",
|
|
|
- "text" : "this is just 1 a test"
|
|
|
-}
|
|
|
+[ 1, 2 ]
|
|
|
--------------------------------------------------
|
|
|
-// TEST[continued]
|
|
|
-
|
|
|
-The response will be:
|
|
|
|
|
|
+/////////////////////
|
|
|
[source,console-result]
|
|
|
--------------------------------------------------
|
|
|
{
|
|
|
"tokens": [
|
|
|
{
|
|
|
"token": "1",
|
|
|
- "start_offset": 13,
|
|
|
- "end_offset": 14,
|
|
|
+ "start_offset": 0,
|
|
|
+ "end_offset": 1,
|
|
|
+ "type": "<NUM>",
|
|
|
+ "position": 0
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "token": "2",
|
|
|
+ "start_offset": 12,
|
|
|
+ "end_offset": 13,
|
|
|
"type": "<NUM>",
|
|
|
"position": 3
|
|
|
}
|
|
|
]
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
+/////////////////////
|
|
|
|
|
|
-Note how only the `<NUM>` token is in the output.
|
|
|
-
|
|
|
-[discrete]
|
|
|
-=== Exclude mode settings example
|
|
|
+[[analysis-keep-types-tokenfilter-analyze-exclude-ex]]
|
|
|
+==== Exclude example
|
|
|
|
|
|
-If the `mode` parameter is set to `exclude` like in the following example:
|
|
|
+The following <<indices-analyze,analyze API>> request uses the `keep_types`
|
|
|
+filter to remove `<NUM>` tokens from `1 quick fox 2 lazy dogs`. Note the `mode`
|
|
|
+parameter is set to `exclude`.
|
|
|
|
|
|
[source,console]
|
|
|
--------------------------------------------------
|
|
|
-PUT /keep_types_exclude_example
|
|
|
+GET _analyze
|
|
|
{
|
|
|
- "settings" : {
|
|
|
- "analysis" : {
|
|
|
- "analyzer" : {
|
|
|
- "my_analyzer" : {
|
|
|
- "tokenizer" : "standard",
|
|
|
- "filter" : ["lowercase", "remove_numbers"]
|
|
|
- }
|
|
|
- },
|
|
|
- "filter" : {
|
|
|
- "remove_numbers" : {
|
|
|
- "type" : "keep_types",
|
|
|
- "mode" : "exclude",
|
|
|
- "types" : [ "<NUM>" ]
|
|
|
- }
|
|
|
- }
|
|
|
- }
|
|
|
+ "tokenizer": "standard",
|
|
|
+ "filter": [
|
|
|
+ {
|
|
|
+ "type": "keep_types",
|
|
|
+ "types": [ "<NUM>" ],
|
|
|
+ "mode": "exclude"
|
|
|
}
|
|
|
+ ],
|
|
|
+ "text": "1 quick fox 2 lazy dogs"
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
|
|
|
-And we test it like:
|
|
|
+The filter produces the following tokens:
|
|
|
|
|
|
-[source,console]
|
|
|
+[source,text]
|
|
|
--------------------------------------------------
|
|
|
-POST /keep_types_exclude_example/_analyze
|
|
|
-{
|
|
|
- "analyzer" : "my_analyzer",
|
|
|
- "text" : "hello 101 world"
|
|
|
-}
|
|
|
+[ quick, fox, lazy, dogs ]
|
|
|
--------------------------------------------------
|
|
|
-// TEST[continued]
|
|
|
-
|
|
|
-The response will be:
|
|
|
|
|
|
+/////////////////////
|
|
|
[source,console-result]
|
|
|
--------------------------------------------------
|
|
|
{
|
|
|
"tokens": [
|
|
|
{
|
|
|
- "token": "hello",
|
|
|
- "start_offset": 0,
|
|
|
- "end_offset": 5,
|
|
|
+ "token": "quick",
|
|
|
+ "start_offset": 2,
|
|
|
+ "end_offset": 7,
|
|
|
"type": "<ALPHANUM>",
|
|
|
- "position": 0
|
|
|
- },
|
|
|
+ "position": 1
|
|
|
+ },
|
|
|
{
|
|
|
- "token": "world",
|
|
|
- "start_offset": 10,
|
|
|
- "end_offset": 15,
|
|
|
+ "token": "fox",
|
|
|
+ "start_offset": 8,
|
|
|
+ "end_offset": 11,
|
|
|
"type": "<ALPHANUM>",
|
|
|
"position": 2
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "token": "lazy",
|
|
|
+ "start_offset": 14,
|
|
|
+ "end_offset": 18,
|
|
|
+ "type": "<ALPHANUM>",
|
|
|
+ "position": 4
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "token": "dogs",
|
|
|
+ "start_offset": 19,
|
|
|
+ "end_offset": 23,
|
|
|
+ "type": "<ALPHANUM>",
|
|
|
+ "position": 5
|
|
|
}
|
|
|
]
|
|
|
}
|
|
|
--------------------------------------------------
|
|
|
+/////////////////////
|
|
|
+
|
|
|
+[[analysis-keep-types-tokenfilter-configure-parms]]
|
|
|
+==== Configurable parameters
|
|
|
+
|
|
|
+`types`::
|
|
|
+(Required, array of strings)
|
|
|
+List of token types to keep or remove.
|
|
|
+
|
|
|
+`mode`::
|
|
|
+(Optional, string)
|
|
|
+Indicates whether to keep or remove the specified token types.
|
|
|
+Valid values are:
|
|
|
+
|
|
|
+`include`:::
|
|
|
+(Default) Keep only the specified token types.
|
|
|
+
|
|
|
+`exclude`:::
|
|
|
+Remove the specified token types.
|
|
|
+
|
|
|
+[[analysis-keep-types-tokenfilter-customize]]
|
|
|
+==== Customize and add to an analyzer
|
|
|
+
|
|
|
+To customize the `keep_types` filter, duplicate it to create the basis
|
|
|
+for a new custom token filter. You can modify the filter using its configurable
|
|
|
+parameters.
|
|
|
+
|
|
|
+For example, the following <<indices-create-index,create index API>> request
|
|
|
+uses a custom `keep_types` filter to configure a new
|
|
|
+<<analysis-custom-analyzer,custom analyzer>>. The custom `keep_types` filter
|
|
|
+keeps only `<ALPHANUM>` (alphanumeric) tokens.
|
|
|
+
|
|
|
+[source,console]
|
|
|
+--------------------------------------------------
|
|
|
+PUT keep_types_example
|
|
|
+{
|
|
|
+ "settings": {
|
|
|
+ "analysis": {
|
|
|
+ "analyzer": {
|
|
|
+ "my_analyzer": {
|
|
|
+ "tokenizer": "standard",
|
|
|
+ "filter": [ "extract_alpha" ]
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "filter": {
|
|
|
+ "extract_alpha": {
|
|
|
+ "type": "keep_types",
|
|
|
+ "types": [ "<ALPHANUM>" ]
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|