|
@@ -11,11 +11,13 @@ experimental[]
|
|
|
Finds the structure of a text file. The text file must contain data that is
|
|
|
suitable to be ingested into {es}.
|
|
|
|
|
|
+
|
|
|
[[ml-find-file-structure-request]]
|
|
|
==== {api-request-title}
|
|
|
|
|
|
`POST _ml/find_file_structure`
|
|
|
|
|
|
+
|
|
|
[[ml-find-file-structure-prereqs]]
|
|
|
==== {api-prereq-title}
|
|
|
|
|
@@ -23,6 +25,7 @@ suitable to be ingested into {es}.
|
|
|
`monitor` cluster privileges to use this API. See
|
|
|
<<security-privileges>>.
|
|
|
|
|
|
+
|
|
|
[[ml-find-file-structure-desc]]
|
|
|
==== {api-description-title}
|
|
|
|
|
@@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
|
|
|
the response, which should help in determining why the returned structure was
|
|
|
chosen.
|
|
|
|
|
|
+
|
|
|
[[ml-find-file-structure-query-parms]]
|
|
|
==== {api-query-parms-title}
|
|
|
|
|
|
`charset`::
|
|
|
- (string) Optional. The file's character set. It must be a character set that
|
|
|
+ (Optional, string) The file's character set. It must be a character set that
|
|
|
is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
|
|
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
|
|
finder chooses an appropriate character set.
|
|
|
|
|
|
`column_names`::
|
|
|
- (string) Optional. If you have set `format` to `delimited`, you can specify
|
|
|
+ (Optional, string) If you have set `format` to `delimited`, you can specify
|
|
|
the column names in a comma-separated list. If this parameter is not specified,
|
|
|
the structure finder uses the column names from the header row of the file. If
|
|
|
the file does not have a header role, columns are named "column1", "column2",
|
|
|
"column3", etc.
|
|
|
|
|
|
`delimiter`::
|
|
|
- (string) Optional. If you have set `format` to `delimited`, you can specify
|
|
|
+ (Optional, string) If you have set `format` to `delimited`, you can specify
|
|
|
the character used to delimit the values in each row. Only a single character
|
|
|
is supported; the delimiter cannot have multiple characters. If this parameter
|
|
|
is not specified, the structure finder considers the following possibilities:
|
|
|
comma, tab, semi-colon, and pipe (`|`).
|
|
|
|
|
|
`explain`::
|
|
|
- (boolean) Optional. If this parameter is set to `true`, the response includes
|
|
|
+ (Optional, boolean) If this parameter is set to `true`, the response includes
|
|
|
a field named `explanation`, which is an array of strings that indicate how
|
|
|
the structure finder produced its result. The default value is `false`.
|
|
|
|
|
|
`format`::
|
|
|
- (string) Optional. The high level structure of the file. Valid values are
|
|
|
+ (Optional, string) The high level structure of the file. Valid values are
|
|
|
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
|
|
|
not specified, the structure finder chooses one.
|
|
|
|
|
|
`grok_pattern`::
|
|
|
- (string) Optional. If you have set `format` to `semi_structured_text`, you can
|
|
|
+ (Optional, string) If you have set `format` to `semi_structured_text`, you can
|
|
|
specify a Grok pattern that is used to extract fields from every message in
|
|
|
the file. The name of the timestamp field in the Grok pattern must match what
|
|
|
is specified in the `timestamp_field` parameter. If that parameter is not
|
|
@@ -98,20 +102,20 @@ chosen.
|
|
|
a Grok pattern.
|
|
|
|
|
|
`has_header_row`::
|
|
|
- (boolean) Optional. If you have set `format` to `delimited`, you can use this
|
|
|
+ (Optional, boolean) If you have set `format` to `delimited`, you can use this
|
|
|
parameter to indicate whether the column names are in the first row of the
|
|
|
file. If this parameter is not specified, the structure finder guesses based
|
|
|
on the similarity of the first row of the file to other rows.
|
|
|
|
|
|
`line_merge_size_limit`::
|
|
|
- (unsigned integer) Optional. The maximum number of characters in a message
|
|
|
+ (Optional, unsigned integer) The maximum number of characters in a message
|
|
|
when lines are merged to form messages while analyzing semi-structured files.
|
|
|
The default is `10000`. If you have extremely long messages you may need to
|
|
|
increase this, but be aware that this may lead to very long processing times
|
|
|
if the way to group lines into messages is misdetected.
|
|
|
|
|
|
`lines_to_sample`::
|
|
|
- (unsigned integer) Optional. The number of lines to include in the structural
|
|
|
+ (Optional, unsigned integer) The number of lines to include in the structural
|
|
|
analysis, starting from the beginning of the file. The minimum is 2; the
|
|
|
default is `1000`. If the value of this parameter is greater than the number
|
|
|
of lines in the file, the analysis proceeds (as long as there are at least two
|
|
@@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
|
|
|
--
|
|
|
|
|
|
`quote`::
|
|
|
- (string) Optional. If you have set `format` to `delimited`, you can specify
|
|
|
+ (Optional, string) If you have set `format` to `delimited`, you can specify
|
|
|
the character used to quote the values in each row if they contain newlines or
|
|
|
the delimiter character. Only a single character is supported. If this
|
|
|
parameter is not specified, the default value is a double quote (`"`). If your
|
|
@@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
|
|
|
argument to a character that does not appear anywhere in the sample.
|
|
|
|
|
|
`should_trim_fields`::
|
|
|
- (boolean) Optional. If you have set `format` to `delimited`, you can specify
|
|
|
+ (Optional, boolean) If you have set `format` to `delimited`, you can specify
|
|
|
whether values between delimiters should have whitespace trimmed from them. If
|
|
|
this parameter is not specified and the delimiter is pipe (`|`), the default
|
|
|
value is `true`. Otherwise, the default value is `false`.
|
|
|
|
|
|
`timeout`::
|
|
|
- (time) Optional. Sets the maximum amount of time that the structure analysis
|
|
|
- make take. If the analysis is still running when the timeout expires then it
|
|
|
- will be aborted. The default value is 25 seconds.
|
|
|
+ (Optional, <<time-units,time units>>) Sets the maximum amount of time that the
|
|
|
+ structure analysis make take. If the analysis is still running when the
|
|
|
+ timeout expires then it will be aborted. The default value is 25 seconds.
|
|
|
|
|
|
`timestamp_field`::
|
|
|
- (string) Optional. The name of the field that contains the primary timestamp
|
|
|
+ (Optional, string) The name of the field that contains the primary timestamp
|
|
|
of each record in the file. In particular, if the file were ingested into an
|
|
|
index, this is the field that would be used to populate the `@timestamp` field.
|
|
|
+
|
|
@@ -159,16 +163,16 @@ also specified.
|
|
|
For structured file formats, if you specify this parameter, the field must exist
|
|
|
within the file.
|
|
|
|
|
|
-If this parameter is not specified, the structure finder makes a decision about which
|
|
|
-field (if any) is the primary timestamp field. For structured file formats, it
|
|
|
-is not compulsory to have a timestamp in the file.
|
|
|
+If this parameter is not specified, the structure finder makes a decision about
|
|
|
+which field (if any) is the primary timestamp field. For structured file
|
|
|
+formats, it is not compulsory to have a timestamp in the file.
|
|
|
--
|
|
|
|
|
|
`timestamp_format`::
|
|
|
- (string) Optional. The Java time format of the timestamp field in the file. +
|
|
|
+ (Optional, string) The Java time format of the timestamp field in the file.
|
|
|
+
|
|
|
--
|
|
|
-NOTE: Only a subset of Java time format letter groups are supported:
|
|
|
+Only a subset of Java time format letter groups are supported:
|
|
|
|
|
|
* `a`
|
|
|
* `d`
|
|
@@ -206,6 +210,20 @@ structure finder does not consider by default.
|
|
|
If this parameter is not specified, the structure finder chooses the best
|
|
|
format from a built-in set.
|
|
|
|
|
|
+The following table provides the appropriate `timeformat` values for some example timestamps:
|
|
|
+
|
|
|
+|===
|
|
|
+| Timeformat | Presentation
|
|
|
+
|
|
|
+| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000
|
|
|
+| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000
|
|
|
+| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285
|
|
|
+|===
|
|
|
+
|
|
|
+See
|
|
|
+https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
|
|
|
+for more information about date and time format syntax.
|
|
|
+
|
|
|
--
|
|
|
|
|
|
[[ml-find-file-structure-request-body]]
|
|
@@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
|
|
|
[[ml-find-file-structure-examples]]
|
|
|
==== {api-examples-title}
|
|
|
|
|
|
+[[ml-find-file-structure-example-nld-json]]
|
|
|
+===== Ingesting newline-delimited JSON
|
|
|
+
|
|
|
Suppose you have a newline-delimited JSON file that contains information about
|
|
|
some books. You can send the contents to the `find_file_structure` endpoint:
|
|
|
|
|
@@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
|
|
|
may provide clues that the data needs to be cleaned or transformed prior
|
|
|
to use by other {ml} functionality.
|
|
|
|
|
|
+
|
|
|
+[[ml-find-file-structure-example-nyc]]
|
|
|
+===== Finding the structure of NYC yellow cab example data
|
|
|
+
|
|
|
The next example shows how it's possible to find the structure of some New York
|
|
|
City yellow cab trip data. The first `curl` command downloads the data, the
|
|
|
first 20000 lines of which are then piped into the `find_file_structure`
|
|
@@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
|
|
|
--
|
|
|
NOTE: The `Content-Type: application/json` header must be set even though in
|
|
|
this case the data is not JSON. (Alternatively the `Content-Type` can be set
|
|
|
-to any other supported by Elasticsearch, but it must be set.)
|
|
|
+to any other supported by {es}, but it must be set.)
|
|
|
--
|
|
|
|
|
|
If the request does not encounter errors, you receive the following result:
|
|
@@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
|
|
|
necessary to supply the timezone they relate to. `need_client_timezone`
|
|
|
will be `false` for timestamp formats that include the timezone.
|
|
|
|
|
|
+
|
|
|
+[[ml-find-file-structure-example-timeout]]
|
|
|
+===== Setting the timeout parameter
|
|
|
+
|
|
|
If you try to analyze a lot of data then the analysis will take a long time.
|
|
|
If you want to limit the amount of processing your {es} cluster performs for
|
|
|
a request, use the `timeout` query parameter. The analysis will be aborted and
|
|
@@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
|
|
|
data.
|
|
|
--
|
|
|
|
|
|
+
|
|
|
+[[ml-find-file-structure-example-eslog]]
|
|
|
+===== Analyzing {es} log files
|
|
|
+
|
|
|
This is an example of analyzing {es}'s own log file:
|
|
|
|
|
|
[source,js]
|
|
@@ -1523,6 +1556,10 @@ this:
|
|
|
and recognizable fields that appear in every analyzed message. In this case
|
|
|
the only field that was recognized beyond the timestamp was the log level.
|
|
|
|
|
|
+
|
|
|
+[[ml-find-file-structure-example-grok]]
|
|
|
+===== Specifying `grok_pattern` as query parameter
|
|
|
+
|
|
|
If you recognize more fields than the simple `grok_pattern` produced by the
|
|
|
structure finder unaided then you can resubmit the request specifying a more
|
|
|
advanced `grok_pattern` as a query parameter and the structure finder will
|