5 years ago · 659b4ceb97
--- a/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc
+++ b/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc
@@ -11,11 +11,13 @@ experimental[]
 
				 Finds the structure of a text file. The text file must contain data that is
			
 
				 suitable to be ingested into {es}.
			
 
				 
			
 
				+
			
 
				 [[ml-find-file-structure-request]]
			
 
				 ==== {api-request-title}
			
 
				 
			
 
				 `POST _ml/find_file_structure`
			
 
				 
			
 
				+
			
 
				 [[ml-find-file-structure-prereqs]]
			
 
				 ==== {api-prereq-title}
			
 
				 
			
@@ -23,6 +25,7 @@ suitable to be ingested into {es}.
 
				 `monitor` cluster privileges to use this API. See
			
 
				 <<security-privileges>>.
			
 
				 
			
 
				+
			
 
				 [[ml-find-file-structure-desc]]
			
 
				 ==== {api-description-title}
			
 
				 
			
@@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
 
				 the response, which should help in determining why the returned structure was
			
 
				 chosen.
			
 
				 
			
 
				+
			
 
				 [[ml-find-file-structure-query-parms]]
			
 
				 ==== {api-query-parms-title}
			
 
				 
			
 
				 `charset`::
			
 
				-  (string) Optional. The file's character set. It must be a character set that
			
 
				+  (Optional, string) The file's character set. It must be a character set that
			
 
				   is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
			
 
				   `windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
			
 
				   finder chooses an appropriate character set.
			
 
				 
			
 
				 `column_names`::
			
 
				-  (string) Optional. If you have set `format` to `delimited`, you can specify
			
 
				+  (Optional, string) If you have set `format` to `delimited`, you can specify
			
 
				   the column names in a comma-separated list. If this parameter is not specified,
			
 
				   the structure finder uses the column names from the header row of the file. If
			
 
				   the file does not have a header role, columns are named "column1", "column2",
			
 
				   "column3", etc.
			
 
				 
			
 
				 `delimiter`::
			
 
				-  (string) Optional. If you have set `format` to `delimited`, you can specify
			
 
				+  (Optional, string) If you have set `format` to `delimited`, you can specify
			
 
				   the character used to delimit the values in each row. Only a single character
			
 
				   is supported; the delimiter cannot have multiple characters. If this parameter
			
 
				   is not specified, the structure finder considers the following possibilities:
			
 
				   comma, tab, semi-colon, and pipe (`|`).
			
 
				 
			
 
				 `explain`::
			
 
				-  (boolean) Optional. If this parameter is set to `true`, the response includes
			
 
				+  (Optional, boolean) If this parameter is set to `true`, the response includes
			
 
				   a field named `explanation`, which is an array of strings that indicate how
			
 
				   the structure finder produced its result. The default value is `false`.
			
 
				 
			
 
				 `format`::
			
 
				-  (string) Optional. The high level structure of the file. Valid values are
			
 
				+  (Optional, string) The high level structure of the file. Valid values are
			
 
				   `ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
			
 
				   not specified, the structure finder chooses one.
			
 
				 
			
 
				 `grok_pattern`::
			
 
				-  (string) Optional. If you have set `format` to `semi_structured_text`, you can
			
 
				+  (Optional, string) If you have set `format` to `semi_structured_text`, you can
			
 
				   specify a Grok pattern that is used to extract fields from every message in
			
 
				   the file. The name of the timestamp field in the Grok pattern must match what
			
 
				   is specified in the `timestamp_field` parameter. If that parameter is not
			
@@ -98,20 +102,20 @@ chosen.
 
				   a Grok pattern.
			
 
				 
			
 
				 `has_header_row`::
			
 
				-  (boolean) Optional. If you have set `format` to `delimited`, you can use this
			
 
				+  (Optional, boolean) If you have set `format` to `delimited`, you can use this
			
 
				   parameter to indicate whether the column names are in the first row of the
			
 
				   file. If this parameter is not specified, the structure finder guesses based
			
 
				   on the similarity of the first row of the file to other rows.
			
 
				 
			
 
				 `line_merge_size_limit`::
			
 
				-  (unsigned integer) Optional. The maximum number of characters in a message
			
 
				+  (Optional, unsigned integer) The maximum number of characters in a message
			
 
				   when lines are merged to form messages while analyzing semi-structured files.
			
 
				   The default is `10000`. If you have extremely long messages you may need to
			
 
				   increase this, but be aware that this may lead to very long processing times
			
 
				   if the way to group lines into messages is misdetected.
			
 
				 
			
 
				 `lines_to_sample`::
			
 
				-  (unsigned integer) Optional. The number of lines to include in the structural
			
 
				+  (Optional, unsigned integer) The number of lines to include in the structural
			
 
				   analysis, starting from the beginning of the file. The minimum is 2; the
			
 
				   default is `1000`. If the value of this parameter is greater than the number
			
 
				   of lines in the file, the analysis proceeds (as long as there are at least two
			
@@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
 
				 --
			
 
				 
			
 
				 `quote`::
			
 
				-  (string) Optional. If you have set `format` to `delimited`, you can specify
			
 
				+  (Optional, string) If you have set `format` to `delimited`, you can specify
			
 
				   the character used to quote the values in each row if they contain newlines or
			
 
				   the delimiter character. Only a single character is supported. If this
			
 
				   parameter is not specified, the default value is a double quote (`"`). If your
			
@@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
 
				   argument to a character that does not appear anywhere in the sample.
			
 
				 
			
 
				 `should_trim_fields`::
			
 
				-  (boolean) Optional. If you have set `format` to `delimited`, you can specify
			
 
				+  (Optional, boolean) If you have set `format` to `delimited`, you can specify
			
 
				   whether values between delimiters should have whitespace trimmed from them. If
			
 
				   this parameter is not specified and the delimiter is pipe (`|`), the default
			
 
				   value is `true`. Otherwise, the default value is `false`.
			
 
				 
			
 
				 `timeout`::
			
 
				-  (time) Optional. Sets the maximum amount of time that the structure analysis
			
 
				-  make take. If the analysis is still running when the timeout expires then it
			
 
				-  will be aborted. The default value is 25 seconds.
			
 
				+  (Optional, <<time-units,time units>>) Sets the maximum amount of time that the 
			
 
				+  structure analysis make take. If the analysis is still running when the 
			
 
				+  timeout expires then it will be aborted. The default value is 25 seconds.
			
 
				 
			
 
				 `timestamp_field`::
			
 
				-  (string) Optional. The name of the field that contains the primary timestamp
			
 
				+  (Optional, string) The name of the field that contains the primary timestamp
			
 
				   of each record in the file. In particular, if the file were ingested into an
			
 
				   index, this is the field that would be used to populate the `@timestamp` field.
			
 
				 +
			
@@ -159,16 +163,16 @@ also specified.
 
				 For structured file formats, if you specify this parameter, the field must exist
			
 
				 within the file.
			
 
				 
			
 
				-If this parameter is not specified, the structure finder makes a decision about which
			
 
				-field (if any) is the primary timestamp field. For structured file formats, it
			
 
				-is not compulsory to have a timestamp in the file.
			
 
				+If this parameter is not specified, the structure finder makes a decision about 
			
 
				+which field (if any) is the primary timestamp field. For structured file 
			
 
				+formats, it is not compulsory to have a timestamp in the file.
			
 
				 --
			
 
				 
			
 
				 `timestamp_format`::
			
 
				-  (string) Optional. The Java time format of the timestamp field in the file. +
			
 
				+  (Optional, string) The Java time format of the timestamp field in the file.
			
 
				 +
			
 
				 --
			
 
				-NOTE: Only a subset of Java time format letter groups are supported:
			
 
				+Only a subset of Java time format letter groups are supported:
			
 
				 
			
 
				 * `a`
			
 
				 * `d`
			
@@ -206,6 +210,20 @@ structure finder does not consider by default.
 
				 If this parameter is not specified, the structure finder chooses the best
			
 
				 format from a built-in set.
			
 
				 
			
 
				+The following table provides the appropriate `timeformat` values for some example timestamps:
			
 
				+
			
 
				+|===
			
 
				+| Timeformat                 | Presentation 
			
 
				+
			
 
				+| yyyy-MM-dd HH:mm:ssZ       | 2019-04-20 13:15:22+0000
			
 
				+| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000    
			
 
				+| dd.MM.yy HH:mm:ss.SSS      | 20.04.19 13:15:22.285
			
 
				+|===
			
 
				+
			
 
				+See 
			
 
				+https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
			
 
				+for more information about date and time format syntax.
			
 
				+
			
 
				 --
			
 
				 
			
 
				 [[ml-find-file-structure-request-body]]
			
@@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
 
				 [[ml-find-file-structure-examples]]
			
 
				 ==== {api-examples-title}
			
 
				 
			
 
				+[[ml-find-file-structure-example-nld-json]]
			
 
				+===== Ingesting newline-delimited JSON
			
 
				+
			
 
				 Suppose you have a newline-delimited JSON file that contains information about
			
 
				 some books. You can send the contents to the `find_file_structure` endpoint:
			
 
				 
			
@@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
 
				      may provide clues that the data needs to be cleaned or transformed prior
			
 
				      to use by other {ml} functionality.
			
 
				 
			
 
				+
			
 
				+[[ml-find-file-structure-example-nyc]]
			
 
				+===== Finding the structure of NYC yellow cab example data
			
 
				+
			
 
				 The next example shows how it's possible to find the structure of some New York
			
 
				 City yellow cab trip data. The first `curl` command downloads the data, the
			
 
				 first 20000 lines of which are then piped into the `find_file_structure`
			
@@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
 
				 --
			
 
				 NOTE: The `Content-Type: application/json` header must be set even though in
			
 
				 this case the data is not JSON. (Alternatively the `Content-Type` can be set
			
 
				-to any other supported by Elasticsearch, but it must be set.)
			
 
				+to any other supported by {es}, but it must be set.)
			
 
				 --
			
 
				 
			
 
				 If the request does not encounter errors, you receive the following result:
			
@@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
 
				      necessary to supply the timezone they relate to. `need_client_timezone`
			
 
				      will be `false` for timestamp formats that include the timezone.
			
 
				 
			
 
				+
			
 
				+[[ml-find-file-structure-example-timeout]]
			
 
				+===== Setting the timeout parameter
			
 
				+
			
 
				 If you try to analyze a lot of data then the analysis will take a long time.
			
 
				 If you want to limit the amount of processing your {es} cluster performs for
			
 
				 a request, use the `timeout` query parameter. The analysis will be aborted and
			
@@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
 
				 data.
			
 
				 --
			
 
				 
			
 
				+
			
 
				+[[ml-find-file-structure-example-eslog]]
			
 
				+===== Analyzing {es} log files
			
 
				+
			
 
				 This is an example of analyzing {es}'s own log file:
			
 
				 
			
 
				 [source,js]
			
@@ -1523,6 +1556,10 @@ this:
 
				     and recognizable fields that appear in every analyzed message. In this case
			
 
				     the only field that was recognized beyond the timestamp was the log level.
			
 
				 
			
 
				+
			
 
				+[[ml-find-file-structure-example-grok]]
			
 
				+===== Specifying `grok_pattern` as query parameter
			
 
				+
			
 
				 If you recognize more fields than the simple `grok_pattern` produced by the
			
 
				 structure finder unaided then you can resubmit the request specifying a more
			
 
				 advanced `grok_pattern` as a query parameter and the structure finder will