|
@@ -20,13 +20,13 @@ experimental[]
|
|
|
[[ml-put-dfanalytics-prereq]]
|
|
|
== {api-prereq-title}
|
|
|
|
|
|
-If the {es} {security-features} are enabled, you must have the following
|
|
|
+If the {es} {security-features} are enabled, you must have the following
|
|
|
built-in roles and privileges:
|
|
|
|
|
|
* `machine_learning_admin`
|
|
|
* source indices: `read`, `view_index_metadata`
|
|
|
* destination index: `read`, `create_index`, `manage` and `index`
|
|
|
-
|
|
|
+
|
|
|
For more information, see <<built-in-roles>>, <<security-privileges>>, and
|
|
|
{ml-docs-setup-privileges}.
|
|
|
|
|
@@ -34,13 +34,13 @@ For more information, see <<built-in-roles>>, <<security-privileges>>, and
|
|
|
NOTE: The {dfanalytics-job} remembers which roles the user who created it had at
|
|
|
the time of creation. When you start the job, it performs the analysis using
|
|
|
those same roles. If you provide
|
|
|
-<<http-clients-secondary-authorization,secondary authorization headers>>,
|
|
|
+<<http-clients-secondary-authorization,secondary authorization headers>>,
|
|
|
those credentials are used instead.
|
|
|
|
|
|
[[ml-put-dfanalytics-desc]]
|
|
|
== {api-description-title}
|
|
|
|
|
|
-This API creates a {dfanalytics-job} that performs an analysis on the source
|
|
|
+This API creates a {dfanalytics-job} that performs an analysis on the source
|
|
|
indices and stores the outcome in a destination index.
|
|
|
|
|
|
If the destination index does not exist, it is created automatically when you
|
|
@@ -48,7 +48,7 @@ start the job. See <<start-dfanalytics>>.
|
|
|
|
|
|
If you supply only a subset of the {regression} or {classification} parameters,
|
|
|
{ml-docs}/hyperparameters.html[hyperparameter optimization] occurs. It
|
|
|
-determines a value for each of the undefined parameters.
|
|
|
+determines a value for each of the undefined parameters.
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-path-params]]
|
|
@@ -63,8 +63,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-def
|
|
|
== {api-request-body-title}
|
|
|
|
|
|
`allow_lazy_start`::
|
|
|
-(Optional, boolean)
|
|
|
-Specifies whether this job can start when there is insufficient {ml} node
|
|
|
+(Optional, boolean)
|
|
|
+Specifies whether this job can start when there is insufficient {ml} node
|
|
|
capacity for it to be immediately assigned to a node. The default is `false`; if
|
|
|
a {ml} node with capacity to run the job cannot immediately be found, the API
|
|
|
returns an error. However, this is also subject to the cluster-wide
|
|
@@ -88,7 +88,7 @@ one of the following types of analysis: {classification}, {oldetection}, or
|
|
|
The configuration information necessary to perform
|
|
|
{ml-docs}/dfa-classification.html[{classification}].
|
|
|
+
|
|
|
-TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
|
|
|
+TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
|
|
|
automatically by hyperparameter optimization to give the minimum validation
|
|
|
error. It is highly recommended to use the default values unless you fully
|
|
|
understand the function of these parameters.
|
|
@@ -105,28 +105,32 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=class-assignment-objective]
|
|
|
+
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=dependent-variable]
|
|
|
+
|
|
|
-The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
|
|
|
+The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
|
|
|
categorical (`ip` or `keyword`), or boolean. There must be no more than 30
|
|
|
-different values in this field.
|
|
|
+different values in this field.
|
|
|
|
|
|
`eta`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
|
|
|
`feature_bag_fraction`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
|
|
|
+`feature_processors`::::
|
|
|
+(Optional, list)
|
|
|
+include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=dfas-feature-processors]
|
|
|
+
|
|
|
`gamma`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
|
|
|
`lambda`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
|
|
|
`max_trees`::::
|
|
|
-(Optional, integer)
|
|
|
+(Optional, integer)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=max-trees]
|
|
|
|
|
|
`num_top_classes`::::
|
|
@@ -138,11 +142,11 @@ categories, the API reports all category probabilities. Defaults to 2.
|
|
|
`num_top_feature_importance_values`::::
|
|
|
(Optional, integer)
|
|
|
Advanced configuration option. Specifies the maximum number of
|
|
|
-{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
+{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
By default, it is zero and no {feat-imp} calculation occurs.
|
|
|
|
|
|
`prediction_field_name`::::
|
|
|
-(Optional, string)
|
|
|
+(Optional, string)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
|
|
|
|
|
|
`randomize_seed`::::
|
|
@@ -164,27 +168,27 @@ The configuration information necessary to perform
|
|
|
[%collapsible%open]
|
|
|
=====
|
|
|
`compute_feature_influence`::::
|
|
|
-(Optional, boolean)
|
|
|
+(Optional, boolean)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
|
|
|
-
|
|
|
-`feature_influence_threshold`::::
|
|
|
-(Optional, double)
|
|
|
+
|
|
|
+`feature_influence_threshold`::::
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
|
|
|
|
|
|
`method`::::
|
|
|
(Optional, string)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=method]
|
|
|
-
|
|
|
+
|
|
|
`n_neighbors`::::
|
|
|
(Optional, integer)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=n-neighbors]
|
|
|
-
|
|
|
+
|
|
|
`outlier_fraction`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
|
|
|
-
|
|
|
+
|
|
|
`standardization_enabled`::::
|
|
|
-(Optional, boolean)
|
|
|
+(Optional, boolean)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
|
|
|
//End outlier_detection
|
|
|
=====
|
|
@@ -194,7 +198,7 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
|
|
|
The configuration information necessary to perform
|
|
|
{ml-docs}/dfa-regression.html[{regression}].
|
|
|
+
|
|
|
-TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
|
|
|
+TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
|
|
|
automatically by hyperparameter optimization to give the minimum validation
|
|
|
error. It is highly recommended to use the default values unless you fully
|
|
|
understand the function of these parameters.
|
|
@@ -217,20 +221,24 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
|
|
|
+`feature_processors`::::
|
|
|
+(Optional, list)
|
|
|
+include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=dfas-feature-processors]
|
|
|
+
|
|
|
`gamma`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
|
|
|
`lambda`::::
|
|
|
-(Optional, double)
|
|
|
+(Optional, double)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
|
|
|
`loss_function`::::
|
|
|
(Optional, string)
|
|
|
-The loss function used during {regression}. Available options are `mse` (mean
|
|
|
-squared error), `msle` (mean squared logarithmic error), `huber` (Pseudo-Huber
|
|
|
-loss). Defaults to `mse`. Refer to
|
|
|
-{ml-docs}/dfa-regression.html#dfa-regression-lossfunction[Loss functions for {regression} analyses]
|
|
|
+The loss function used during {regression}. Available options are `mse` (mean
|
|
|
+squared error), `msle` (mean squared logarithmic error), `huber` (Pseudo-Huber
|
|
|
+loss). Defaults to `mse`. Refer to
|
|
|
+{ml-docs}/dfa-regression.html#dfa-regression-lossfunction[Loss functions for {regression} analyses]
|
|
|
to learn more.
|
|
|
|
|
|
`loss_function_parameter`::::
|
|
@@ -238,13 +246,13 @@ to learn more.
|
|
|
A positive number that is used as a parameter to the `loss_function`.
|
|
|
|
|
|
`max_trees`::::
|
|
|
-(Optional, integer)
|
|
|
+(Optional, integer)
|
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=max-trees]
|
|
|
|
|
|
`num_top_feature_importance_values`::::
|
|
|
(Optional, integer)
|
|
|
Advanced configuration option. Specifies the maximum number of
|
|
|
-{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
+{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
By default, it is zero and no {feat-imp} calculation occurs.
|
|
|
|
|
|
`prediction_field_name`::::
|
|
@@ -266,31 +274,31 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=training-percent]
|
|
|
//Begin analyzed_fields
|
|
|
`analyzed_fields`::
|
|
|
(Optional, object)
|
|
|
-Specify `includes` and/or `excludes` patterns to select which fields will be
|
|
|
-included in the analysis. The patterns specified in `excludes` are applied last,
|
|
|
-therefore `excludes` takes precedence. In other words, if the same field is
|
|
|
-specified in both `includes` and `excludes`, then the field will not be included
|
|
|
+Specify `includes` and/or `excludes` patterns to select which fields will be
|
|
|
+included in the analysis. The patterns specified in `excludes` are applied last,
|
|
|
+therefore `excludes` takes precedence. In other words, if the same field is
|
|
|
+specified in both `includes` and `excludes`, then the field will not be included
|
|
|
in the analysis.
|
|
|
+
|
|
|
--
|
|
|
[[dfa-supported-fields]]
|
|
|
The supported fields for each type of analysis are as follows:
|
|
|
|
|
|
-* {oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
|
|
-don't support missing values therefore fields that have data types other than
|
|
|
-numeric or boolean are ignored. Documents where included fields contain missing
|
|
|
-values, null values, or an array are also ignored. Therefore the `dest` index
|
|
|
+* {oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
|
|
+don't support missing values therefore fields that have data types other than
|
|
|
+numeric or boolean are ignored. Documents where included fields contain missing
|
|
|
+values, null values, or an array are also ignored. Therefore the `dest` index
|
|
|
may contain documents that don't have an {olscore}.
|
|
|
-* {regression-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
-`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
-supported are included in the analysis, other fields are ignored. Documents
|
|
|
-where included fields contain an array with two or more values are also
|
|
|
-ignored. Documents in the `dest` index that don’t contain a results field are
|
|
|
+* {regression-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
+`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
+supported are included in the analysis, other fields are ignored. Documents
|
|
|
+where included fields contain an array with two or more values are also
|
|
|
+ignored. Documents in the `dest` index that don’t contain a results field are
|
|
|
not included in the {reganalysis}.
|
|
|
* {classification-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
-`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
+`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
supported are included in the analysis, other fields are ignored. Documents
|
|
|
-where included fields contain an array with two or more values are also ignored.
|
|
|
+where included fields contain an array with two or more values are also ignored.
|
|
|
Documents in the `dest` index that don’t contain a results field are not
|
|
|
included in the {classanalysis}. {classanalysis-cap} can be improved by mapping
|
|
|
ordinal variable values to a single number. For example, in case of age ranges,
|
|
@@ -312,7 +320,7 @@ analysis. You do not need to add fields with unsupported data types to
|
|
|
|
|
|
`includes`:::
|
|
|
(Optional, array)
|
|
|
-An array of strings that defines the fields that will be included in the
|
|
|
+An array of strings that defines the fields that will be included in the
|
|
|
analysis.
|
|
|
//End analyzed_fields
|
|
|
====
|
|
@@ -332,16 +340,16 @@ The default value is `1`. Using more threads may decrease the time
|
|
|
necessary to complete the analysis at the cost of using more CPU.
|
|
|
Note that the process may use additional threads for operational
|
|
|
functionality other than the analysis itself.
|
|
|
-
|
|
|
+
|
|
|
`model_memory_limit`::
|
|
|
(Optional, string)
|
|
|
-The approximate maximum amount of memory resources that are permitted for
|
|
|
-analytical processing. The default value for {dfanalytics-jobs} is `1gb`. If
|
|
|
-your `elasticsearch.yml` file contains an `xpack.ml.max_model_memory_limit`
|
|
|
-setting, an error occurs when you try to create {dfanalytics-jobs} that have
|
|
|
-`model_memory_limit` values greater than that setting. For more information, see
|
|
|
+The approximate maximum amount of memory resources that are permitted for
|
|
|
+analytical processing. The default value for {dfanalytics-jobs} is `1gb`. If
|
|
|
+your `elasticsearch.yml` file contains an `xpack.ml.max_model_memory_limit`
|
|
|
+setting, an error occurs when you try to create {dfanalytics-jobs} that have
|
|
|
+`model_memory_limit` values greater than that setting. For more information, see
|
|
|
<<ml-settings>>.
|
|
|
-
|
|
|
+
|
|
|
`source`::
|
|
|
(object)
|
|
|
The configuration of how to source the analysis data. It requires an `index`.
|
|
@@ -355,7 +363,7 @@ Optionally, `query` and `_source` may be specified.
|
|
|
It can be a single index or index pattern as well as an array of indices or
|
|
|
patterns.
|
|
|
+
|
|
|
-WARNING: If your source indices contain documents with the same IDs, only the
|
|
|
+WARNING: If your source indices contain documents with the same IDs, only the
|
|
|
document that is indexed last appears in the destination index.
|
|
|
|
|
|
`query`:::
|
|
@@ -376,7 +384,7 @@ included in the analysis.
|
|
|
`includes`::::
|
|
|
(array) An array of strings that defines the fields that will be included in the
|
|
|
destination.
|
|
|
-
|
|
|
+
|
|
|
`excludes`::::
|
|
|
(array) An array of strings that defines the fields that will be excluded from
|
|
|
the destination.
|
|
@@ -389,8 +397,8 @@ the destination.
|
|
|
[[ml-put-dfanalytics-example-preprocess]]
|
|
|
=== Preprocessing actions example
|
|
|
|
|
|
-The following example shows how to limit the scope of the analysis to certain
|
|
|
-fields, specify excluded fields in the destination index, and use a query to
|
|
|
+The following example shows how to limit the scope of the analysis to certain
|
|
|
+fields, specify excluded fields in the destination index, and use a query to
|
|
|
filter your data before analysis.
|
|
|
|
|
|
[source,console]
|
|
@@ -403,7 +411,7 @@ PUT _ml/data_frame/analytics/model-flight-delays-pre
|
|
|
],
|
|
|
"query": { <2>
|
|
|
"range": {
|
|
|
- "DistanceKilometers": {
|
|
|
+ "DistanceKilometers": {
|
|
|
"gt": 0
|
|
|
}
|
|
|
}
|
|
@@ -428,7 +436,7 @@ PUT _ml/data_frame/analytics/model-flight-delays-pre
|
|
|
},
|
|
|
"analyzed_fields": { <5>
|
|
|
"includes": [],
|
|
|
- "excludes": [
|
|
|
+ "excludes": [
|
|
|
"FlightNum"
|
|
|
]
|
|
|
},
|
|
@@ -438,29 +446,29 @@ PUT _ml/data_frame/analytics/model-flight-delays-pre
|
|
|
// TEST[skip:setup kibana sample data]
|
|
|
|
|
|
<1> Source index to analyze.
|
|
|
-<2> This query filters out entire documents that will not be present in the
|
|
|
+<2> This query filters out entire documents that will not be present in the
|
|
|
destination index.
|
|
|
-<3> The `_source` object defines fields in the dataset that will be included or
|
|
|
-excluded in the destination index.
|
|
|
-<4> Defines the destination index that contains the results of the analysis and
|
|
|
-the fields of the source index specified in the `_source` object. Also defines
|
|
|
+<3> The `_source` object defines fields in the data set that will be included or
|
|
|
+excluded in the destination index.
|
|
|
+<4> Defines the destination index that contains the results of the analysis and
|
|
|
+the fields of the source index specified in the `_source` object. Also defines
|
|
|
the name of the `results_field`.
|
|
|
-<5> Specifies fields to be included in or excluded from the analysis. This does
|
|
|
-not affect whether the fields will be present in the destination index, only
|
|
|
+<5> Specifies fields to be included in or excluded from the analysis. This does
|
|
|
+not affect whether the fields will be present in the destination index, only
|
|
|
affects whether they are used in the analysis.
|
|
|
|
|
|
-In this example, we can see that all the fields of the source index are included
|
|
|
-in the destination index except `FlightDelay` and `FlightDelayType` because
|
|
|
-these are defined as excluded fields by the `excludes` parameter of the
|
|
|
-`_source` object. The `FlightNum` field is included in the destination index,
|
|
|
-however it is not included in the analysis because it is explicitly specified as
|
|
|
+In this example, we can see that all the fields of the source index are included
|
|
|
+in the destination index except `FlightDelay` and `FlightDelayType` because
|
|
|
+these are defined as excluded fields by the `excludes` parameter of the
|
|
|
+`_source` object. The `FlightNum` field is included in the destination index,
|
|
|
+however it is not included in the analysis because it is explicitly specified as
|
|
|
excluded field by the `excludes` parameter of the `analyzed_fields` object.
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example-od]]
|
|
|
=== {oldetection-cap} example
|
|
|
|
|
|
-The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
|
|
+The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
|
|
type is `outlier_detection`:
|
|
|
|
|
|
[source,console]
|
|
@@ -524,7 +532,7 @@ The API returns the following result:
|
|
|
[[ml-put-dfanalytics-example-r]]
|
|
|
=== {regression-cap} examples
|
|
|
|
|
|
-The following example creates the `house_price_regression_analysis`
|
|
|
+The following example creates the `house_price_regression_analysis`
|
|
|
{dfanalytics-job}, the analysis type is `regression`:
|
|
|
|
|
|
[source,console]
|
|
@@ -537,7 +545,7 @@ PUT _ml/data_frame/analytics/house_price_regression_analysis
|
|
|
"dest": {
|
|
|
"index": "house_price_predictions"
|
|
|
},
|
|
|
- "analysis":
|
|
|
+ "analysis":
|
|
|
{
|
|
|
"regression": {
|
|
|
"dependent_variable": "price"
|
|
@@ -613,7 +621,7 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
|
|
[[ml-put-dfanalytics-example-c]]
|
|
|
=== {classification-cap} example
|
|
|
|
|
|
-The following example creates the `loan_classification` {dfanalytics-job}, the
|
|
|
+The following example creates the `loan_classification` {dfanalytics-job}, the
|
|
|
analysis type is `classification`:
|
|
|
|
|
|
[source,console]
|