|
@@ -53,82 +53,200 @@ If the destination index already exists, then it will be use as is. This makes
|
|
|
it possible to set up the destination index in advance with custom settings
|
|
|
and mappings.
|
|
|
|
|
|
-[[ml-put-dfanalytics-supported-fields]]
|
|
|
-===== Supported fields
|
|
|
+[discrete]
|
|
|
+[[ml-hyperparam-optimization]]
|
|
|
+===== Hyperparameter optimization
|
|
|
+
|
|
|
+If you don't supply {regression} or {classification} parameters, _hyperparameter
|
|
|
+optimization_ occurs, which sets a value for the undefined parameters. The
|
|
|
+starting point is calculated for data dependent parameters by examining the loss
|
|
|
+on the training data. Subject to the size constraint, this operation provides an
|
|
|
+upper bound on the improvement in validation loss.
|
|
|
+
|
|
|
+A fixed number of rounds is used for optimization which depends on the number of
|
|
|
+parameters being optimized. The optimization starts with random search, then
|
|
|
+Bayesian optimization is performed that is targeting maximum expected
|
|
|
+improvement. If you override any parameters,
|
|
|
+//TBD: What is meant by overriding them? Explicitly setting the parameter instead of letting it take the default?
|
|
|
+the optimization calculates the value of the remaining parameters accordingly
|
|
|
+and uses the value you provided for the overridden parameter. The number of
|
|
|
+rounds are reduced respectively. The validation error is estimated in each round
|
|
|
+by using 4-fold cross validation.
|
|
|
|
|
|
-====== {oldetection-cap}
|
|
|
+[[ml-put-dfanalytics-path-params]]
|
|
|
+==== {api-path-parms-title}
|
|
|
|
|
|
-{oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
|
|
-don't support missing values therefore fields that have data types other than
|
|
|
-numeric or boolean are ignored. Documents where included fields contain missing
|
|
|
-values, null values, or an array are also ignored. Therefore the `dest` index
|
|
|
-may contain documents that don't have an {olscore}.
|
|
|
+`<data_frame_analytics_id>`::
|
|
|
+(Required, string)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
|
|
|
|
|
|
+[[ml-put-dfanalytics-request-body]]
|
|
|
+==== {api-request-body-title}
|
|
|
|
|
|
-====== {regression-cap}
|
|
|
+`allow_lazy_start`::
|
|
|
+(Optional, boolean)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
|
|
|
+
|
|
|
+`analysis`::
|
|
|
+(Required, object)
|
|
|
+The analysis configuration, which contains the information necessary to perform
|
|
|
+one of the following types of analysis: {classification}, {oldetection}, or
|
|
|
+{regression}.
|
|
|
+//include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
|
|
|
+
|
|
|
+`analysis`.`classification`:::
|
|
|
+(Required^*^, object)
|
|
|
+The configuration information necessary to perform
|
|
|
+{ml-docs}/dfa-classification.html[{classification}].
|
|
|
++
|
|
|
+--
|
|
|
+TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
|
|
|
+automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
|
|
|
+to give minimum validation error. It is highly recommended to use the default
|
|
|
+values unless you fully understand the function of these parameters.
|
|
|
+
|
|
|
+--
|
|
|
+
|
|
|
+`analysis`.`classification`.`dependent_variable`::::
|
|
|
+(Required, string)
|
|
|
++
|
|
|
+--
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
|
|
|
|
|
|
-{regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
|
|
|
-and `ip`. It is also tolerant of missing values. Fields that are supported are
|
|
|
-included in the analysis, other fields are ignored. Documents where included
|
|
|
-fields contain an array with two or more values are also ignored. Documents in
|
|
|
-the `dest` index that don’t contain a results field are not included in the
|
|
|
- {reganalysis}.
|
|
|
+The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
|
|
|
+categorical (`ip`, `keyword`, `text`), or boolean.
|
|
|
+--
|
|
|
|
|
|
+`analysis`.`classification`.`eta`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
|
|
|
-====== {classification-cap}
|
|
|
+`analysis`.`classification`.`feature_bag_fraction`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
|
|
|
-{classification-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
-`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
-supported are included in the analysis, other fields are ignored. Documents
|
|
|
-where included fields contain an array with two or more values are also ignored.
|
|
|
-Documents in the `dest` index that don’t contain a results field are not
|
|
|
-included in the {classanalysis}.
|
|
|
+`analysis`.`classification`.`maximum_number_trees`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
|
|
|
|
|
|
-{classanalysis-cap} can be improved by mapping ordinal variable values to a
|
|
|
-single number. For example, in case of age ranges, you can model the values as
|
|
|
-"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
|
|
+`analysis`.`classification`.`gamma`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
|
|
|
+`analysis`.`classification`.`lambda`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
|
|
|
-[[ml-put-dfanalytics-path-params]]
|
|
|
-==== {api-path-parms-title}
|
|
|
+`analysis`.`classification`.`num_top_classes`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
|
|
|
|
|
|
-`<data_frame_analytics_id>`::
|
|
|
+`analysis`.`classification`.`prediction_field_name`::::
|
|
|
+(Optional, string)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
|
|
|
+
|
|
|
+`analysis`.`classification`.`randomize_seed`::::
|
|
|
+(Optional, long)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
|
|
|
+
|
|
|
+`analysis`.`classification`.`training_percent`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`:::
|
|
|
+(Required^*^, object)
|
|
|
+The configuration information necessary to perform
|
|
|
+{ml-docs}/dfa-outlier-detection.html[{oldetection}]:
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`compute_feature_influence`::::
|
|
|
+(Optional, boolean)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`feature_influence_threshold`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`method`::::
|
|
|
+(Optional, string)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=method]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`n_neighbors`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`outlier_fraction`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
|
|
|
+
|
|
|
+`analysis`.`outlier_detection`.`standardization_enabled`::::
|
|
|
+(Optional, boolean)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
|
|
|
+
|
|
|
+`analysis`.`regression`:::
|
|
|
+(Required^*^, object)
|
|
|
+The configuration information necessary to perform
|
|
|
+{ml-docs}/dfa-regression.html[{regression}].
|
|
|
++
|
|
|
+--
|
|
|
+TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
|
|
|
+automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
|
|
|
+to give minimum validation error. It is highly recommended to use the default
|
|
|
+values unless you fully understand the function of these parameters.
|
|
|
+
|
|
|
+--
|
|
|
+
|
|
|
+`analysis`.`regression`.`dependent_variable`::::
|
|
|
(Required, string)
|
|
|
-include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
|
|
|
++
|
|
|
+--
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
|
|
|
|
|
|
-[[ml-put-dfanalytics-request-body]]
|
|
|
-==== {api-request-body-title}
|
|
|
+The data type of the field must be numeric.
|
|
|
+--
|
|
|
|
|
|
-`analysis`::
|
|
|
-(Required, object)
|
|
|
-include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
|
|
|
+`analysis`.`regression`.`eta`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
+
|
|
|
+`analysis`.`regression`.`feature_bag_fraction`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
+
|
|
|
+`analysis`.`regression`.`maximum_number_trees`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
|
|
|
+
|
|
|
+`analysis`.`regression`.`gamma`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
+
|
|
|
+`analysis`.`regression`.`lambda`::::
|
|
|
+(Optional, double)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
+
|
|
|
+`analysis`.`regression`.`prediction_field_name`::::
|
|
|
+(Optional, string)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
|
|
|
+
|
|
|
+`analysis`.`regression`.`training_percent`::::
|
|
|
+(Optional, integer)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
|
|
|
+
|
|
|
+`analysis`.`regression`.`randomize_seed`::::
|
|
|
+(Optional, long)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
|
|
|
|
|
|
`analyzed_fields`::
|
|
|
(Optional, object)
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields]
|
|
|
|
|
|
-[source,console]
|
|
|
---------------------------------------------------
|
|
|
-PUT _ml/data_frame/analytics/loganalytics
|
|
|
-{
|
|
|
- "source": {
|
|
|
- "index": "logdata"
|
|
|
- },
|
|
|
- "dest": {
|
|
|
- "index": "logdata_out"
|
|
|
- },
|
|
|
- "analysis": {
|
|
|
- "outlier_detection": {
|
|
|
- }
|
|
|
- },
|
|
|
- "analyzed_fields": {
|
|
|
- "includes": [ "request.bytes", "response.counts.error" ],
|
|
|
- "excludes": [ "source.geo" ]
|
|
|
- }
|
|
|
-}
|
|
|
---------------------------------------------------
|
|
|
-// TEST[setup:setup_logdata]
|
|
|
+`analyzed_fields`.`excludes`:::
|
|
|
+(Optional, array)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-excludes]
|
|
|
|
|
|
+`analyzed_fields`.`includes`:::
|
|
|
+(Optional, array)
|
|
|
+include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-includes]
|
|
|
|
|
|
`description`::
|
|
|
(Optional, string)
|
|
@@ -146,15 +264,9 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit-dfa]
|
|
|
(object)
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa]
|
|
|
|
|
|
-`allow_lazy_start`::
|
|
|
-(Optional, boolean)
|
|
|
-include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
|
|
|
-
|
|
|
-
|
|
|
[[ml-put-dfanalytics-example]]
|
|
|
==== {api-examples-title}
|
|
|
|
|
|
-
|
|
|
[[ml-put-dfanalytics-example-preprocess]]
|
|
|
===== Preprocessing actions example
|
|
|
|