123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564 |
- [role="xpack"]
- [testenv="platinum"]
- [[put-dfanalytics]]
- === Create {dfanalytics-jobs} API
- [subs="attributes"]
- ++++
- <titleabbrev>Create {dfanalytics-jobs}</titleabbrev>
- ++++
- Instantiates a {dfanalytics-job}.
- experimental[]
- [[ml-put-dfanalytics-request]]
- ==== {api-request-title}
- `PUT _ml/data_frame/analytics/<data_frame_analytics_id>`
- [[ml-put-dfanalytics-prereq]]
- ==== {api-prereq-title}
- If the {es} {security-features} are enabled, you must have the following
- built-in roles and privileges:
- * `machine_learning_admin`
- * `kibana_admin` (UI only)
- * source index: `read`, `view_index_metadata`
- * destination index: `read`, `create_index`, `manage` and `index`
- * cluster: `monitor` (UI only)
-
- For more information, see <<security-privileges>> and <<built-in-roles>>.
- +
- --
- NOTE: It is possible that secondary authorization headers are supplied in the
- request. If this is the case, the secondary authorization headers are used
- instead of the primary headers.
- --
- [[ml-put-dfanalytics-desc]]
- ==== {api-description-title}
- This API creates a {dfanalytics-job} that performs an analysis on the source
- index and stores the outcome in a destination index.
- The destination index will be automatically created if it does not exist. The
- `index.number_of_shards` and `index.number_of_replicas` settings of the source
- index will be copied over the destination index. When the source index matches
- multiple indices, these settings will be set to the maximum values found in the
- source indices.
- The mappings of the source indices are also attempted to be copied over
- to the destination index, however, if the mappings of any of the fields don't
- match among the source indices, the attempt will fail with an error message.
- If the destination index already exists, then it will be use as is. This makes
- it possible to set up the destination index in advance with custom settings
- and mappings.
- [discrete]
- [[ml-hyperparam-optimization]]
- ===== Hyperparameter optimization
- If you don't supply {regression} or {classification} parameters, _hyperparameter
- optimization_ occurs, which sets a value for the undefined parameters. The
- starting point is calculated for data dependent parameters by examining the loss
- on the training data. Subject to the size constraint, this operation provides an
- upper bound on the improvement in validation loss.
- A fixed number of rounds is used for optimization which depends on the number of
- parameters being optimized. The optimization starts with random search, then
- Bayesian optimization is performed that is targeting maximum expected
- improvement. If you override any parameters by explicitely setting it, the
- optimization calculates the value of the remaining parameters accordingly and
- uses the value you provided for the overridden parameter. The number of rounds
- are reduced respectively. The validation error is estimated in each round by
- using 4-fold cross validation.
- [[ml-put-dfanalytics-path-params]]
- ==== {api-path-parms-title}
- `<data_frame_analytics_id>`::
- (Required, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
- [role="child_attributes"]
- [[ml-put-dfanalytics-request-body]]
- ==== {api-request-body-title}
- `allow_lazy_start`::
- (Optional, boolean)
- include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
- //Begin analysis
- `analysis`::
- (Required, object)
- The analysis configuration, which contains the information necessary to perform
- one of the following types of analysis: {classification}, {oldetection}, or
- {regression}.
- +
- .Properties of `analysis`
- [%collapsible%open]
- ====
- //Begin classification
- `classification`:::
- (Required^*^, object)
- The configuration information necessary to perform
- {ml-docs}/dfa-classification.html[{classification}].
- +
- TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
- automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
- to give minimum validation error. It is highly recommended to use the default
- values unless you fully understand the function of these parameters.
- +
- .Properties of `classification`
- [%collapsible%open]
- =====
- `class_assignment_objective`::::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=class-assignment-objective]
- `dependent_variable`::::
- (Required, string)
- +
- include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
- +
- The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
- categorical (`ip`, `keyword`, `text`), or boolean.
- `eta`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
- `feature_bag_fraction`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
- `gamma`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
- `lambda`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
- `max_trees`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
- `num_top_classes`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
- `num_top_feature_importance_values`::::
- (Optional, integer)
- Advanced configuration option. Specifies the maximum number of
- {ml-docs}/dfa-classification.html#dfa-classification-feature-importance[feature
- importance] values per document to return. By default, it is zero and no feature importance
- calculation occurs.
- `prediction_field_name`::::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
- `randomize_seed`::::
- (Optional, long)
- include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
- `training_percent`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
- //End classification
- =====
- //Begin outlier_detection
- `outlier_detection`:::
- (Required^*^, object)
- The configuration information necessary to perform
- {ml-docs}/dfa-outlier-detection.html[{oldetection}]:
- +
- .Properties of `outlier_detection`
- [%collapsible%open]
- =====
- `compute_feature_influence`::::
- (Optional, boolean)
- include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
-
- `feature_influence_threshold`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
- `method`::::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=method]
-
- `n_neighbors`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
-
- `outlier_fraction`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
-
- `standardization_enabled`::::
- (Optional, boolean)
- include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
- //End outlier_detection
- =====
- //Begin regression
- `regression`:::
- (Required^*^, object)
- The configuration information necessary to perform
- {ml-docs}/dfa-regression.html[{regression}].
- +
- TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
- automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
- to give minimum validation error. It is highly recommended to use the default
- values unless you fully understand the function of these parameters.
- +
- .Properties of `regression`
- [%collapsible%open]
- =====
- `dependent_variable`::::
- (Required, string)
- +
- include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
- +
- The data type of the field must be numeric.
- `eta`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
- `feature_bag_fraction`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
- `gamma`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
- `lambda`::::
- (Optional, double)
- include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
- `max_trees`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
- `num_top_feature_importance_values`::::
- (Optional, integer)
- Advanced configuration option. Specifies the maximum number of
- {ml-docs}/dfa-regression.html#dfa-regression-feature-importance[feature importance]
- values per document to return. By default, it is zero and no feature importance
- calculation occurs.
- `prediction_field_name`::::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
- `randomize_seed`::::
- (Optional, long)
- include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
- `training_percent`::::
- (Optional, integer)
- include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
- =====
- //End regression
- ====
- //End analysis
- //Begin analyzed_fields
- `analyzed_fields`::
- (Optional, object)
- include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields]
- +
- .Properties of `analyzed_fields`
- [%collapsible%open]
- ====
- `excludes`:::
- (Optional, array)
- include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-excludes]
- `includes`:::
- (Optional, array)
- include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-includes]
- //End analyzed_fields
- ====
- `description`::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=description-dfa]
- `dest`::
- (Required, object)
- include::{docdir}/ml/ml-shared.asciidoc[tag=dest]
-
- `model_memory_limit`::
- (Optional, string)
- include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit-dfa]
-
- `source`::
- (object)
- include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa]
- [[ml-put-dfanalytics-example]]
- ==== {api-examples-title}
- [[ml-put-dfanalytics-example-preprocess]]
- ===== Preprocessing actions example
- The following example shows how to limit the scope of the analysis to certain
- fields, specify excluded fields in the destination index, and use a query to
- filter your data before analysis.
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/model-flight-delays-pre
- {
- "source": {
- "index": [
- "kibana_sample_data_flights" <1>
- ],
- "query": { <2>
- "range": {
- "DistanceKilometers": {
- "gt": 0
- }
- }
- },
- "_source": { <3>
- "includes": [],
- "excludes": [
- "FlightDelay",
- "FlightDelayType"
- ]
- }
- },
- "dest": { <4>
- "index": "df-flight-delays",
- "results_field": "ml-results"
- },
- "analysis": {
- "regression": {
- "dependent_variable": "FlightDelayMin",
- "training_percent": 90
- }
- },
- "analyzed_fields": { <5>
- "includes": [],
- "excludes": [
- "FlightNum"
- ]
- },
- "model_memory_limit": "100mb"
- }
- --------------------------------------------------
- // TEST[skip:setup kibana sample data]
- <1> The source index to analyze.
- <2> This query filters out entire documents that will not be present in the
- destination index.
- <3> The `_source` object defines fields in the dataset that will be included or
- excluded in the destination index. In this case, `includes` does not specify any
- fields, so the default behavior takes place: all the fields of the source index
- will included except the ones that are explicitly specified in `excludes`.
- <4> Defines the destination index that contains the results of the analysis and
- the fields of the source index specified in the `_source` object. Also defines
- the name of the `results_field`.
- <5> Specifies fields to be included in or excluded from the analysis. This does
- not affect whether the fields will be present in the destination index, only
- affects whether they are used in the analysis.
- In this example, we can see that all the fields of the source index are included
- in the destination index except `FlightDelay` and `FlightDelayType` because
- these are defined as excluded fields by the `excludes` parameter of the
- `_source` object. The `FlightNum` field is included in the destination index,
- however it is not included in the analysis because it is explicitly specified as
- excluded field by the `excludes` parameter of the `analyzed_fields` object.
- [[ml-put-dfanalytics-example-od]]
- ===== {oldetection-cap} example
- The following example creates the `loganalytics` {dfanalytics-job}, the analysis
- type is `outlier_detection`:
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/loganalytics
- {
- "description": "Outlier detection on log data",
- "source": {
- "index": "logdata"
- },
- "dest": {
- "index": "logdata_out"
- },
- "analysis": {
- "outlier_detection": {
- "compute_feature_influence": true,
- "outlier_fraction": 0.05,
- "standardization_enabled": true
- }
- }
- }
- --------------------------------------------------
- // TEST[setup:setup_logdata]
- The API returns the following result:
- [source,console-result]
- ----
- {
- "id": "loganalytics",
- "description": "Outlier detection on log data",
- "source": {
- "index": ["logdata"],
- "query": {
- "match_all": {}
- }
- },
- "dest": {
- "index": "logdata_out",
- "results_field": "ml"
- },
- "analysis": {
- "outlier_detection": {
- "compute_feature_influence": true,
- "outlier_fraction": 0.05,
- "standardization_enabled": true
- }
- },
- "model_memory_limit": "1gb",
- "create_time" : 1562265491319,
- "version" : "8.0.0",
- "allow_lazy_start" : false
- }
- ----
- // TESTRESPONSE[s/1562265491319/$body.$_path/]
- // TESTRESPONSE[s/"version" : "8.0.0"/"version" : $body.version/]
- [[ml-put-dfanalytics-example-r]]
- ===== {regression-cap} examples
- The following example creates the `house_price_regression_analysis`
- {dfanalytics-job}, the analysis type is `regression`:
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/house_price_regression_analysis
- {
- "source": {
- "index": "houses_sold_last_10_yrs"
- },
- "dest": {
- "index": "house_price_predictions"
- },
- "analysis":
- {
- "regression": {
- "dependent_variable": "price"
- }
- }
- }
- --------------------------------------------------
- // TEST[skip:TBD]
- The API returns the following result:
- [source,console-result]
- ----
- {
- "id" : "house_price_regression_analysis",
- "source" : {
- "index" : [
- "houses_sold_last_10_yrs"
- ],
- "query" : {
- "match_all" : { }
- }
- },
- "dest" : {
- "index" : "house_price_predictions",
- "results_field" : "ml"
- },
- "analysis" : {
- "regression" : {
- "dependent_variable" : "price",
- "training_percent" : 100
- }
- },
- "model_memory_limit" : "1gb",
- "create_time" : 1567168659127,
- "version" : "8.0.0",
- "allow_lazy_start" : false
- }
- ----
- // TESTRESPONSE[s/1567168659127/$body.$_path/]
- // TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
- The following example creates a job and specifies a training percent:
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
- {
- "source": {
- "index": "student_performance_mathematics"
- },
- "dest": {
- "index":"student_performance_mathematics_reg"
- },
- "analysis":
- {
- "regression": {
- "dependent_variable": "G3",
- "training_percent": 70, <1>
- "randomize_seed": 19673948271 <2>
- }
- }
- }
- --------------------------------------------------
- // TEST[skip:TBD]
- <1> The `training_percent` defines the percentage of the data set that will be
- used for training the model.
- <2> The `randomize_seed` is the seed used to randomly pick which data is used
- for training.
- [[ml-put-dfanalytics-example-c]]
- ===== {classification-cap} example
- The following example creates the `loan_classification` {dfanalytics-job}, the
- analysis type is `classification`:
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/loan_classification
- {
- "source" : {
- "index": "loan-applicants"
- },
- "dest" : {
- "index": "loan-applicants-classified"
- },
- "analysis" : {
- "classification": {
- "dependent_variable": "label",
- "training_percent": 75,
- "num_top_classes": 2
- }
- }
- }
- --------------------------------------------------
- // TEST[skip:TBD]
|