123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298 |
- [role="xpack"]
- [testenv="platinum"]
- [[ml-dfanalytics-resources]]
- === {dfanalytics-cap} job resources
- {dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
- <<get-dfanalytics>>.
- [discrete]
- [[ml-dfanalytics-properties]]
- ==== {api-definitions-title}
- `analysis`::
- (object) The type of analysis that is performed on the `source`. For example:
- `outlier_detection` or `regression`. For more information, see
- <<dfanalytics-types>>.
-
- `analyzed_fields`::
- (Optional, object) Specify `includes` and/or `excludes` patterns to select
- which fields will be included in the analysis. If `analyzed_fields` is not set,
- only the relevant fields will be included. For example, all the numeric fields
- for {oldetection}. For the supported field types, see <<ml-put-dfanalytics-supported-fields>>.
- Also see the <<explain-dfanalytics>> which helps understand field selection.
-
- `includes`:::
- (Optional, array) An array of strings that defines the fields that will be included in
- the analysis.
-
- `excludes`:::
- (Optional, array) An array of strings that defines the fields that will be excluded
- from the analysis.
-
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/loganalytics
- {
- "source": {
- "index": "logdata"
- },
- "dest": {
- "index": "logdata_out"
- },
- "analysis": {
- "outlier_detection": {
- }
- },
- "analyzed_fields": {
- "includes": [ "request.bytes", "response.counts.error" ],
- "excludes": [ "source.geo" ]
- }
- }
- --------------------------------------------------
- // TEST[setup:setup_logdata]
- `description`::
- (Optional, string) A description of the job.
- `dest`::
- (object) The destination configuration of the analysis.
-
- `index`:::
- (Required, string) Defines the _destination index_ to store the results of
- the {dfanalytics-job}.
-
- `results_field`:::
- (Optional, string) Defines the name of the field in which to store the
- results of the analysis. Default to `ml`.
- `id`::
- (string) The unique identifier for the {dfanalytics-job}. This identifier can
- contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
- underscores. It must start and end with alphanumeric characters. This property
- is informational; you cannot change the identifier for existing jobs.
-
- `model_memory_limit`::
- (string) The approximate maximum amount of memory resources that are
- permitted for analytical processing. The default value for {dfanalytics-jobs}
- is `1gb`. If your `elasticsearch.yml` file contains an
- `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
- create {dfanalytics-jobs} that have `model_memory_limit` values greater than
- that setting. For more information, see <<ml-settings>>.
- `source`::
- (object) The configuration of how to source the analysis data. It requires an `index`.
- Optionally, `query` and `_source` may be specified.
-
- `index`:::
- (Required, string or array) Index or indices on which to perform the
- analysis. It can be a single index or index pattern as well as an array of
- indices or patterns.
-
- `query`:::
- (Optional, object) The {es} query domain-specific language
- (<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
- search POST body. All the options that are supported by {es} can be used,
- as this object is passed verbatim to {es}. By default, this property has
- the following value: `{"match_all": {}}`.
- `_source`:::
- (Optional, object) Specify `includes` and/or `excludes` patterns to select
- which fields will be present in the destination. Fields that are excluded
- cannot be included in the analysis.
-
- `includes`::::
- (array) An array of strings that defines the fields that will be included in
- the destination.
-
- `excludes`::::
- (array) An array of strings that defines the fields that will be excluded
- from the destination.
- [[dfanalytics-types]]
- ==== Analysis objects
- {dfanalytics-cap} resources contain `analysis` objects. For example, when you
- create a {dfanalytics-job}, you must define the type of analysis it performs.
- [discrete]
- [[oldetection-resources]]
- ==== {oldetection-cap} configuration objects
- An `outlier_detection` configuration object has the following properties:
- `compute_feature_influence`::
- (boolean) If `true`, the feature influence calculation is enabled. Defaults to
- `true`.
-
- `feature_influence_threshold`::
- (double) The minimum {olscore} that a document needs to have in order to
- calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
- `method`::
- (string) Sets the method that {oldetection} uses. If the method is not set
- {oldetection} uses an ensemble of different methods and normalises and
- combines their individual {olscores} to obtain the overall {olscore}. We
- recommend to use the ensemble method. Available methods are `lof`, `ldof`,
- `distance_kth_nn`, `distance_knn`.
-
- `n_neighbors`::
- (integer) Defines the value for how many nearest neighbors each method of
- {oldetection} will use to calculate its {olscore}. When the value is not set,
- different values will be used for different ensemble members. This helps
- improve diversity in the ensemble. Therefore, only override this if you are
- confident that the value you choose is appropriate for the data set.
-
- `outlier_fraction`::
- (double) Sets the proportion of the data set that is assumed to be outlying prior to
- {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
- and 95% are inliers.
-
- `standardization_enabled`::
- (boolean) If `true`, then the following operation is performed on the columns
- before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
- `true`. For more information, see
- https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
- [discrete]
- [[regression-resources]]
- ==== {regression-cap} configuration objects
- [source,console]
- --------------------------------------------------
- PUT _ml/data_frame/analytics/house_price_regression_analysis
- {
- "source": {
- "index": "houses_sold_last_10_yrs" <1>
- },
- "dest": {
- "index": "house_price_predictions" <2>
- },
- "analysis":
- {
- "regression": { <3>
- "dependent_variable": "price" <4>
- }
- }
- }
- --------------------------------------------------
- // TEST[skip:TBD]
- <1> Training data is taken from source index `houses_sold_last_10_yrs`.
- <2> Analysis results will be output to destination index
- `house_price_predictions`.
- <3> The regression analysis configuration object.
- <4> Regression analysis will use field `price` to train on. As no other
- parameters have been specified it will train on 100% of eligible data, store its
- prediction in destination index field `price_prediction` and use in-built
- hyperparameter optimization to give minimum validation errors.
- [float]
- [[regression-resources-standard]]
- ===== Standard parameters
- include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
- +
- --
- The data type of the field must be numeric.
- --
- include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
- include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
- include::{docdir}/ml/ml-shared.asciidoc[tag=randomize_seed]
- [float]
- [[regression-resources-advanced]]
- ===== Advanced parameters
- Advanced parameters are for fine-tuning {reganalysis}. They are set
- automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
- to give minimum validation error. It is highly recommended to use the default
- values unless you fully understand the function of these parameters. If these
- parameters are not supplied, their values are automatically tuned to give
- minimum validation error.
- include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
- include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
- include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
- include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
- include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
- [discrete]
- [[classification-resources]]
- ==== {classification-cap} configuration objects
-
-
- [float]
- [[classification-resources-standard]]
- ===== Standard parameters
-
- include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
- +
- --
- The data type of the field must be numeric or boolean.
- --
-
- `num_top_classes`::
- (Optional, integer) Defines the number of categories for which the predicted
- probabilities are reported. It must be non-negative. If it is greater than the
- total number of categories (in the {version} version of the {stack}, it's two)
- to predict then we will report all category probabilities. Defaults to 2.
-
- include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
- include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
- include::{docdir}/ml/ml-shared.asciidoc[tag=randomize_seed]
- [float]
- [[classification-resources-advanced]]
- ===== Advanced parameters
- Advanced parameters are for fine-tuning {classanalysis}. They are set
- automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
- to give minimum validation error. It is highly recommended to use the default
- values unless you fully understand the function of these parameters. If these
- parameters are not supplied, their values are automatically tuned to give
- minimum validation error.
- include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
- include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
- include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
- include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
- include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
- [[ml-hyperparameter-optimization]]
- ===== Hyperparameter optimization
- If you don't supply {regression} or {classification} parameters, hyperparameter
- optimization will be performed by default to set a value for the undefined
- parameters. The starting point is calculated for data dependent parameters by
- examining the loss on the training data. Subject to the size constraint, this
- operation provides an upper bound on the improvement in validation loss.
- A fixed number of rounds is used for optimization which depends on the number of
- parameters being optimized. The optimization starts with random search, then
- Bayesian optimization is performed that is targeting maximum expected
- improvement. If you override any parameters, then the optimization will
- calculate the value of the remaining parameters accordingly and use the value
- you provided for the overridden parameter. The number of rounds are reduced
- respectively. The validation error is estimated in each round by using 4-fold
- cross validation.
|