6 tahun lalu · 14227106b0
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@@ -12,7 +12,8 @@
 
				 
			
 
				 `analysis`::
			
 
				   (object) The type of analysis that is performed on the `source`. For example: 
			
 
				-  `outlier_detection`. For more information, see <<dfanalytics-types>>.
			
 
				+  `outlier_detection` or `regression`. For more information, see 
			
 
				+  <<dfanalytics-types>>.
			
 
				   
			
 
				 `analyzed_fields`::
			
 
				   (object) You can specify both `includes` and/or `excludes` patterns. If 
			
@@ -98,15 +99,13 @@ PUT _ml/data_frame/analytics/loganalytics
 
				 ==== Analysis objects
			
 
				 
			
 
				 {dfanalytics-cap} resources contain `analysis` objects. For example, when you
			
 
				-create a {dfanalytics-job}, you must define the type of analysis it performs. 
			
 
				-Currently, `outlier_detection` is the only available type of analysis, however, 
			
 
				-other types will be added, for example `regression`.
			
 
				-  
			
 
				+create a {dfanalytics-job}, you must define the type of analysis it performs.
			
 
				+
			
 
				 [discrete]
			
 
				 [[oldetection-resources]]
			
 
				 ==== {oldetection-cap} configuration objects 
			
 
				 
			
 
				-An {oldetection} configuration object has the following properties:
			
 
				+An `outlier_detection` configuration object has the following properties:
			
 
				 
			
 
				 `compute_feature_influence`::
			
 
				   (boolean) If `true`, the feature influence calculation is enabled. Defaults to 
			
@@ -123,7 +122,7 @@ An {oldetection} configuration object has the following properties:
 
				   recommend to use the ensemble method. Available methods are `lof`, `ldof`, 
			
 
				   `distance_kth_nn`, `distance_knn`.
			
 
				   
			
 
				-`n_neighbors`::
			
 
				+  `n_neighbors`::
			
 
				   (integer) Defines the value for how many nearest neighbors each method of 
			
 
				   {oldetection} will use to calculate its {olscore}. When the value is not set, 
			
 
				   different values will be used for different ensemble members. This helps 
			
@@ -140,3 +139,122 @@ An {oldetection} configuration object has the following properties:
 
				   before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to 
			
 
				   `true`. For more information, see 
			
 
				   https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
			
 
				+
			
 
				+
			
 
				+[discrete]
			
 
				+[[regression-resources]]
			
 
				+==== {regression-cap} configuration objects
			
 
				+
			
 
				+[source,console]
			
 
				+--------------------------------------------------
			
 
				+PUT _ml/data_frame/analytics/house_price_regression_analysis
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "houses_sold_last_10_yrs" <1>
			
 
				+  },
			
 
				+  "dest": {
			
 
				+    "index": "house_price_predictions" <2>
			
 
				+  },
			
 
				+  "analysis": 
			
 
				+    {
			
 
				+      "regression": { <3>
			
 
				+        "dependent_variable": "price" <4>
			
 
				+      }
			
 
				+    }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// TEST[skip:TBD]
			
 
				+
			
 
				+<1> Training data is taken from source index `houses_sold_last_10_yrs`.
			
 
				+<2> Analysis results will be output to destination index 
			
 
				+`house_price_predictions`.
			
 
				+<3> The regression analysis configuration object.
			
 
				+<4> Regression analysis will use field `price` to train on. As no other 
			
 
				+parameters have been specified it will train on 100% of eligible data, store its 
			
 
				+prediction in destination index field `price_prediction` and use in-built 
			
 
				+hyperparameter optimization to give minimum validation errors.
			
 
				+
			
 
				+
			
 
				+[float]
			
 
				+[[regression-resources-standard]]
			
 
				+===== Standard parameters
			
 
				+
			
 
				+`dependent_variable`::
			
 
				+  (Required, string) Defines which field of the {dataframe} is to be predicted. 
			
 
				+  This parameter is supplied by field name and must match one of the fields in 
			
 
				+  the index being used to train. If this field is missing from a document, then 
			
 
				+  that document will not be used for training, but a prediction with the trained 
			
 
				+  model will be generated for it. The data type of the field must be numeric. It 
			
 
				+  is also known as continuous target variable.
			
 
				+
			
 
				+`prediction_field_name`::
			
 
				+ (Optional, string) Defines the name of the prediction field in the results. 
			
 
				+ Defaults to `<dependent_variable>_prediction`.
			
 
				+ 
			
 
				+`training_percent`::
			
 
				+ (Optional, integer) Defines what percentage of the eligible documents that will 
			
 
				+ be used for training. Documents that are ignored by the analysis (for example 
			
 
				+ those that contain arrays) won’t be included in the calculation for used 
			
 
				+ percentage. Defaults to `100`.
			
 
				+
			
 
				+
			
 
				+[float]
			
 
				+[[regression-resources-advanced]]
			
 
				+===== Advanced parameters
			
 
				+
			
 
				+Advanced parameters are for fine-tuning {reganalysis}. They are set 
			
 
				+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
			
 
				+to give minimum validation error. It is highly recommended to use the default 
			
 
				+values unless you fully understand the function of these parameters. If these 
			
 
				+parameters are not supplied, their values are automatically tuned to give 
			
 
				+minimum validation error.
			
 
				+
			
 
				+`eta`::
			
 
				+ (Optional, double) The shrinkage applied to the weights. Smaller values result 
			
 
				+ in larger forests which have better generalization error. However, the smaller 
			
 
				+ the value the longer the training will take. For more information, see 
			
 
				+ https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
			
 
				+ about shrinkage.
			
 
				+ 
			
 
				+`feature_bag_fraction`::
			
 
				+ (Optional, double) Defines the fraction of features that will be used when 
			
 
				+ selecting a random bag for each candidate split. 
			
 
				+ 
			
 
				+`maximum_number_trees`::
			
 
				+ (Optional, integer) Defines the maximum number of trees the forest is allowed 
			
 
				+ to contain. The maximum value is 2000.
			
 
				+
			
 
				+`gamma`::
			
 
				+ (Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				+ training dataset. Multiplies a linear penalty associated with the size of 
			
 
				+ individual trees in the forest. The higher the value the more training will 
			
 
				+ prefer smaller trees. The smaller this parameter the larger individual trees 
			
 
				+ will be and the longer train will take.
			
 
				+ 
			
 
				+`lambda`::
			
 
				+ (Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				+ training dataset. Multiplies an L2 regularisation term which applies to leaf 
			
 
				+ weights of the individual trees in the forest. The higher the value the more 
			
 
				+ training will attempt to keep leaf weights small. This makes the prediction 
			
 
				+ function smoother at the expense of potentially not being able to capture 
			
 
				+ relevant relationships between the features and the {depvar}. The smaller this 
			
 
				+ parameter the larger individual trees will be and the longer train will take.
			
 
				+
			
 
				+
			
 
				+[[ml-hyperparameter-optimization]]
			
 
				+===== Hyperparameter optimization
			
 
				+
			
 
				+If you don't supply {regression} parameters, hyperparameter optimization will be 
			
 
				+performed by default to set a value for the undefined parameters. The starting 
			
 
				+point is calculated for data dependent parameters by examining the loss on the 
			
 
				+training data. Subject to the size constraint, this operation provides an upper 
			
 
				+bound on the improvement in validation loss.
			
 
				+
			
 
				+A fixed number of rounds is used for optimization which depends on the number of 
			
 
				+parameters being optimized. The optimitazion starts with random search, then 
			
 
				+Bayesian Optimisation is performed that is targeting maximum expected 
			
 
				+improvement. If you override any parameters, then the optimization will 
			
 
				+calculate the value of the remaining parameters accordingly and use the value 
			
 
				+you provided for the overridden parameter. The number of rounds are reduced 
			
 
				+respectively. The validation error is estimated in each round by using 4-fold 
			
 
				+cross validation.
			
--- a/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
@@ -27,15 +27,11 @@ information, see {stack-ov}/security-privileges.html[Security privileges] and
 
				 [[ml-evaluate-dfanalytics-desc]]
			
 
				 ==== {api-description-title}
			
 
				 
			
 
				-This API evaluates the executed analysis on an index that is already annotated
			
 
				-with a field that contains the results of the analytics (the `ground truth`)
			
 
				-for each {dataframe} row.
			
 
				+The API packages together commonly used evaluation metrics for various types of 
			
 
				+machine learning features. This has been designed for use on indexes created by 
			
 
				+{dfanalytics}. Evaluation requires both a ground truth field and an analytics 
			
 
				+result field to be present.
			
 
				 
			
 
				-Evaluation is typically done by calculating a set of metrics that capture various aspects of the quality of the results over the data for which you have the
			
 
				-`ground truth`.
			
 
				-
			
 
				-For different types of analyses different metrics are suitable. This API
			
 
				-packages together commonly used metrics for various analyses.
			
 
				 
			
 
				 [[ml-evaluate-dfanalytics-request-body]]
			
 
				 ==== {api-request-body-title}
			
@@ -45,15 +41,20 @@ packages together commonly used metrics for various analyses.
 
				   performed.
			
 
				 
			
 
				 `query`::
			
 
				-  (Optional, object) Query used to select data from the index.
			
 
				-  The {es} query domain-specific language (DSL). This value corresponds to the query
			
 
				-  object in an {es} search POST body. By default, this property has the following
			
 
				-  value: `{"match_all": {}}`.
			
 
				+  (Optional, object) A query clause that retrieves a subset of data from the 
			
 
				+  source index. See <<query-dsl>>.
			
 
				 
			
 
				 `evaluation`::
			
 
				-  (Required, object) Defines the type of evaluation you want to perform. For example: 
			
 
				-  `binary_soft_classification`. See <<ml-evaluate-dfanalytics-resources>>.
			
 
				-  
			
 
				+  (Required, object) Defines the type of evaluation you want to perform. See 
			
 
				+  <<ml-evaluate-dfanalytics-resources>>.
			
 
				++
			
 
				+--
			
 
				+Available evaluation types:
			
 
				+* `binary_soft_classification`
			
 
				+* `regression`
			
 
				+--
			
 
				+
			
 
				+
			
 
				 ////
			
 
				 [[ml-evaluate-dfanalytics-results]]
			
 
				 ==== {api-response-body-title}
			
@@ -74,6 +75,8 @@ packages together commonly used metrics for various analyses.
 
				 [[ml-evaluate-dfanalytics-example]]
			
 
				 ==== {api-examples-title}
			
 
				 
			
 
				+===== Binary soft classification
			
 
				+
			
 
				 [source,console]
			
 
				 --------------------------------------------------
			
 
				 POST _ml/data_frame/_evaluate
			
@@ -131,3 +134,40 @@ The API returns the following results:
 
				   }
			
 
				 }
			
 
				 ----
			
 
				+
			
 
				+
			
 
				+===== {regression-cap}
			
 
				+
			
 
				+[source,console]
			
 
				+--------------------------------------------------
			
 
				+POST _ml/data_frame/_evaluate
			
 
				+{
			
 
				+  "index": "house_price_predictions", <1>
			
 
				+  "query": {
			
 
				+      "bool": {
			
 
				+        "filter": [
			
 
				+          { "term":  { "ml.is_training": false } } <2>
			
 
				+        ]
			
 
				+      }
			
 
				+  },
			
 
				+  "evaluation": {
			
 
				+    "regression": { 
			
 
				+      "actual_field": "price", <3>
			
 
				+      "predicted_field": "ml.price_prediction", <4>
			
 
				+      "metrics": {  
			
 
				+        "r_squared": {},
			
 
				+        "mean_squared_error": {}                             
			
 
				+      }
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// TEST[skip:TBD]
			
 
				+
			
 
				+<1> The output destination index from a {dfanalytics} {reganalysis}.
			
 
				+<2> In this example, a test/train split (`training_percent`) was defined for the 
			
 
				+{reganalysis}. This query limits evaluation to be performed on the test split 
			
 
				+only. 
			
 
				+<3> The ground truth value for the actual house price. This is required in order 
			
 
				+to evaluate results.
			
 
				+<4> The predicted value for house price calculated by the {reganalysis}.
			
--- a/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc
@@ -12,7 +12,19 @@ Evaluation configuration objects relate to the <<evaluate-dfanalytics>>.
 
				 `evaluation`::
			
 
				   (object) Defines the type of evaluation you want to perform. The value of this 
			
 
				   object can be different depending on the type of evaluation you want to 
			
 
				-  perform. For example, it can contain <<binary-sc-resources>>.
			
 
				+  perform.
			
 
				++
			
 
				+--
			
 
				+Available evaluation types:
			
 
				+* `binary_soft_classification`
			
 
				+* `regression`
			
 
				+--
			
 
				+  
			
 
				+`query`::
			
 
				+  (object) A query clause that retrieves a subset of data from the source index. 
			
 
				+  See <<query-dsl>>. The evaluation only applies to those documents of the index 
			
 
				+  that match the query.
			
 
				+
			
 
				 
			
 
				 [[binary-sc-resources]]
			
 
				 ==== Binary soft classification configuration objects
			
@@ -27,18 +39,18 @@ probability whether each row is an outlier.
 
				 ===== {api-definitions-title}
			
 
				 
			
 
				 `actual_field`::
			
 
				-  (string) The field of the `index` which contains the `ground 
			
 
				-  truth`. The data type of this field can be boolean or integer. If the data 
			
 
				-  type is integer, the value has to be either `0` (false) or `1` (true).
			
 
				+  (string) The field of the `index` which contains the `ground truth`. 
			
 
				+  The data type of this field can be boolean or integer. If the data type is 
			
 
				+  integer, the value has to be either `0` (false) or `1` (true).
			
 
				 
			
 
				 `predicted_probability_field`::
			
 
				-  (string) The field of the `index` that defines the probability of whether the 
			
 
				-  item belongs to the class in question or not. It's the field that contains the 
			
 
				-  results of the analysis.
			
 
				+  (string) The field of the `index` that defines the probability of 
			
 
				+  whether the item belongs to the class in question or not. It's the field that 
			
 
				+  contains the results of the analysis.
			
 
				 
			
 
				 `metrics`::
			
 
				-  (object) Specifies the metrics that are used for the evaluation. Available 
			
 
				-  metrics:
			
 
				+  (object) Specifies the metrics that are used for the evaluation. 
			
 
				+  Available metrics:
			
 
				   
			
 
				   `auc_roc`::
			
 
				     (object) The AUC ROC (area under the curve of the receiver operating 
			
@@ -60,4 +72,27 @@ probability whether each row is an outlier.
 
				     (`tp` - true positive, `fp` - false positive, `tn` - true negative, `fn` - 
			
 
				     false negative) are calculated.
			
 
				     Default value is {"at": [0.25, 0.50, 0.75]}.
			
 
				-  
			
 
				+
			
 
				+    
			
 
				+[[regression-evaluation-resources]]
			
 
				+==== {regression-cap} evaluation objects
			
 
				+
			
 
				+{regression-cap} evaluation evaluates the results of a {regression} analysis 
			
 
				+which outputs a prediction of values.
			
 
				+
			
 
				+
			
 
				+[discrete]
			
 
				+[[regression-evaluation-resources-properties]]
			
 
				+===== {api-definitions-title}
			
 
				+
			
 
				+`actual_field`::
			
 
				+  (string) The field of the `index` which contains the `ground truth`. The data 
			
 
				+  type of this field must be numerical.
			
 
				+  
			
 
				+`predicted_field`::
			
 
				+  (string) The field in the `index` that contains the predicted value, 
			
 
				+  in other words the results of the {regression} analysis.
			
 
				+  
			
 
				+`metrics`::
			
 
				+  (object) Specifies the metrics that are used for the evaluation. Available 
			
 
				+  metrics are `r_squared` and `mean_squared_error`.
			
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@@ -121,6 +121,9 @@ and mappings.
 
				 [[ml-put-dfanalytics-example]]
			
 
				 ==== {api-examples-title}
			
 
				 
			
 
				+[[ml-put-dfanalytics-example-od]]
			
 
				+===== {oldetection-cap} example
			
 
				+
			
 
				 The following example creates the `loganalytics` {dfanalytics-job}, the analysis 
			
 
				 type is `outlier_detection`:
			
 
				 
			
@@ -170,4 +173,65 @@ The API returns the following result:
 
				 }
			
 
				 ----
			
 
				 // TESTRESPONSE[s/1562265491319/$body.$_path/]
			
 
				-// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
			
 
				+// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
			
 
				+
			
 
				+
			
 
				+[[ml-put-dfanalytics-example-r]]
			
 
				+===== {regression-cap} example
			
 
				+
			
 
				+The following example creates the `house_price_regression_analysis` {
			
 
				+dfanalytics-job}, the analysis type is `regression`:
			
 
				+
			
 
				+[source,console]
			
 
				+--------------------------------------------------
			
 
				+PUT _ml/data_frame/analytics/house_price_regression_analysis
			
 
				+{
			
 
				+  "source": {
			
 
				+    "index": "houses_sold_last_10_yrs"
			
 
				+  },
			
 
				+  "dest": {
			
 
				+    "index": "house_price_predictions"
			
 
				+  },
			
 
				+  "analysis": 
			
 
				+    {
			
 
				+      "regression": {
			
 
				+        "dependent_variable": "price"
			
 
				+      }
			
 
				+    }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// TEST[skip:TBD]
			
 
				+
			
 
				+
			
 
				+The API returns the following result:
			
 
				+
			
 
				+[source,console-result]
			
 
				+----
			
 
				+{
			
 
				+  "id" : "house_price_regression_analysis",
			
 
				+  "source" : {
			
 
				+    "index" : [
			
 
				+      "houses_sold_last_10_yrs"
			
 
				+    ],
			
 
				+    "query" : {
			
 
				+      "match_all" : { }
			
 
				+    }
			
 
				+  },
			
 
				+  "dest" : {
			
 
				+    "index" : "house_price_predictions",
			
 
				+    "results_field" : "ml"
			
 
				+  },
			
 
				+  "analysis" : {
			
 
				+    "regression" : {
			
 
				+      "dependent_variable" : "price",
			
 
				+      "training_percent" : 100
			
 
				+    }
			
 
				+  },
			
 
				+  "model_memory_limit" : "1gb",
			
 
				+  "create_time" : 1567168659127,
			
 
				+  "version" : "8.0.0"
			
 
				+}
			
 
				+----
			
 
				+// TESTRESPONSE[s/1567168659127/$body.$_path/]
			
 
				+// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
			
 
				+