6 years ago · 6c3fed8d4d
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@@ -18,13 +18,14 @@
 
															 `analyzed_fields`::
														
 
															   (object) You can specify both `includes` and/or `excludes` patterns. If 
														
 
															   `analyzed_fields` is not set, only the relevant fields will be included. For 
														
 
															-  example all the numeric fields for {oldetection}.
														
 
															-  
														
 
															-  `analyzed_fields.includes`:::
														
 
															+  example, all the numeric fields for {oldetection}. For the supported field 
														
 
															+  types, see <<ml-put-dfanalytics-supported-fields>>.
														
 
															+    
														
 
															+  `includes`:::
														
 
															     (array) An array of strings that defines the fields that will be included in 
														
 
															     the analysis.
														
 
															-    
														
 
															-  `analyzed_fields.excludes`:::
														
 
															+      
														
 
															+  `excludes`:::
														
 
															     (array) An array of strings that defines the fields that will be excluded 
														
 
															     from the analysis.
														
@@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
 
															 [[regression-resources-standard]]
														
 
															 ===== Standard parameters
														
 
															-`dependent_variable`::
														
 
															-  (Required, string) Defines which field of the document is to be predicted. 
														
 
															-  This parameter is supplied by field name and must match one of the fields in 
														
 
															-  the index being used to train. If this field is missing from a document, then 
														
 
															-  that document will not be used for training, but a prediction with the trained 
														
 
															-  model will be generated for it. The data type of the field must be numeric. It 
														
 
															-  is also known as continuous target variable.
														
 
															-
														
 
															-`prediction_field_name`::
														
 
															- (Optional, string) Defines the name of the prediction field in the results. 
														
 
															- Defaults to `<dependent_variable>_prediction`.
														
 
															- 
														
 
															-`training_percent`::
														
 
															- (Optional, integer) Defines what percentage of the eligible documents that will 
														
 
															- be used for training. Documents that are ignored by the analysis (for example 
														
 
															- those that contain arrays) won’t be included in the calculation for used 
														
 
															- percentage. Defaults to `100`.
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
														
 
															++
														
 
															+--
														
 
															+The data type of the field must be numeric.
														
 
															+--
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
														
 
															 [float]
														
@@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
 
															 parameters are not supplied, their values are automatically tuned to give 
														
 
															 minimum validation error.
														
 
															-`eta`::
														
 
															- (Optional, double) The shrinkage applied to the weights. Smaller values result 
														
 
															- in larger forests which have better generalization error. However, the smaller 
														
 
															- the value the longer the training will take. For more information, see 
														
 
															- https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
														
 
															- about shrinkage.
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
														
 
															+
														
 
															+
														
 
															+[discrete]
														
 
															+[[classification-resources]]
														
 
															+==== {classification-cap} configuration objects 
														
 
															-`feature_bag_fraction`::
														
 
															- (Optional, double) Defines the fraction of features that will be used when 
														
 
															- selecting a random bag for each candidate split. 
														
 
															-`maximum_number_trees`::
														
 
															- (Optional, integer) Defines the maximum number of trees the forest is allowed 
														
 
															- to contain. The maximum value is 2000.
														
 
															-
														
 
															-`gamma`::
														
 
															- (Optional, double) Regularization parameter to prevent overfitting on the 
														
 
															- training dataset. Multiplies a linear penalty associated with the size of 
														
 
															- individual trees in the forest. The higher the value the more training will 
														
 
															- prefer smaller trees. The smaller this parameter the larger individual trees 
														
 
															- will be and the longer train will take.
														
 
															+[float]
														
 
															+[[classification-resources-standard]]
														
 
															+===== Standard parameters
														
 
															+ 
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
														
 
															++
														
 
															+--
														
 
															+The data type of the field must be numeric or boolean.
														
 
															+--
														
 
															+  
														
 
															+`num_top_classes`::
														
 
															+  (Optional, integer) Defines the number of categories for which the predicted 
														
 
															+  probabilities are reported. It must be non-negative. If it is greater than the 
														
 
															+  total number of categories (in the {version} version of the {stack}, it's two) 
														
 
															+  to predict then we will report all category probabilities. Defaults to 2.
														
 
															-`lambda`::
														
 
															- (Optional, double) Regularization parameter to prevent overfitting on the 
														
 
															- training dataset. Multiplies an L2 regularisation term which applies to leaf 
														
 
															- weights of the individual trees in the forest. The higher the value the more 
														
 
															- training will attempt to keep leaf weights small. This makes the prediction 
														
 
															- function smoother at the expense of potentially not being able to capture 
														
 
															- relevant relationships between the features and the {depvar}. The smaller this 
														
 
															- parameter the larger individual trees will be and the longer train will take.
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
														
 
															+
														
 
															+
														
 
															+[float]
														
 
															+[[classification-resources-advanced]]
														
 
															+===== Advanced parameters
														
 
															+
														
 
															+Advanced parameters are for fine-tuning {classanalysis}. They are set 
														
 
															+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
														
 
															+to give minimum validation error. It is highly recommended to use the default 
														
 
															+values unless you fully understand the function of these parameters. If these 
														
 
															+parameters are not supplied, their values are automatically tuned to give 
														
 
															+minimum validation error.
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
														
 
															+
														
 
															+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
														
 
															 [[ml-hyperparameter-optimization]]
														
 
															 ===== Hyperparameter optimization
														
 
															-If you don't supply {regression} parameters, hyperparameter optimization will be 
														
 
															-performed by default to set a value for the undefined parameters. The starting 
														
 
															-point is calculated for data dependent parameters by examining the loss on the 
														
 
															-training data. Subject to the size constraint, this operation provides an upper 
														
 
															-bound on the improvement in validation loss.
														
 
															+If you don't supply {regression} or {classification} parameters, hyperparameter 
														
 
															+optimization will be performed by default to set a value for the undefined 
														
 
															+parameters. The starting point is calculated for data dependent parameters by 
														
 
															+examining the loss on the training data. Subject to the size constraint, this 
														
 
															+operation provides an upper bound on the improvement in validation loss.
														
 
															 A fixed number of rounds is used for optimization which depends on the number of 
														
 
															 parameters being optimized. The optimitazion starts with random search, then 
														
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
 
															 that don’t contain a results field are not included in the {reganalysis}.
														
 
															+====== {classification-cap}
														
 
															+
														
 
															+{classification-cap} supports fields that are numeric, boolean, text, keyword 
														
 
															+and ip. It is also tolerant of missing values. Fields that are supported are 
														
 
															+included in the analysis, other fields are ignored. Documents where included 
														
 
															+fields contain an array with two or more values are also ignored. Documents in 
														
 
															+the `dest` index that don’t contain a results field are not included in the 
														
 
															+{classanalysis}.
														
 
															+
														
 
															+{classanalysis-cap} can be improved by mapping ordinal variable values to a 
														
 
															+single number. For example, in case of age ranges, you can model the values as 
														
 
															+"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
														
 
															+
														
 
															+Fields that are highly correlated to the `dependent_variable` should be excluded 
														
 
															+from the analysis. For example, if you have a multi-value field as 
														
 
															+`dependent_variable`, {es} will be mapping it both as text and keyword which 
														
 
															+results in two fields (`field` and `field.keyword`). It is required to exclude 
														
 
															+the field with the text mapping to get exact results from the analysis.
														
 
															+
														
 
															+
														
 
															 [[ml-put-dfanalytics-path-params]]
														
 
															 ==== {api-path-parms-title}
														
@@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
 
															 [[ml-put-dfanalytics-example]]
														
 
															 ==== {api-examples-title}
														
 
															+
														
 
															 [[ml-put-dfanalytics-example-od]]
														
 
															 ===== {oldetection-cap} example
														
@@ -303,3 +324,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
 
															 <1> The `training_percent` defines the percentage of the data set that will be used 
														
 
															 for training the model.
														
 
															+
														
 
															+
														
 
															+[[ml-put-dfanalytics-example-c]]
														
 
															+===== {classification-cap} example
														
 
															+
														
 
															+The following example creates the `loan_classification` {dfanalytics-job}, the 
														
 
															+analysis type is `classification`:
														
 
															+
														
 
															+[source,console]
														
 
															+--------------------------------------------------
														
 
															+PUT _ml/data_frame/analytics/loan_classification
														
 
															+{
														
 
															+  "source" : {
														
 
															+    "index": "loan-applicants"
														
 
															+  },
														
 
															+  "dest" : {
														
 
															+    "index": "loan-applicants-classified"
														
 
															+  },
														
 
															+  "analysis" : {
														
 
															+    "classification": {
														
 
															+      "dependent_variable": "label",
														
 
															+      "training_percent": 75,
														
 
															+      "num_top_classes": 2
														
 
															+    }
														
 
															+  }
														
 
															+}
														
 
															+--------------------------------------------------
														
 
															+// TEST[skip:TBD]
														
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@@ -0,0 +1,70 @@
 
															+tag::dependent_variable[]
														
 
															+`dependent_variable`::
														
 
															+(Required, string) Defines which field of the document is to be predicted. 
														
 
															+This parameter is supplied by field name and must match one of the fields in 
														
 
															+the index being used to train. If this field is missing from a document, then 
														
 
															+that document will not be used for training, but a prediction with the trained 
														
 
															+model will be generated for it. It is also known as continuous target variable.
														
 
															+end::dependent_variable[]
														
 
															+
														
 
															+
														
 
															+tag::eta[]
														
 
															+`eta`::
														
 
															+(Optional, double) The shrinkage applied to the weights. Smaller values result 
														
 
															+in larger forests which have better generalization error. However, the smaller 
														
 
															+the value the longer the training will take. For more information, see 
														
 
															+https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
														
 
															+about shrinkage.
														
 
															+end::eta[]
														
 
															+
														
 
															+
														
 
															+tag::feature_bag_fraction[]
														
 
															+`feature_bag_fraction`::
														
 
															+(Optional, double) Defines the fraction of features that will be used when 
														
 
															+selecting a random bag for each candidate split. 
														
 
															+end::feature_bag_fraction[]
														
 
															+
														
 
															+
														
 
															+tag::gamma[]
														
 
															+`gamma`::
														
 
															+(Optional, double) Regularization parameter to prevent overfitting on the 
														
 
															+training dataset. Multiplies a linear penalty associated with the size of 
														
 
															+individual trees in the forest. The higher the value the more training will 
														
 
															+prefer smaller trees. The smaller this parameter the larger individual trees 
														
 
															+will be and the longer train will take.
														
 
															+end::gamma[]
														
 
															+
														
 
															+
														
 
															+tag::lambda[] 
														
 
															+`lambda`::
														
 
															+(Optional, double) Regularization parameter to prevent overfitting on the 
														
 
															+training dataset. Multiplies an L2 regularisation term which applies to leaf 
														
 
															+weights of the individual trees in the forest. The higher the value the more 
														
 
															+training will attempt to keep leaf weights small. This makes the prediction  
														
 
															+function smoother at the expense of potentially not being able to capture 
														
 
															+relevant relationships between the features and the {depvar}. The smaller this 
														
 
															+parameter the larger individual trees will be and the longer train will take.
														
 
															+end::lambda[]
														
 
															+
														
 
															+
														
 
															+tag::maximum_number_trees[]
														
 
															+`maximum_number_trees`::
														
 
															+(Optional, integer) Defines the maximum number of trees the forest is allowed 
														
 
															+to contain. The maximum value is 2000.
														
 
															+end::maximum_number_trees[]
														
 
															+
														
 
															+
														
 
															+tag::prediction_field_name[]
														
 
															+`prediction_field_name`::
														
 
															+(Optional, string) Defines the name of the prediction field in the results. 
														
 
															+Defaults to `<dependent_variable>_prediction`.
														
 
															+end::prediction_field_name[]
														
 
															+
														
 
															+
														
 
															+tag::training_percent[]
														
 
															+`training_percent`::
														
 
															+(Optional, integer) Defines what percentage of the eligible documents that will 
														
 
															+be used for training. Documents that are ignored by the analysis (for example 
														
 
															+those that contain arrays) won’t be included in the calculation for used 
														
 
															+percentage. Defaults to `100`.
														
 
															+end::training_percent[]