6 ani în urmă · 6c3fed8d4d
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@@ -18,13 +18,14 @@
 
				 `analyzed_fields`::
			
 
				   (object) You can specify both `includes` and/or `excludes` patterns. If 
			
 
				   `analyzed_fields` is not set, only the relevant fields will be included. For 
			
 
				-  example all the numeric fields for {oldetection}.
			
 
				-  
			
 
				-  `analyzed_fields.includes`:::
			
 
				+  example, all the numeric fields for {oldetection}. For the supported field 
			
 
				+  types, see <<ml-put-dfanalytics-supported-fields>>.
			
 
				+    
			
 
				+  `includes`:::
			
 
				     (array) An array of strings that defines the fields that will be included in 
			
 
				     the analysis.
			
 
				-    
			
 
				-  `analyzed_fields.excludes`:::
			
 
				+      
			
 
				+  `excludes`:::
			
 
				     (array) An array of strings that defines the fields that will be excluded 
			
 
				     from the analysis.
			
 
				   
			
@@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
 
				 [[regression-resources-standard]]
			
 
				 ===== Standard parameters
			
 
				 
			
 
				-`dependent_variable`::
			
 
				-  (Required, string) Defines which field of the document is to be predicted. 
			
 
				-  This parameter is supplied by field name and must match one of the fields in 
			
 
				-  the index being used to train. If this field is missing from a document, then 
			
 
				-  that document will not be used for training, but a prediction with the trained 
			
 
				-  model will be generated for it. The data type of the field must be numeric. It 
			
 
				-  is also known as continuous target variable.
			
 
				-
			
 
				-`prediction_field_name`::
			
 
				- (Optional, string) Defines the name of the prediction field in the results. 
			
 
				- Defaults to `<dependent_variable>_prediction`.
			
 
				- 
			
 
				-`training_percent`::
			
 
				- (Optional, integer) Defines what percentage of the eligible documents that will 
			
 
				- be used for training. Documents that are ignored by the analysis (for example 
			
 
				- those that contain arrays) won’t be included in the calculation for used 
			
 
				- percentage. Defaults to `100`.
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
			
 
				++
			
 
				+--
			
 
				+The data type of the field must be numeric.
			
 
				+--
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
			
 
				 
			
 
				 
			
 
				 [float]
			
@@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
 
				 parameters are not supplied, their values are automatically tuned to give 
			
 
				 minimum validation error.
			
 
				 
			
 
				-`eta`::
			
 
				- (Optional, double) The shrinkage applied to the weights. Smaller values result 
			
 
				- in larger forests which have better generalization error. However, the smaller 
			
 
				- the value the longer the training will take. For more information, see 
			
 
				- https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
			
 
				- about shrinkage.
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
			
 
				+
			
 
				+
			
 
				+[discrete]
			
 
				+[[classification-resources]]
			
 
				+==== {classification-cap} configuration objects 
			
 
				  
			
 
				-`feature_bag_fraction`::
			
 
				- (Optional, double) Defines the fraction of features that will be used when 
			
 
				- selecting a random bag for each candidate split. 
			
 
				  
			
 
				-`maximum_number_trees`::
			
 
				- (Optional, integer) Defines the maximum number of trees the forest is allowed 
			
 
				- to contain. The maximum value is 2000.
			
 
				-
			
 
				-`gamma`::
			
 
				- (Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				- training dataset. Multiplies a linear penalty associated with the size of 
			
 
				- individual trees in the forest. The higher the value the more training will 
			
 
				- prefer smaller trees. The smaller this parameter the larger individual trees 
			
 
				- will be and the longer train will take.
			
 
				+[float]
			
 
				+[[classification-resources-standard]]
			
 
				+===== Standard parameters
			
 
				+ 
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
			
 
				++
			
 
				+--
			
 
				+The data type of the field must be numeric or boolean.
			
 
				+--
			
 
				+  
			
 
				+`num_top_classes`::
			
 
				+  (Optional, integer) Defines the number of categories for which the predicted 
			
 
				+  probabilities are reported. It must be non-negative. If it is greater than the 
			
 
				+  total number of categories (in the {version} version of the {stack}, it's two) 
			
 
				+  to predict then we will report all category probabilities. Defaults to 2.
			
 
				  
			
 
				-`lambda`::
			
 
				- (Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				- training dataset. Multiplies an L2 regularisation term which applies to leaf 
			
 
				- weights of the individual trees in the forest. The higher the value the more 
			
 
				- training will attempt to keep leaf weights small. This makes the prediction 
			
 
				- function smoother at the expense of potentially not being able to capture 
			
 
				- relevant relationships between the features and the {depvar}. The smaller this 
			
 
				- parameter the larger individual trees will be and the longer train will take.
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
			
 
				+
			
 
				+
			
 
				+[float]
			
 
				+[[classification-resources-advanced]]
			
 
				+===== Advanced parameters
			
 
				+
			
 
				+Advanced parameters are for fine-tuning {classanalysis}. They are set 
			
 
				+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
			
 
				+to give minimum validation error. It is highly recommended to use the default 
			
 
				+values unless you fully understand the function of these parameters. If these 
			
 
				+parameters are not supplied, their values are automatically tuned to give 
			
 
				+minimum validation error.
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
			
 
				+
			
 
				+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
			
 
				 
			
 
				 
			
 
				 [[ml-hyperparameter-optimization]]
			
 
				 ===== Hyperparameter optimization
			
 
				 
			
 
				-If you don't supply {regression} parameters, hyperparameter optimization will be 
			
 
				-performed by default to set a value for the undefined parameters. The starting 
			
 
				-point is calculated for data dependent parameters by examining the loss on the 
			
 
				-training data. Subject to the size constraint, this operation provides an upper 
			
 
				-bound on the improvement in validation loss.
			
 
				+If you don't supply {regression} or {classification} parameters, hyperparameter 
			
 
				+optimization will be performed by default to set a value for the undefined 
			
 
				+parameters. The starting point is calculated for data dependent parameters by 
			
 
				+examining the loss on the training data. Subject to the size constraint, this 
			
 
				+operation provides an upper bound on the improvement in validation loss.
			
 
				 
			
 
				 A fixed number of rounds is used for optimization which depends on the number of 
			
 
				 parameters being optimized. The optimitazion starts with random search, then 
			
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
 
				 that don’t contain a results field are not included in the {reganalysis}.
			
 
				 
			
 
				 
			
 
				+====== {classification-cap}
			
 
				+
			
 
				+{classification-cap} supports fields that are numeric, boolean, text, keyword 
			
 
				+and ip. It is also tolerant of missing values. Fields that are supported are 
			
 
				+included in the analysis, other fields are ignored. Documents where included 
			
 
				+fields contain an array with two or more values are also ignored. Documents in 
			
 
				+the `dest` index that don’t contain a results field are not included in the 
			
 
				+{classanalysis}.
			
 
				+
			
 
				+{classanalysis-cap} can be improved by mapping ordinal variable values to a 
			
 
				+single number. For example, in case of age ranges, you can model the values as 
			
 
				+"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
			
 
				+
			
 
				+Fields that are highly correlated to the `dependent_variable` should be excluded 
			
 
				+from the analysis. For example, if you have a multi-value field as 
			
 
				+`dependent_variable`, {es} will be mapping it both as text and keyword which 
			
 
				+results in two fields (`field` and `field.keyword`). It is required to exclude 
			
 
				+the field with the text mapping to get exact results from the analysis.
			
 
				+
			
 
				+
			
 
				 [[ml-put-dfanalytics-path-params]]
			
 
				 ==== {api-path-parms-title}
			
 
				 
			
@@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
 
				 [[ml-put-dfanalytics-example]]
			
 
				 ==== {api-examples-title}
			
 
				 
			
 
				+
			
 
				 [[ml-put-dfanalytics-example-od]]
			
 
				 ===== {oldetection-cap} example
			
 
				 
			
@@ -303,3 +324,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
 
				 
			
 
				 <1> The `training_percent` defines the percentage of the data set that will be used 
			
 
				 for training the model.
			
 
				+
			
 
				+
			
 
				+[[ml-put-dfanalytics-example-c]]
			
 
				+===== {classification-cap} example
			
 
				+
			
 
				+The following example creates the `loan_classification` {dfanalytics-job}, the 
			
 
				+analysis type is `classification`:
			
 
				+
			
 
				+[source,console]
			
 
				+--------------------------------------------------
			
 
				+PUT _ml/data_frame/analytics/loan_classification
			
 
				+{
			
 
				+  "source" : {
			
 
				+    "index": "loan-applicants"
			
 
				+  },
			
 
				+  "dest" : {
			
 
				+    "index": "loan-applicants-classified"
			
 
				+  },
			
 
				+  "analysis" : {
			
 
				+    "classification": {
			
 
				+      "dependent_variable": "label",
			
 
				+      "training_percent": 75,
			
 
				+      "num_top_classes": 2
			
 
				+    }
			
 
				+  }
			
 
				+}
			
 
				+--------------------------------------------------
			
 
				+// TEST[skip:TBD]
			
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@@ -0,0 +1,70 @@
 
				+tag::dependent_variable[]
			
 
				+`dependent_variable`::
			
 
				+(Required, string) Defines which field of the document is to be predicted. 
			
 
				+This parameter is supplied by field name and must match one of the fields in 
			
 
				+the index being used to train. If this field is missing from a document, then 
			
 
				+that document will not be used for training, but a prediction with the trained 
			
 
				+model will be generated for it. It is also known as continuous target variable.
			
 
				+end::dependent_variable[]
			
 
				+
			
 
				+
			
 
				+tag::eta[]
			
 
				+`eta`::
			
 
				+(Optional, double) The shrinkage applied to the weights. Smaller values result 
			
 
				+in larger forests which have better generalization error. However, the smaller 
			
 
				+the value the longer the training will take. For more information, see 
			
 
				+https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
			
 
				+about shrinkage.
			
 
				+end::eta[]
			
 
				+
			
 
				+
			
 
				+tag::feature_bag_fraction[]
			
 
				+`feature_bag_fraction`::
			
 
				+(Optional, double) Defines the fraction of features that will be used when 
			
 
				+selecting a random bag for each candidate split. 
			
 
				+end::feature_bag_fraction[]
			
 
				+
			
 
				+
			
 
				+tag::gamma[]
			
 
				+`gamma`::
			
 
				+(Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				+training dataset. Multiplies a linear penalty associated with the size of 
			
 
				+individual trees in the forest. The higher the value the more training will 
			
 
				+prefer smaller trees. The smaller this parameter the larger individual trees 
			
 
				+will be and the longer train will take.
			
 
				+end::gamma[]
			
 
				+
			
 
				+
			
 
				+tag::lambda[] 
			
 
				+`lambda`::
			
 
				+(Optional, double) Regularization parameter to prevent overfitting on the 
			
 
				+training dataset. Multiplies an L2 regularisation term which applies to leaf 
			
 
				+weights of the individual trees in the forest. The higher the value the more 
			
 
				+training will attempt to keep leaf weights small. This makes the prediction  
			
 
				+function smoother at the expense of potentially not being able to capture 
			
 
				+relevant relationships between the features and the {depvar}. The smaller this 
			
 
				+parameter the larger individual trees will be and the longer train will take.
			
 
				+end::lambda[]
			
 
				+
			
 
				+
			
 
				+tag::maximum_number_trees[]
			
 
				+`maximum_number_trees`::
			
 
				+(Optional, integer) Defines the maximum number of trees the forest is allowed 
			
 
				+to contain. The maximum value is 2000.
			
 
				+end::maximum_number_trees[]
			
 
				+
			
 
				+
			
 
				+tag::prediction_field_name[]
			
 
				+`prediction_field_name`::
			
 
				+(Optional, string) Defines the name of the prediction field in the results. 
			
 
				+Defaults to `<dependent_variable>_prediction`.
			
 
				+end::prediction_field_name[]
			
 
				+
			
 
				+
			
 
				+tag::training_percent[]
			
 
				+`training_percent`::
			
 
				+(Optional, integer) Defines what percentage of the eligible documents that will 
			
 
				+be used for training. Documents that are ignored by the analysis (for example 
			
 
				+those that contain arrays) won’t be included in the calculation for used 
			
 
				+percentage. Defaults to `100`.
			
 
				+end::training_percent[]