Browse Source

[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)

István Zoltán Szabó 6 years ago
parent
commit
6c3fed8d4d

+ 74 - 54
docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc

@@ -18,13 +18,14 @@
 `analyzed_fields`::
 `analyzed_fields`::
   (object) You can specify both `includes` and/or `excludes` patterns. If 
   (object) You can specify both `includes` and/or `excludes` patterns. If 
   `analyzed_fields` is not set, only the relevant fields will be included. For 
   `analyzed_fields` is not set, only the relevant fields will be included. For 
-  example all the numeric fields for {oldetection}.
-  
-  `analyzed_fields.includes`:::
+  example, all the numeric fields for {oldetection}. For the supported field 
+  types, see <<ml-put-dfanalytics-supported-fields>>.
+    
+  `includes`:::
     (array) An array of strings that defines the fields that will be included in 
     (array) An array of strings that defines the fields that will be included in 
     the analysis.
     the analysis.
-    
-  `analyzed_fields.excludes`:::
+      
+  `excludes`:::
     (array) An array of strings that defines the fields that will be excluded 
     (array) An array of strings that defines the fields that will be excluded 
     from the analysis.
     from the analysis.
   
   
@@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
 [[regression-resources-standard]]
 [[regression-resources-standard]]
 ===== Standard parameters
 ===== Standard parameters
 
 
-`dependent_variable`::
-  (Required, string) Defines which field of the document is to be predicted. 
-  This parameter is supplied by field name and must match one of the fields in 
-  the index being used to train. If this field is missing from a document, then 
-  that document will not be used for training, but a prediction with the trained 
-  model will be generated for it. The data type of the field must be numeric. It 
-  is also known as continuous target variable.
-
-`prediction_field_name`::
- (Optional, string) Defines the name of the prediction field in the results. 
- Defaults to `<dependent_variable>_prediction`.
- 
-`training_percent`::
- (Optional, integer) Defines what percentage of the eligible documents that will 
- be used for training. Documents that are ignored by the analysis (for example 
- those that contain arrays) won’t be included in the calculation for used 
- percentage. Defaults to `100`.
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
++
+--
+The data type of the field must be numeric.
+--
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
 
 
 
 
 [float]
 [float]
@@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
 parameters are not supplied, their values are automatically tuned to give 
 parameters are not supplied, their values are automatically tuned to give 
 minimum validation error.
 minimum validation error.
 
 
-`eta`::
- (Optional, double) The shrinkage applied to the weights. Smaller values result 
- in larger forests which have better generalization error. However, the smaller 
- the value the longer the training will take. For more information, see 
- https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
- about shrinkage.
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
+
+
+[discrete]
+[[classification-resources]]
+==== {classification-cap} configuration objects 
  
  
-`feature_bag_fraction`::
- (Optional, double) Defines the fraction of features that will be used when 
- selecting a random bag for each candidate split. 
  
  
-`maximum_number_trees`::
- (Optional, integer) Defines the maximum number of trees the forest is allowed 
- to contain. The maximum value is 2000.
-
-`gamma`::
- (Optional, double) Regularization parameter to prevent overfitting on the 
- training dataset. Multiplies a linear penalty associated with the size of 
- individual trees in the forest. The higher the value the more training will 
- prefer smaller trees. The smaller this parameter the larger individual trees 
- will be and the longer train will take.
+[float]
+[[classification-resources-standard]]
+===== Standard parameters
+ 
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
++
+--
+The data type of the field must be numeric or boolean.
+--
+  
+`num_top_classes`::
+  (Optional, integer) Defines the number of categories for which the predicted 
+  probabilities are reported. It must be non-negative. If it is greater than the 
+  total number of categories (in the {version} version of the {stack}, it's two) 
+  to predict then we will report all category probabilities. Defaults to 2.
  
  
-`lambda`::
- (Optional, double) Regularization parameter to prevent overfitting on the 
- training dataset. Multiplies an L2 regularisation term which applies to leaf 
- weights of the individual trees in the forest. The higher the value the more 
- training will attempt to keep leaf weights small. This makes the prediction 
- function smoother at the expense of potentially not being able to capture 
- relevant relationships between the features and the {depvar}. The smaller this 
- parameter the larger individual trees will be and the longer train will take.
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
+
+
+[float]
+[[classification-resources-advanced]]
+===== Advanced parameters
+
+Advanced parameters are for fine-tuning {classanalysis}. They are set 
+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
+to give minimum validation error. It is highly recommended to use the default 
+values unless you fully understand the function of these parameters. If these 
+parameters are not supplied, their values are automatically tuned to give 
+minimum validation error.
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
 
 
 
 
 [[ml-hyperparameter-optimization]]
 [[ml-hyperparameter-optimization]]
 ===== Hyperparameter optimization
 ===== Hyperparameter optimization
 
 
-If you don't supply {regression} parameters, hyperparameter optimization will be 
-performed by default to set a value for the undefined parameters. The starting 
-point is calculated for data dependent parameters by examining the loss on the 
-training data. Subject to the size constraint, this operation provides an upper 
-bound on the improvement in validation loss.
+If you don't supply {regression} or {classification} parameters, hyperparameter 
+optimization will be performed by default to set a value for the undefined 
+parameters. The starting point is calculated for data dependent parameters by 
+examining the loss on the training data. Subject to the size constraint, this 
+operation provides an upper bound on the improvement in validation loss.
 
 
 A fixed number of rounds is used for optimization which depends on the number of 
 A fixed number of rounds is used for optimization which depends on the number of 
 parameters being optimized. The optimitazion starts with random search, then 
 parameters being optimized. The optimitazion starts with random search, then 

+ 49 - 0
docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc

@@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
 that don’t contain a results field are not included in the {reganalysis}.
 that don’t contain a results field are not included in the {reganalysis}.
 
 
 
 
+====== {classification-cap}
+
+{classification-cap} supports fields that are numeric, boolean, text, keyword 
+and ip. It is also tolerant of missing values. Fields that are supported are 
+included in the analysis, other fields are ignored. Documents where included 
+fields contain an array with two or more values are also ignored. Documents in 
+the `dest` index that don’t contain a results field are not included in the 
+{classanalysis}.
+
+{classanalysis-cap} can be improved by mapping ordinal variable values to a 
+single number. For example, in case of age ranges, you can model the values as 
+"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
+
+Fields that are highly correlated to the `dependent_variable` should be excluded 
+from the analysis. For example, if you have a multi-value field as 
+`dependent_variable`, {es} will be mapping it both as text and keyword which 
+results in two fields (`field` and `field.keyword`). It is required to exclude 
+the field with the text mapping to get exact results from the analysis.
+
+
 [[ml-put-dfanalytics-path-params]]
 [[ml-put-dfanalytics-path-params]]
 ==== {api-path-parms-title}
 ==== {api-path-parms-title}
 
 
@@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
 [[ml-put-dfanalytics-example]]
 [[ml-put-dfanalytics-example]]
 ==== {api-examples-title}
 ==== {api-examples-title}
 
 
+
 [[ml-put-dfanalytics-example-od]]
 [[ml-put-dfanalytics-example-od]]
 ===== {oldetection-cap} example
 ===== {oldetection-cap} example
 
 
@@ -303,3 +324,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
 
 
 <1> The `training_percent` defines the percentage of the data set that will be used 
 <1> The `training_percent` defines the percentage of the data set that will be used 
 for training the model.
 for training the model.
+
+
+[[ml-put-dfanalytics-example-c]]
+===== {classification-cap} example
+
+The following example creates the `loan_classification` {dfanalytics-job}, the 
+analysis type is `classification`:
+
+[source,console]
+--------------------------------------------------
+PUT _ml/data_frame/analytics/loan_classification
+{
+  "source" : {
+    "index": "loan-applicants"
+  },
+  "dest" : {
+    "index": "loan-applicants-classified"
+  },
+  "analysis" : {
+    "classification": {
+      "dependent_variable": "label",
+      "training_percent": 75,
+      "num_top_classes": 2
+    }
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]

+ 70 - 0
docs/reference/ml/ml-shared.asciidoc

@@ -0,0 +1,70 @@
+tag::dependent_variable[]
+`dependent_variable`::
+(Required, string) Defines which field of the document is to be predicted. 
+This parameter is supplied by field name and must match one of the fields in 
+the index being used to train. If this field is missing from a document, then 
+that document will not be used for training, but a prediction with the trained 
+model will be generated for it. It is also known as continuous target variable.
+end::dependent_variable[]
+
+
+tag::eta[]
+`eta`::
+(Optional, double) The shrinkage applied to the weights. Smaller values result 
+in larger forests which have better generalization error. However, the smaller 
+the value the longer the training will take. For more information, see 
+https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
+about shrinkage.
+end::eta[]
+
+
+tag::feature_bag_fraction[]
+`feature_bag_fraction`::
+(Optional, double) Defines the fraction of features that will be used when 
+selecting a random bag for each candidate split. 
+end::feature_bag_fraction[]
+
+
+tag::gamma[]
+`gamma`::
+(Optional, double) Regularization parameter to prevent overfitting on the 
+training dataset. Multiplies a linear penalty associated with the size of 
+individual trees in the forest. The higher the value the more training will 
+prefer smaller trees. The smaller this parameter the larger individual trees 
+will be and the longer train will take.
+end::gamma[]
+
+
+tag::lambda[] 
+`lambda`::
+(Optional, double) Regularization parameter to prevent overfitting on the 
+training dataset. Multiplies an L2 regularisation term which applies to leaf 
+weights of the individual trees in the forest. The higher the value the more 
+training will attempt to keep leaf weights small. This makes the prediction  
+function smoother at the expense of potentially not being able to capture 
+relevant relationships between the features and the {depvar}. The smaller this 
+parameter the larger individual trees will be and the longer train will take.
+end::lambda[]
+
+
+tag::maximum_number_trees[]
+`maximum_number_trees`::
+(Optional, integer) Defines the maximum number of trees the forest is allowed 
+to contain. The maximum value is 2000.
+end::maximum_number_trees[]
+
+
+tag::prediction_field_name[]
+`prediction_field_name`::
+(Optional, string) Defines the name of the prediction field in the results. 
+Defaults to `<dependent_variable>_prediction`.
+end::prediction_field_name[]
+
+
+tag::training_percent[]
+`training_percent`::
+(Optional, integer) Defines what percentage of the eligible documents that will 
+be used for training. Documents that are ignored by the analysis (for example 
+those that contain arrays) won’t be included in the calculation for used 
+percentage. Defaults to `100`.
+end::training_percent[]