|
@@ -12,7 +12,8 @@
|
|
|
|
|
|
`analysis`::
|
|
|
(object) The type of analysis that is performed on the `source`. For example:
|
|
|
- `outlier_detection`. For more information, see <<dfanalytics-types>>.
|
|
|
+ `outlier_detection` or `regression`. For more information, see
|
|
|
+ <<dfanalytics-types>>.
|
|
|
|
|
|
`analyzed_fields`::
|
|
|
(object) You can specify both `includes` and/or `excludes` patterns. If
|
|
@@ -98,15 +99,13 @@ PUT _ml/data_frame/analytics/loganalytics
|
|
|
==== Analysis objects
|
|
|
|
|
|
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
|
|
|
-create a {dfanalytics-job}, you must define the type of analysis it performs.
|
|
|
-Currently, `outlier_detection` is the only available type of analysis, however,
|
|
|
-other types will be added, for example `regression`.
|
|
|
-
|
|
|
+create a {dfanalytics-job}, you must define the type of analysis it performs.
|
|
|
+
|
|
|
[discrete]
|
|
|
[[oldetection-resources]]
|
|
|
==== {oldetection-cap} configuration objects
|
|
|
|
|
|
-An {oldetection} configuration object has the following properties:
|
|
|
+An `outlier_detection` configuration object has the following properties:
|
|
|
|
|
|
`compute_feature_influence`::
|
|
|
(boolean) If `true`, the feature influence calculation is enabled. Defaults to
|
|
@@ -123,7 +122,7 @@ An {oldetection} configuration object has the following properties:
|
|
|
recommend to use the ensemble method. Available methods are `lof`, `ldof`,
|
|
|
`distance_kth_nn`, `distance_knn`.
|
|
|
|
|
|
-`n_neighbors`::
|
|
|
+ `n_neighbors`::
|
|
|
(integer) Defines the value for how many nearest neighbors each method of
|
|
|
{oldetection} will use to calculate its {olscore}. When the value is not set,
|
|
|
different values will be used for different ensemble members. This helps
|
|
@@ -140,3 +139,122 @@ An {oldetection} configuration object has the following properties:
|
|
|
before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
|
|
|
`true`. For more information, see
|
|
|
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
|
|
|
+
|
|
|
+
|
|
|
+[discrete]
|
|
|
+[[regression-resources]]
|
|
|
+==== {regression-cap} configuration objects
|
|
|
+
|
|
|
+[source,console]
|
|
|
+--------------------------------------------------
|
|
|
+PUT _ml/data_frame/analytics/house_price_regression_analysis
|
|
|
+{
|
|
|
+ "source": {
|
|
|
+ "index": "houses_sold_last_10_yrs" <1>
|
|
|
+ },
|
|
|
+ "dest": {
|
|
|
+ "index": "house_price_predictions" <2>
|
|
|
+ },
|
|
|
+ "analysis":
|
|
|
+ {
|
|
|
+ "regression": { <3>
|
|
|
+ "dependent_variable": "price" <4>
|
|
|
+ }
|
|
|
+ }
|
|
|
+}
|
|
|
+--------------------------------------------------
|
|
|
+// TEST[skip:TBD]
|
|
|
+
|
|
|
+<1> Training data is taken from source index `houses_sold_last_10_yrs`.
|
|
|
+<2> Analysis results will be output to destination index
|
|
|
+`house_price_predictions`.
|
|
|
+<3> The regression analysis configuration object.
|
|
|
+<4> Regression analysis will use field `price` to train on. As no other
|
|
|
+parameters have been specified it will train on 100% of eligible data, store its
|
|
|
+prediction in destination index field `price_prediction` and use in-built
|
|
|
+hyperparameter optimization to give minimum validation errors.
|
|
|
+
|
|
|
+
|
|
|
+[float]
|
|
|
+[[regression-resources-standard]]
|
|
|
+===== Standard parameters
|
|
|
+
|
|
|
+`dependent_variable`::
|
|
|
+ (Required, string) Defines which field of the {dataframe} is to be predicted.
|
|
|
+ This parameter is supplied by field name and must match one of the fields in
|
|
|
+ the index being used to train. If this field is missing from a document, then
|
|
|
+ that document will not be used for training, but a prediction with the trained
|
|
|
+ model will be generated for it. The data type of the field must be numeric. It
|
|
|
+ is also known as continuous target variable.
|
|
|
+
|
|
|
+`prediction_field_name`::
|
|
|
+ (Optional, string) Defines the name of the prediction field in the results.
|
|
|
+ Defaults to `<dependent_variable>_prediction`.
|
|
|
+
|
|
|
+`training_percent`::
|
|
|
+ (Optional, integer) Defines what percentage of the eligible documents that will
|
|
|
+ be used for training. Documents that are ignored by the analysis (for example
|
|
|
+ those that contain arrays) won’t be included in the calculation for used
|
|
|
+ percentage. Defaults to `100`.
|
|
|
+
|
|
|
+
|
|
|
+[float]
|
|
|
+[[regression-resources-advanced]]
|
|
|
+===== Advanced parameters
|
|
|
+
|
|
|
+Advanced parameters are for fine-tuning {reganalysis}. They are set
|
|
|
+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
|
|
|
+to give minimum validation error. It is highly recommended to use the default
|
|
|
+values unless you fully understand the function of these parameters. If these
|
|
|
+parameters are not supplied, their values are automatically tuned to give
|
|
|
+minimum validation error.
|
|
|
+
|
|
|
+`eta`::
|
|
|
+ (Optional, double) The shrinkage applied to the weights. Smaller values result
|
|
|
+ in larger forests which have better generalization error. However, the smaller
|
|
|
+ the value the longer the training will take. For more information, see
|
|
|
+ https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
|
|
+ about shrinkage.
|
|
|
+
|
|
|
+`feature_bag_fraction`::
|
|
|
+ (Optional, double) Defines the fraction of features that will be used when
|
|
|
+ selecting a random bag for each candidate split.
|
|
|
+
|
|
|
+`maximum_number_trees`::
|
|
|
+ (Optional, integer) Defines the maximum number of trees the forest is allowed
|
|
|
+ to contain. The maximum value is 2000.
|
|
|
+
|
|
|
+`gamma`::
|
|
|
+ (Optional, double) Regularization parameter to prevent overfitting on the
|
|
|
+ training dataset. Multiplies a linear penalty associated with the size of
|
|
|
+ individual trees in the forest. The higher the value the more training will
|
|
|
+ prefer smaller trees. The smaller this parameter the larger individual trees
|
|
|
+ will be and the longer train will take.
|
|
|
+
|
|
|
+`lambda`::
|
|
|
+ (Optional, double) Regularization parameter to prevent overfitting on the
|
|
|
+ training dataset. Multiplies an L2 regularisation term which applies to leaf
|
|
|
+ weights of the individual trees in the forest. The higher the value the more
|
|
|
+ training will attempt to keep leaf weights small. This makes the prediction
|
|
|
+ function smoother at the expense of potentially not being able to capture
|
|
|
+ relevant relationships between the features and the {depvar}. The smaller this
|
|
|
+ parameter the larger individual trees will be and the longer train will take.
|
|
|
+
|
|
|
+
|
|
|
+[[ml-hyperparameter-optimization]]
|
|
|
+===== Hyperparameter optimization
|
|
|
+
|
|
|
+If you don't supply {regression} parameters, hyperparameter optimization will be
|
|
|
+performed by default to set a value for the undefined parameters. The starting
|
|
|
+point is calculated for data dependent parameters by examining the loss on the
|
|
|
+training data. Subject to the size constraint, this operation provides an upper
|
|
|
+bound on the improvement in validation loss.
|
|
|
+
|
|
|
+A fixed number of rounds is used for optimization which depends on the number of
|
|
|
+parameters being optimized. The optimitazion starts with random search, then
|
|
|
+Bayesian Optimisation is performed that is targeting maximum expected
|
|
|
+improvement. If you override any parameters, then the optimization will
|
|
|
+calculate the value of the remaining parameters accordingly and use the value
|
|
|
+you provided for the overridden parameter. The number of rounds are reduced
|
|
|
+respectively. The validation error is estimated in each round by using 4-fold
|
|
|
+cross validation.
|