dfanalyticsresources.asciidoc 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260
  1. [role="xpack"]
  2. [testenv="platinum"]
  3. [[ml-dfanalytics-resources]]
  4. === {dfanalytics-cap} job resources
  5. {dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
  6. <<get-dfanalytics>>.
  7. [discrete]
  8. [[ml-dfanalytics-properties]]
  9. ==== {api-definitions-title}
  10. `analysis`::
  11. (object) The type of analysis that is performed on the `source`. For example:
  12. `outlier_detection` or `regression`. For more information, see
  13. <<dfanalytics-types>>.
  14. `analyzed_fields`::
  15. (object) You can specify both `includes` and/or `excludes` patterns. If
  16. `analyzed_fields` is not set, only the relevant fields will be included. For
  17. example all the numeric fields for {oldetection}.
  18. `analyzed_fields.includes`:::
  19. (array) An array of strings that defines the fields that will be included in
  20. the analysis.
  21. `analyzed_fields.excludes`:::
  22. (array) An array of strings that defines the fields that will be excluded
  23. from the analysis.
  24. [source,console]
  25. --------------------------------------------------
  26. PUT _ml/data_frame/analytics/loganalytics
  27. {
  28. "source": {
  29. "index": "logdata"
  30. },
  31. "dest": {
  32. "index": "logdata_out"
  33. },
  34. "analysis": {
  35. "outlier_detection": {
  36. }
  37. },
  38. "analyzed_fields": {
  39. "includes": [ "request.bytes", "response.counts.error" ],
  40. "excludes": [ "source.geo" ]
  41. }
  42. }
  43. --------------------------------------------------
  44. // TEST[setup:setup_logdata]
  45. `description`::
  46. (Optional, string) A description of the job.
  47. `dest`::
  48. (object) The destination configuration of the analysis.
  49. `index`:::
  50. (Required, string) Defines the _destination index_ to store the results of
  51. the {dfanalytics-job}.
  52. `results_field`:::
  53. (Optional, string) Defines the name of the field in which to store the
  54. results of the analysis. Default to `ml`.
  55. `id`::
  56. (string) The unique identifier for the {dfanalytics-job}. This identifier can
  57. contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
  58. underscores. It must start and end with alphanumeric characters. This property
  59. is informational; you cannot change the identifier for existing jobs.
  60. `model_memory_limit`::
  61. (string) The approximate maximum amount of memory resources that are
  62. permitted for analytical processing. The default value for {dfanalytics-jobs}
  63. is `1gb`. If your `elasticsearch.yml` file contains an
  64. `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
  65. create {dfanalytics-jobs} that have `model_memory_limit` values greater than
  66. that setting. For more information, see <<ml-settings>>.
  67. `source`::
  68. (object) The source configuration consisting an `index` and optionally a
  69. `query` object.
  70. `index`:::
  71. (Required, string or array) Index or indices on which to perform the
  72. analysis. It can be a single index or index pattern as well as an array of
  73. indices or patterns.
  74. `query`:::
  75. (Optional, object) The {es} query domain-specific language
  76. (<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
  77. search POST body. All the options that are supported by {es} can be used,
  78. as this object is passed verbatim to {es}. By default, this property has
  79. the following value: `{"match_all": {}}`.
  80. [[dfanalytics-types]]
  81. ==== Analysis objects
  82. {dfanalytics-cap} resources contain `analysis` objects. For example, when you
  83. create a {dfanalytics-job}, you must define the type of analysis it performs.
  84. [discrete]
  85. [[oldetection-resources]]
  86. ==== {oldetection-cap} configuration objects
  87. An `outlier_detection` configuration object has the following properties:
  88. `compute_feature_influence`::
  89. (boolean) If `true`, the feature influence calculation is enabled. Defaults to
  90. `true`.
  91. `feature_influence_threshold`::
  92. (double) The minimum {olscore} that a document needs to have in order to
  93. calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
  94. `method`::
  95. (string) Sets the method that {oldetection} uses. If the method is not set
  96. {oldetection} uses an ensemble of different methods and normalises and
  97. combines their individual {olscores} to obtain the overall {olscore}. We
  98. recommend to use the ensemble method. Available methods are `lof`, `ldof`,
  99. `distance_kth_nn`, `distance_knn`.
  100. `n_neighbors`::
  101. (integer) Defines the value for how many nearest neighbors each method of
  102. {oldetection} will use to calculate its {olscore}. When the value is not set,
  103. different values will be used for different ensemble members. This helps
  104. improve diversity in the ensemble. Therefore, only override this if you are
  105. confident that the value you choose is appropriate for the data set.
  106. `outlier_fraction`::
  107. (double) Sets the proportion of the data set that is assumed to be outlying prior to
  108. {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
  109. and 95% are inliers.
  110. `standardize_columns`::
  111. (boolean) If `true`, then the following operation is performed on the columns
  112. before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
  113. `true`. For more information, see
  114. https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
  115. [discrete]
  116. [[regression-resources]]
  117. ==== {regression-cap} configuration objects
  118. [source,console]
  119. --------------------------------------------------
  120. PUT _ml/data_frame/analytics/house_price_regression_analysis
  121. {
  122. "source": {
  123. "index": "houses_sold_last_10_yrs" <1>
  124. },
  125. "dest": {
  126. "index": "house_price_predictions" <2>
  127. },
  128. "analysis":
  129. {
  130. "regression": { <3>
  131. "dependent_variable": "price" <4>
  132. }
  133. }
  134. }
  135. --------------------------------------------------
  136. // TEST[skip:TBD]
  137. <1> Training data is taken from source index `houses_sold_last_10_yrs`.
  138. <2> Analysis results will be output to destination index
  139. `house_price_predictions`.
  140. <3> The regression analysis configuration object.
  141. <4> Regression analysis will use field `price` to train on. As no other
  142. parameters have been specified it will train on 100% of eligible data, store its
  143. prediction in destination index field `price_prediction` and use in-built
  144. hyperparameter optimization to give minimum validation errors.
  145. [float]
  146. [[regression-resources-standard]]
  147. ===== Standard parameters
  148. `dependent_variable`::
  149. (Required, string) Defines which field of the {dataframe} is to be predicted.
  150. This parameter is supplied by field name and must match one of the fields in
  151. the index being used to train. If this field is missing from a document, then
  152. that document will not be used for training, but a prediction with the trained
  153. model will be generated for it. The data type of the field must be numeric. It
  154. is also known as continuous target variable.
  155. `prediction_field_name`::
  156. (Optional, string) Defines the name of the prediction field in the results.
  157. Defaults to `<dependent_variable>_prediction`.
  158. `training_percent`::
  159. (Optional, integer) Defines what percentage of the eligible documents that will
  160. be used for training. Documents that are ignored by the analysis (for example
  161. those that contain arrays) won’t be included in the calculation for used
  162. percentage. Defaults to `100`.
  163. [float]
  164. [[regression-resources-advanced]]
  165. ===== Advanced parameters
  166. Advanced parameters are for fine-tuning {reganalysis}. They are set
  167. automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
  168. to give minimum validation error. It is highly recommended to use the default
  169. values unless you fully understand the function of these parameters. If these
  170. parameters are not supplied, their values are automatically tuned to give
  171. minimum validation error.
  172. `eta`::
  173. (Optional, double) The shrinkage applied to the weights. Smaller values result
  174. in larger forests which have better generalization error. However, the smaller
  175. the value the longer the training will take. For more information, see
  176. https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
  177. about shrinkage.
  178. `feature_bag_fraction`::
  179. (Optional, double) Defines the fraction of features that will be used when
  180. selecting a random bag for each candidate split.
  181. `maximum_number_trees`::
  182. (Optional, integer) Defines the maximum number of trees the forest is allowed
  183. to contain. The maximum value is 2000.
  184. `gamma`::
  185. (Optional, double) Regularization parameter to prevent overfitting on the
  186. training dataset. Multiplies a linear penalty associated with the size of
  187. individual trees in the forest. The higher the value the more training will
  188. prefer smaller trees. The smaller this parameter the larger individual trees
  189. will be and the longer train will take.
  190. `lambda`::
  191. (Optional, double) Regularization parameter to prevent overfitting on the
  192. training dataset. Multiplies an L2 regularisation term which applies to leaf
  193. weights of the individual trees in the forest. The higher the value the more
  194. training will attempt to keep leaf weights small. This makes the prediction
  195. function smoother at the expense of potentially not being able to capture
  196. relevant relationships between the features and the {depvar}. The smaller this
  197. parameter the larger individual trees will be and the longer train will take.
  198. [[ml-hyperparameter-optimization]]
  199. ===== Hyperparameter optimization
  200. If you don't supply {regression} parameters, hyperparameter optimization will be
  201. performed by default to set a value for the undefined parameters. The starting
  202. point is calculated for data dependent parameters by examining the loss on the
  203. training data. Subject to the size constraint, this operation provides an upper
  204. bound on the improvement in validation loss.
  205. A fixed number of rounds is used for optimization which depends on the number of
  206. parameters being optimized. The optimitazion starts with random search, then
  207. Bayesian Optimisation is performed that is targeting maximum expected
  208. improvement. If you override any parameters, then the optimization will
  209. calculate the value of the remaining parameters accordingly and use the value
  210. you provided for the overridden parameter. The number of rounds are reduced
  211. respectively. The validation error is estimated in each round by using 4-fold
  212. cross validation.