dfanalyticsresources.asciidoc 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298
  1. [role="xpack"]
  2. [testenv="platinum"]
  3. [[ml-dfanalytics-resources]]
  4. === {dfanalytics-cap} job resources
  5. {dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
  6. <<get-dfanalytics>>.
  7. [discrete]
  8. [[ml-dfanalytics-properties]]
  9. ==== {api-definitions-title}
  10. `analysis`::
  11. (object) The type of analysis that is performed on the `source`. For example:
  12. `outlier_detection` or `regression`. For more information, see
  13. <<dfanalytics-types>>.
  14. `analyzed_fields`::
  15. (Optional, object) Specify `includes` and/or `excludes` patterns to select
  16. which fields will be included in the analysis. If `analyzed_fields` is not set,
  17. only the relevant fields will be included. For example, all the numeric fields
  18. for {oldetection}. For the supported field types, see <<ml-put-dfanalytics-supported-fields>>.
  19. Also see the <<explain-dfanalytics>> which helps understand field selection.
  20. `includes`:::
  21. (Optional, array) An array of strings that defines the fields that will be included in
  22. the analysis.
  23. `excludes`:::
  24. (Optional, array) An array of strings that defines the fields that will be excluded
  25. from the analysis.
  26. [source,console]
  27. --------------------------------------------------
  28. PUT _ml/data_frame/analytics/loganalytics
  29. {
  30. "source": {
  31. "index": "logdata"
  32. },
  33. "dest": {
  34. "index": "logdata_out"
  35. },
  36. "analysis": {
  37. "outlier_detection": {
  38. }
  39. },
  40. "analyzed_fields": {
  41. "includes": [ "request.bytes", "response.counts.error" ],
  42. "excludes": [ "source.geo" ]
  43. }
  44. }
  45. --------------------------------------------------
  46. // TEST[setup:setup_logdata]
  47. `description`::
  48. (Optional, string) A description of the job.
  49. `dest`::
  50. (object) The destination configuration of the analysis.
  51. `index`:::
  52. (Required, string) Defines the _destination index_ to store the results of
  53. the {dfanalytics-job}.
  54. `results_field`:::
  55. (Optional, string) Defines the name of the field in which to store the
  56. results of the analysis. Default to `ml`.
  57. `id`::
  58. (string) The unique identifier for the {dfanalytics-job}. This identifier can
  59. contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
  60. underscores. It must start and end with alphanumeric characters. This property
  61. is informational; you cannot change the identifier for existing jobs.
  62. `model_memory_limit`::
  63. (string) The approximate maximum amount of memory resources that are
  64. permitted for analytical processing. The default value for {dfanalytics-jobs}
  65. is `1gb`. If your `elasticsearch.yml` file contains an
  66. `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
  67. create {dfanalytics-jobs} that have `model_memory_limit` values greater than
  68. that setting. For more information, see <<ml-settings>>.
  69. `source`::
  70. (object) The configuration of how to source the analysis data. It requires an `index`.
  71. Optionally, `query` and `_source` may be specified.
  72. `index`:::
  73. (Required, string or array) Index or indices on which to perform the
  74. analysis. It can be a single index or index pattern as well as an array of
  75. indices or patterns.
  76. `query`:::
  77. (Optional, object) The {es} query domain-specific language
  78. (<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
  79. search POST body. All the options that are supported by {es} can be used,
  80. as this object is passed verbatim to {es}. By default, this property has
  81. the following value: `{"match_all": {}}`.
  82. `_source`:::
  83. (Optional, object) Specify `includes` and/or `excludes` patterns to select
  84. which fields will be present in the destination. Fields that are excluded
  85. cannot be included in the analysis.
  86. `includes`::::
  87. (array) An array of strings that defines the fields that will be included in
  88. the destination.
  89. `excludes`::::
  90. (array) An array of strings that defines the fields that will be excluded
  91. from the destination.
  92. [[dfanalytics-types]]
  93. ==== Analysis objects
  94. {dfanalytics-cap} resources contain `analysis` objects. For example, when you
  95. create a {dfanalytics-job}, you must define the type of analysis it performs.
  96. [discrete]
  97. [[oldetection-resources]]
  98. ==== {oldetection-cap} configuration objects
  99. An `outlier_detection` configuration object has the following properties:
  100. `compute_feature_influence`::
  101. (boolean) If `true`, the feature influence calculation is enabled. Defaults to
  102. `true`.
  103. `feature_influence_threshold`::
  104. (double) The minimum {olscore} that a document needs to have in order to
  105. calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
  106. `method`::
  107. (string) Sets the method that {oldetection} uses. If the method is not set
  108. {oldetection} uses an ensemble of different methods and normalises and
  109. combines their individual {olscores} to obtain the overall {olscore}. We
  110. recommend to use the ensemble method. Available methods are `lof`, `ldof`,
  111. `distance_kth_nn`, `distance_knn`.
  112. `n_neighbors`::
  113. (integer) Defines the value for how many nearest neighbors each method of
  114. {oldetection} will use to calculate its {olscore}. When the value is not set,
  115. different values will be used for different ensemble members. This helps
  116. improve diversity in the ensemble. Therefore, only override this if you are
  117. confident that the value you choose is appropriate for the data set.
  118. `outlier_fraction`::
  119. (double) Sets the proportion of the data set that is assumed to be outlying prior to
  120. {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
  121. and 95% are inliers.
  122. `standardization_enabled`::
  123. (boolean) If `true`, then the following operation is performed on the columns
  124. before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
  125. `true`. For more information, see
  126. https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
  127. [discrete]
  128. [[regression-resources]]
  129. ==== {regression-cap} configuration objects
  130. [source,console]
  131. --------------------------------------------------
  132. PUT _ml/data_frame/analytics/house_price_regression_analysis
  133. {
  134. "source": {
  135. "index": "houses_sold_last_10_yrs" <1>
  136. },
  137. "dest": {
  138. "index": "house_price_predictions" <2>
  139. },
  140. "analysis":
  141. {
  142. "regression": { <3>
  143. "dependent_variable": "price" <4>
  144. }
  145. }
  146. }
  147. --------------------------------------------------
  148. // TEST[skip:TBD]
  149. <1> Training data is taken from source index `houses_sold_last_10_yrs`.
  150. <2> Analysis results will be output to destination index
  151. `house_price_predictions`.
  152. <3> The regression analysis configuration object.
  153. <4> Regression analysis will use field `price` to train on. As no other
  154. parameters have been specified it will train on 100% of eligible data, store its
  155. prediction in destination index field `price_prediction` and use in-built
  156. hyperparameter optimization to give minimum validation errors.
  157. [float]
  158. [[regression-resources-standard]]
  159. ===== Standard parameters
  160. include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
  161. +
  162. --
  163. The data type of the field must be numeric.
  164. --
  165. include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
  166. include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
  167. include::{docdir}/ml/ml-shared.asciidoc[tag=randomize_seed]
  168. [float]
  169. [[regression-resources-advanced]]
  170. ===== Advanced parameters
  171. Advanced parameters are for fine-tuning {reganalysis}. They are set
  172. automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
  173. to give minimum validation error. It is highly recommended to use the default
  174. values unless you fully understand the function of these parameters. If these
  175. parameters are not supplied, their values are automatically tuned to give
  176. minimum validation error.
  177. include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
  178. include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
  179. include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
  180. include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
  181. include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
  182. [discrete]
  183. [[classification-resources]]
  184. ==== {classification-cap} configuration objects
  185. [float]
  186. [[classification-resources-standard]]
  187. ===== Standard parameters
  188. include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
  189. +
  190. --
  191. The data type of the field must be numeric or boolean.
  192. --
  193. `num_top_classes`::
  194. (Optional, integer) Defines the number of categories for which the predicted
  195. probabilities are reported. It must be non-negative. If it is greater than the
  196. total number of categories (in the {version} version of the {stack}, it's two)
  197. to predict then we will report all category probabilities. Defaults to 2.
  198. include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
  199. include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
  200. include::{docdir}/ml/ml-shared.asciidoc[tag=randomize_seed]
  201. [float]
  202. [[classification-resources-advanced]]
  203. ===== Advanced parameters
  204. Advanced parameters are for fine-tuning {classanalysis}. They are set
  205. automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
  206. to give minimum validation error. It is highly recommended to use the default
  207. values unless you fully understand the function of these parameters. If these
  208. parameters are not supplied, their values are automatically tuned to give
  209. minimum validation error.
  210. include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
  211. include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
  212. include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
  213. include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
  214. include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
  215. [[ml-hyperparameter-optimization]]
  216. ===== Hyperparameter optimization
  217. If you don't supply {regression} or {classification} parameters, hyperparameter
  218. optimization will be performed by default to set a value for the undefined
  219. parameters. The starting point is calculated for data dependent parameters by
  220. examining the loss on the training data. Subject to the size constraint, this
  221. operation provides an upper bound on the improvement in validation loss.
  222. A fixed number of rounds is used for optimization which depends on the number of
  223. parameters being optimized. The optimization starts with random search, then
  224. Bayesian optimization is performed that is targeting maximum expected
  225. improvement. If you override any parameters, then the optimization will
  226. calculate the value of the remaining parameters accordingly and use the value
  227. you provided for the overridden parameter. The number of rounds are reduced
  228. respectively. The validation error is estimated in each round by using 4-fold
  229. cross validation.