get-trained-models-stats.asciidoc 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444
  1. [role="xpack"]
  2. [[get-trained-models-stats]]
  3. = Get trained models statistics API
  4. [subs="attributes"]
  5. ++++
  6. <titleabbrev>Get trained models stats</titleabbrev>
  7. ++++
  8. Retrieves usage information for trained models.
  9. [[ml-get-trained-models-stats-request]]
  10. == {api-request-title}
  11. `GET _ml/trained_models/_stats` +
  12. `GET _ml/trained_models/_all/_stats` +
  13. `GET _ml/trained_models/<model_id>/_stats` +
  14. `GET _ml/trained_models/<model_id>,<model_id_2>/_stats` +
  15. `GET _ml/trained_models/<model_id_pattern*>,<model_id_2>/_stats`
  16. [[ml-get-trained-models-stats-prereq]]
  17. == {api-prereq-title}
  18. Requires the `monitor_ml` cluster privilege. This privilege is included in the
  19. `machine_learning_user` built-in role.
  20. [[ml-get-trained-models-stats-desc]]
  21. == {api-description-title}
  22. You can get usage information for multiple trained models in a single API
  23. request by using a comma-separated list of model IDs or a wildcard expression.
  24. [[ml-get-trained-models-stats-path-params]]
  25. == {api-path-parms-title}
  26. `<model_id>`::
  27. (Optional, string)
  28. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id-or-alias]
  29. [[ml-get-trained-models-stats-query-params]]
  30. == {api-query-parms-title}
  31. `allow_no_match`::
  32. (Optional, Boolean)
  33. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=allow-no-match-models]
  34. `from`::
  35. (Optional, integer)
  36. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=from-models]
  37. `size`::
  38. (Optional, integer)
  39. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=size-models]
  40. [role="child_attributes"]
  41. [[ml-get-trained-models-stats-results]]
  42. == {api-response-body-title}
  43. `count`::
  44. (integer)
  45. The total number of trained model statistics that matched the requested ID
  46. patterns. Could be higher than the number of items in the `trained_model_stats`
  47. array as the size of the array is restricted by the supplied `size` parameter.
  48. `trained_model_stats`::
  49. (array)
  50. An array of trained model statistics, which are sorted by the `model_id` value
  51. in ascending order.
  52. +
  53. .Properties of trained model stats
  54. [%collapsible%open]
  55. ====
  56. `deployment_stats`:::
  57. (list)
  58. A collection of deployment stats if one of the provided `model_id` values
  59. is deployed
  60. +
  61. .Properties of deployment stats
  62. [%collapsible%open]
  63. =====
  64. `allocation_status`:::
  65. (object)
  66. The detailed allocation status given the deployment configuration.
  67. +
  68. .Properties of allocation stats
  69. [%collapsible%open]
  70. ======
  71. `allocation_count`:::
  72. (integer)
  73. The current number of nodes where the model is allocated.
  74. `cache_size`:::
  75. (<<byte-units,byte value>>)
  76. The inference cache size (in memory outside the JVM heap) per node for the model.
  77. `state`:::
  78. (string)
  79. The detailed allocation state related to the nodes.
  80. +
  81. --
  82. * `starting`: Allocations are being attempted but no node currently has the model allocated.
  83. * `started`: At least one node has the model allocated.
  84. * `fully_allocated`: The deployment is fully allocated and satisfies the `target_allocation_count`.
  85. --
  86. `target_allocation_count`:::
  87. (integer)
  88. The desired number of nodes for model allocation.
  89. ======
  90. `error_count`:::
  91. (integer)
  92. The sum of `error_count` for all nodes in the deployment.
  93. `inference_count`:::
  94. (integer)
  95. The sum of `inference_count` for all nodes in the deployment.
  96. `model_id`:::
  97. (string)
  98. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
  99. `nodes`:::
  100. (array of objects)
  101. The deployment stats for each node that currently has the model allocated.
  102. +
  103. .Properties of node stats
  104. [%collapsible%open]
  105. ======
  106. `average_inference_time_ms`:::
  107. (double)
  108. The average time for each inference call to complete on this node.
  109. The average is calculated over the lifetime of the deployment.
  110. `average_inference_time_ms_last_minute`:::
  111. (double)
  112. The average time for each inference call to complete on this node
  113. in the last minute.
  114. `error_count`:::
  115. (integer)
  116. The number of errors when evaluating the trained model.
  117. `inference_count`:::
  118. (integer)
  119. The total number of inference calls made against this node for this model.
  120. `last_access`:::
  121. (long)
  122. The epoch time stamp of the last inference call for the model on this node.
  123. `node`:::
  124. (object)
  125. Information pertaining to the node.
  126. +
  127. .Properties of node
  128. [%collapsible%open]
  129. ========
  130. `attributes`:::
  131. (object)
  132. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-attributes]
  133. `ephemeral_id`:::
  134. (string)
  135. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-ephemeral-id]
  136. `id`:::
  137. (string)
  138. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-id]
  139. `name`:::
  140. (string) The node name.
  141. `transport_address`:::
  142. (string)
  143. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-transport-address]
  144. ========
  145. `number_of_allocations`:::
  146. (integer)
  147. The number of allocations assigned to this node.
  148. `number_of_pending_requests`:::
  149. (integer)
  150. The number of inference requests queued to be processed.
  151. `peak_throughput_per_minute`:::
  152. (integer)
  153. The peak number of requests processed in a 1 minute period.
  154. `routing_state`:::
  155. (object)
  156. The current routing state and reason for the current routing state for this allocation.
  157. +
  158. .Properties of routing_state
  159. [%collapsible%open]
  160. ========
  161. `reason`:::
  162. (string)
  163. The reason for the current state. Usually only populated when the `routing_state` is `failed`.
  164. `routing_state`:::
  165. (string)
  166. The current routing state.
  167. --
  168. * `starting`: The model is attempting to allocate on this model, inference calls are not yet accepted.
  169. * `started`: The model is allocated and ready to accept inference requests.
  170. * `stopping`: The model is being deallocated from this node.
  171. * `stopped`: The model is fully deallocated from this node.
  172. * `failed`: The allocation attempt failed, see `reason` field for the potential cause.
  173. --
  174. ========
  175. `rejected_execution_count`:::
  176. (integer)
  177. The number of inference requests that were not processed because the
  178. queue was full.
  179. `start_time`:::
  180. (long)
  181. The epoch timestamp when the allocation started.
  182. `threads_per_allocation`:::
  183. (integer)
  184. The number of threads for each allocation during inference.
  185. This value is limited by the number of hardware threads on the node;
  186. it might therefore differ from the `threads_per_allocation` value in the <<start-trained-model-deployment>> API.
  187. `timeout_count`:::
  188. (integer)
  189. The number of inference requests that timed out before being processed.
  190. `throughput_last_minute`:::
  191. (integer)
  192. The number of requests processed in the last 1 minute.
  193. ======
  194. `number_of_allocations`:::
  195. (integer)
  196. The requested number of allocations for the trained model deployment.
  197. `peak_throughput_per_minute`:::
  198. (integer)
  199. The peak number of requests processed in a 1 minute period for
  200. all nodes in the deployment. This is calculated as the sum of
  201. each node's `peak_throughput_per_minute` value.
  202. `rejected_execution_count`:::
  203. (integer)
  204. The sum of `rejected_execution_count` for all nodes in the deployment.
  205. Individual nodes reject an inference request if the inference queue is full.
  206. The queue size is controlled by the `queue_capacity` setting in the
  207. <<start-trained-model-deployment>> API.
  208. `reason`:::
  209. (string)
  210. The reason for the current deployment state.
  211. Usually only populated when the model is not deployed to a node.
  212. `start_time`:::
  213. (long)
  214. The epoch timestamp when the deployment started.
  215. `state`:::
  216. (string)
  217. The overall state of the deployment. The values may be:
  218. +
  219. --
  220. * `starting`: The deployment has recently started but is not yet usable as the model is not allocated on any nodes.
  221. * `started`: The deployment is usable as at least one node has the model allocated.
  222. * `stopping`: The deployment is preparing to stop and deallocate the model from the relevant nodes.
  223. --
  224. `threads_per_allocation`:::
  225. (integer)
  226. The number of threads per allocation used by the inference process.
  227. `timeout_count`:::
  228. (integer)
  229. The sum of `timeout_count` for all nodes in the deployment.
  230. `queue_capacity`:::
  231. (integer)
  232. The number of inference requests that may be queued before new requests are
  233. rejected.
  234. =====
  235. `inference_stats`:::
  236. (object)
  237. A collection of inference stats fields.
  238. +
  239. .Properties of inference stats
  240. [%collapsible%open]
  241. =====
  242. `missing_all_fields_count`:::
  243. (integer)
  244. The number of inference calls where all the training features for the model
  245. were missing.
  246. `inference_count`:::
  247. (integer)
  248. The total number of times the model has been called for inference.
  249. This is across all inference contexts, including all pipelines.
  250. `cache_miss_count`:::
  251. (integer)
  252. The number of times the model was loaded for inference and was not retrieved
  253. from the cache. If this number is close to the `inference_count`, then the cache
  254. is not being appropriately used. This can be solved by increasing the cache size
  255. or its time-to-live (TTL). See <<general-ml-settings>> for the appropriate
  256. settings.
  257. `failure_count`:::
  258. (integer)
  259. The number of failures when using the model for inference.
  260. `timestamp`:::
  261. (<<time-units,time units>>)
  262. The time when the statistics were last updated.
  263. =====
  264. `ingest`:::
  265. (object)
  266. A collection of ingest stats for the model across all nodes. The values are
  267. summations of the individual node statistics. The format matches the `ingest`
  268. section in <<cluster-nodes-stats>>.
  269. `model_id`:::
  270. (string)
  271. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
  272. `model_size_stats`:::
  273. (object)
  274. A collection of model size stats fields.
  275. +
  276. .Properties of model size stats
  277. [%collapsible%open]
  278. =====
  279. `model_size_bytes`:::
  280. (integer)
  281. The size of the model in bytes.
  282. `required_native_memory_bytes`:::
  283. (integer)
  284. The amount of memory required to load the model in bytes.
  285. =====
  286. `pipeline_count`:::
  287. (integer)
  288. The number of ingest pipelines that currently refer to the model.
  289. ====
  290. [[ml-get-trained-models-stats-response-codes]]
  291. == {api-response-codes-title}
  292. `404` (Missing resources)::
  293. If `allow_no_match` is `false`, this code indicates that there are no
  294. resources that match the request or only partial matches for the request.
  295. [[ml-get-trained-models-stats-example]]
  296. == {api-examples-title}
  297. The following example gets usage information for all the trained models:
  298. [source,console]
  299. --------------------------------------------------
  300. GET _ml/trained_models/_stats
  301. --------------------------------------------------
  302. // TEST[skip:TBD]
  303. The API returns the following results:
  304. [source,console-result]
  305. ----
  306. {
  307. "count": 2,
  308. "trained_model_stats": [
  309. {
  310. "model_id": "flight-delay-prediction-1574775339910",
  311. "pipeline_count": 0,
  312. "inference_stats": {
  313. "failure_count": 0,
  314. "inference_count": 4,
  315. "cache_miss_count": 3,
  316. "missing_all_fields_count": 0,
  317. "timestamp": 1592399986979
  318. }
  319. },
  320. {
  321. "model_id": "regression-job-one-1574775307356",
  322. "pipeline_count": 1,
  323. "inference_stats": {
  324. "failure_count": 0,
  325. "inference_count": 178,
  326. "cache_miss_count": 3,
  327. "missing_all_fields_count": 0,
  328. "timestamp": 1592399986979
  329. },
  330. "ingest": {
  331. "total": {
  332. "count": 178,
  333. "time_in_millis": 8,
  334. "current": 0,
  335. "failed": 0
  336. },
  337. "pipelines": {
  338. "flight-delay": {
  339. "count": 178,
  340. "time_in_millis": 8,
  341. "current": 0,
  342. "failed": 0,
  343. "processors": [
  344. {
  345. "inference": {
  346. "type": "inference",
  347. "stats": {
  348. "count": 178,
  349. "time_in_millis": 7,
  350. "current": 0,
  351. "failed": 0
  352. }
  353. }
  354. }
  355. ]
  356. }
  357. }
  358. }
  359. }
  360. ]
  361. }
  362. ----
  363. // NOTCONSOLE