start-trained-model-deployment.asciidoc 5.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
  1. [role="xpack"]
  2. [[start-trained-model-deployment]]
  3. = Start trained model deployment API
  4. [subs="attributes"]
  5. ++++
  6. <titleabbrev>Start trained model deployment</titleabbrev>
  7. ++++
  8. Starts a new trained model deployment.
  9. [[start-trained-model-deployment-request]]
  10. == {api-request-title}
  11. `POST _ml/trained_models/<model_id>/deployment/_start`
  12. [[start-trained-model-deployment-prereq]]
  13. == {api-prereq-title}
  14. Requires the `manage_ml` cluster privilege. This privilege is included in the
  15. `machine_learning_admin` built-in role.
  16. [[start-trained-model-deployment-desc]]
  17. == {api-description-title}
  18. Currently only `pytorch` models are supported for deployment. Once deployed
  19. the model can be used by the <<inference-processor,{infer-cap} processor>>
  20. in an ingest pipeline or directly in the <<infer-trained-model>> API.
  21. Scaling inference performance can be achieved by setting the parameters
  22. `number_of_allocations` and `threads_per_allocation`.
  23. Increasing `threads_per_allocation` means more threads are used when an
  24. inference request is processed on a node. This can improve inference speed for
  25. certain models. It may also result in improvement to throughput.
  26. Increasing `number_of_allocations` means more threads are used to process
  27. multiple inference requests in parallel resulting in throughput improvement.
  28. Each model allocation uses a number of threads defined by
  29. `threads_per_allocation`.
  30. Model allocations are distributed across {ml} nodes. All allocations assigned to
  31. a node share the same copy of the model in memory. To avoid thread
  32. oversubscription which is detrimental to performance, model allocations are
  33. distributed in such a way that the total number of used threads does not surpass
  34. the node's allocated processors.
  35. [[start-trained-model-deployment-path-params]]
  36. == {api-path-parms-title}
  37. `<model_id>`::
  38. (Required, string)
  39. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
  40. [[start-trained-model-deployment-query-params]]
  41. == {api-query-parms-title}
  42. `cache_size`::
  43. (Optional, <<byte-units,byte value>>)
  44. The inference cache size (in memory outside the JVM heap) per node for the
  45. model. The default value is the size of the model as reported by the
  46. `model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
  47. cache, `0b` can be provided.
  48. `number_of_allocations`::
  49. (Optional, integer)
  50. The total number of allocations this model is assigned across {ml} nodes.
  51. Increasing this value generally increases the throughput. Defaults to 1.
  52. `priority`::
  53. (Optional, string)
  54. The priority of the deployment. The default value is `normal`.
  55. There are two priority settings:
  56. +
  57. --
  58. * `normal`: Use this for deployments in production. The deployment allocations
  59. are distributed so that node processors are not oversubscribed.
  60. * `low`: Use this for testing model functionality. The intention is that these
  61. deployments are not sent a high volume of input. The deployment is required to
  62. have a single allocation with just one thread. Low priority deployments may be
  63. assigned on nodes that already utilize all their processors but will be given a
  64. lower CPU priority than normal deployments. Low priority deployments may be
  65. unassigned in order to satisfy more allocations of normal priority deployments.
  66. --
  67. WARNING: Heavy usage of low priority deployments may impact performance of
  68. normal priority deployments.
  69. `queue_capacity`::
  70. (Optional, integer)
  71. Controls how many inference requests are allowed in the queue at a time.
  72. Every machine learning node in the cluster where the model can be allocated
  73. has a queue of this size; when the number of requests exceeds the total value,
  74. new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
  75. is 1000000.
  76. `threads_per_allocation`::
  77. (Optional, integer)
  78. Sets the number of threads used by each model allocation during inference. This
  79. generally increases the speed per inference request. The inference process is a
  80. compute-bound process; `threads_per_allocations` must not exceed the number of
  81. available allocated processors per node. Defaults to 1. Must be a power of 2.
  82. Max allowed value is 32.
  83. `timeout`::
  84. (Optional, time)
  85. Controls the amount of time to wait for the model to deploy. Defaults to 20
  86. seconds.
  87. `wait_for`::
  88. (Optional, string)
  89. Specifies the allocation status to wait for before returning. Defaults to
  90. `started`. The value `starting` indicates deployment is starting but not yet on
  91. any node. The value `started` indicates the model has started on at least one
  92. node. The value `fully_allocated` indicates the deployment has started on all
  93. valid nodes.
  94. [[start-trained-model-deployment-example]]
  95. == {api-examples-title}
  96. The following example starts a new deployment for a
  97. `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:
  98. [source,console]
  99. --------------------------------------------------
  100. POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
  101. --------------------------------------------------
  102. // TEST[skip:TBD]
  103. The API returns the following results:
  104. [source,console-result]
  105. ----
  106. {
  107. "assignment": {
  108. "task_parameters": {
  109. "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
  110. "model_bytes": 265632637,
  111. "threads_per_allocation" : 1,
  112. "number_of_allocations" : 1,
  113. "queue_capacity" : 1024,
  114. "priority": "normal"
  115. },
  116. "routing_table": {
  117. "uckeG3R8TLe2MMNBQ6AGrw": {
  118. "routing_state": "started",
  119. "reason": ""
  120. }
  121. },
  122. "assignment_state": "started",
  123. "start_time": "2022-11-02T11:50:34.766591Z"
  124. }
  125. }
  126. ----