start-trained-model-deployment.asciidoc 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
  1. [role="xpack"]
  2. [[start-trained-model-deployment]]
  3. = Start trained model deployment API
  4. [subs="attributes"]
  5. ++++
  6. <titleabbrev>Start trained model deployment</titleabbrev>
  7. ++++
  8. Starts a new trained model deployment.
  9. preview::[]
  10. [[start-trained-model-deployment-request]]
  11. == {api-request-title}
  12. `POST _ml/trained_models/<model_id>/deployment/_start`
  13. [[start-trained-model-deployment-prereq]]
  14. == {api-prereq-title}
  15. Requires the `manage_ml` cluster privilege. This privilege is included in the
  16. `machine_learning_admin` built-in role.
  17. [[start-trained-model-deployment-desc]]
  18. == {api-description-title}
  19. Currently only `pytorch` models are supported for deployment. Once deployed
  20. the model can be used by the <<inference-processor,{infer-cap} processor>>
  21. in an ingest pipeline or directly in the <<infer-trained-model>> API.
  22. Scaling inference performance can be achieved by setting the parameters
  23. `number_of_allocations` and `threads_per_allocation`.
  24. Increasing `threads_per_allocation` means more threads are used when
  25. an inference request is processed on a node. This can improve inference speed
  26. for certain models. It may also result in improvement to throughput.
  27. Increasing `number_of_allocations` means more threads are used to
  28. process multiple inference requests in parallel resulting in throughput
  29. improvement. Each model allocation uses a number of threads defined by
  30. `threads_per_allocation`.
  31. Model allocations are distributed across {ml} nodes. All allocations assigned
  32. to a node share the same copy of the model in memory. To avoid
  33. thread oversubscription which is detrimental to performance, model allocations
  34. are distributed in such a way that the total number of used threads does not
  35. surpass the node's allocated processors.
  36. [[start-trained-model-deployment-path-params]]
  37. == {api-path-parms-title}
  38. `<model_id>`::
  39. (Required, string)
  40. include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
  41. [[start-trained-model-deployment-query-params]]
  42. == {api-query-parms-title}
  43. `cache_size`::
  44. (Optional, <<byte-units,byte value>>)
  45. The inference cache size (in memory outside the JVM heap) per node for the model.
  46. The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
  47. `number_of_allocations`::
  48. (Optional, integer)
  49. The total number of allocations this model is assigned across {ml} nodes.
  50. Increasing this value generally increases the throughput.
  51. Defaults to 1.
  52. `queue_capacity`::
  53. (Optional, integer)
  54. Controls how many inference requests are allowed in the queue at a time.
  55. Every machine learning node in the cluster where the model can be allocated
  56. has a queue of this size; when the number of requests exceeds the total value,
  57. new requests are rejected with a 429 error. Defaults to 1024.
  58. `threads_per_allocation`::
  59. (Optional, integer)
  60. Sets the number of threads used by each model allocation during inference. This generally increases
  61. the speed per inference request. The inference process is a compute-bound process;
  62. `threads_per_allocations` must not exceed the number of available allocated processors per node.
  63. Defaults to 1. Must be a power of 2. Max allowed value is 32.
  64. `timeout`::
  65. (Optional, time)
  66. Controls the amount of time to wait for the model to deploy. Defaults
  67. to 20 seconds.
  68. `wait_for`::
  69. (Optional, string)
  70. Specifies the allocation status to wait for before returning. Defaults to
  71. `started`. The value `starting` indicates deployment is starting but not yet on
  72. any node. The value `started` indicates the model has started on at least one
  73. node. The value `fully_allocated` indicates the deployment has started on all
  74. valid nodes.
  75. [[start-trained-model-deployment-example]]
  76. == {api-examples-title}
  77. The following example starts a new deployment for a
  78. `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:
  79. [source,console]
  80. --------------------------------------------------
  81. POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
  82. --------------------------------------------------
  83. // TEST[skip:TBD]
  84. The API returns the following results:
  85. [source,console-result]
  86. ----
  87. {
  88. "assignment": {
  89. "task_parameters": {
  90. "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
  91. "model_bytes": 265632637,
  92. "threads_per_allocation" : 1,
  93. "number_of_allocations" : 1,
  94. "queue_capacity" : 1024
  95. },
  96. "routing_table": {
  97. "uckeG3R8TLe2MMNBQ6AGrw": {
  98. "routing_state": "started",
  99. "reason": ""
  100. }
  101. },
  102. "assignment_state": "started",
  103. "start_time": "2022-11-02T11:50:34.766591Z"
  104. }
  105. }
  106. ----