3 years ago · f3199e968b
--- a/docs/reference/ml/trained-models/apis/get-trained-models-stats.asciidoc
+++ b/docs/reference/ml/trained-models/apis/get-trained-models-stats.asciidoc
@@ -183,8 +183,6 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=node-transport-address]
 
				 `number_of_allocations`:::
			
 
				 (integer)
			
 
				 The number of allocations assigned to this node.
			
 
				-This value is limited by the number of hardware threads on the node;
			
 
				-it might therefore differ from the `number_of_allocations` value in the <<start-trained-model-deployment>> API.
			
 
				 
			
 
				 `number_of_pending_requests`:::
			
 
				 (integer)
			
--- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -23,11 +23,28 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the
 
				 [[start-trained-model-deployment-desc]]
			
 
				 == {api-description-title}
			
 
				 
			
 
				-Currently only `pytorch` models are supported for deployment. When deployed,
			
 
				-the model attempts allocation to every machine learning node. Once deployed
			
 
				+Currently only `pytorch` models are supported for deployment. Once deployed
			
 
				 the model can be used by the <<inference-processor,{infer-cap} processor>>
			
 
				 in an ingest pipeline or directly in the <<infer-trained-model>> API.
			
 
				 
			
 
				+Scaling inference performance can be achieved by setting the parameters
			
 
				+`number_of_allocations` and `threads_per_allocation`.
			
 
				+
			
 
				+Increasing `threads_per_allocation` means more threads are used when
			
 
				+an inference request is processed on a node. This can improve inference speed
			
 
				+for certain models. It may also result in improvement to throughput.
			
 
				+
			
 
				+Increasing `number_of_allocations` means more threads are used to 
			
 
				+process multiple inference requests in parallel resulting in throughput
			
 
				+improvement. Each model allocation uses a number of threads defined by
			
 
				+`threads_per_allocation`.
			
 
				+
			
 
				+Model allocations are distributed across {ml} nodes. All allocations assigned
			
 
				+to a node share the same copy of the model in memory. To avoid
			
 
				+thread oversubscription which is detrimental to performance, model allocations
			
 
				+are distributed in such a way that the total number of used threads does not
			
 
				+surpass the node's allocated processors.
			
 
				+
			
 
				 [[start-trained-model-deployment-path-params]]
			
 
				 == {api-path-parms-title}
			
 
				 
			
@@ -40,21 +57,9 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
 
				 
			
 
				 `number_of_allocations`::
			
 
				 (Optional, integer)
			
 
				-The number of model allocations on each node where the model is deployed.
			
 
				-All allocations on a node share the same copy of the model in memory but use
			
 
				-a separate set of threads to evaluate the model. 
			
 
				+The total number of allocations this model is assigned across {ml} nodes.
			
 
				 Increasing this value generally increases the throughput.
			
 
				-If this setting is greater than the number of hardware threads
			
 
				-it will automatically be changed to a value less than the number of hardware threads.
			
 
				 Defaults to 1.
			
 
				-+
			
 
				---
			
 
				-[NOTE]
			
 
				-=============================================
			
 
				-If the sum of `threads_per_allocation` and `number_of_allocations` is greater
			
 
				-than the number of hardware threads, the `threads_per_allocation` value is reduced.
			
 
				-=============================================
			
 
				---
			
 
				 
			
 
				 `queue_capacity`::
			
 
				 (Optional, integer)
			
@@ -66,10 +71,8 @@ new requests are rejected with a 429 error. Defaults to 1024.
 
				 `threads_per_allocation`::
			
 
				 (Optional, integer)
			
 
				 Sets the number of threads used by each model allocation during inference. This generally increases
			
 
				-the inference speed. The inference process is a compute-bound process; any number
			
 
				-greater than the number of available hardware threads on the machine does not increase the
			
 
				-inference speed. If this setting is greater than the number of hardware threads
			
 
				-it will automatically be changed to a value less than the number of hardware threads.
			
 
				+the speed per inference request. The inference process is a compute-bound process;
			
 
				+`threads_per_allocations` must not exceed the number of available allocated processors per node.
			
 
				 Defaults to 1. Must be a power of 2. Max allowed value is 32.
			
 
				 
			
 
				 `timeout`::