|  | @@ -23,11 +23,28 @@ Requires the `manage_ml` cluster privilege. This privilege is included in the
 | 
	
		
			
				|  |  |  [[start-trained-model-deployment-desc]]
 | 
	
		
			
				|  |  |  == {api-description-title}
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | -Currently only `pytorch` models are supported for deployment. When deployed,
 | 
	
		
			
				|  |  | -the model attempts allocation to every machine learning node. Once deployed
 | 
	
		
			
				|  |  | +Currently only `pytorch` models are supported for deployment. Once deployed
 | 
	
		
			
				|  |  |  the model can be used by the <<inference-processor,{infer-cap} processor>>
 | 
	
		
			
				|  |  |  in an ingest pipeline or directly in the <<infer-trained-model>> API.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | +Scaling inference performance can be achieved by setting the parameters
 | 
	
		
			
				|  |  | +`number_of_allocations` and `threads_per_allocation`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Increasing `threads_per_allocation` means more threads are used when
 | 
	
		
			
				|  |  | +an inference request is processed on a node. This can improve inference speed
 | 
	
		
			
				|  |  | +for certain models. It may also result in improvement to throughput.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Increasing `number_of_allocations` means more threads are used to 
 | 
	
		
			
				|  |  | +process multiple inference requests in parallel resulting in throughput
 | 
	
		
			
				|  |  | +improvement. Each model allocation uses a number of threads defined by
 | 
	
		
			
				|  |  | +`threads_per_allocation`.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Model allocations are distributed across {ml} nodes. All allocations assigned
 | 
	
		
			
				|  |  | +to a node share the same copy of the model in memory. To avoid
 | 
	
		
			
				|  |  | +thread oversubscription which is detrimental to performance, model allocations
 | 
	
		
			
				|  |  | +are distributed in such a way that the total number of used threads does not
 | 
	
		
			
				|  |  | +surpass the node's allocated processors.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  |  [[start-trained-model-deployment-path-params]]
 | 
	
		
			
				|  |  |  == {api-path-parms-title}
 | 
	
		
			
				|  |  |  
 | 
	
	
		
			
				|  | @@ -40,21 +57,9 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  `number_of_allocations`::
 | 
	
		
			
				|  |  |  (Optional, integer)
 | 
	
		
			
				|  |  | -The number of model allocations on each node where the model is deployed.
 | 
	
		
			
				|  |  | -All allocations on a node share the same copy of the model in memory but use
 | 
	
		
			
				|  |  | -a separate set of threads to evaluate the model. 
 | 
	
		
			
				|  |  | +The total number of allocations this model is assigned across {ml} nodes.
 | 
	
		
			
				|  |  |  Increasing this value generally increases the throughput.
 | 
	
		
			
				|  |  | -If this setting is greater than the number of hardware threads
 | 
	
		
			
				|  |  | -it will automatically be changed to a value less than the number of hardware threads.
 | 
	
		
			
				|  |  |  Defaults to 1.
 | 
	
		
			
				|  |  | -+
 | 
	
		
			
				|  |  | ---
 | 
	
		
			
				|  |  | -[NOTE]
 | 
	
		
			
				|  |  | -=============================================
 | 
	
		
			
				|  |  | -If the sum of `threads_per_allocation` and `number_of_allocations` is greater
 | 
	
		
			
				|  |  | -than the number of hardware threads, the `threads_per_allocation` value is reduced.
 | 
	
		
			
				|  |  | -=============================================
 | 
	
		
			
				|  |  | ---
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  `queue_capacity`::
 | 
	
		
			
				|  |  |  (Optional, integer)
 | 
	
	
		
			
				|  | @@ -66,10 +71,8 @@ new requests are rejected with a 429 error. Defaults to 1024.
 | 
	
		
			
				|  |  |  `threads_per_allocation`::
 | 
	
		
			
				|  |  |  (Optional, integer)
 | 
	
		
			
				|  |  |  Sets the number of threads used by each model allocation during inference. This generally increases
 | 
	
		
			
				|  |  | -the inference speed. The inference process is a compute-bound process; any number
 | 
	
		
			
				|  |  | -greater than the number of available hardware threads on the machine does not increase the
 | 
	
		
			
				|  |  | -inference speed. If this setting is greater than the number of hardware threads
 | 
	
		
			
				|  |  | -it will automatically be changed to a value less than the number of hardware threads.
 | 
	
		
			
				|  |  | +the speed per inference request. The inference process is a compute-bound process;
 | 
	
		
			
				|  |  | +`threads_per_allocations` must not exceed the number of available allocated processors per node.
 | 
	
		
			
				|  |  |  Defaults to 1. Must be a power of 2. Max allowed value is 32.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  `timeout`::
 |