2 år sedan · 612a7b673a
--- a/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/infer-trained-model-deployment.asciidoc
@@ -46,8 +46,8 @@ Controls the amount of time to wait for {infer} results. Defaults to 10 seconds.
 
				 `docs`::
			
 
				 (Required, array)
			
 
				 An array of objects to pass to the model for inference. The objects should
			
 
				-contain a field matching your configured trained model input. Typically, the field
			
 
				-name is `text_field`. Currently, only a single value is allowed.
			
 
				+contain a field matching your configured trained model input. Typically, the 
			
 
				+field name is `text_field`. Currently, only a single value is allowed.
			
 
				 
			
 
				 ////
			
 
				 [[infer-trained-model-deployment-results]]
			
@@ -62,8 +62,8 @@ name is `text_field`. Currently, only a single value is allowed.
 
				 [[infer-trained-model-deployment-example]]
			
 
				 == {api-examples-title}
			
 
				 
			
 
				-The response depends on the task the model is trained for. If it is a
			
 
				-text classification task, the response is the score. For example:
			
 
				+The response depends on the task the model is trained for. If it is a text 
			
 
				+classification task, the response is the score. For example:
			
 
				 
			
 
				 [source,console]
			
 
				 --------------------------------------------------
			
@@ -123,8 +123,8 @@ The API returns in this case:
 
				 ----
			
 
				 // NOTCONSOLE
			
 
				 
			
 
				-Zero-shot classification tasks require extra configuration defining the class labels.
			
 
				-These labels are passed in the zero-shot inference config.
			
 
				+Zero-shot classification tasks require extra configuration defining the class 
			
 
				+labels. These labels are passed in the zero-shot inference config.
			
 
				 
			
 
				 [source,console]
			
 
				 --------------------------------------------------
			
@@ -150,7 +150,8 @@ POST _ml/trained_models/model2/deployment/_infer
 
				 --------------------------------------------------
			
 
				 // TEST[skip:TBD]
			
 
				 
			
 
				-The API returns the predicted label and the confidence, as well as the top classes:
			
 
				+The API returns the predicted label and the confidence, as well as the top 
			
 
				+classes:
			
 
				 
			
 
				 [source,console-result]
			
 
				 ----
			
@@ -204,8 +205,8 @@ POST _ml/trained_models/model2/deployment/_infer
 
				 --------------------------------------------------
			
 
				 // TEST[skip:TBD]
			
 
				 
			
 
				-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
			
 
				-the `is_truncated` field appears in the response.
			
 
				+When the input has been truncated due to the limit imposed by the model's 
			
 
				+`max_sequence_length` the `is_truncated` field appears in the response.
			
 
				 
			
 
				 [source,console-result]
			
 
				 ----
			
--- a/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
+++ b/docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
@@ -6,7 +6,11 @@
 
				 <titleabbrev>Infer trained model</titleabbrev>
			
 
				 ++++
			
 
				 
			
 
				-Evaluates a trained model. The model may be any supervised model either trained by {dfanalytics} or imported.
			
 
				+Evaluates a trained model. The model may be any supervised model either trained 
			
 
				+by {dfanalytics} or imported.
			
 
				+
			
 
				+NOTE: For model deployments with caching enabled, results may be returned 
			
 
				+directly from the {infer} cache.
			
 
				 
			
 
				 beta::[]
			
 
				 
			
@@ -102,7 +106,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-nlp-fill-mask]
 
				 =====
			
 
				 `num_top_classes`::::
			
 
				 (Optional, integer)
			
 
				-Number of top predicted tokens to return for replacing the mask token. Defaults to `0`.
			
 
				+Number of top predicted tokens to return for replacing the mask token. Defaults 
			
 
				+to `0`.
			
 
				 
			
 
				 `results_field`::::
			
 
				 (Optional, string)
			
@@ -272,7 +277,8 @@ The maximum amount of words in the answer. Defaults to `15`.
 
				 
			
 
				 `num_top_classes`::::
			
 
				 (Optional, integer)
			
 
				-The number the top found answers to return. Defaults to `0`, meaning only the best found answer is returned.
			
 
				+The number the top found answers to return. Defaults to `0`, meaning only the 
			
 
				+best found answer is returned.
			
 
				 
			
 
				 `question`::::
			
 
				 (Required, string)
			
@@ -368,7 +374,8 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=inference-config-text-classific
 
				 
			
 
				 `num_top_classes`::::
			
 
				 (Optional, integer)
			
 
				-Specifies the number of top class predictions to return. Defaults to all classes (-1).
			
 
				+Specifies the number of top class predictions to return. Defaults to all classes 
			
 
				+(-1).
			
 
				 
			
 
				 `results_field`::::
			
 
				 (Optional, string)
			
@@ -879,8 +886,8 @@ POST _ml/trained_models/model2/_infer
 
				 --------------------------------------------------
			
 
				 // TEST[skip:TBD]
			
 
				 
			
 
				-When the input has been truncated due to the limit imposed by the model's `max_sequence_length`
			
 
				-the `is_truncated` field appears in the response.
			
 
				+When the input has been truncated due to the limit imposed by the model's 
			
 
				+`max_sequence_length` the `is_truncated` field appears in the response.
			
 
				 
			
 
				 [source,console-result]
			
 
				 ----
			
--- a/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
+++ b/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc
@@ -30,20 +30,20 @@ in an ingest pipeline or directly in the <<infer-trained-model>> API.
 
				 Scaling inference performance can be achieved by setting the parameters
			
 
				 `number_of_allocations` and `threads_per_allocation`.
			
 
				 
			
 
				-Increasing `threads_per_allocation` means more threads are used when
			
 
				-an inference request is processed on a node. This can improve inference speed
			
 
				-for certain models. It may also result in improvement to throughput.
			
 
				+Increasing `threads_per_allocation` means more threads are used when an 
			
 
				+inference request is processed on a node. This can improve inference speed for 
			
 
				+certain models. It may also result in improvement to throughput.
			
 
				 
			
 
				-Increasing `number_of_allocations` means more threads are used to
			
 
				-process multiple inference requests in parallel resulting in throughput
			
 
				-improvement. Each model allocation uses a number of threads defined by
			
 
				+Increasing `number_of_allocations` means more threads are used to process 
			
 
				+multiple inference requests in parallel resulting in throughput improvement. 
			
 
				+Each model allocation uses a number of threads defined by 
			
 
				 `threads_per_allocation`.
			
 
				 
			
 
				-Model allocations are distributed across {ml} nodes. All allocations assigned
			
 
				-to a node share the same copy of the model in memory. To avoid
			
 
				-thread oversubscription which is detrimental to performance, model allocations
			
 
				-are distributed in such a way that the total number of used threads does not
			
 
				-surpass the node's allocated processors.
			
 
				+Model allocations are distributed across {ml} nodes. All allocations assigned to 
			
 
				+a node share the same copy of the model in memory. To avoid thread 
			
 
				+oversubscription which is detrimental to performance, model allocations are 
			
 
				+distributed in such a way that the total number of used threads does not surpass 
			
 
				+the node's allocated processors.
			
 
				 
			
 
				 [[start-trained-model-deployment-path-params]]
			
 
				 == {api-path-parms-title}
			
@@ -57,53 +57,55 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
 
				 
			
 
				 `cache_size`::
			
 
				 (Optional, <<byte-units,byte value>>)
			
 
				-The inference cache size (in memory outside the JVM heap) per node for the model.
			
 
				-The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
			
 
				+The inference cache size (in memory outside the JVM heap) per node for the 
			
 
				+model. The default value is the size of the model as reported by the 
			
 
				+`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the 
			
 
				+cache, `0b` can be provided.
			
 
				 
			
 
				 `number_of_allocations`::
			
 
				 (Optional, integer)
			
 
				 The total number of allocations this model is assigned across {ml} nodes.
			
 
				-Increasing this value generally increases the throughput.
			
 
				-Defaults to 1.
			
 
				+Increasing this value generally increases the throughput. Defaults to 1.
			
 
				 
			
 
				 `priority`::
			
 
				 (Optional, string)
			
 
				 The priority of the deployment. The default value is `normal`.
			
 
				-
			
 
				 There are two priority settings:
			
 
				 +
			
 
				 --
			
 
				-* `normal`: Use this for deployments in production.
			
 
				-The deployment allocations are distributed so that node processors are not oversubscribed.
			
 
				-* `low`: Use this for testing model functionality.
			
 
				-The intention is that these deployments are not sent a high volume of input.
			
 
				-The deployment is required to have a single allocation with just one thread.
			
 
				-Low priority deployments may be assigned on nodes that already utilize all their processors
			
 
				-but will be given a lower CPU priority than normal deployments. Low priority deployments may be unassigned in order
			
 
				-to satisfy more allocations of normal priority deployments.
			
 
				+* `normal`: Use this for deployments in production. The deployment allocations 
			
 
				+are distributed so that node processors are not oversubscribed.
			
 
				+* `low`: Use this for testing model functionality. The intention is that these 
			
 
				+deployments are not sent a high volume of input. The deployment is required to 
			
 
				+have a single allocation with just one thread. Low priority deployments may be 
			
 
				+assigned on nodes that already utilize all their processors but will be given a 
			
 
				+lower CPU priority than normal deployments. Low priority deployments may be 
			
 
				+unassigned in order to satisfy more allocations of normal priority deployments.
			
 
				 --
			
 
				 
			
 
				-WARNING: Heavy usage of low priority deployments may impact performance of normal
			
 
				-priority deployments.
			
 
				+WARNING: Heavy usage of low priority deployments may impact performance of 
			
 
				+normal priority deployments.
			
 
				 
			
 
				 `queue_capacity`::
			
 
				 (Optional, integer)
			
 
				 Controls how many inference requests are allowed in the queue at a time.
			
 
				 Every machine learning node in the cluster where the model can be allocated
			
 
				 has a queue of this size; when the number of requests exceeds the total value,
			
 
				-new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
			
 
				+new requests are rejected with a 429 error. Defaults to 1024. Max allowed value 
			
 
				+is 1000000.
			
 
				 
			
 
				 `threads_per_allocation`::
			
 
				 (Optional, integer)
			
 
				-Sets the number of threads used by each model allocation during inference. This generally increases
			
 
				-the speed per inference request. The inference process is a compute-bound process;
			
 
				-`threads_per_allocations` must not exceed the number of available allocated processors per node.
			
 
				-Defaults to 1. Must be a power of 2. Max allowed value is 32.
			
 
				+Sets the number of threads used by each model allocation during inference. This 
			
 
				+generally increases the speed per inference request. The inference process is a 
			
 
				+compute-bound process; `threads_per_allocations` must not exceed the number of 
			
 
				+available allocated processors per node. Defaults to 1. Must be a power of 2. 
			
 
				+Max allowed value is 32.
			
 
				 
			
 
				 `timeout`::
			
 
				 (Optional, time)
			
 
				-Controls the amount of time to wait for the model to deploy. Defaults
			
 
				-to 20 seconds.
			
 
				+Controls the amount of time to wait for the model to deploy. Defaults to 20 
			
 
				+seconds.
			
 
				 
			
 
				 `wait_for`::
			
 
				 (Optional, string)