|
@@ -28,19 +28,19 @@ in an ingest pipeline or directly in the <<infer-trained-model>> API.
|
|
|
Scaling inference performance can be achieved by setting the parameters
|
|
|
`number_of_allocations` and `threads_per_allocation`.
|
|
|
|
|
|
-Increasing `threads_per_allocation` means more threads are used when an
|
|
|
-inference request is processed on a node. This can improve inference speed for
|
|
|
+Increasing `threads_per_allocation` means more threads are used when an
|
|
|
+inference request is processed on a node. This can improve inference speed for
|
|
|
certain models. It may also result in improvement to throughput.
|
|
|
|
|
|
-Increasing `number_of_allocations` means more threads are used to process
|
|
|
-multiple inference requests in parallel resulting in throughput improvement.
|
|
|
-Each model allocation uses a number of threads defined by
|
|
|
+Increasing `number_of_allocations` means more threads are used to process
|
|
|
+multiple inference requests in parallel resulting in throughput improvement.
|
|
|
+Each model allocation uses a number of threads defined by
|
|
|
`threads_per_allocation`.
|
|
|
|
|
|
-Model allocations are distributed across {ml} nodes. All allocations assigned to
|
|
|
-a node share the same copy of the model in memory. To avoid thread
|
|
|
-oversubscription which is detrimental to performance, model allocations are
|
|
|
-distributed in such a way that the total number of used threads does not surpass
|
|
|
+Model allocations are distributed across {ml} nodes. All allocations assigned to
|
|
|
+a node share the same copy of the model in memory. To avoid thread
|
|
|
+oversubscription which is detrimental to performance, model allocations are
|
|
|
+distributed in such a way that the total number of used threads does not surpass
|
|
|
the node's allocated processors.
|
|
|
|
|
|
[[start-trained-model-deployment-path-params]]
|
|
@@ -55,9 +55,9 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
|
|
|
|
|
|
`cache_size`::
|
|
|
(Optional, <<byte-units,byte value>>)
|
|
|
-The inference cache size (in memory outside the JVM heap) per node for the
|
|
|
-model. The default value is the size of the model as reported by the
|
|
|
-`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
|
|
|
+The inference cache size (in memory outside the JVM heap) per node for the
|
|
|
+model. The default value is the size of the model as reported by the
|
|
|
+`model_size_bytes` field in the <<get-trained-models-stats>>. To disable the
|
|
|
cache, `0b` can be provided.
|
|
|
|
|
|
`number_of_allocations`::
|
|
@@ -71,17 +71,17 @@ The priority of the deployment. The default value is `normal`.
|
|
|
There are two priority settings:
|
|
|
+
|
|
|
--
|
|
|
-* `normal`: Use this for deployments in production. The deployment allocations
|
|
|
+* `normal`: Use this for deployments in production. The deployment allocations
|
|
|
are distributed so that node processors are not oversubscribed.
|
|
|
-* `low`: Use this for testing model functionality. The intention is that these
|
|
|
-deployments are not sent a high volume of input. The deployment is required to
|
|
|
-have a single allocation with just one thread. Low priority deployments may be
|
|
|
-assigned on nodes that already utilize all their processors but will be given a
|
|
|
-lower CPU priority than normal deployments. Low priority deployments may be
|
|
|
+* `low`: Use this for testing model functionality. The intention is that these
|
|
|
+deployments are not sent a high volume of input. The deployment is required to
|
|
|
+have a single allocation with just one thread. Low priority deployments may be
|
|
|
+assigned on nodes that already utilize all their processors but will be given a
|
|
|
+lower CPU priority than normal deployments. Low priority deployments may be
|
|
|
unassigned in order to satisfy more allocations of normal priority deployments.
|
|
|
--
|
|
|
|
|
|
-WARNING: Heavy usage of low priority deployments may impact performance of
|
|
|
+WARNING: Heavy usage of low priority deployments may impact performance of
|
|
|
normal priority deployments.
|
|
|
|
|
|
`queue_capacity`::
|
|
@@ -89,20 +89,20 @@ normal priority deployments.
|
|
|
Controls how many inference requests are allowed in the queue at a time.
|
|
|
Every machine learning node in the cluster where the model can be allocated
|
|
|
has a queue of this size; when the number of requests exceeds the total value,
|
|
|
-new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
|
|
|
+new requests are rejected with a 429 error. Defaults to 1024. Max allowed value
|
|
|
is 1000000.
|
|
|
|
|
|
`threads_per_allocation`::
|
|
|
(Optional, integer)
|
|
|
-Sets the number of threads used by each model allocation during inference. This
|
|
|
-generally increases the speed per inference request. The inference process is a
|
|
|
-compute-bound process; `threads_per_allocations` must not exceed the number of
|
|
|
-available allocated processors per node. Defaults to 1. Must be a power of 2.
|
|
|
+Sets the number of threads used by each model allocation during inference. This
|
|
|
+generally increases the speed per inference request. The inference process is a
|
|
|
+compute-bound process; `threads_per_allocations` must not exceed the number of
|
|
|
+available allocated processors per node. Defaults to 1. Must be a power of 2.
|
|
|
Max allowed value is 32.
|
|
|
|
|
|
`timeout`::
|
|
|
(Optional, time)
|
|
|
-Controls the amount of time to wait for the model to deploy. Defaults to 20
|
|
|
+Controls the amount of time to wait for the model to deploy. Defaults to 30
|
|
|
seconds.
|
|
|
|
|
|
`wait_for`::
|