Browse Source

[ML] Adding pytorch oom to known issues (#110668)

* Adding pytorch oom to known issues

* Fixing section

* Updating text to exclude the pytorch version
Jonathan Buttner 1 year ago
parent
commit
3e7f7f4709

+ 6 - 2
docs/reference/release-notes/8.13.0.asciidoc

@@ -28,6 +28,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[breaking-8.13.0]]
 [float]
 === Breaking changes
@@ -464,5 +470,3 @@ Search::
 * Upgrade to Lucene 9.9.0 {es-pull}102782[#102782]
 * Upgrade to Lucene 9.9.1 {es-pull}103387[#103387]
 * Upgrade to Lucene 9.9.2 {es-pull}104753[#104753]
-
-

+ 6 - 2
docs/reference/release-notes/8.13.1.asciidoc

@@ -13,6 +13,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.13.1]]
 [float]
 
@@ -45,5 +51,3 @@ Transform::
 
 Transform::
 * Raise loglevel of events related to transform lifecycle from DEBUG to INFO {es-pull}106602[#106602]
-
-

+ 6 - 2
docs/reference/release-notes/8.13.2.asciidoc

@@ -13,6 +13,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.13.2]]
 [float]
 
@@ -46,5 +52,3 @@ Packaging::
 Security::
 * Query API Key Information API support for the `typed_keys` request parameter {es-pull}106873[#106873] (issue: {es-issue}106817[#106817])
 * Query API Keys support for both `aggs` and `aggregations` keywords {es-pull}107054[#107054] (issue: {es-issue}106839[#106839])
-
-

+ 6 - 2
docs/reference/release-notes/8.13.3.asciidoc

@@ -20,6 +20,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.13.3]]
 [float]
 === Bug fixes
@@ -52,5 +58,3 @@ Search::
 
 ES|QL::
 * ESQL: Introduce language versioning to REST API {es-pull}106824[#106824]
-
-

+ 6 - 2
docs/reference/release-notes/8.13.4.asciidoc

@@ -13,6 +13,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.13.4]]
 [float]
 === Bug fixes
@@ -28,5 +34,3 @@ Snapshot/Restore::
 
 TSDB::
 * Fix tsdb codec when doc-values spread in two blocks {es-pull}108276[#108276]
-
-

+ 6 - 2
docs/reference/release-notes/8.14.0.asciidoc

@@ -22,6 +22,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.14.0]]
 [float]
 === Bug fixes
@@ -356,5 +362,3 @@ Network::
 
 Packaging::
 * Update bundled JDK to Java 22 (again) {es-pull}108654[#108654]
-
-

+ 6 - 2
docs/reference/release-notes/8.14.1.asciidoc

@@ -14,6 +14,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.14.1]]
 [float]
 === Bug fixes
@@ -42,5 +48,3 @@ Vector Search::
 
 Infra/Settings::
 * Add remove index setting command {es-pull}109276[#109276]
-
-

+ 6 - 0
docs/reference/release-notes/8.14.2.asciidoc

@@ -13,6 +13,12 @@ If your cluster is running on ECK 2.12.1 and above, this may cause problems with
 To resolve this issue, perform a rolling restart on the non-master-eligible nodes once all Elasticsearch nodes
 are upgraded.
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])
+
 [[bug-8.14.2]]
 [float]
 === Bug fixes

+ 8 - 0
docs/reference/release-notes/8.15.0.asciidoc

@@ -5,4 +5,12 @@ coming[8.15.0]
 
 Also see <<breaking-changes-8.15,Breaking changes in 8.15>>.
 
+[[known-issues-8.15.0]]
+[float]
+=== Known issues
 
+* The `pytorch_inference` process used to run Machine Learning models can consume large amounts of memory.
+In environments where the available memory is limited, the OS Out of Memory Killer will kill the `pytorch_inference`
+process to reclaim memory. This can cause inference requests to fail.
+Elasticsearch will automatically restart the `pytorch_inference` process
+after it is killed up to four times in 24 hours. (issue: {es-issue}110530[#110530])