remote-clusters-troubleshooting.asciidoc 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394
  1. [[remote-clusters-troubleshooting]]
  2. === Troubleshooting remote clusters
  3. ++++
  4. <titleabbrev>Troubleshooting</titleabbrev>
  5. ++++
  6. You may encounter several issues when setting up a remote cluster for {ccr} or
  7. {ccs}.
  8. [[remote-clusters-troubleshooting-general]]
  9. ==== General troubleshooting
  10. [[remote-clusters-troubleshooting-check-connection]]
  11. ===== Checking whether a remote cluster has connected successfully
  12. A successful call to the cluster settings update API for adding or updating
  13. remote clusters does not necessarily mean the configuration is successful.
  14. Use the <<cluster-remote-info,remote cluster info API>> to verify that a local
  15. cluster is successfully connected to a remote cluster.
  16. include::remote-clusters-remote-info.asciidoc[]
  17. [[remote-clusters-troubleshooting-enable-server]]
  18. ===== Enabling the remote cluster server
  19. When using API key authentication, cross-cluster traffic happens on the remote
  20. cluster interface, instead of the transport interface. The remote cluster
  21. interface is not enabled by default. This means a node is not ready to accept
  22. incoming cross-cluster requests by default, while it is ready to send outgoing
  23. cross-cluster requests. Ensure you've enabled the remote cluster server on every
  24. node of the remote cluster. In `elasticsearch.yml`:
  25. * Set <<remote-cluster-network-settings,`remote_cluster_server.enabled`>> to
  26. `true`.
  27. * Configure the bind and publish address for remote cluster server traffic, for
  28. example using <<remote-cluster-network-settings,`remote_cluster.host`>>. Without
  29. configuring the address, remote cluster traffic may be bound to the local
  30. interface, and remote clusters running on other machines can't connect.
  31. * Optionally, configure the remote server port using
  32. <<remote_cluster.port,`remote_cluster.port`>> (defaults to `9443`).
  33. [[remote-clusters-troubleshooting-common-issues]]
  34. ==== Common issues
  35. The following issues are listed in the order they may occur while setting up a
  36. remote cluster.
  37. [[remote-clusters-not-reachable]]
  38. ===== Remote cluster not reachable
  39. ====== Symptom
  40. A local cluster may not be able to reach a remote cluster for many reasons. For
  41. example, the remote cluster server may not be enabled, an incorrect host or port
  42. may be configured, or a firewall may be blocking traffic. When a remote cluster
  43. is not reachable, check the logs of the local cluster for a `connect_exception`.
  44. When the remote cluster is configured using proxy mode:
  45. [source,txt,subs=+quotes]
  46. ----
  47. [2023-06-28T16:36:47,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
  48. org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] *connect_exception*
  49. ----
  50. When the remote cluster is configured using sniff mode:
  51. [source,txt,subs=+quotes]
  52. ----
  53. [2023-06-28T16:38:37,731][WARN ][o.e.t.SniffConnectionStrategy] [local-node] fetching nodes from external cluster [my] failed
  54. org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] *connect_exception*
  55. ----
  56. ====== Resolution
  57. * Check the host and port for the remote cluster are correct.
  58. * Ensure the <<remote-clusters-troubleshooting-enable-server,remote cluster
  59. server is enabled>> on the remote cluster.
  60. * Ensure no firewall is blocking the communication.
  61. [[remote-clusters-troubleshooting-tls-trust]]
  62. ===== TLS trust not established
  63. TLS can be misconfigured on the local or the remote cluster. The result is that
  64. the local cluster does not trust the certificate presented by the remote
  65. cluster.
  66. ====== Symptom
  67. The local cluster logs `failed to establish trust with server`:
  68. [source,txt,subs=+quotes]
  69. ----
  70. [2023-06-29T09:40:55,465][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] *failed to establish trust with server* at [192.168.0.42]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-08-16T23:40:55.464275Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([(shared) (with trust configuration: JDK-trusted-certs)]) is not configured to trust that issuer but trusts [97] other issuers
  71. sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
  72. ----
  73. The remote cluster logs `client did not trust this server's certificate`:
  74. [source,txt,subs=+quotes]
  75. ----
  76. [2023-06-29T09:40:55,478][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] *client did not trust this server's certificate*, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:57305, profile=_remote_cluster}
  77. ----
  78. ====== Resolution
  79. Read the warn log message on the local cluster carefully to determine the exact
  80. cause of the failure. For example:
  81. * Is the remote cluster certificate not signed by a trusted CA? This is the most
  82. likely cause.
  83. * Is hostname verification failing?
  84. * Is the certificate expired?
  85. Once you know the cause, you should be able to fix it by adjusting the remote
  86. cluster related SSL settings on either the local cluster or the remote cluster.
  87. Often, the issue is on the local cluster. For example, fix it by configuring necessary
  88. trusted CAs (`xpack.security.remote_cluster_client.ssl.certificate_authorities`).
  89. If you change the `elasticsearch.yml` file, the associated cluster needs to be
  90. restarted for the changes to take effect.
  91. [[remote-clusters-troubleshooting-api-key]]
  92. ==== API key authentication issues
  93. [[remote-clusters-troubleshooting-transport-port-api-key]]
  94. ===== Connecting to transport port when using API key authentication
  95. When using API key authentication, a local cluster should connect to a remote
  96. cluster's remote cluster server port (defaults to `9443`) instead of the
  97. transport port (defaults to `9300`). A misconfiguration can lead to a number of
  98. symptoms:
  99. ====== Symptom 1
  100. It's recommended to use different CAs and certificates for the transport
  101. interface and the remote cluster server interface. If this recommendation is
  102. followed, a remote cluster client node does not trust the server certificate
  103. presented by a remote cluster on the transport interface.
  104. The local cluster logs `failed to establish trust with server`:
  105. [source,txt,subs=+quotes]
  106. ----
  107. [2023-06-28T12:48:46,575][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] *failed to establish trust with server* at [1192.168.0.42]; the server provided a certificate with subject name [CN=transport], fingerprint [c43e628be2a8aaaa4092b82d78f2bc206c492322], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:05:53Z] and [2032-08-29T12:05:53Z] (current time is [2023-06-28T02:48:46.574738Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto Transport CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.remote_cluster_client.ssl (with trust configuration: PEM-trust{/rcs2/ssl/remote-cluster-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto RemoteCluster CA] with fingerprint [ba2350661f66e46c746c1629f0c4b645a2587ff4]
  108. sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
  109. ----
  110. The remote cluster logs `client did not trust this server's certificate`:
  111. [source,txt,subs=+quotes]
  112. ----
  113. [2023-06-28T12:48:46,584][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] *client did not trust this server's certificate*, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60810, profile=default}
  114. ----
  115. ====== Symptom 2
  116. The CA and certificate can be shared between the transport and remote cluster
  117. server interface. Since a remote cluster client does not have a client
  118. certificate by default, the server will fail to verify the client certificate.
  119. The local cluster logs `Received fatal alert: bad_certificate`:
  120. [source,txt,subs=+quotes]
  121. ----
  122. [2023-06-28T12:43:30,705][WARN ][o.e.t.TcpTransport ] [local-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.84:60738, remoteAddress=/192.168.0.42:9309, profile=_remote_cluster}], closing connection
  123. io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: *Received fatal alert: bad_certificate*
  124. ----
  125. The remote cluster logs `Empty client certificate chain`:
  126. [source,txt,subs=+quotes]
  127. ----
  128. [2023-06-28T12:43:30,772][WARN ][o.e.t.TcpTransport ] [remote-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60783, profile=default}], closing connection
  129. io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: *Empty client certificate chain*
  130. ----
  131. ====== Symptom 3
  132. If the remote cluster client is configured for mTLS and provides a valid client
  133. certificate, the connection fails because the client does not send the expected
  134. authentication header.
  135. The local cluster logs `missing authentication`:
  136. [source,txt,subs=+quotes]
  137. ----
  138. [2023-06-28T13:04:52,710][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
  139. org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
  140. Caused by: org.elasticsearch.ElasticsearchSecurityException: *missing authentication* credentials for action [cluster:internal/remote_cluster/handshake]
  141. ----
  142. This does not show up in the logs of the remote cluster.
  143. ====== Symptom 4
  144. If anonymous access is enabled on the remote cluster and it does not require
  145. authentication, depending on the privileges of the anonymous user, the local
  146. cluster may log the following.
  147. If the anonymous user does not the have necessary privileges to make a
  148. connection, the local cluster logs `unauthorized`:
  149. [source,txt,subs=+quotes]
  150. ----
  151. org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
  152. Caused by: org.elasticsearch.ElasticsearchSecurityException: action [cluster:internal/remote_cluster/handshake] is *unauthorized* for user [anonymous_foo] with effective roles [reporting_user], this action is granted by the cluster privileges [cross_cluster_search,cross_cluster_replication,manage,all]
  153. ----
  154. If the anonymous user has necessary privileges, for example it is a superuser,
  155. the local cluster logs `requires channel profile to be [_remote_cluster],
  156. but got [default]`:
  157. [source,txt,subs=+quotes]
  158. ----
  159. [2023-06-28T13:09:52,031][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
  160. org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
  161. Caused by: java.lang.IllegalArgumentException: remote cluster handshake action *requires channel profile to be [_remote_cluster], but got [default]*
  162. ----
  163. ====== Resolution
  164. Check the port number and ensure you are indeed connecting to the remote cluster
  165. server instead of the transport interface.
  166. [[remote-clusters-troubleshooting-no-api-key]]
  167. ===== Connecting without a cross-cluster API key
  168. A local cluster uses the presence of a cross-cluster API key to determine the
  169. model with which it connects to a remote cluster. If a cross-cluster API key is
  170. present, it uses API key based authentication. Otherwise, it uses certificate
  171. based authentication. You can check what model is being used with the <<cluster-remote-info,remote cluster info API>> on the local cluster:
  172. include::remote-clusters-remote-info.asciidoc[]
  173. Besides checking the response of the remote cluster info API, you can also check
  174. the logs.
  175. ====== Symptom 1
  176. If no cross-cluster API key is used, the local cluster uses the certificate
  177. based authentication method, and connects to the remote cluster using the TLS
  178. configuration of the transport interface. If the remote cluster has different
  179. TLS CA and certificate for transport and remote cluster server interfaces (which
  180. is the recommendation), TLS verification will fail.
  181. The local cluster logs `failed to establish trust with server`:
  182. [source,txt,subs=+quotes]
  183. ----
  184. [2023-06-28T12:51:06,452][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] *failed to establish trust with server* at [<unknown host>]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-06-28T02:51:06.451581Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.transport.ssl (with trust configuration: PEM-trust{/rcs2/ssl/transport-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto Transport CA] with fingerprint [bbe49e3f986506008a70ab651b188c70df104812]
  185. sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
  186. ----
  187. The remote cluster logs `client did not trust this server's certificate`:
  188. [source,txt,subs=+quotes]
  189. ----
  190. [2023-06-28T12:52:16,914][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] *client did not trust this server's certificate*, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:60981, profile=_remote_cluster}
  191. ----
  192. ====== Symptom 2
  193. Even if TLS verification is not an issue, the connection fails due to missing
  194. credentials.
  195. The local cluster logs `Please ensure you have configured remote cluster credentials`:
  196. [source,txt,subs=+quotes]
  197. ----
  198. Caused by: java.lang.IllegalArgumentException: Cross cluster requests through the dedicated remote cluster server port require transport header [_cross_cluster_access_credentials] but none found. *Please ensure you have configured remote cluster credentials* on the cluster originating the request.
  199. ----
  200. This does not show up in the logs of the remote cluster.
  201. ====== Resolution
  202. Add the cross-cluster API key to {es} keystore on every node of the local
  203. cluster. Restart the local cluster to reload the keystore.
  204. [[remote-clusters-troubleshooting-wrong-api-key-type]]
  205. ===== Using the wrong API key type
  206. API key based authentication requires
  207. <<security-api-create-cross-cluster-api-key,cross-cluster API keys>>. It does
  208. not work with <<security-api-create-api-key,REST API keys>>.
  209. ====== Symptom
  210. The local cluster logs `authentication expected API key type of [cross_cluster]`:
  211. [source,txt,subs=+quotes]
  212. ----
  213. [2023-06-28T13:26:53,962][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
  214. org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
  215. Caused by: org.elasticsearch.ElasticsearchSecurityException: *authentication expected API key type of [cross_cluster]*, but API key [agZXJocBmA2beJfq2yKu] has type [rest]
  216. ----
  217. This does not show up in the logs of the remote cluster.
  218. ====== Resolution
  219. Ask the remote cluster administrator to create and distribute a
  220. <<security-api-create-cross-cluster-api-key,cross-cluster API key>>. Replace the
  221. existing API key in the {es} keystore with this cross-cluster API key on every
  222. node of the local cluster. Restart the local cluster for keystore changes to
  223. take effect.
  224. [[remote-clusters-troubleshooting-non-valid-api-key]]
  225. ===== Invalid API key
  226. A cross-cluster API can fail to authenticate. For example, when its credentials
  227. are incorrect, or if it's invalidated or expired.
  228. ====== Symptom
  229. The local cluster logs `unable to authenticate`:
  230. [source,txt,subs=+quotes]
  231. ----
  232. [2023-06-28T13:22:58,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
  233. org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
  234. Caused by: org.elasticsearch.ElasticsearchSecurityException: *unable to authenticate* user [agZXJocBmA2beJfq2yKu] for action [cluster:internal/remote_cluster/handshake]
  235. ----
  236. The remote cluster logs `Authentication using apikey failed`:
  237. [source,txt,subs=+quotes]
  238. ----
  239. [2023-06-28T13:24:38,744][WARN ][o.e.x.s.a.ApiKeyAuthenticator] [remote-node] *Authentication using apikey failed* - invalid credentials for API key [agZXJocBmA2beJfq2yKu]
  240. ----
  241. ====== Resolution
  242. Ask the remote cluster administrator to create and distribute a
  243. <<security-api-create-cross-cluster-api-key,cross-cluster API key>>. Replace the
  244. existing API key in the {es} keystore with this cross-cluster API key on every
  245. node of the local cluster. Restart the local cluster for keystore changes to
  246. take effect.
  247. [[remote-clusters-troubleshooting-insufficient-privileges]]
  248. ===== API key or local user has insufficient privileges
  249. The effective permission for a local user running requests on a remote cluster
  250. is determined by the intersection of the cross-cluster API key's privileges and
  251. the local user's `remote_indices` privileges.
  252. ====== Symptom
  253. Request failures due to insufficient privileges result in API responses like:
  254. [source,js,subs=+quotes]
  255. ----
  256. {
  257. "type": "security_exception",
  258. "reason": "action [indices:data/read/search] towards remote cluster is *unauthorized for user* [foo] with assigned roles [foo-role] authenticated by API key id [agZXJocBmA2beJfq2yKu] of user [elastic-admin] on indices [cd], this action is granted by the index privileges [read,all]"
  259. }
  260. ----
  261. // NOTCONSOLE
  262. This does not show up in any logs.
  263. ====== Resolution
  264. . Check that the local user has the necessary `remote_indices` privileges. Grant sufficient `remote_indices` privileges if necessary.
  265. . If permission is not an issue locally, ask the remote cluster administrator to
  266. create and distribute a
  267. <<security-api-create-cross-cluster-api-key,cross-cluster API key>>. Replace the
  268. existing API key in the {es} keystore with this cross-cluster API key on every
  269. node of the local cluster. Restart the local cluster for keystore changes to
  270. take effect.
  271. [[remote-clusters-troubleshooting-no-remote_indices-privileges]]
  272. ===== Local user has no `remote_indices` privileges
  273. This is a special case of insufficient privileges. In this case, the local user
  274. has no `remote_indices` privileges at all for the target remote cluster. {es}
  275. can detect that and issue a more explicit error response.
  276. ====== Symptom
  277. This results in API responses like:
  278. [source,js,subs=+quotes]
  279. ----
  280. {
  281. "type": "security_exception",
  282. "reason": "action [indices:data/read/search] towards remote cluster [my] is unauthorized for user [foo] with effective roles [] (assigned roles [foo-role] were not found) because *no remote indices privileges apply for the target cluster*"
  283. }
  284. ----
  285. // NOTCONSOLE
  286. ====== Resolution
  287. Grant sufficient `remote_indices` privileges to the local user.