Browse Source

Add readme for APM metrics (#103400)

this commit adds documentation about the APM metrics usage in Elasticsearch
Przemyslaw Gomulka 1 year ago
parent
commit
c39a59503c
1 changed files with 171 additions and 0 deletions
  1. 171 0
      modules/apm/METERING.md

+ 171 - 0
modules/apm/METERING.md

@@ -0,0 +1,171 @@
+# Metrics in Elasticsearch
+
+Elasticsearch has the metrics API available in server's (perhaps we should move to lib?) package
+`org.elasticsearch.telemetry.metric`.
+This package contains base classes/interfaces for creating and working with metrics.
+Please refer to the javadocs provided in these classes in that package for more details.
+The entry point for working with metrics is `MeterRegistry`.
+
+## Implementation
+We use elastic's apm-java-agent as an implementation of the API we expose.
+the implementation can be found in `:modules:apm`
+The apm-java-agent is responsible for buffering metrics and upon metrics_interval
+send them over to apm server.
+Metrics_interval is configured via a `tracing.apm.agent.metrics_interval` setting
+The agent also collects a number of JVM metrics.
+see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm
+
+
+## How to choose an instrument
+
+We support various instruments and might be adding more as we go.
+The choice of the right instrument is not always easy as often differences are subtle.
+The simplified algorithm could be as follows:
+
+1. You want to measure something (absolute value)
+    1. values are non-additive
+        1.  use a gauge
+        2.  Example: a cpu temperature
+    2. values are additive
+        1. use asynchronous counter
+        2. Example: total number of requests
+2. You want to count something
+    1. values are monotonously increasing
+        1. use a counter
+        2. Example: Recording a failed authentication count
+    2. values can be decreased
+        1. use UpDownCounter
+        2. Example: Number of orders in a queue
+3. You want to record a statistics
+    1. use a histogram
+        1. Example: A statistics about how long it took to access a value from cache
+
+refer to https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/#instrument-selection
+for more details
+
+## How to name an instrument
+See the naming guidelines for metrics:
+https://docs.google.com/document/d/1jKxuaZi7QAMIRD_Eq3nonkYswVlQXW5TWllJEicoOtM/edit#heading=h.jxn90hx2ayic
+
+### Restarts and overflows
+if the instrument is correctly chosen, the apm server will be able to determine if the metrics
+were restarted (i.e. node was restarted) or there was a counter overflow
+(the metric in ES might use an int internally, but apm backend might have a long )
+
+## How to use an instrument
+There are 2 types of usages of an instrument depending on a type.
+- For synchronous instrument (counter/UpDownCounter) we need to register an instrument with
+  `MeterRegistry` and use the returned value to increment a value of that instrument
+```java
+  MeterRegistry registry;
+  LongCounter longCounter = registry.registerLongCounter("es.test.requests.count", "a test counter", "count");
+  longCounter.increment();
+  longCounter.incrementBy(1, Map.of("name", "Alice"));
+  longCounter.incrementBy(1, Map.of("name", "Bob"));
+```
+
+- For asynchronous instrument (gauge/AsynchronousCounter) we register an instrument
+  and have to provide a callback that will report the absolute measured value.
+  This callback has to be provided upon registration and cannot be changed.
+```java
+MeterRegistry registry;
+long someValue = 1;
+registry.registerLongGauge("es.test.cpu.temperature", "a test gauge", "celcius",
+() -> new LongWithAttributes(someValue, Map.of("cpuNumber", 1)));
+```
+
+If we don’t have access to ‘state’ that will be fetched on metric event (when callback is executed)
+we can use a utility LongGaugeMetric or LongGaugeMetric
+```java
+MeterRegistry meterRegistry ;
+LongGaugeMetric longGaugeMetric = LongGaugeMetric.create(meterRegistry, "es.test.gauge", "a test gauge", "total value");
+longGaugeMetric.set(123L);
+```
+### The use of attributes aka dimensions
+Each instrument can attach attributes to a reported value. This helps drilling down into the details
+of value that was reported during the metric event
+
+
+## Development
+
+### Fake http server
+
+The quickest way to verify that your metrics are working is to run `./gradlew run --with-apm-server`.
+This will run ES node (or nodes in serverless) and also start a fake http server that will act
+as an apm server. This fake http server will log all the http messages it receives from apm-agent
+
+### With APM server in cloud
+You can also run local ES node with an apm server in cloud.
+Create a new deployment in cloud, then click the 'hamburger :)' on the left, scroll to Observability and click APM under it.
+At the upper right corner there is `Add data` link, then scroll down to `ApmAgents` section and pick Java
+There you should be able to see `elastic.apm.secret_token` and `elastic.apm.server_url. You will use them in the next step.
+
+Next you should create a file `apm_server_ess.gradle`
+in a different directory than your elasticsearch checkout (so that branch changes don't remove it)
+The content of the file:
+```
+rootProject {
+  if (project.name == 'elasticsearch') {
+    afterEvaluate {
+      testClusters.matching { it.name == "runTask" }.configureEach {
+        setting 'xpack.security.audit.enabled', 'true'
+        keystore 'tracing.apm.secret_token', 'REDACTED'
+        setting 'telemetry.metrics.enabled', 'true'
+
+      setting 'tracing.apm.agent.server_url', 'https://REDACTED:443'
+      }
+    }
+  }
+}
+```
+Use the secret_token and server_url (REDACTED) from previous step.
+
+you can run your local ES node with APM in ESS with this command
+`./gradlew run -I ../apm_enable_statefull.gradle`
+
+#### An init.d gradle setup
+
+Alternatively you can edit your `~/.gradle/init.d/apm.gradle`
+```groovy
+rootProject {
+    if (project.name == 'elasticsearch' && Boolean.getBoolean('metrics.enabled')) {
+        afterEvaluate {
+            testClusters.matching { it.name == "runTask" }.configureEach {
+                setting 'xpack.security.audit.enabled', 'true'
+                keystore 'tracing.apm.secret_token', 'TODO-REPLACE'
+                setting 'telemetry.metrics.enabled', 'true'
+                setting 'tracing.apm.agent.server_url', 'https://TODO-REPLACE-URL.apm.eastus2.staging.azure.foundit.no:443'
+            }
+        }
+    }
+}
+```
+
+The example use:
+```
+./gradlew :run -Dmetrics.enabled=true
+```
+
+
+
+
+#### Logging
+with any approach you took to run your ES with APM you will find apm-agent.json file
+in ES's logs directory. If there are any problems with connecting to APM you will see WARN/ERROR messages.
+We run apm-agent with logs at WARN level, so normally you should not see any logs there.
+
+When running ES in cloud, logs are being also indexed in a logging cluster, so you will be able to find them
+in kibana. The `datastream.dataset` is `elasticsearch.apm_agent`
+
+
+### Testing
+We currently provide a base `TestTelemetryPlugin` which should help you write an integration test.
+See an example `S3BlobStoreRepositoryTests`
+
+
+
+
+# Links and further reading
+https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/
+
+https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html