123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184 |
- [[search-aggregations-metrics-median-absolute-deviation-aggregation]]
- === Median Absolute Deviation Aggregation
- This `single-value` aggregation approximates the https://en.wikipedia.org/wiki/Median_absolute_deviation[median absolute deviation]
- of its search results.
- Median absolute deviation is a measure of variability. It is a robust
- statistic, meaning that it is useful for describing data that may have
- outliers, or may not be normally distributed. For such data it can be more
- descriptive than standard deviation.
- It is calculated as the median of each data point's deviation from the median
- of the entire sample. That is, for a random variable X, the median absolute
- deviation is median(|median(X) - X~i~|).
- ==== Example
- Assume our data represents product reviews on a one to five star scale.
- Such reviews are usually summarized as a mean, which is easily understandable
- but doesn't describe the reviews' variability. Estimating the median absolute
- deviation can provide insight into how much reviews vary from one another.
- In this example we have a product which has an average rating of
- 3 stars. Let's look at its ratings' median absolute deviation to determine
- how much they vary
- [source,console]
- ---------------------------------------------------------
- GET reviews/_search
- {
- "size": 0,
- "aggs": {
- "review_average": {
- "avg": {
- "field": "rating"
- }
- },
- "review_variability": {
- "median_absolute_deviation": {
- "field": "rating" <1>
- }
- }
- }
- }
- ---------------------------------------------------------
- // TEST[setup:reviews]
- <1> `rating` must be a numeric field
- The resulting median absolute deviation of `2` tells us that there is a fair
- amount of variability in the ratings. Reviewers must have diverse opinions about
- this product.
- [source,js]
- ---------------------------------------------------------
- {
- ...
- "aggregations": {
- "review_average": {
- "value": 3.0
- },
- "review_variability": {
- "value": 2.0
- }
- }
- }
- ---------------------------------------------------------
- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
- ==== Approximation
- The naive implementation of calculating median absolute deviation stores the
- entire sample in memory, so this aggregation instead calculates an
- approximation. It uses the https://github.com/tdunning/t-digest[TDigest data structure]
- to approximate the sample median and the median of deviations from the sample
- median. For more about the approximation characteristics of TDigests, see
- <<search-aggregations-metrics-percentile-aggregation-approximation>>.
- The tradeoff between resource usage and accuracy of a TDigest's quantile
- approximation, and therefore the accuracy of this aggregation's approximation
- of median absolute deviation, is controlled by the `compression` parameter. A
- higher `compression` setting provides a more accurate approximation at the
- cost of higher memory usage. For more about the characteristics of the TDigest
- `compression` parameter see
- <<search-aggregations-metrics-percentile-aggregation-compression>>.
- [source,console]
- ---------------------------------------------------------
- GET reviews/_search
- {
- "size": 0,
- "aggs": {
- "review_variability": {
- "median_absolute_deviation": {
- "field": "rating",
- "compression": 100
- }
- }
- }
- }
- ---------------------------------------------------------
- // TEST[setup:reviews]
- The default `compression` value for this aggregation is `1000`. At this
- compression level this aggregation is usually within 5% of the exact result,
- but observed performance will depend on the sample data.
- ==== Script
- This metric aggregation supports scripting. In our example above, product
- reviews are on a scale of one to five. If we wanted to modify them to a scale
- of one to ten, we can using scripting.
- To provide an inline script:
- [source,console]
- ---------------------------------------------------------
- GET reviews/_search
- {
- "size": 0,
- "aggs": {
- "review_variability": {
- "median_absolute_deviation": {
- "script": {
- "lang": "painless",
- "source": "doc['rating'].value * params.scaleFactor",
- "params": {
- "scaleFactor": 2
- }
- }
- }
- }
- }
- }
- ---------------------------------------------------------
- // TEST[setup:reviews]
- To provide a stored script:
- [source,console]
- ---------------------------------------------------------
- GET reviews/_search
- {
- "size": 0,
- "aggs": {
- "review_variability": {
- "median_absolute_deviation": {
- "script": {
- "id": "my_script",
- "params": {
- "field": "rating"
- }
- }
- }
- }
- }
- }
- ---------------------------------------------------------
- // TEST[setup:reviews,stored_example_script]
- ==== Missing value
- The `missing` parameter defines how documents that are missing a value should be
- treated. By default they will be ignored but it is also possible to treat them
- as if they had a value.
- Let's be optimistic and assume some reviewers loved the product so much that
- they forgot to give it a rating. We'll assign them five stars
- [source,console]
- ---------------------------------------------------------
- GET reviews/_search
- {
- "size": 0,
- "aggs": {
- "review_variability": {
- "median_absolute_deviation": {
- "field": "rating",
- "missing": 5
- }
- }
- }
- }
- ---------------------------------------------------------
- // TEST[setup:reviews]
|