serial-diff-aggregation.asciidoc 4.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
  1. [[search-aggregations-pipeline-serialdiff-aggregation]]
  2. === Serial differencing aggregation
  3. ++++
  4. <titleabbrev>Serial differencing</titleabbrev>
  5. ++++
  6. Serial differencing is a technique where values in a time series are subtracted from itself at
  7. different time lags or periods. For example, the datapoint f(x) = f(x~t~) - f(x~t-n~), where n is the period being used.
  8. A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the
  9. next. Single periods are useful for removing constant, linear trends.
  10. Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is
  11. plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
  12. By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
  13. data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to
  14. exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the
  15. previous value +/- a random amount. This insight allows selection of further tools for analysis.
  16. [[serialdiff_dow]]
  17. .Dow Jones plotted and made stationary with first-differencing
  18. image::images/pipeline_serialdiff/dow.png[]
  19. Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was
  20. synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
  21. The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the
  22. first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.
  23. [[serialdiff_lemmings]]
  24. .Lemmings data plotted made stationary with 1st and 30th difference
  25. image::images/pipeline_serialdiff/lemmings.png[]
  26. ==== Syntax
  27. A `serial_diff` aggregation looks like this in isolation:
  28. [source,js]
  29. --------------------------------------------------
  30. {
  31. "serial_diff": {
  32. "buckets_path": "the_sum",
  33. "lag": 7
  34. }
  35. }
  36. --------------------------------------------------
  37. // NOTCONSOLE
  38. [[serial-diff-params]]
  39. .`serial_diff` Parameters
  40. [options="header"]
  41. |===
  42. |Parameter Name |Description |Required |Default Value
  43. |`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required |
  44. |`lag` |The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from
  45. the value 7 buckets ago. Must be a positive, non-zero integer |Optional |`1`
  46. |`gap_policy` |Determines what should happen when a gap in the data is encountered. |Optional |`insert_zero`
  47. |`format` |Format to apply to the output value of this aggregation |Optional | `null`
  48. |===
  49. `serial_diff` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation:
  50. [source,console]
  51. --------------------------------------------------
  52. POST /_search
  53. {
  54. "size": 0,
  55. "aggs": {
  56. "my_date_histo": { <1>
  57. "date_histogram": {
  58. "field": "timestamp",
  59. "calendar_interval": "day"
  60. },
  61. "aggs": {
  62. "the_sum": {
  63. "sum": {
  64. "field": "lemmings" <2>
  65. }
  66. },
  67. "thirtieth_difference": {
  68. "serial_diff": { <3>
  69. "buckets_path": "the_sum",
  70. "lag" : 30
  71. }
  72. }
  73. }
  74. }
  75. }
  76. }
  77. --------------------------------------------------
  78. <1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
  79. <2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
  80. <3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.
  81. Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
  82. add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
  83. The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
  84. <<buckets-path-syntax>> for a description of the syntax for `buckets_path`.