grouping.asciidoc 4.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
  1. [role="xpack"]
  2. [testenv="basic"]
  3. [[sql-functions-grouping]]
  4. === Grouping Functions
  5. Functions for creating special __grouping__s (also known as _bucketing_); as such these need to be used
  6. as part of the <<sql-syntax-group-by, grouping>>.
  7. [[sql-functions-grouping-histogram]]
  8. ==== `HISTOGRAM`
  9. .Synopsis:
  10. [source, sql]
  11. ----
  12. HISTOGRAM(
  13. numeric_exp, <1>
  14. numeric_interval) <2>
  15. HISTOGRAM(
  16. date_exp, <3>
  17. date_time_interval) <4>
  18. ----
  19. *Input*:
  20. <1> numeric expression (typically a field). If this field contains only `null`
  21. values, the function returns `null`. Otherwise, the function ignores `null`
  22. values in this field.
  23. <2> numeric interval. If `null`, the function returns `null`.
  24. <3> date/time expression (typically a field). If this field contains only `null`
  25. values, the function returns `null`. Otherwise, the function ignores `null`
  26. values in this field.
  27. <4> date/time <<sql-functions-datetime-interval, interval>>. If `null`, the
  28. function returns `null`.
  29. *Output*: non-empty buckets or groups of the given expression divided according to the given interval
  30. *Description*: The histogram function takes all matching values and divides them into buckets with fixed size matching the given interval, using (roughly) the following formula:
  31. [source, sql]
  32. ----
  33. bucket_key = Math.floor(value / interval) * interval
  34. ----
  35. [NOTE]
  36. The histogram in SQL does *NOT* return empty buckets for missing intervals as the traditional <<search-aggregations-bucket-histogram-aggregation, histogram>> and <<search-aggregations-bucket-datehistogram-aggregation, date histogram>>. Such behavior does not fit conceptually in SQL which treats all missing values as `null`; as such the histogram places all missing values in the `null` group.
  37. `Histogram` can be applied on either numeric fields:
  38. [source, sql]
  39. ----
  40. include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumeric]
  41. ----
  42. or date/time fields:
  43. [source, sql]
  44. ----
  45. include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTime]
  46. ----
  47. Expressions inside the histogram are also supported as long as the
  48. return type is numeric:
  49. [source, sql]
  50. ----
  51. include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumericExpression]
  52. ----
  53. Do note that histograms (and grouping functions in general) allow custom expressions but cannot have any functions applied to them in the `GROUP BY`. In other words, the following statement is *NOT* allowed:
  54. [source, sql]
  55. ----
  56. include-tagged::{sql-specs}/docs/docs.csv-spec[expressionOnHistogramNotAllowed]
  57. ----
  58. as it requires two groupings (one for histogram followed by a second for applying the function on top of the histogram groups).
  59. Instead one can rewrite the query to move the expression on the histogram _inside_ of it:
  60. [source, sql]
  61. ----
  62. include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTimeExpression]
  63. ----
  64. [IMPORTANT]
  65. When the histogram in SQL is applied on **DATE** type instead of **DATETIME**, the interval specified is truncated to
  66. the multiple of a day. E.g.: for `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '2 3:04' DAY TO MINUTE)` the interval
  67. actually used will be `INTERVAL '2' DAY`. If the interval specified is less than 1 day, e.g.:
  68. `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)` then the interval used will be `INTERVAL '1' DAY`.
  69. [IMPORTANT]
  70. All intervals specified for a date/time HISTOGRAM will use a <<search-aggregations-bucket-datehistogram-aggregation,fixed interval>>
  71. in their `date_histogram` aggregation definition, with the notable exceptions of `INTERVAL '1' YEAR`, `INTERVAL '1' MONTH` and `INTERVAL '1' DAY` where a calendar interval is used.
  72. The choice for a calendar interval was made for having a more intuitive result for YEAR, MONTH and DAY groupings. In the case of YEAR, for example, the calendar intervals consider a one year
  73. bucket as the one starting on January 1st that specific year, whereas a fixed interval one-year-bucket considers one year as a number
  74. of milliseconds (for example, `31536000000ms` corresponding to 365 days, 24 hours per day, 60 minutes per hour etc.). With fixed intervals,
  75. the day of February 5th, 2019 for example, belongs to a bucket that starts on December 20th, 2018 and {es} (and implicitly {es-sql}) would
  76. have returned the year 2018 for a date that's actually in 2019. With calendar interval this behavior is more intuitive, having the day of
  77. February 5th, 2019 actually belonging to the 2019 year bucket.
  78. [IMPORTANT]
  79. Histogram in SQL cannot be applied on **TIME** type.
  80. E.g.: `HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES)` is currently not supported.