adjacency-matrix-aggregation.asciidoc 3.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
  1. [[search-aggregations-bucket-adjacency-matrix-aggregation]]
  2. === Adjacency matrix aggregation
  3. ++++
  4. <titleabbrev>Adjacency matrix</titleabbrev>
  5. ++++
  6. A bucket aggregation returning a form of {wikipedia}/Adjacency_matrix[adjacency matrix].
  7. The request provides a collection of named filter expressions, similar to the `filters` aggregation
  8. request.
  9. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.
  10. Given filters named `A`, `B` and `C` the response would return buckets with the following names:
  11. [options="header"]
  12. |=======================
  13. | h|A h|B h|C
  14. h|A |A |A&B |A&C
  15. h|B | |B |B&C
  16. h|C | | |C
  17. |=======================
  18. The intersecting buckets e.g `A&C` are labelled using a combination of the two filter names separated by
  19. the ampersand character. Note that the response does not also include a "C&A" bucket as this would be the
  20. same set of documents as "A&C". The matrix is said to be _symmetric_ so we only return half of it. To do this we sort
  21. the filter name strings and always use the lowest of a pair as the value to the left of the "&" separator.
  22. An alternative `separator` parameter can be passed in the request if clients wish to use a separator string
  23. other than the default of the ampersand.
  24. Example:
  25. [source,console,id=adjacency-matrix-aggregation-example]
  26. --------------------------------------------------
  27. PUT /emails/_bulk?refresh
  28. { "index" : { "_id" : 1 } }
  29. { "accounts" : ["hillary", "sidney"]}
  30. { "index" : { "_id" : 2 } }
  31. { "accounts" : ["hillary", "donald"]}
  32. { "index" : { "_id" : 3 } }
  33. { "accounts" : ["vladimir", "donald"]}
  34. GET emails/_search
  35. {
  36. "size": 0,
  37. "aggs" : {
  38. "interactions" : {
  39. "adjacency_matrix" : {
  40. "filters" : {
  41. "grpA" : { "terms" : { "accounts" : ["hillary", "sidney"] }},
  42. "grpB" : { "terms" : { "accounts" : ["donald", "mitt"] }},
  43. "grpC" : { "terms" : { "accounts" : ["vladimir", "nigel"] }}
  44. }
  45. }
  46. }
  47. }
  48. }
  49. --------------------------------------------------
  50. In the above example, we analyse email messages to see which groups of individuals
  51. have exchanged messages.
  52. We will get counts for each group individually and also a count of messages for pairs
  53. of groups that have recorded interactions.
  54. Response:
  55. [source,console-result]
  56. --------------------------------------------------
  57. {
  58. "took": 9,
  59. "timed_out": false,
  60. "_shards": ...,
  61. "hits": ...,
  62. "aggregations": {
  63. "interactions": {
  64. "buckets": [
  65. {
  66. "key":"grpA",
  67. "doc_count": 2
  68. },
  69. {
  70. "key":"grpA&grpB",
  71. "doc_count": 1
  72. },
  73. {
  74. "key":"grpB",
  75. "doc_count": 2
  76. },
  77. {
  78. "key":"grpB&grpC",
  79. "doc_count": 1
  80. },
  81. {
  82. "key":"grpC",
  83. "doc_count": 1
  84. }
  85. ]
  86. }
  87. }
  88. }
  89. --------------------------------------------------
  90. // TESTRESPONSE[s/"took": 9/"took": $body.took/]
  91. // TESTRESPONSE[s/"_shards": \.\.\./"_shards": $body._shards/]
  92. // TESTRESPONSE[s/"hits": \.\.\./"hits": $body.hits/]
  93. ==== Usage
  94. On its own this aggregation can provide all of the data required to create an undirected weighted graph.
  95. However, when used with child aggregations such as a `date_histogram` the results can provide the
  96. additional levels of data required to perform {wikipedia}/Dynamic_network_analysis[dynamic network analysis]
  97. where examining interactions _over time_ becomes important.
  98. ==== Limitations
  99. For N filters the matrix of buckets produced can be N²/2 which can be costly.
  100. The circuit breaker settings prevent results producing too many buckets and to avoid excessive disk seeks
  101. the `indices.query.bool.max_clause_count` setting is used to limit the number of filters.