general.asciidoc 6.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
  1. [[general-recommendations]]
  2. == General recommendations
  3. [float]
  4. [[large-size]]
  5. === Don't return large result sets
  6. Elasticsearch is designed as a search engine, which makes it very good at
  7. getting back the top documents that match a query. However, it is not as good
  8. for workloads that fall into the database domain, such as retrieving all
  9. documents that match a particular query. If you need to do this, make sure to
  10. use the <<search-request-scroll,Scroll>> API.
  11. [float]
  12. [[maximum-document-size]]
  13. === Avoid large documents
  14. Given that the default <<modules-http,`http.max_context_length`>> is set to
  15. 100MB, Elasticsearch will refuse to index any document that is larger than
  16. that. You might decide to increase that particular setting, but Lucene still
  17. has a limit of about 2GB.
  18. Even without considering hard limits, large documents are usually not
  19. practical. Large documents put more stress on network, memory usage and disk,
  20. even for search requests that do not request the `_source` since Elasticsearch
  21. needs to fetch the `_id` of the document in all cases, and the cost of getting
  22. this field is bigger for large documents due to how the filesystem cache works.
  23. Indexing this document can use an amount of memory that is a multiplier of the
  24. original size of the document. Proximity search (phrase queries for instance)
  25. and <<search-request-highlighting,highlighting>> also become more expensive
  26. since their cost directly depends on the size of the original document.
  27. It is sometimes useful to reconsider what the unit of information should be.
  28. For instance, the fact you want to make books searchable doesn't necesarily
  29. mean that a document should consist of a whole book. It might be a better idea
  30. to use chapters or even paragraphs as documents, and then have a property in
  31. these documents that identifies which book they belong to. This does not only
  32. avoid the issues with large documents, it also makes the search experience
  33. better. For instance if a user searches for two words `foo` and `bar`, a match
  34. across different chapters is probably very poor, while a match within the same
  35. paragraph is likely good.
  36. [float]
  37. [[sparsity]]
  38. === Avoid sparsity
  39. The data-structures behind Lucene, which Elasticsearch relies on in order to
  40. index and store data, work best with dense data, ie. when all documents have the
  41. same fields. This is especially true for fields that have norms enabled (which
  42. is the case for `text` fields by default) or doc values enabled (which is the
  43. case for numerics, `date`, `ip` and `keyword` by default).
  44. The reason is that Lucene internally identifies documents with so-called doc
  45. ids, which are integers between 0 and the total number of documents in the
  46. index. These doc ids are used for communication between the internal APIs of
  47. Lucene: for instance searching on a term with a `match` query produces an
  48. iterator of doc ids, and these doc ids are then used to retrieve the value of
  49. the `norm` in order to compute a score for these documents. The way this `norm`
  50. lookup is implemented currently is by reserving one byte for each document.
  51. The `norm` value for a given doc id can then be retrieved by reading the
  52. byte at index `doc_id`. While this is very efficient and helps Lucene quickly
  53. have access to the `norm` values of every document, this has the drawback that
  54. documents that do not have a value will also require one byte of storage.
  55. In practice, this means that if an index has `M` documents, norms will require
  56. `M` bytes of storage *per field*, even for fields that only appear in a small
  57. fraction of the documents of the index. Although slightly more complex with doc
  58. values due to the fact that doc values have multiple ways that they can be
  59. encoded depending on the type of field and on the actual data that the field
  60. stores, the problem is very similar. In case you wonder: `fielddata`, which was
  61. used in Elasticsearch pre-2.0 before being replaced with doc values, also
  62. suffered from this issue, except that the impact was only on the memory
  63. footprint since `fielddata` was not explicitly materialized on disk.
  64. Note that even though the most notable impact of sparsity is on storage
  65. requirements, it also has an impact on indexing speed and search speed since
  66. these bytes for documents that do not have a field still need to be written
  67. at index time and skipped over at search time.
  68. It is totally fine to have a minority of sparse fields in an index. But beware
  69. that if sparsity becomes the rule rather than the exception, then the index
  70. will not be as efficient as it could be.
  71. This section mostly focused on `norms` and `doc values` because those are the
  72. two features that are most affected by sparsity. Sparsity also affect the
  73. efficiency of the inverted index (used to index `text`/`keyword` fields) and
  74. dimensional points (used to index `geo_point` and numerics) but to a lesser
  75. extent.
  76. Here are some recommendations that can help avoid sparsity:
  77. [float]
  78. ==== Avoid putting unrelated data in the same index
  79. You should avoid putting documents that have totally different structures into
  80. the same index in order to avoid sparsity. It is often better to put these
  81. documents into different indices, you could also consider giving fewer shards
  82. to these smaller indices since they will contain fewer documents overall.
  83. Note that this advice does not apply in the case that you need to use
  84. parent/child relations between your documents since this feature is only
  85. supported on documents that live in the same index.
  86. [float]
  87. ==== Normalize document structures
  88. Even if you really need to put different kinds of documents in the same index,
  89. maybe there are opportunities to reduce sparsity. For instance if all documents
  90. in the index have a timestamp field but some call it `timestamp` and others
  91. call it `creation_date`, it would help to rename it so that all documents have
  92. the same field name for the same data.
  93. [float]
  94. ==== Avoid types
  95. Types might sound like a good way to store multiple tenants in a single index.
  96. They are not: given that types store everything in a single index, having
  97. multiple types that have different fields in a single index will also cause
  98. problems due to sparsity as described above. If your types do not have very
  99. similar mappings, you might want to consider moving them to a dedicated index.
  100. [float]
  101. ==== Disable `norms` and `doc_values` on sparse fields
  102. If none of the above recommendations apply in your case, you might want to
  103. check whether you actually need `norms` and `doc_values` on your sparse fields.
  104. `norms` can be disabled if producing scores is not necessary on a field, this is
  105. typically true for fields that are only used for filtering. `doc_values` can be
  106. disabled on fields that are neither used for sorting nor for aggregations.
  107. Beware that this decision should not be made lightly since these parameters
  108. cannot be changed on a live index, so you would have to reindex if you realize
  109. that you need `norms` or `doc_values`.