| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133 | [[general-recommendations]]== General recommendations[float][[large-size]]=== Don't return large result setsElasticsearch is designed as a search engine, which makes it very good atgetting back the top documents that match a query. However, it is not as goodfor workloads that fall into the database domain, such as retrieving alldocuments that match a particular query. If you need to do this, make sure touse the <<search-request-scroll,Scroll>> API.[float][[maximum-document-size]]=== Avoid large documentsGiven that the default <<modules-http,`http.max_context_length`>> is set to100MB, Elasticsearch will refuse to index any document that is larger thanthat. You might decide to increase that particular setting, but Lucene stillhas a limit of about 2GB.Even without considering hard limits, large documents are usually notpractical. Large documents put more stress on network, memory usage and disk,even for search requests that do not request the `_source` since Elasticsearchneeds to fetch the `_id` of the document in all cases, and the cost of gettingthis field is bigger for large documents due to how the filesystem cache works.Indexing this document can use an amount of memory that is a multiplier of theoriginal size of the document. Proximity search (phrase queries for instance)and <<search-request-highlighting,highlighting>> also become more expensivesince their cost directly depends on the size of the original document.It is sometimes useful to reconsider what the unit of information should be.For instance, the fact you want to make books searchable doesn't necesarilymean that a document should consist of a whole book. It might be a better ideato use chapters or even paragraphs as documents, and then have a property inthese documents that identifies which book they belong to. This does not onlyavoid the issues with large documents, it also makes the search experiencebetter. For instance if a user searches for two words `foo` and `bar`, a matchacross different chapters is probably very poor, while a match within the sameparagraph is likely good.[float][[sparsity]]=== Avoid sparsityThe data-structures behind Lucene, which Elasticsearch relies on in order toindex and store data, work best with dense data, ie. when all documents have thesame fields. This is especially true for fields that have norms enabled (whichis the case for `text` fields by default) or doc values enabled (which is thecase for numerics, `date`, `ip` and `keyword` by default).The reason is that Lucene internally identifies documents with so-called docids, which are integers between 0 and the total number of documents in theindex. These doc ids are used for communication between the internal APIs ofLucene: for instance searching on a term with a `match` query produces aniterator of doc ids, and these doc ids are then used to retrieve the value ofthe `norm` in order to compute a score for these documents. The way this `norm`lookup is implemented currently is by reserving one byte for each document.The `norm` value for a given doc id can then be retrieved by reading thebyte at index `doc_id`. While this is very efficient and helps Lucene quicklyhave access to the `norm` values of every document, this has the drawback thatdocuments that do not have a value will also require one byte of storage.In practice, this means that if an index has `M` documents, norms will require`M` bytes of storage *per field*, even for fields that only appear in a smallfraction of the documents of the index. Although slightly more complex with docvalues due to the fact that doc values have multiple ways that they can beencoded depending on the type of field and on the actual data that the fieldstores, the problem is very similar. In case you wonder: `fielddata`, which wasused in Elasticsearch pre-2.0 before being replaced with doc values, alsosuffered from this issue, except that the impact was only on the memoryfootprint since `fielddata` was not explicitly materialized on disk.Note that even though the most notable impact of sparsity is on storagerequirements, it also has an impact on indexing speed and search speed sincethese bytes for documents that do not have a field still need to be writtenat index time and skipped over at search time.It is totally fine to have a minority of sparse fields in an index. But bewarethat if sparsity becomes the rule rather than the exception, then the indexwill not be as efficient as it could be.This section mostly focused on `norms` and `doc values` because those are thetwo features that are most affected by sparsity. Sparsity also affect theefficiency of the inverted index (used to index `text`/`keyword` fields) anddimensional points (used to index `geo_point` and numerics) but to a lesserextent.Here are some recommendations that can help avoid sparsity:[float]==== Avoid putting unrelated data in the same indexYou should avoid putting documents that have totally different structures intothe same index in order to avoid sparsity. It is often better to put thesedocuments into different indices, you could also consider giving fewer shardsto these smaller indices since they will contain fewer documents overall.Note that this advice does not apply in the case that you need to useparent/child relations between your documents since this feature is onlysupported on documents that live in the same index.[float]==== Normalize document structuresEven if you really need to put different kinds of documents in the same index,maybe there are opportunities to reduce sparsity. For instance if all documentsin the index have a timestamp field but some call it `timestamp` and otherscall it `creation_date`, it would help to rename it so that all documents havethe same field name for the same data.[float]==== Avoid typesTypes might sound like a good way to store multiple tenants in a single index.They are not: given that types store everything in a single index, havingmultiple types that have different fields in a single index will also causeproblems due to sparsity as described above. If your types do not have verysimilar mappings, you might want to consider moving them to a dedicated index.[float]==== Disable `norms` and `doc_values` on sparse fieldsIf none of the above recommendations apply in your case, you might want tocheck whether you actually need `norms` and `doc_values` on your sparse fields.`norms` can be disabled if producing scores is not necessary on a field, this istypically true for fields that are only used for filtering. `doc_values` can bedisabled on fields that are neither used for sorting nor for aggregations.Beware that this decision should not be made lightly since these parameterscannot be changed on a live index, so you would have to reindex if you realizethat you need `norms` or `doc_values`.
 |