indexing-speed.asciidoc 7.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
  1. [[tune-for-indexing-speed]]
  2. == Tune for indexing speed
  3. [discrete]
  4. === Use bulk requests
  5. Bulk requests will yield much better performance than single-document index
  6. requests. In order to know the optimal size of a bulk request, you should run
  7. a benchmark on a single node with a single shard. First try to index 100
  8. documents at once, then 200, then 400, etc. doubling the number of documents
  9. in a bulk request in every benchmark run. When the indexing speed starts to
  10. plateau then you know you reached the optimal size of a bulk request for your
  11. data. In case of tie, it is better to err in the direction of too few rather
  12. than too many documents. Beware that too large bulk requests might put the
  13. cluster under memory pressure when many of them are sent concurrently, so
  14. it is advisable to avoid going beyond a couple tens of megabytes per request
  15. even if larger requests seem to perform better.
  16. [discrete]
  17. [[multiple-workers-threads]]
  18. === Use multiple workers/threads to send data to Elasticsearch
  19. A single thread sending bulk requests is unlikely to be able to max out the
  20. indexing capacity of an Elasticsearch cluster. In order to use all resources
  21. of the cluster, you should send data from multiple threads or processes. In
  22. addition to making better use of the resources of the cluster, this should
  23. help reduce the cost of each fsync.
  24. Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes
  25. (`EsRejectedExecutionException` with the Java client), which is the way that
  26. Elasticsearch tells you that it cannot keep up with the current indexing rate.
  27. When it happens, you should pause indexing a bit before trying again, ideally
  28. with randomized exponential backoff.
  29. Similarly to sizing bulk requests, only testing can tell what the optimal
  30. number of workers is. This can be tested by progressively increasing the
  31. number of workers until either I/O or CPU is saturated on the cluster.
  32. [discrete]
  33. === Unset or increase the refresh interval
  34. The operation that consists of making changes visible to search - called a
  35. <<indices-refresh,refresh>> - is costly, and calling it often while there is
  36. ongoing indexing activity can hurt indexing speed.
  37. include::{es-repo-dir}/indices/refresh.asciidoc[tag=refresh-interval-default]
  38. This is the optimal configuration if you have no or very little search traffic
  39. (e.g. less than one search request every 5 minutes) and want to optimize for
  40. indexing speed. This behavior aims to automatically optimize bulk indexing in
  41. the default case when no searches are performed. In order to opt out of this
  42. behavior set the refresh interval explicitly.
  43. On the other hand, if your index experiences regular search requests, this
  44. default behavior means that Elasticsearch will refresh your index every 1
  45. second. If you can afford to increase the amount of time between when a document
  46. gets indexed and when it becomes visible, increasing the
  47. <<index-refresh-interval-setting,`index.refresh_interval`>> to a larger value, e.g.
  48. `30s`, might help improve indexing speed.
  49. [discrete]
  50. === Disable replicas for initial loads
  51. If you have a large amount of data that you want to load all at once into
  52. Elasticsearch, it may be beneficial to set `index.number_of_replicas` to `0` in
  53. order to speed up indexing. Having no replicas means that losing a single node
  54. may incur data loss, so it is important that the data lives elsewhere so that
  55. this initial load can be retried in case of an issue. Once the initial load is
  56. finished, you can set `index.number_of_replicas` back to its original value.
  57. If `index.refresh_interval` is configured in the index settings, it may further
  58. help to unset it during this initial load and setting it back to its original
  59. value once the initial load is finished.
  60. [discrete]
  61. === Disable swapping
  62. You should make sure that the operating system is not swapping out the java
  63. process by <<setup-configuration-memory,disabling swapping>>.
  64. [discrete]
  65. === Give memory to the filesystem cache
  66. The filesystem cache will be used in order to buffer I/O operations. You should
  67. make sure to give at least half the memory of the machine running Elasticsearch
  68. to the filesystem cache.
  69. [discrete]
  70. === Use auto-generated ids
  71. When indexing a document that has an explicit id, Elasticsearch needs to check
  72. whether a document with the same id already exists within the same shard, which
  73. is a costly operation and gets even more costly as the index grows. By using
  74. auto-generated ids, Elasticsearch can skip this check, which makes indexing
  75. faster.
  76. [discrete]
  77. === Use faster hardware
  78. If indexing is I/O-bound, consider increasing the size of the filesystem cache
  79. (see above) or using faster storage. Elasticsearch generally creates individual
  80. files with sequential writes. However, indexing involves writing multiple files
  81. concurrently, and a mix of random and sequential reads too, so SSD drives tend
  82. to perform better than spinning disks.
  83. Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
  84. that it will increase the risk of failure since the failure of any one SSD
  85. destroys the index. However this is typically the right tradeoff to make:
  86. optimize single shards for maximum performance, and then add replicas across
  87. different nodes so there's redundancy for any node failures. You can also use
  88. <<snapshot-restore,snapshot and restore>> to backup the index for further
  89. insurance.
  90. Directly-attached (local) storage generally performs better than remote storage
  91. because it is simpler to configure well and avoids communications overheads.
  92. With careful tuning it is sometimes possible to achieve acceptable performance
  93. using remote storage too. Benchmark your system with a realistic workload to
  94. determine the effects of any tuning parameters. If you cannot achieve the
  95. performance you expect, work with the vendor of your storage system to identify
  96. the problem.
  97. [discrete]
  98. === Indexing buffer size
  99. If your node is doing only heavy indexing, be sure
  100. <<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to give
  101. at most 512 MB indexing buffer per shard doing heavy indexing (beyond that
  102. indexing performance does not typically improve). Elasticsearch takes that
  103. setting (a percentage of the java heap or an absolute byte-size), and
  104. uses it as a shared buffer across all active shards. Very active shards will
  105. naturally use this buffer more than shards that are performing lightweight
  106. indexing.
  107. The default is `10%` which is often plenty: for example, if you give the JVM
  108. 10GB of memory, it will give 1GB to the index buffer, which is enough to host
  109. two shards that are heavily indexing.
  110. [discrete]
  111. === Use {ccr} to prevent searching from stealing resources from indexing
  112. Within a single cluster, indexing and searching can compete for resources. By
  113. setting up two clusters, configuring <<xpack-ccr,{ccr}>> to replicate data from
  114. one cluster to the other one, and routing all searches to the cluster that has
  115. the follower indices, search activity will no longer steal resources from
  116. indexing on the cluster that hosts the leader indices.
  117. [discrete]
  118. === Avoid hot spotting
  119. <<hotspotting,Hot Spotting>> can occur when node resources, shards, or requests
  120. are not evenly distributed. {es} maintains cluster state by syncing it across
  121. nodes, so continually hot spotted nodes can cause overall cluster performance
  122. degredation.
  123. [discrete]
  124. === Additional optimizations
  125. Many of the strategies outlined in <<tune-for-disk-usage>> also
  126. provide an improvement in the speed of indexing.