indexing-speed.asciidoc 5.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
  1. [[tune-for-indexing-speed]]
  2. == Tune for indexing speed
  3. [float]
  4. === Use bulk requests
  5. Bulk requests will yield much better performance than single-document index
  6. requests. In order to know the optimal size of a bulk request, you should run
  7. a benchmark on a single node with a single shard. First try to index 100
  8. documents at once, then 200, then 400, etc. doubling the number of documents
  9. in a bulk request in every benchmark run. When the indexing speed starts to
  10. plateau then you know you reached the optimal size of a bulk request for your
  11. data. In case of tie, it is better to err in the direction of too few rather
  12. than too many documents. Beware that too large bulk requests might put the
  13. cluster under memory pressure when many of them are sent concurrently, so
  14. it is advisable to avoid going beyond a couple tens of megabytes per request
  15. even if larger requests seem to perform better.
  16. [float]
  17. === Use multiple workers/threads to send data to Elasticsearch
  18. A single thread sending bulk requests is unlikely to be able to max out the
  19. indexing capacity of an Elasticsearch cluster. In order to use all resources
  20. of the cluster, you should send data from multiple threads or processes. In
  21. addition to making better use of the resources of the cluster, this should
  22. help reduce the cost of each fsync.
  23. Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes
  24. (`EsRejectedExecutionException` with the Java client), which is the way that
  25. Elasticsearch tells you that it cannot keep up with the current indexing rate.
  26. When it happens, you should pause indexing a bit before trying again, ideally
  27. with randomized exponential backoff.
  28. Similarly to sizing bulk requests, only testing can tell what the optimal
  29. number of workers is. This can be tested by progressively increasing the
  30. number of workers until either I/O or CPU is saturated on the cluster.
  31. [float]
  32. === Increase the refresh interval
  33. The default <<dynamic-index-settings,`index.refresh_interval`>> is `1s`, which
  34. forces Elasticsearch to create a new segment every second.
  35. Increasing this value (to say, `30s`) will allow larger segments to flush and
  36. decreases future merge pressure.
  37. [float]
  38. === Disable refresh and replicas for initial loads
  39. If you need to load a large amount of data at once, you should disable refresh
  40. by setting `index.refresh_interval` to `-1` and set `index.number_of_replicas`
  41. to `0`. This will temporarily put your index at risk since the loss of any shard
  42. will cause data loss, but at the same time indexing will be faster since
  43. documents will be indexed only once. Once the initial loading is finished, you
  44. can set `index.refresh_interval` and `index.number_of_replicas` back to their
  45. original values.
  46. [float]
  47. === Disable swapping
  48. You should make sure that the operating system is not swapping out the java
  49. process by <<setup-configuration-memory,disabling swapping>>.
  50. [float]
  51. === Give memory to the filesystem cache
  52. The filesystem cache will be used in order to buffer I/O operations. You should
  53. make sure to give at least half the memory of the machine running Elasticsearch
  54. to the filesystem cache.
  55. [float]
  56. === Use auto-generated ids
  57. When indexing a document that has an explicit id, Elasticsearch needs to check
  58. whether a document with the same id already exists within the same shard, which
  59. is a costly operation and gets even more costly as the index grows. By using
  60. auto-generated ids, Elasticsearch can skip this check, which makes indexing
  61. faster.
  62. [float]
  63. === Use faster hardware
  64. If indexing is I/O bound, you should investigate giving more memory to the
  65. filesystem cache (see above) or buying faster drives. In particular SSD drives
  66. are known to perform better than spinning disks. Always use local storage,
  67. remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of
  68. virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized
  69. storage works very well with Elasticsearch, and it is appealing since it is so
  70. fast and simple to set up, but it is also unfortunately inherently slower on an
  71. ongoing basis when compared to dedicated local storage. If you put an index on
  72. `EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
  73. throttled.
  74. Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
  75. that it will increase the risk of failure since the failure of any one SSD
  76. destroys the index. However this is typically the right tradeoff to make:
  77. optimize single shards for maximum performance, and then add replicas across
  78. different nodes so there's redundancy for any node failures. You can also use
  79. <<modules-snapshots,snapshot and restore>> to backup the index for further
  80. insurance.
  81. [float]
  82. === Indexing buffer size
  83. If your node is doing only heavy indexing, be sure
  84. <<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to give
  85. at most 512 MB indexing buffer per shard doing heavy indexing (beyond that
  86. indexing performance does not typically improve). Elasticsearch takes that
  87. setting (a percentage of the java heap or an absolute byte-size), and
  88. uses it as a shared buffer across all active shards. Very active shards will
  89. naturally use this buffer more than shards that are performing lightweight
  90. indexing.
  91. The default is `10%` which is often plenty: for example, if you give the JVM
  92. 10GB of memory, it will give 1GB to the index buffer, which is enough to host
  93. two shards that are heavily indexing.
  94. [float]
  95. === Disable `_field_names`
  96. The <<mapping-field-names-field,`_field_names` field>> introduces some
  97. index-time overhead, so you might want to disable it if you never need to
  98. run `exists` queries.
  99. [float]
  100. === Additional optimizations
  101. Many of the strategies outlined in <<tune-for-disk-usage>> also
  102. provide an improvement in the speed of indexing.