| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136 | [[tune-for-indexing-speed]]== Tune for indexing speed[float]=== Use bulk requestsBulk requests will yield much better performance than single-document indexrequests. In order to know the optimal size of a bulk request, you should runa benchmark on a single node with a single shard. First try to index 100documents at once, then 200, then 400, etc. doubling the number of documentsin a bulk request in every benchmark run. When the indexing speed starts toplateau then you know you reached the optimal size of a bulk request for yourdata. In case of tie, it is better to err in the direction of too few ratherthan too many documents. Beware that too large bulk requests might put thecluster under memory pressure when many of them are sent concurrently, soit is advisable to avoid going beyond a couple tens of megabytes per requesteven if larger requests seem to perform better.[float][[multiple-workers-threads]]=== Use multiple workers/threads to send data to ElasticsearchA single thread sending bulk requests is unlikely to be able to max out theindexing capacity of an Elasticsearch cluster. In order to use all resourcesof the cluster, you should send data from multiple threads or processes. Inaddition to making better use of the resources of the cluster, this shouldhelp reduce the cost of each fsync.Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes(`EsRejectedExecutionException` with the Java client), which is the way thatElasticsearch tells you that it cannot keep up with the current indexing rate.When it happens, you should pause indexing a bit before trying again, ideallywith randomized exponential backoff.Similarly to sizing bulk requests, only testing can tell what the optimalnumber of workers is. This can be tested by progressively increasing thenumber of workers until either I/O or CPU is saturated on the cluster.[float]=== Unset or increase the refresh intervalThe operation that consists of making changes visible to search - called a<<indices-refresh,refresh>> - is costly, and calling it often while there isongoing indexing activity can hurt indexing speed.By default, Elasticsearch runs this operation every second, but only onindices that have received one search request or more in the last 30 seconds.This is the optimal configuration if you have no or very little search traffic(e.g. less than one search request every 5 minutes) and want to optimize forindexing speed. This behavior aims to automatically optimize bulk indexing inthe default case when no searches are performed. In order to opt out of thisbehavior set the refresh interval explicitly.On the other hand, if your index experiences regular search requests, thisdefault behavior means that Elasticsearch will refresh your index every 1second. If you can afford to increase the amount of time between when a documentgets indexed and when it becomes visible, increasing the<<dynamic-index-settings,`index.refresh_interval`>> to a larger value, e.g.`30s`, might help improve indexing speed.[float]=== Disable refresh and replicas for initial loadsIf you need to load a large amount of data at once, you should disable refreshby setting `index.refresh_interval` to `-1` and set `index.number_of_replicas`to `0`. This will temporarily put your index at risk since the loss of any shardwill cause data loss, but at the same time indexing will be faster sincedocuments will be indexed only once. Once the initial loading is finished, youcan set `index.refresh_interval` and `index.number_of_replicas` back to theiroriginal values.[float]=== Disable swappingYou should make sure that the operating system is not swapping out the javaprocess by <<setup-configuration-memory,disabling swapping>>.[float]=== Give memory to the filesystem cacheThe filesystem cache will be used in order to buffer I/O operations. You shouldmake sure to give at least half the memory of the machine running Elasticsearchto the filesystem cache.[float]=== Use auto-generated idsWhen indexing a document that has an explicit id, Elasticsearch needs to checkwhether a document with the same id already exists within the same shard, whichis a costly operation and gets even more costly as the index grows. By usingauto-generated ids, Elasticsearch can skip this check, which makes indexingfaster.[float]=== Use faster hardwareIf indexing is I/O bound, you should investigate giving more memory to thefilesystem cache (see above) or buying faster drives. In particular SSD drivesare known to perform better than spinning disks. Always use local storage,remote filesystems such as `NFS` or `SMB` should be avoided. Also beware ofvirtualized storage such as Amazon's `Elastic Block Storage`. Virtualizedstorage works very well with Elasticsearch, and it is appealing since it is sofast and simple to set up, but it is also unfortunately inherently slower on anongoing basis when compared to dedicated local storage. If you put an index on`EBS`, be sure to use provisioned IOPS otherwise operations could be quicklythrottled.Stripe your index across multiple SSDs by configuring a RAID 0 array. Rememberthat it will increase the risk of failure since the failure of any one SSDdestroys the index. However this is typically the right tradeoff to make:optimize single shards for maximum performance, and then add replicas acrossdifferent nodes so there's redundancy for any node failures. You can also use<<modules-snapshots,snapshot and restore>> to backup the index for furtherinsurance.[float]=== Indexing buffer sizeIf your node is doing only heavy indexing, be sure<<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to giveat most 512 MB indexing buffer per shard doing heavy indexing (beyond thatindexing performance does not typically improve). Elasticsearch takes thatsetting (a percentage of the java heap or an absolute byte-size), anduses it as a shared buffer across all active shards. Very active shards willnaturally use this buffer more than shards that are performing lightweightindexing.The default is `10%` which is often plenty: for example, if you give the JVM10GB of memory, it will give 1GB to the index buffer, which is enough to hosttwo shards that are heavily indexing.[float]=== Additional optimizationsMany of the strategies outlined in <<tune-for-disk-usage>> alsoprovide an improvement in the speed of indexing.
 |