| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266 | [[elasticsearch-intro]]== What is {es}?_**You know, for search (and analysis)**_{es} is the distributed search and analytics engine at the heart ofthe {stack}. {ls} and {beats} facilitate collecting, aggregating, andenriching your data and storing it in {es}. {kib} enables you tointeractively explore, visualize, and share insights into your data and manageand monitor the stack. {es} is where the indexing, search, and analysismagic happens.{es} provides near real-time search and analytics for all types of data. Whether youhave structured or unstructured text, numerical data, or geospatial data,{es} can efficiently store and index it in a way that supports fast searches.You can go far beyond simple data retrieval and aggregate information to discovertrends and patterns in your data. And as your data and query volume grows, thedistributed nature of {es} enables your deployment to grow seamlessly rightalong with it.While not _every_ problem is a search problem, {es} offers speed and flexibilityto handle data in a wide variety of use cases:* Add a search box to an app or website* Store and analyze logs, metrics, and security event data* Use machine learning to automatically model the behavior of your data in real  time* Automate business workflows using {es} as a storage engine* Manage, integrate, and analyze spatial information using {es} as a geographic  information system (GIS)* Store and process genetic data using {es} as a bioinformatics research toolWe’re continually amazed by the novel ways people use search. But whetheryour use case is similar to one of these, or you're using {es} to tackle a newproblem, the way you work with your data, documents, and indices in {es} isthe same.[[documents-indices]]=== Data in: documents and indices{es} is a distributed document store. Instead of storing information as rows ofcolumnar data, {es} stores complex data structures that have been serializedas JSON documents. When you have multiple {es} nodes in a cluster, storeddocuments are distributed across the cluster and can be accessed immediatelyfrom any node.When a document is stored, it is indexed and fully searchable in <<near-real-time,near real-time>>--within 1 second. {es} uses a data structure called aninverted index that supports very fast full-text searches. An inverted indexlists every unique word that appears in any document and identifies all of thedocuments each word occurs in.An index can be thought of as an optimized collection of documents and eachdocument is a collection of fields, which are the key-value pairs that containyour data. By default, {es} indexes all data in every field and each indexedfield has a dedicated, optimized data structure. For example, text fields arestored in inverted indices, and numeric and geo fields are stored in BKD trees.The ability to use the per-field data structures to assemble and return searchresults is what makes {es} so fast.{es} also has the ability to be schema-less, which means that documents can beindexed without explicitly specifying how to handle each of the different fieldsthat might occur in a document. When dynamic mapping is enabled, {es}automatically detects and adds new fields to the index. This defaultbehavior makes it easy to index and explore your data--just startindexing documents and {es} will detect and map booleans, floating point andinteger values, dates, and strings to the appropriate {es} data types.Ultimately, however, you know more about your data and how you want to use itthan {es} can. You can define rules to control dynamic mapping and explicitlydefine mappings to take full control of how fields are stored and indexed.Defining your own mappings enables you to:* Distinguish between full-text string fields and exact value string fields* Perform language-specific text analysis* Optimize fields for partial matching* Use custom date formats* Use data types such as `geo_point` and `geo_shape` that cannot be automaticallydetectedIt’s often useful to index the same field in different ways for differentpurposes. For example, you might want to index a string field as both a textfield for full-text search and as a keyword field for sorting or aggregatingyour data. Or, you might choose to use more than one language analyzer toprocess the contents of a string field that contains user input.The analysis chain that is applied to a full-text field during indexing is alsoused at search time. When you query a full-text field, the query text undergoesthe same analysis before the terms are looked up in the index.[[search-analyze]]=== Information out: search and analyzeWhile you can use {es} as a document store and retrieve documents and theirmetadata, the real power comes from being able to easily access the full suiteof search capabilities built on the Apache Lucene search engine library.{es} provides a simple, coherent REST API for managing your cluster and indexingand searching your data.  For testing purposes, you can easily submit requestsdirectly from the command line or through the Developer Console in {kib}. Fromyour applications, you can use thehttps://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Pythonor Ruby.[discrete][[search-data]]==== Searching your dataThe {es} REST APIs support structured queries, full text queries, and complexqueries that combine the two. Structured queries aresimilar to the types of queries you can construct in SQL. For example, youcould search the `gender` and `age` fields in your `employee` index and sort thematches by the `hire_date` field. Full-text queries find all documents thatmatch the query string and return them sorted by _relevance_—how good amatch they are for your search terms.In addition to searching for individual terms, you can perform phrase searches,similarity searches, and prefix searches, and get autocomplete suggestions.Have geospatial or other numerical data that you want to search? {es} indexesnon-textual data in optimized data structures that supporthigh-performance geo and numerical queries.You can access all of these search capabilities using {es}'scomprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can alsoconstruct <<sql-overview, SQL-style queries>> to search and aggregate datanatively inside {es}, and JDBC and ODBC drivers enable a broad range ofthird-party applications to interact with {es} via SQL.[discrete][[analyze-data]]==== Analyzing your data{es} aggregations enable you to build complex summaries of your data and gaininsight into key metrics, patterns, and trends. Instead of just finding theproverbial “needle in a haystack”, aggregations enable you to answer questionslike:* How many needles are in the haystack?* What is the average length of the needles?* What is the median length of the needles, broken down by manufacturer?* How many needles were added to the haystack in each of the last six months?You can also use aggregations to answer more subtle questions, such as:* What are your most popular needle manufacturers?* Are there any unusual or anomalous clumps of needles?Because aggregations leverage the same data-structures used for search, they arealso very fast. This enables you to analyze and visualize your data in real time.Your reports and dashboards update as your data changes so you can take actionbased on the latest information.What’s more, aggregations operate alongside search requests. You can searchdocuments, filter results, and perform analytics at the same time, on the samedata, in a single request. And because aggregations are calculated in thecontext of a particular search, you’re not just displaying a count of allsize 70 needles, you’re displaying a count of the size 70 needlesthat match your users' search criteria--for example, all size 70 _non-stickembroidery_ needles.[discrete][[more-features]]===== But wait, there’s moreWant to automate the analysis of your time series data? You can use{ml-docs}/ml-overview.html[machine learning] features to create accuratebaselines of normal behavior in your data and identify anomalous patterns. Withmachine learning, you can detect:* Anomalies related to temporal deviations in values, counts, or frequencies* Statistical rarity* Unusual behaviors for a member of a populationAnd the best part? You can do this without having to specify algorithms, models,or other data science-related configurations.[[scalability]]=== Scalability and resilience: clusters, nodes, and shards++++<titleabbrev>Scalability and resilience</titleabbrev>++++{es} is built to be always available and to scale with your needs. It does thisby being distributed by nature. You can add servers (nodes) to a cluster toincrease capacity and {es} automatically distributes your data and query loadacross all of the available nodes. No need to overhaul your application, {es}knows how to balance multi-node clusters to provide scale and high availability.The more nodes, the merrier.How does this work? Under the covers, an {es} index is really just a logicalgrouping of one or more physical shards, where each shard is actually aself-contained index. By distributing the documents in an index across multipleshards, and distributing those shards across multiple nodes, {es} can ensureredundancy, which both protects against hardware failures and increasesquery capacity as nodes are added to a cluster. As the cluster grows (or shrinks),{es} automatically migrates shards to rebalance the cluster.There are two types of shards: primaries and replicas. Each document in an indexbelongs to one primary shard. A replica shard is a copy of a primary shard.Replicas provide redundant copies of your data to protect against hardwarefailure and increase capacity to serve read requestslike searching or retrieving a document.The number of primary shards in an index is fixed at the time that an index iscreated, but the number of replica shards can be changed at any time, withoutinterrupting indexing or query operations.[discrete][[it-depends]]==== It depends...There are a number of performance considerations and trade offs with respectto shard size and the number of primary shards configured for an index. The moreshards, the more overhead there is simply in maintaining those indices. Thelarger the shard size, the longer it takes to move shards around when {es}needs to rebalance a cluster.Querying lots of small shards makes the processing per shard faster, but morequeries means more overhead, so querying a smallernumber of larger shards might be faster. In short...it depends.As a starting point:* Aim to keep the average shard size between a few GB and a few tens of GB. For  use cases with time-based data, it is common to see shards in the 20GB to 40GB  range.* Avoid the gazillion shards problem. The number of shards a node can hold is  proportional to the available heap space. As a general rule, the number of  shards per GB of heap space should be less than 20.The best way to determine the optimal configuration for your use case isthrough https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[testing with your own data and queries].[discrete][[disaster-ccr]]==== In case of disasterFor performance reasons, the nodes within a cluster need to be on the samenetwork. Balancing shards in a cluster across nodes in different data centerssimply takes too long. But high-availability architectures demand that you avoidputting all of your eggs in one basket. In the event of a major outage in onelocation, servers in another location need to be able to take over. Seamlessly.The answer? {ccr-cap} (CCR).CCR provides a way to automatically synchronize indices from your primary clusterto a secondary remote cluster that can serve as a hot backup. If the primarycluster fails, the secondary cluster can take over. You can also use CCR tocreate secondary clusters to serve read requests in geo-proximity to your users.{ccr-cap} is active-passive. The index on the primary cluster isthe active leader index and handles all write requests. Indices replicated tosecondary clusters are read-only followers.[discrete][[admin]]==== Care and feedingAs with any enterprise system, you need tools to secure, manage, andmonitor your {es} clusters. Security, monitoring, and administrative featuresthat are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]as a control center for managing a cluster. Features like <<rollup-overview,data rollups>> and <<index-lifecycle-management, index lifecycle management>>help you intelligently manage your data over time.
 |