1
0

intro.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266
  1. [[elasticsearch-intro]]
  2. == What is {es}?
  3. _**You know, for search (and analysis)**_
  4. {es} is the distributed search and analytics engine at the heart of
  5. the {stack}. {ls} and {beats} facilitate collecting, aggregating, and
  6. enriching your data and storing it in {es}. {kib} enables you to
  7. interactively explore, visualize, and share insights into your data and manage
  8. and monitor the stack. {es} is where the indexing, search, and analysis
  9. magic happens.
  10. {es} provides near real-time search and analytics for all types of data. Whether you
  11. have structured or unstructured text, numerical data, or geospatial data,
  12. {es} can efficiently store and index it in a way that supports fast searches.
  13. You can go far beyond simple data retrieval and aggregate information to discover
  14. trends and patterns in your data. And as your data and query volume grows, the
  15. distributed nature of {es} enables your deployment to grow seamlessly right
  16. along with it.
  17. While not _every_ problem is a search problem, {es} offers speed and flexibility
  18. to handle data in a wide variety of use cases:
  19. * Add a search box to an app or website
  20. * Store and analyze logs, metrics, and security event data
  21. * Use machine learning to automatically model the behavior of your data in real
  22. time
  23. * Automate business workflows using {es} as a storage engine
  24. * Manage, integrate, and analyze spatial information using {es} as a geographic
  25. information system (GIS)
  26. * Store and process genetic data using {es} as a bioinformatics research tool
  27. We’re continually amazed by the novel ways people use search. But whether
  28. your use case is similar to one of these, or you're using {es} to tackle a new
  29. problem, the way you work with your data, documents, and indices in {es} is
  30. the same.
  31. [[documents-indices]]
  32. === Data in: documents and indices
  33. {es} is a distributed document store. Instead of storing information as rows of
  34. columnar data, {es} stores complex data structures that have been serialized
  35. as JSON documents. When you have multiple {es} nodes in a cluster, stored
  36. documents are distributed across the cluster and can be accessed immediately
  37. from any node.
  38. When a document is stored, it is indexed and fully searchable in <<near-real-time,near real-time>>--within 1 second. {es} uses a data structure called an
  39. inverted index that supports very fast full-text searches. An inverted index
  40. lists every unique word that appears in any document and identifies all of the
  41. documents each word occurs in.
  42. An index can be thought of as an optimized collection of documents and each
  43. document is a collection of fields, which are the key-value pairs that contain
  44. your data. By default, {es} indexes all data in every field and each indexed
  45. field has a dedicated, optimized data structure. For example, text fields are
  46. stored in inverted indices, and numeric and geo fields are stored in BKD trees.
  47. The ability to use the per-field data structures to assemble and return search
  48. results is what makes {es} so fast.
  49. {es} also has the ability to be schema-less, which means that documents can be
  50. indexed without explicitly specifying how to handle each of the different fields
  51. that might occur in a document. When dynamic mapping is enabled, {es}
  52. automatically detects and adds new fields to the index. This default
  53. behavior makes it easy to index and explore your data--just start
  54. indexing documents and {es} will detect and map booleans, floating point and
  55. integer values, dates, and strings to the appropriate {es} data types.
  56. Ultimately, however, you know more about your data and how you want to use it
  57. than {es} can. You can define rules to control dynamic mapping and explicitly
  58. define mappings to take full control of how fields are stored and indexed.
  59. Defining your own mappings enables you to:
  60. * Distinguish between full-text string fields and exact value string fields
  61. * Perform language-specific text analysis
  62. * Optimize fields for partial matching
  63. * Use custom date formats
  64. * Use data types such as `geo_point` and `geo_shape` that cannot be automatically
  65. detected
  66. It’s often useful to index the same field in different ways for different
  67. purposes. For example, you might want to index a string field as both a text
  68. field for full-text search and as a keyword field for sorting or aggregating
  69. your data. Or, you might choose to use more than one language analyzer to
  70. process the contents of a string field that contains user input.
  71. The analysis chain that is applied to a full-text field during indexing is also
  72. used at search time. When you query a full-text field, the query text undergoes
  73. the same analysis before the terms are looked up in the index.
  74. [[search-analyze]]
  75. === Information out: search and analyze
  76. While you can use {es} as a document store and retrieve documents and their
  77. metadata, the real power comes from being able to easily access the full suite
  78. of search capabilities built on the Apache Lucene search engine library.
  79. {es} provides a simple, coherent REST API for managing your cluster and indexing
  80. and searching your data. For testing purposes, you can easily submit requests
  81. directly from the command line or through the Developer Console in {kib}. From
  82. your applications, you can use the
  83. https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]
  84. for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python
  85. or Ruby.
  86. [discrete]
  87. [[search-data]]
  88. ==== Searching your data
  89. The {es} REST APIs support structured queries, full text queries, and complex
  90. queries that combine the two. Structured queries are
  91. similar to the types of queries you can construct in SQL. For example, you
  92. could search the `gender` and `age` fields in your `employee` index and sort the
  93. matches by the `hire_date` field. Full-text queries find all documents that
  94. match the query string and return them sorted by _relevance_&mdash;how good a
  95. match they are for your search terms.
  96. In addition to searching for individual terms, you can perform phrase searches,
  97. similarity searches, and prefix searches, and get autocomplete suggestions.
  98. Have geospatial or other numerical data that you want to search? {es} indexes
  99. non-textual data in optimized data structures that support
  100. high-performance geo and numerical queries.
  101. You can access all of these search capabilities using {es}'s
  102. comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also
  103. construct <<sql-overview, SQL-style queries>> to search and aggregate data
  104. natively inside {es}, and JDBC and ODBC drivers enable a broad range of
  105. third-party applications to interact with {es} via SQL.
  106. [discrete]
  107. [[analyze-data]]
  108. ==== Analyzing your data
  109. {es} aggregations enable you to build complex summaries of your data and gain
  110. insight into key metrics, patterns, and trends. Instead of just finding the
  111. proverbial “needle in a haystack”, aggregations enable you to answer questions
  112. like:
  113. * How many needles are in the haystack?
  114. * What is the average length of the needles?
  115. * What is the median length of the needles, broken down by manufacturer?
  116. * How many needles were added to the haystack in each of the last six months?
  117. You can also use aggregations to answer more subtle questions, such as:
  118. * What are your most popular needle manufacturers?
  119. * Are there any unusual or anomalous clumps of needles?
  120. Because aggregations leverage the same data-structures used for search, they are
  121. also very fast. This enables you to analyze and visualize your data in real time.
  122. Your reports and dashboards update as your data changes so you can take action
  123. based on the latest information.
  124. What’s more, aggregations operate alongside search requests. You can search
  125. documents, filter results, and perform analytics at the same time, on the same
  126. data, in a single request. And because aggregations are calculated in the
  127. context of a particular search, you’re not just displaying a count of all
  128. size 70 needles, you’re displaying a count of the size 70 needles
  129. that match your users' search criteria--for example, all size 70 _non-stick
  130. embroidery_ needles.
  131. [discrete]
  132. [[more-features]]
  133. ===== But wait, there’s more
  134. Want to automate the analysis of your time series data? You can use
  135. {ml-docs}/ml-ad-overview.html[machine learning] features to create accurate
  136. baselines of normal behavior in your data and identify anomalous patterns. With
  137. machine learning, you can detect:
  138. * Anomalies related to temporal deviations in values, counts, or frequencies
  139. * Statistical rarity
  140. * Unusual behaviors for a member of a population
  141. And the best part? You can do this without having to specify algorithms, models,
  142. or other data science-related configurations.
  143. [[scalability]]
  144. === Scalability and resilience: clusters, nodes, and shards
  145. ++++
  146. <titleabbrev>Scalability and resilience</titleabbrev>
  147. ++++
  148. {es} is built to be always available and to scale with your needs. It does this
  149. by being distributed by nature. You can add servers (nodes) to a cluster to
  150. increase capacity and {es} automatically distributes your data and query load
  151. across all of the available nodes. No need to overhaul your application, {es}
  152. knows how to balance multi-node clusters to provide scale and high availability.
  153. The more nodes, the merrier.
  154. How does this work? Under the covers, an {es} index is really just a logical
  155. grouping of one or more physical shards, where each shard is actually a
  156. self-contained index. By distributing the documents in an index across multiple
  157. shards, and distributing those shards across multiple nodes, {es} can ensure
  158. redundancy, which both protects against hardware failures and increases
  159. query capacity as nodes are added to a cluster. As the cluster grows (or shrinks),
  160. {es} automatically migrates shards to rebalance the cluster.
  161. There are two types of shards: primaries and replicas. Each document in an index
  162. belongs to one primary shard. A replica shard is a copy of a primary shard.
  163. Replicas provide redundant copies of your data to protect against hardware
  164. failure and increase capacity to serve read requests
  165. like searching or retrieving a document.
  166. The number of primary shards in an index is fixed at the time that an index is
  167. created, but the number of replica shards can be changed at any time, without
  168. interrupting indexing or query operations.
  169. [discrete]
  170. [[it-depends]]
  171. ==== It depends...
  172. There are a number of performance considerations and trade offs with respect
  173. to shard size and the number of primary shards configured for an index. The more
  174. shards, the more overhead there is simply in maintaining those indices. The
  175. larger the shard size, the longer it takes to move shards around when {es}
  176. needs to rebalance a cluster.
  177. Querying lots of small shards makes the processing per shard faster, but more
  178. queries means more overhead, so querying a smaller
  179. number of larger shards might be faster. In short...it depends.
  180. As a starting point:
  181. * Aim to keep the average shard size between a few GB and a few tens of GB. For
  182. use cases with time-based data, it is common to see shards in the 20GB to 40GB
  183. range.
  184. * Avoid the gazillion shards problem. The number of shards a node can hold is
  185. proportional to the available heap space. As a general rule, the number of
  186. shards per GB of heap space should be less than 20.
  187. The best way to determine the optimal configuration for your use case is
  188. through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[
  189. testing with your own data and queries].
  190. [discrete]
  191. [[disaster-ccr]]
  192. ==== In case of disaster
  193. For performance reasons, the nodes within a cluster need to be on the same
  194. network. Balancing shards in a cluster across nodes in different data centers
  195. simply takes too long. But high-availability architectures demand that you avoid
  196. putting all of your eggs in one basket. In the event of a major outage in one
  197. location, servers in another location need to be able to take over. Seamlessly.
  198. The answer? {ccr-cap} (CCR).
  199. CCR provides a way to automatically synchronize indices from your primary cluster
  200. to a secondary remote cluster that can serve as a hot backup. If the primary
  201. cluster fails, the secondary cluster can take over. You can also use CCR to
  202. create secondary clusters to serve read requests in geo-proximity to your users.
  203. {ccr-cap} is active-passive. The index on the primary cluster is
  204. the active leader index and handles all write requests. Indices replicated to
  205. secondary clusters are read-only followers.
  206. [discrete]
  207. [[admin]]
  208. ==== Care and feeding
  209. As with any enterprise system, you need tools to secure, manage, and
  210. monitor your {es} clusters. Security, monitoring, and administrative features
  211. that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]
  212. as a control center for managing a cluster. Features like <<xpack-rollup,
  213. data rollups>> and <<index-lifecycle-management, index lifecycle management>>
  214. help you intelligently manage your data over time.