|
@@ -0,0 +1,281 @@
|
|
|
+# Distributed Area Team Internals
|
|
|
+
|
|
|
+(Summary, brief discussion of our features)
|
|
|
+
|
|
|
+# Networking
|
|
|
+
|
|
|
+### ThreadPool
|
|
|
+
|
|
|
+(We have many thread pools, what and why)
|
|
|
+
|
|
|
+### ActionListener
|
|
|
+
|
|
|
+`ActionListener`s are a means off injecting logic into lower layers of the code. They encapsulate a block of code that takes a response
|
|
|
+value -- the `onResponse()` method --, and then that block of code (the `ActionListener`) is passed into a function that will eventually
|
|
|
+execute the code (call `onResponse()`) when a response value is available. `ActionListener`s are used to pass code down to act on a result,
|
|
|
+rather than lower layers returning a result back up to be acted upon by the caller. One of three things can happen to a listener: it can be
|
|
|
+executed in the same thread — e.g. `ActionListener.run()` --; it can be passed off to another thread to be executed; or it can be added to
|
|
|
+a list someplace, to eventually be executed by some service. `ActionListener`s also define `onFailure()` logic, in case an error is
|
|
|
+encountered before a result can be formed.
|
|
|
+
|
|
|
+This pattern is often used in the transport action layer with the use of the
|
|
|
+[ChannelActionListener]([url](https://github.com/elastic/elasticsearch/blob/8.12/server/src/main/java/org/elasticsearch/action/support/ChannelActionListener.java))
|
|
|
+class, which wraps a `TransportChannel` produced by the transport layer. `TransportChannel` implementations can hold a reference to a Netty
|
|
|
+channel with which to pass the response back to the network caller. Netty has a many-to-one association of network callers to channels, so
|
|
|
+a call taking a long time generally won't hog resources: it's cheap. A transport action can take hours to respond and that's alright,
|
|
|
+barring caller timeouts.
|
|
|
+
|
|
|
+(TODO: add useful starter references and explanations for a range of Listener classes. Reference the Netty section.)
|
|
|
+
|
|
|
+### REST Layer
|
|
|
+
|
|
|
+(including how REST and Transport layers are bound together through the ActionModule)
|
|
|
+
|
|
|
+### Transport Layer
|
|
|
+
|
|
|
+### Chunk Encoding
|
|
|
+
|
|
|
+#### XContent
|
|
|
+
|
|
|
+### Performance
|
|
|
+
|
|
|
+### Netty
|
|
|
+
|
|
|
+(long running actions should be forked off of the Netty thread. Keep short operations to avoid forking costs)
|
|
|
+
|
|
|
+### Work Queues
|
|
|
+
|
|
|
+# Cluster Coordination
|
|
|
+
|
|
|
+(Sketch of important classes? Might inform more sections to add for details.)
|
|
|
+
|
|
|
+(A NodeB can coordinate a search across several other nodes, when NodeB itself does not have the data, and then return a result to the caller. Explain this coordinating role)
|
|
|
+
|
|
|
+### Node Roles
|
|
|
+
|
|
|
+### Master Nodes
|
|
|
+
|
|
|
+### Master Elections
|
|
|
+
|
|
|
+(Quorum, terms, any eligibility limitations)
|
|
|
+
|
|
|
+### Cluster Formation / Membership
|
|
|
+
|
|
|
+(Explain joining, and how it happens every time a new master is elected)
|
|
|
+
|
|
|
+#### Discovery
|
|
|
+
|
|
|
+### Master Transport Actions
|
|
|
+
|
|
|
+### Cluster State
|
|
|
+
|
|
|
+#### Master Service
|
|
|
+
|
|
|
+#### Cluster State Publication
|
|
|
+
|
|
|
+(Majority concensus to apply, what happens if a master-eligible node falls behind / is incommunicado.)
|
|
|
+
|
|
|
+#### Cluster State Application
|
|
|
+
|
|
|
+(Go over the two kinds of listeners -- ClusterStateApplier and ClusterStateListener?)
|
|
|
+
|
|
|
+#### Persistence
|
|
|
+
|
|
|
+(Sketch ephemeral vs persisted cluster state.)
|
|
|
+
|
|
|
+(what's the format for persisted metadata)
|
|
|
+
|
|
|
+# Replication
|
|
|
+
|
|
|
+(More Topics: ReplicationTracker concepts / highlights.)
|
|
|
+
|
|
|
+### What is a Shard
|
|
|
+
|
|
|
+### Primary Shard Selection
|
|
|
+
|
|
|
+(How a primary shard is chosen)
|
|
|
+
|
|
|
+#### Versioning
|
|
|
+
|
|
|
+(terms and such)
|
|
|
+
|
|
|
+### How Data Replicates
|
|
|
+
|
|
|
+(How an index write replicates across shards -- TransportReplicationAction?)
|
|
|
+
|
|
|
+### Consistency Guarantees
|
|
|
+
|
|
|
+(What guarantees do we give the user about persistence and readability?)
|
|
|
+
|
|
|
+# Locking
|
|
|
+
|
|
|
+(rarely use locks)
|
|
|
+
|
|
|
+### ShardLock
|
|
|
+
|
|
|
+### Translog / Engine Locking
|
|
|
+
|
|
|
+### Lucene Locking
|
|
|
+
|
|
|
+# Engine
|
|
|
+
|
|
|
+(What does Engine mean in the distrib layer? Distinguish Engine vs Directory vs Lucene)
|
|
|
+
|
|
|
+(High level explanation of how translog ties in with Lucene)
|
|
|
+
|
|
|
+(contrast Lucene vs ES flush / refresh / fsync)
|
|
|
+
|
|
|
+### Refresh for Read
|
|
|
+
|
|
|
+(internal vs external reader manager refreshes? flush vs refresh)
|
|
|
+
|
|
|
+### Reference Counting
|
|
|
+
|
|
|
+### Store
|
|
|
+
|
|
|
+(Data lives beyond a high level IndexShard instance. Continue to exist until all references to the Store go away, then Lucene data is removed)
|
|
|
+
|
|
|
+### Translog
|
|
|
+
|
|
|
+(Explain checkpointing and generations, when happens on Lucene flush / fsync)
|
|
|
+
|
|
|
+(Concurrency control for flushing)
|
|
|
+
|
|
|
+(VersionMap)
|
|
|
+
|
|
|
+#### Translog Truncation
|
|
|
+
|
|
|
+#### Direct Translog Read
|
|
|
+
|
|
|
+### Index Version
|
|
|
+
|
|
|
+### Lucene
|
|
|
+
|
|
|
+(copy a sketch of the files Lucene can have here and explain)
|
|
|
+
|
|
|
+(Explain about SearchIndexInput -- IndexWriter, IndexReader -- and the shared blob cache)
|
|
|
+
|
|
|
+(Lucene uses Directory, ES extends/overrides the Directory class to implement different forms of file storage.
|
|
|
+Lucene contains a map of where all the data is located in files and offsites, and fetches it from various files.
|
|
|
+ES doesn't just treat Lucene as a storage engine at the bottom (the end) of the stack. Rather ES has other information that
|
|
|
+works in parallel with the storage engine.)
|
|
|
+
|
|
|
+#### Segment Merges
|
|
|
+
|
|
|
+# Recovery
|
|
|
+
|
|
|
+(All shards go through a 'recovery' process. Describe high level. createShard goes through this code.)
|
|
|
+
|
|
|
+(How is the translog involved in recovery?)
|
|
|
+
|
|
|
+### Create a Shard
|
|
|
+
|
|
|
+### Local Recovery
|
|
|
+
|
|
|
+### Peer Recovery
|
|
|
+
|
|
|
+### Snapshot Recovery
|
|
|
+
|
|
|
+### Recovery Across Server Restart
|
|
|
+
|
|
|
+(partial shard recoveries survive server restart? `reestablishRecovery`? How does that work.)
|
|
|
+
|
|
|
+### How a Recovery Method is Chosen
|
|
|
+
|
|
|
+# Data Tiers
|
|
|
+
|
|
|
+(Frozen, warm, hot, etc.)
|
|
|
+
|
|
|
+# Allocation
|
|
|
+
|
|
|
+(AllocationService runs on the master node)
|
|
|
+
|
|
|
+(Discuss different deciders that limit allocation. Sketch / list the different deciders that we have.)
|
|
|
+
|
|
|
+### APIs for Balancing Operations
|
|
|
+
|
|
|
+(Significant internal APIs for balancing a cluster)
|
|
|
+
|
|
|
+### Heuristics for Allocation
|
|
|
+
|
|
|
+### Cluster Reroute Command
|
|
|
+
|
|
|
+(How does this command behave with the desired auto balancer.)
|
|
|
+
|
|
|
+# Autoscaling
|
|
|
+
|
|
|
+(Reactive and proactive autoscaling. Explain that we surface recommendations, how control plane uses it.)
|
|
|
+
|
|
|
+(Sketch / list the different deciders that we have, and then also how we use information from each to make a recommendation.)
|
|
|
+
|
|
|
+# Snapshot / Restore
|
|
|
+
|
|
|
+(We've got some good package level documentation that should be linked here in the intro)
|
|
|
+
|
|
|
+(copy a sketch of the file system here, with explanation -- good reference)
|
|
|
+
|
|
|
+### Snapshot Repository
|
|
|
+
|
|
|
+### Creation of a Snapshot
|
|
|
+
|
|
|
+(Include an overview of the coordination between data and master nodes, which writes what and when)
|
|
|
+
|
|
|
+(Concurrency control: generation numbers, pending generation number, etc.)
|
|
|
+
|
|
|
+(partial snapshots)
|
|
|
+
|
|
|
+### Deletion of a Snapshot
|
|
|
+
|
|
|
+### Restoring a Snapshot
|
|
|
+
|
|
|
+### Detecting Multiple Writers to a Single Repository
|
|
|
+
|
|
|
+# Task Management / Tracking
|
|
|
+
|
|
|
+(How we identify operations/tasks in the system and report upon them. How we group operations via parent task ID.)
|
|
|
+
|
|
|
+### What Tasks Are Tracked
|
|
|
+
|
|
|
+### Tracking A Task Across Threads
|
|
|
+
|
|
|
+### Tracking A Task Across Nodes
|
|
|
+
|
|
|
+### Kill / Cancel A Task
|
|
|
+
|
|
|
+### Persistent Tasks
|
|
|
+
|
|
|
+# Cross Cluster Replication (CCR)
|
|
|
+
|
|
|
+(Brief explanation of the use case for CCR)
|
|
|
+
|
|
|
+(Explain how this works at a high level, and details of any significant components / ideas.)
|
|
|
+
|
|
|
+### Cross Cluster Search
|
|
|
+
|
|
|
+# Indexing / CRUD
|
|
|
+
|
|
|
+(Explain that the Distributed team is responsible for the write path, while the Search team owns the read path.)
|
|
|
+
|
|
|
+(Generating document IDs. Same across shard replicas, \_id field)
|
|
|
+
|
|
|
+(Sequence number: different than ID)
|
|
|
+
|
|
|
+### Reindex
|
|
|
+
|
|
|
+### Locking
|
|
|
+
|
|
|
+(what limits write concurrency, and how do we minimize)
|
|
|
+
|
|
|
+### Soft Deletes
|
|
|
+
|
|
|
+### Refresh
|
|
|
+
|
|
|
+(explain visibility of writes, and reference the Lucene section for more details (whatever makes more sense explained there))
|
|
|
+
|
|
|
+# Server Startup
|
|
|
+
|
|
|
+# Server Shutdown
|
|
|
+
|
|
|
+### Closing a Shard
|
|
|
+
|
|
|
+(this can also happen during shard reallocation, right? This might be a standalone topic, or need another section about it in allocation?...)
|