Distributed commit protocols
Last updated
Last updated
Ensure that an operation is executed by all the members in a group, or by none.
Reliable multicast: one message is delivered to all members.
Distributed transaction: each local transaction must be successful.
The client that initiates the computation acts as a coordinator, and processes that need to take part in the commit are participants.
Phase 1a: The coordinator sends a VOTE-REQUEST to the participants (also called the pre-write).
Phase 1b: When a participant receives the VOTE-REQUEST it responds with a VOTE-COMMIT or VOTE-ABORT. If an abort is sent, it also kills its local computation.
Phase 2a: The coordinator collects all the votes, if all are commits, it sends a GLOBAL-COMMIT, otherwise it sends a GLOBAL-ABORT.
Phase 2b: Each participant waits for a GLOBAL-COMMIT or GLOBAL-ABORT and processes accordingly.
The participant fails while in the S state, and recovers back to it.
INIT: There is no problem, the participant didn't know the protocol.
READY: Participant waiting for a commit or an abort. After recovering, the participant needs to know what state transition it needs to do -> keep the coordinator's decision.
ABORT: Simply making the abort state idempotent, meaning, removing the results from the workspace.
COMMIT: Can also turn the commit state idempotent, meaning, copy the workspace to storage.
When a distributed commit is necessary, having participants that use a temporary workspace to keep results, allows for recovery in case of failures.
The main problem is the fact that the coordinator's final decision may not be available.
To a participant P in the READY state, define a maximum t to wait for a decision from the coordinator. P should discover what the other participants know.
A participant can't decide locally. it always depends on other processes.
When a failure occurs, it's necessary to bring the process back to a non-failed state.
Forward Error Recovery: Finding a new state where the system can continue to operate.
Backward Error Recovery: Bring the system back to a state without errors.
Backward Error Recovery is the most common, but it needs recovery points to be in place.
The recovery of distributed systems is difficult because of the fact that the processes have the need to cooperate to identify a consistent state, from which they can all recover.
For each received message, it must be possible to demonstrate that it has been sent.
Assuming processes that creak checkpoints regularly, it is the most recent one.
Each process creates a checkpoint after a globally coordinated action.
Simple solution:
Use a two-stage blocking protocol:
The coordinator sends a checkpoint request multicast message.
When a participant receives that message, it creates a checkpoint so as to send a message (app) and reports back that the checkpoint has been created.
When all checkpoints have been confirmed by the coordinator, it broadcast that all checkpoints have been created.
It is possible to limit the scope to only the processes that depend on the coordinator's recovery.
If a checkpoint is created in the wrong instances, the line of recovery can end up located at the beginning of the system. This situation is called Waterfall Reversion.
Each process creates checkpoints independently of other processes, with the risk of waterfall reversion.
Being CPi(m) the mº checkpoint of process Pi and INTi(m) the interval between CPi(m-1) and CPi(m).
When the Pi process sends a message in the INTi(m) interval, it performs piggyback(i,m).
When the Pj process receives the message in the INTj(n) interval, it keeps a dependency INTi(m) -> INTj(n).
The INTi(m) -> INTj(n) dependency is kept alongside the checkpoint CPj(n).
If the Pi process retrocedes to CPi(m-1), Pj has to retroced to CPj(n-1).
Instead of assuming a checkpoint (costly), it tries to repeat the messages and behaviors since the last checkpoint -> keeps a registry of messages.
It is assumed a deterministic execution model, "peace by peace".
The execution of each process can be considered a sequence of intercalated states.
Each interval between states begins with a non-deterministic event (ex. message reception).
Execution in the interval is deterministic.
If non-deterministic events are registered (to repeat later), the deterministic execution model obtained allows the system to repeat all actions in a complete manner.
When should messages be logged?
Orphan processes case:
Q process just receives m1 and m2.
Assume that m2 was not logged.
After m1 and m2 have been delivered, Q sends m3 to process R.
Process R receives, and delivers, m3: an orphan.
Notation:
DEP(m): processes in which m has been delivered. If the message m* is caused by de delivery of m, and m* was delivered to Q, then Q ∈ DEP(m).
COPY(m): processes that hold a copy of m, but do not hold it (yet) reliably.
FAIL: set of processes that crashed.
Characterization:
Q is an orphan <=> Q ∈ DEP(m) and COPY(m) ⊆ FAIL.
To each m unstable message there is, at most, one dependent process in such a way that |DEP(m)| ≤ 1.
Consequence:
An unstable message in a pessimistic protocol has to stabilize before it sends the next message.
To each m unstable message, it is ensured that if COPY(m) ⊆ FAIL, and then also, eventually, DEP(m) ⊆ FAIL.
Consequence:
To ensure that DEP(m) ⊆ FAIL, usually a reversion of each Q orphan process is done until Q ∉ DEP(m).