Failure Detection
How can a process crash be reliably detected?
General mode:
Each process has its own failure detection module.
A process P probes another process, Q, for a reaction.
If Q reacts -> Q is considered available (by P).
If Q does not react in a t timeframe -> Q is suspected to have failed.
In a synchronous system:
A suspicion of a crash is in reality, a sure thing.
In practice
If P does not receive a heartbeat from Q in a t timeframe: P will suspect Q.
If Q sends a message later on (which is received by P):
P stops being suspicious of Q.
P increases the t value.
Last updated