Failure Detection

How can a process crash be reliably detected?

General mode:

  • Each process has its own failure detection module.

  • A process P probes another process, Q, for a reaction.

  • If Q reacts -> Q is considered available (by P).

  • If Q does not react in a t timeframe -> Q is suspected to have failed.

In a synchronous system:

  • A suspicion of a crash is in reality, a sure thing.

In practice

If P does not receive a heartbeat from Q in a t timeframe: P will suspect Q.

If Q sends a message later on (which is received by P):

  • P stops being suspicious of Q.

  • P increases the t value.

Last updated