Reliable RPCs

What can go wrong?

The client does not find the server.
The message with the request from the client to the server can get lost.
The server crashes after receiving the request.
The response from the server gets lost.
The client crashes after sending the request.

Solutions

Locating failure: report to the client.
Lost request: re-send.

Server crashes

Problem

While (a) is the normal case, (b) and (c) need different solutions. In spite of all that, what happened is still unknown

Two approaches

At-least-once-semantics: The server ensures the execution of the operation at least one time.
At-most-once-semantics: The server ensures that it will execute the operation one time, at most.

Failure recovery in transparent servers

Why is it impossible to recover from a failure?

Three distinct events in the server:

M: send a complete message.
P: completes the document processing.
C: crash.

Six different orders:

M->P->C: Crash after reporting the finish.
M->C->P: Crash after reporting the finish, but before updating.
P->M->C: Crash after reporting the finish, and after updating.
P->C(->M): The update occurred, but then it crashed.
C(->P->M): Crash before any action.
C(->M->P): Crash before any action.

Message lost

What the client notices is that is not receiving an answer, but it has no way of knowing what is causing the loss, if the server crashed, or if the response got lost.

Solution (partial)

Design the server in such a way that the operations are idempotent: repeating an operation is the same as running it only once.

Pure reading operations.
Restrict substitution operations.

Many operations are idempotent by nature, such as bank transactions.

Client crashes

Problem

The server is working and using resources without any reason (orphan computation).

Solution

The orphan is killed by the client when it recovers.

The client sends a broadcast with a new number from the time when it recovers -> server kills orphans of the client.

Request that a computation end in a maximum of T units of times. Older ones are just removed.

Non-simple broadcast:

Reliable communication in the presence of failed processes.
- Communication is said to be reliable when it can ensure that the received message is subsequently delivered to all non-failing members of the groups.
Difficulty
- An agreement about who is in the group is needed before the message gets delivered.

PreviousFailure Detection NextDistributed commit protocols

Last updated 1 year ago