[Bernstein09] Section 4.3. Client Recovery

Stefen 2010-06-04

展开全文

4.3. Client Recovery

An important reason to use queuing instead of direct TP is to address certain client and server failure situations. In this section, we systematically explore the various failure situations that can arise. We do this from a client’s perspective, to determine what a client should do in each case.

We will assume the request-reply model of Figure 4.5 . That is, a client runs Transaction 1 to construct and submit a request, and later runs Transaction 3 to receive and process the reply. Its goal is to get exactly-once behavior; that is, that Transaction 2 executes exactly once and its reply is processed in Transaction 3 exactly once.

Let us assume that there is no failure of the client, the communications between the client and the queues, or the queues themselves. In this case, the client’s behavior is pretty straightforward. It submits a request. Since there are no failures between the client and the request queue, the client receives an acknowledgment that the request is successfully enqueued. The client then waits for a reply. If it is waiting too long, then there is presumably a problem with the server—it is down, disconnected, or busy—and the client can take appropriate action, such as sending a message to a system administrator. The important point is that there is no ambiguity about the state of the request. It’s either in the request queue, in the reply queue, or being processed.

Suppose the client fails or loses connectivity to the queues, or the queues fail. This could happen for a variety of reasons, such as the failure of the client application or machine, the failure of the machine that stores the queues, a network failure, or a burst of traffic that causes one of these components to be overloaded and therefore unresponsive due to processing delays. At some point, the failed or unresponsive components recover and are running normally again, so the client can communicate with the queues. At this point the client needs to run recovery actions to resynchronize with the queues. What exactly should it do?

To keep things simple, let’s assume that the client processes one request at a time. That is, it processes the reply to each request before it submits another request, so it has at most one request outstanding. In that case, at the time the client recovers, there are four possible states of the last request it submitted:

Transaction 1 did not run and commit. Either it didn’t run at all, or it aborted. Either way, the request was not submitted. The client should resubmit the request (if possible) or else continue with a new request.
Transaction 1 committed but Transaction 2 did not. So the request was submitted, but it hasn’t executed yet. The client must wait until the reply is produced and then process it.
Transaction 2 committed but Transaction 3 did not. The request was submitted and executed, but the client hasn’t processed the reply yet. The client can process the reply right away.
Transaction 3 committed. The request was submitted and executed, and the client already processed the reply. So the client’s last request is done, and the client can continue with a new request.

To determine what recovery action to take, the client needs to figure out which of the four states it is in.

If each client has a private reply queue, it can make some headway in this analysis. Since the client processes one request at a time, the reply queue either is empty or has one reply in it. So, if the reply queue is nonempty, then the system must be in state C, and the client should go ahead and process the reply. If not, it could be in states A, B, or D.

To disambiguate these states, some additional state information needs to be stored somewhere. If the client has access to persistent storage that supports transaction semantics, it can use that storage for state information. The client marks each request with a globally-unique identifier (ID) and stores the request in persistent storage before enqueuing it in the request queue (see LastRequest in Transaction 0 in Figure 4.6 ). In persistent storage the client also keeps the IDs of the last request it enqueued and the last reply it dequeued, denoted LastEnqueuedID and LastDequeuedID, respectively. It updates these IDs as part of transactions 1 and 3 that enqueue a request and dequeue a reply, as shown in Figure 4.6 . In that figure, the expression R.ID denotes the ID of request R.

Figure 4.6. Client Maintains Request State. The client stores the ID of the last request it enqueued and the last reply it dequeued, in Transactions 1 and 3, respectively.

At recovery time, the client reads LastRequest, LastEnqueuedID, and LastDequeuedID from persistent storage. It uses them to analyze the state of LastRequest as follows:

If LastRequest.ID ≠ LastEnqueuedID, then the system must be in state A. That is, the last request that the client constructed was not successfully submitted to the request queue. Either the client failed before running Transaction 1, or Transaction 1 aborted because of the client failure or some other error. The client can either resubmit the request or delete it, depending on the behavior expected by the end user.
If LastRequest.ID = LastDequeuedID, then the client dequeued (and presumably processed) the reply to the last request the client submitted, so the system is in state D. In this case, the request ID has helped the client match up the last request with its reply, in addition to helping it figure out which state it is in.
If the reply queue is nonempty, the client should dequeue the reply and process it (i.e., state C). Notice that in this case, LastRequest.ID = LastEnqueuedID and LastRequest.ID ≠ LastDequeuedID, so the previous two cases do not apply.
Otherwise, the client should wait until the reply appears before dequeuing it (i.e., state B).

This recovery procedure assumes that the client uses a persistent storage system that supports transaction semantics. This is a fairly strong assumption. The client may not have such storage available. Even if the client does have it, the application developer may want to avoid using it for performance reasons. That is, since the queue manager and persistent storage are independent resource managers, the two-phase commit protocol is needed for Transactions 1 and 3, which incurs some cost.

This cost can be avoided by storing the state information in the queue manager itself. For example, the client could store LastEnqueuedID and LastDequeuedID in a separate queue dedicated for this purpose. Alternatively, the queue manager could maintain LastEnqueuedID and LastDequeuedID as the state of a persistent session between the client and the queue manager. The client signs up with the queue manager by opening a session. The session information is recorded in the queue manager’s persistent storage, so the queue manager can remember that the client is connected. If the client loses connectivity with the server and later reconnects, the queue manager remembers that it already has a session with the client, because it is maintaining that information in persistent storage. So when the client attempts to reconnect, the system re-establishes the existing session. Since the session state includes the request and reply IDs, the client can ask for them as input to its recovery activity.

The recovery scenario that we just described is based on the assumption that the client waits for a reply to each request before submitting another one. That is, the client never has more than one request outstanding. What if this assumption doesn’t hold? In that case, it is not enough for the system to maintain the ID of the last request enqueued and the last reply dequeued. Rather, it needs to remember enough information to help the client resolve the state of all outstanding requests. For example, it could retain the ID of every request that has not been processed and the ID of the last n replies the client has dequeued. Periodically, the client can tell the queue manager the IDs of recently dequeued replies for which it has a persistent record, thereby freeing the queue manager from maintaining that information. Many variations of this type of scheme are possible.

This scenario assumes that after a client processes a reply, it no longer needs to know anything about that request’s state. For example, suppose a client runs two requests. It submits Request₁, the server processes Request₁ and sends Reply₁, and the client processes Reply₁. Then the client submits Request₂, the server processes Request₂ and sends Reply₂, and the client processes Reply₂. At this point, the client can find out about the state of Request₂, but not about Request₁, at least not using the recovery procedure just described.

Finding out the state of old requests is clearly desirable functionality. Indeed, it’s functionality that we often depend on in our everyday lives, such as finding out whether we paid for a shipment that hasn’t arrived or whether we were credited for mileage on an old flight. However, this functionality usually is not offered by a queuing system or queued transaction protocols like the ones we have been discussing. Rather, if it is offered, it needs to be supported by the application as another transaction type—a lookup function for old requests. To support this type of lookup function, the application needs to maintain a record of requests that it already processed. In financial systems, these records are needed in any case, to support the auditability required by accounting rules. However, even when they’re not required, they’re often maintained as a convenience to customers.