[Bernstein09] 2.6. Scalability

Stefen 2010-06-24

展开全文

2.6. Scalability

Scaling up a TP system to handle high load involves two activities. First, one can tune and grow each individual server system to handle the maximum possible load. And second, one can distribute the load across multiple interconnected server systems. The decision of which approach to take depends on cost-performance as well as other goals, such as availability, security, and manageability. In this section we focus on the mechanisms that enable scaling up system performance.

Scaling up a Server

The throughput of a server system is ultimately dependent on its hardware configuration; that is, on the speed of its processors, on the size and speed of main memory and secondary storage, and on the bandwidth and latency of its interconnects. Software too plays a major role. For a given hardware configuration, there are two techniques that are commonly used to get the most benefit from that configuration: caching and resource pooling.

Caching

A cache is an area of memory containing data whose permanent home is elsewhere, such as in secondary storage or on a remote server system. Ideally, the cache contains data that frequently is accessed by programs running on the system containing the cache and that is much cheaper to access than its permanent home. Thus, the expense of retrieving the data from its permanent home is amortized across a large number of accesses. This greatly improves performance over a system in which every access to the data requires accessing the data’s permanent home.

Since many components of a TP system need to access data that is not local to the component, caches are heavily used in TP. A web browser may cache pages of frequently accessed web sites, to avoid having to go to the web site every time those pages are requested. A web server may cache information that is needed to service popular requests, such as the information displayed in response to a client’s initial request to access the web site. A proxy server, which sits between the network and the web server, also offers this caching functionality. Some large data items, such as images, may be cached at a third-party’s web site that is closer to the end user and can therefore offer higher speed access to the information. A server running a TP application may cache popular data whose permanent home is a remote database system. And the database system itself may cache data whose permanent home is secondary storage.

The official copy of a data item in its permanent home may be updated shortly after that copy was read and put into a cache. Therefore, once data has been put into a cache (somewhere other than its permanent home), it potentially is no longer up to date. Thus, a cache is most appropriate for data that is infrequently updated. For example, in a TP system, it is probably useful to cache catalog information, since its content changes slowly. But it may not be useful to cache the latest bid in an auction that will close shortly, since it may be updated very frequently.

The implementation of a cache requires a fast way to look up entries. This usually is done using a hash algorithm that maps the identifier of a data item to its memory location in the cache. It also requires a replacement algorithm, which selects an item to remove from the cache to make room for a new item that has to be inserted. A commonly used replacement algorithm is least-recently-used, which replaces the item whose last access was longest in the past among all items in the cache. There is a large repertoire of cache replacement algorithms used in practice. However, coverage of these algorithms is beyond the scope of this book.

Sometimes, items in the cache are invalidated before they need to be replaced. Invalidation is done if it is known that the item is unlikely to be fresh enough to be used. When a server stores an item in its cache, it may include an invalidation time that the cache manager enforces. For example, a web server may add an invalidation time 10 minutes in the future when it caches a copy of a headline news banner, thereby ensuring it refreshes the headline from the news server at least that often.

Alternatively, the server that is the data’s permanent home may keep track of which caches have a copy of that data. After the server processes an update for that data, it can issue an invalidation message to every cache that has a copy, which tells the cache to invalidate its copy of that data. This helps to ensure that the caches are coherent; that is, that a data item has the same value in all caches that currently hold the data item. Clearly, there are limits to cache coherence due to variance in the time it takes for each cache to receive an invalidation message and process it.

A cache may be updatable. Each update to a data item in the cache must be propagated back to the data item’s permanent home. Sometimes, this must be done explicitly by the client of the cache. That is, it stores the updated data item in both the cache and the data item’s permanent home. If the cache manager knows how to map each cached data item to its permanent home, then the client may only need to update the cache and the cache manager propagates the update to the data item’s permanent home. If the cache manager propagates the update immediately as part of the operation to update the cache, then the cache is called write-through. If it propagates the update lazily, potentially long after the cache was updated, then the cache is called write-back.

Clearly, cache coherence is affected by the way that updates to the cache are propagated. For example, if the data item’s server uses invalidation messages to notify caches when the item has changed, then a write-through cache will yield better cache coherence than a write-back cache. But this better coherence has a cost. Usually, the cost of the write-back cache is lower, since multiple updates to the same cached data item within a short time period incur only one write-back, and a write-back can batch multiple updates in a single message to the data’s permanent home.

Since caching mechanisms are complex and important, they are built into many types of products, notably transactional middleware and database systems. There are main memory database systems that are intended to be used for cached data. Some operate as a conventional transactional resource manager, such as Oracle’s TimesTen, McObject’s eXtremeDB, and Raima’s RDM. Others are designed specifically for caching, for example, by offering the application explicit control of when to invalidate cached data or write-back updated cached data to its home location. Examples include Danga Interative’s memcached, Oracle’s Coherence and Microsoft’s project codenamed “Velocity.”

Resource Pooling

Another case where caching can improve performance is when a resource is costly to create and relatively inexpensive to access. Sessions are one such resource. Consider an application that requires the use of a database system. The server process that runs this application needs a session with a database system for each transaction currently running. However, each transaction typically needs the session only for a fraction of a second. Therefore, the server process can maintain a pool (i.e., cache) of sessions. Each transaction is given exclusive use of one of the sessions while it is running. After it commits or aborts, the session is returned to the pool. Thus, sessions are serially reused by different transactions.

Process threads are another example of resources that can be pooled and serially reused. The process has a fixed number of threads. When it receives an RPC, the RPC is assigned to a thread in which to execute. After the RPC finishes executing and returns to its caller, the thread is returned to the thread pool.

A third example is server classes, which avoid the overhead of frequent process creation. Like threads, each process receives a call. After the call completes, the process is available for reuse by another call.

Scaling Out a System

One way to scale up a system is to add more machines. This is called scale-out. There are two approaches to scale-out, partitioning and replication, which offer different ways of distributing the workload across the machines.

Partitioning

One way to distribute the workload is to partition the application and its data into different types of work and assign each type to a different machine. For example, in a bank, one might assign the credit card application to one system, the loan application to a second system, and the checking account application to a third system. When a request arrives, it is directed to the system that supports the relevant application. This can be done by storing the mapping between applications and servers in a registry service and looking up the mapping for each request, as was described in Section 2.4.

Partitioning by application type is an effective technique. However, it is an incomplete solution if an application needs to scale up beyond the capabilities of a single machine. Then the application itself needs to be partitioned. A common way to do this is range partitioning, where different copies of the server handle different ranges of an input parameter. For example, a debit-credit application dealing with retail banking might be split into five servers, each of which handles a range of account numbers (see Figure 2.16 ). The database that supports each of these servers can be local to the system that supports those account numbers. So the first group of account numbers is stored on the same computer as the application program that supports those account numbers, and so on.

Figure 2.16. Parameter-based Routing. The router application forwards each request to the appropriate server based on the account number parameter in the request.

When the system is organized in this way, a routing function needs to forward each request to the correct server based not only on the identifier of the request type, but also on one or more of the input parameters. In the example, it would be the account number. This is called parameter-based routing.

Range partitioning can be implemented directly by the application, by having the application support the routing function. Many systems provide built-in support. For example, range partitioning and parameter-based routing are supported by many high-function database systems and some transaction middleware products.

Partitioning schemes all suffer from the problem of load balancing, especially when servers are partitioned statically. Usually, the workload varies over time. For example, in the system shown in Figure 2.16 there may be a burst of activity for accounts in the 20,000 to 39,999 range, thereby overloading the second server. This problem may arise frequently if the load is correlated with value ranges. For example, if account numbers are correlated with geographical regions, then a peak period in one time zone will cause its partition’s servers to be more heavily loaded than those of other partitions. An overloaded partition will perform poorly. It doesn’t help that other servers may be less heavily loaded, because they don’t have the data required to service requests in the 20,000 to 39,999 range.

One way to reduce the frequency of such overload situations is to use hash partitioning, where a hash function is used to map each parameter value to a server partition. A well-designed hash function will, with very high probability, spread the load evenly across servers. It therefore is less likely to exhibit load-balancing problems than range partitioning. Hash partitioning commonly is used not only for partitioning a database but also for partitioning a large-scale cache that is spread across many servers.

One solution to load balancing is automatic reconfiguration. That is, when a partition becomes overloaded, it automatically is split into two partitions and the routing function is updated accordingly. The decision to split a partition should be based on workload trends, not a short-term spike in load. If a partition is split based on a temporary load spike, the split partitions will be underutilized after the spike dissipates.

Another solution to load balancing is to use table-lookup partitioning, where a mapping table explicitly maps each input parameter value to a particular server. There is a significant cost to maintaining all this mapping information when a lot of parameter values are present, though this can be mitigated with the use of some network switches, such as Layer 7 switches. This partitioning approach offers some significant benefits over range and hash partitioning. One benefit is fine-grained control over reconfiguration. When a server overflows, a new server can be allocated and newly added parameter values can be assigned to the new server. Another benefit is that different parameter values can be explicitly assigned levels of service. For example, a bank may offer two levels of service to checking accounts, depending on their minimum monthly balance. This account-type information can be stored in the mapping table and the account stored at a server that supports the appropriate level of service. A third benefit is that users can be upgraded to a new release of an application one by one. By contrast, with range or hash partitioning, the application would not know whether to access a user’s data in the partition using the old or new format. Thus, all the parameter values (e.g., accounts) in a partition would be inaccessible while the upgrade was in progress.

Whatever partitioning scheme is used, configuring a system with the right amount of server capacity is important. Servers need to be configured for peak load, not average load, to ensure that they can offer good response time even in periods of high load. The more extra capacity (or headroom) that each system offers relative to its expected peak load, the less likely it will become overloaded.

Partitioning Sessions

Partitioning also helps scale-out when communication sessions are required. In a two-tier architecture, if there are many clients and each client requires a session with every server, the result is a polynomial explosion in the number of sessions. For example, if there are 100,000 clients and each one has to connect to all 500 servers, then each server would have 100,000 sessions, resulting in 50,000,000 sessions overall (see Figure 2.17 ). Each session consumes some main memory and requires some setup time. When there are too many sessions, this session overhead can be troublesome. It can limit the ability of the server system to scale out by adding more servers.

Figure 2.17. Polynomial Explosion in Two-Tier Model. If there are f front-end programs and t transaction servers, then there are f × t sessions.

The total number of sessions can be greatly reduced by inserting a routing layer between the clients and servers that partitions the set of clients. Each router process connects to a subset of the clients and to all the servers. Thus, each client can still send messages to all servers, at the cost of an extra message hop through a router. See Figure 2.18.

Figure 2.18. Multilevel Routing. By introducing routers in between clients and servers, the overall number of sessions is greatly reduced, compared to the two-tier model of Figure 2.17.

Now say you have 10 routers between the clients and servers, and each client is connected to one router. Each of the 10 routers would have 10,000 sessions to their clients and 500 sessions to all the servers, resulting in 10,500 sessions per router, or 105,000 sessions overall—a huge reduction from the 50,000,000 sessions required without the routing layer.

Grouping clients by routers can be based on geographical considerations. For example, all the clients on a given local area network might be serviced by the same router. More complex groupings may be needed for fault tolerance reasons. For example, the ATMs at a bank branch may be split across two routers over two separate communication lines, so the failure of one router still leaves half of the ATMs operating.

Replication

Another way to distribute workload is to replicate server processes and assign the replicas to different systems. The replicas are identical, in the sense that they can all process the same kinds of requests. This works especially well if the processes are stateless. In that case, each request can be assigned to any of the replicas, even if a previous request from the same client was processed by a different replica.

As in the partitioning approach, it is desirable to balance the load across the replicated servers. This can be done by having each request randomly choose a server to process the request, sometimes called spraying the requests across the servers. It can be done by the client that issues the request, or it can be done in a server system. For example, a network router that connects a server system to the Internet might have built-in load balancing functionality to forward messages based on round robin, least number of active connections, or fastest response time.

Even if each client sends the same number of requests to each server, the load may not be distributed evenly, because one server may receive requests that require more time to service than those received by another server. To avoid this unbalanced load, each client can put requests into a queue that is shared by all servers, and each server dequeues a request whenever it is idle. Thus, each server obtains new work if and only if it has additional capacity to process it. The main disadvantage of this approach is the overhead of managing the queue. It needs to be accessible to all the servers, and clients and servers need to synchronize their accesses to the shared queue. We will have a lot more to say about queuing mechanisms in Chapter 4.

Replication interacts with caching. Suppose a server is replicated and a client process C issues a request r that accesses one of those server replicas, S. To process r, S may access remote data, which S saves in a cache. For example, C may be a web browser running on a desktop machine and S may be a web server at a site that has a large number of web servers running. The request may access a web page, which is cached by S. A given client often issues many similar requests. If C issues a request r′ that is similar to r and hence accesses the same data as r, then it would be cheaper to process r′ at S rather than at a different server replica that has not cached the data required by r′. In this case, we say that C has cache affinity for S. Although C can still access any of the server replicas, it performs better when accessing S than any of the other server replicas.

A more extreme example of affinity occurs when a server replica S is maintaining shared state with respect to a client C. In this case, it is essential that all requests from C be serviced by S, so the request has access to the shared state. Notice that this problem does not arise if C maintains the shared state. That is, if C includes the shared state with every request, then any server replica can process a request, because the server replicas are stateless with respect to C.

When replicas contain updatable data, updates must be propagated to all replicas to keep them identical. A common configuration is to require that all updates be applied to a primary replica, which forwards those updates to the other read-only replicas. This offers simpler synchronization than immediately broadcasting updates to all replicas, but introduces delay by passing all updates through the primary replica. Synchronization algorithms for replicated data are covered in Chapter 9.

Replication is a common feature of database systems. It can also be used to implement cache coherence. If a replicated cache is updatable, then a replication mechanism can be used to propagate updates from one cache to all the others.

Replication also is used to improve availability. If one replica is unavailable, then its workload can be handled by other replicas. Techniques for using replicas in this way are also covered in Chapter 9.