[Horwitz02] Chapter 8. Service Outages

Stefen 2010-09-27

展开全文

Chapter 8. Service Outages

You will learn about the following in this chapter:

The seven most common types of service outages and their causes
How to schedule maintenance for minimum business disruption
Best practices for performing maintenance within scheduled outage times
Assessing service performance for compliance with service level agreements
Effective procedures for responding to and resolving service outages
Analyzing the root causes of service outages

Information technology is the lifeblood of most organizations. Revenues, production, scheduling, sales, and many other business functions rely on fully functional IT services. Service outages can represent tremendous losses—in revenues, personnel time, customer goodwill, and other important business commodities. As such, outages—even those that are necessary for routine maintenance or system repair—are considered by most to be the great evil of information technology. As a system administrator, your company expects you to rise up and quickly resolve every outage that rears its ugly head—whether it occurs during regular 9-to-5 weekday working hours, or at 3:00 a.m. on Saturday. System administrators are on-call heroes who must faithfully respond to outage notifications and quickly get systems back up and running, in order to minimize losses to the organization. All too often, in fact, system administrators are recognized more for their performance during outages than for the time and energy they invest in designing and implementing system infrastructures that suffer a minimum of such outages. This chapter discusses some common types of Unix system service outages and how they apply to you, as the Unix system administrator, and your business. You learn about the metrics surrounding outages, including maintenance windows and service level agreements. And you learn the most effective procedures for dealing appropriately with outages and how to use each outage as a learning tool that can help you minimize similar outages in the future.

Types of Outages

After you've spent years as a system administrator and are looking back at all the service outages you've dealt with in your organization during that time, you'll probably discover that they can be divided into two groups: those that are preventable, and those that aren't. Preventable outages are ones that are either caused by human error or ones that you can see coming based on current monitoring data. Human error, although impossible to prevent, can be minimized with procedures that eliminate guessing on the part of an administrator. Creating procedures for routine tasks like removing a server from a rack or rebooting a production server can prevent the occasional human mishap.

Other outages can be prevented because you can see them coming long before they happen. For example, a server with a disk at 50% capacity and usage increasing by 10% of the total capacity per week is likely headed for disaster in five weeks; you can see it coming well ahead of time. This is why proactive monitoring is so important: You're looking for potential problems rather than reacting to them. You can read more about proactive monitoring in Chapter 6, “Monitoring Services .” Graphing these trends over time can help you monitor your system for developing problems.

But within those two broad outage categories are several more specific categories of outages based on the cause, duration, and extent of the outage. There is much more to an outage than the unavailability of a service. Although to your users they may all look the same, a system administrator needs to know more than just “the server is down” to assess the severity of each outage; proper categorization of an outage may determine how quickly you are required to respond, who in your organization you need to inform about the outage, and if you should notify users. There are an infinite number of causes for an outage, but they do fall into the following categories:

Scheduled maintenance
Unscheduled outages
Degraded service
Partial service outages
Complete service outages
Distributed service outages
Third-party outages

The sections that follow look more carefully at each of these types of service outage and some of the special demands each can place upon you as a system administrator.

Scheduled Maintenance

Routine maintenance is common on any system, whether it's a computer or a car. Some routine maintenance occurs with new technology or software releases. Just as you change the oil in your car every 3,000 miles, you patch or upgrade your operating systems as new patches and releases become available. Some routine maintenance is unexpected, yet still predictable. For example, you don't know when a disk, memory, or processor will go bad on your server, but you know that at some point you'll need to replace them, just as you know that eventually you'll need to replace your car's tires.

At the same time, you know a tire blow-out can cause a nasty accident. To avoid such surprises, you watch your tires for signs of wear, and you note their “wear guarantees” and the number of miles they've logged. That way, you don't depend on a blow-out to tell you that your tires are ready for replacement. The same is true of your Unix network. With proper logging and monitoring, you can anticipate and avoid many software and hardware meltdowns related to age and overuse. And you can schedule the replacement of those components for a time that causes you and your business the least disruption.

Planning for Routine Maintenance Outages

Scheduled, routine maintenance rarely has to be a critical “show stopper” for your IT department or your business. Routine maintenance issues won't bring down an entire system. Without question, processor, disk, and memory failures can have a dire impact on a system, but that's why smart administrators use logging to provide a “heads up” for impending failures. The administrator can then schedule a time to fix the problems on the system and make announcements appropriate so nobody is taken by surprise, especially the users.

Scheduled maintenance puts unique responsibilities on the system administrator. Because this kind of maintenance (and the outage it requires) occurs at the administrator's discretion, he or she must gauge the severity of the problem and choose the best time to take action. You can find detailed information about monitoring logs and other metrics throughout both Chapter 6 and Chapter 11, “Performance Tuning and Capacity Planning.”

Don't Procrastinate Routine Maintenance

There are some inherent problems that come with the freedom of scheduling routine maintenance at your discretion, such as procrastination. Don't allow yourself to let a minor problem slip deeper and deeper into your pile of “must do” tasks until you eventually forget about the problem completely. Most neglected problems force themselves back into your attention when they escalate into full-blown outages. As you become aware of system maintenance needs, schedule a time to perform the maintenance and fix developing problems. Then stick with the schedule.

Scheduling Routine Maintenance Outages

Scheduled maintenance creates an outage in order to prevent an outage. If you think that sounds like nonsense, think again: Patching a server to avoid a potential problem typically involves rebooting the server, which causes a short outage. If your company is busily using a service, the employees are likely to consider the patching outage an unnecessary loss of productive time.

The justification for scheduling such outages can be difficult for nontechnical management to understand, as they may take the “if it ain't broke, don't fix it” attitude, especially in high-availability environments. To enlist the support of management, do your research and have your information ready to explain to them how the process you'll perform during this outage will prevent longer, more costly outages in the future.

If you're scheduling an outage to upgrade some part of the Unix system, make sure reluctant managers understand the benefits the upgrade will bring to the business. For example, replacing a slow, aging Web server with a new, top-of-the-line multiprocessor Web server may involve an outage while the switch is made. Managers are likely to accept the inconvenience of the outage, however, when they understand the benefits of the upgrade, such as the ability to handle more concurrent connections.

When you are actually scheduling your maintenance, you need to choose a time that minimizes the impact to your users but still allows you to perform the required work. The following guidelines will help you choose the appropriate time:

When possible, schedule work during your maintenance window, which should coincide with the periods of least usage for your services. Maintenance windows are described in detail in the “Maintenance Windows ” section later in this chapter.
Urgent maintenance, such as applying a patch to stabilize a crashing system, should be performed as soon as possible. You may want to obtain the approval of your management on this issue.
If you require a support engineer from a vendor to perform the maintenance, ensure that your support contract covers the times you need the engineer to be present. Also verify that the engineer can meet your scheduling requirements.
Coordinate your maintenance schedule around the availability of your own staff. If it is impossible for the required staff to be available at the time you need them, you may need to reschedule for a time when those resources are available.

Commit to a Scheduled Maintenance Time

Scheduled maintenance is the only type of outage with a fixed time limit. System administrators don't typically release announcements that a critical server will be taken down at 2:00 a.m. without also announcing when the server will be put back into service. After all, the point of a scheduled maintenance is that users and management know when to expect a service to be down and how long the outage will last. Maintenance windows can help enforce time limits; you learn more about them in the section titled “Maintenance Windows.”

Perhaps the most important thing a system administrator can do to eliminate excess resistance to scheduled maintenance outages is to provide management with a clear and accurate plan for when the maintenance will take place and how long the outage will last. When management learns to trust your ability to schedule maintenance outages so that they cause minimal disruption to the business and then keep the outage “on schedule,” they'll be more likely to stop second-guessing you on this issue. You learn more about maintaining maintenance schedules in “Working Within the Window,” later in this chapter.

Unscheduled Outages

An unscheduled outage is any outage which occurs without warning. Even the most watchful and careful system administrator can expect to experience unscheduled outages on his or her network. The causes for unscheduled outages take many forms, such as a configuration glitch, hardware failure, human error, or even a building going up in flames.

Human error is one common cause of unscheduled outages. Some examples of human error that could cause an outage are as follows:

Typing the wrong command as root
Disconnecting the wrong cable from a server
Making a typo in a configuration file
Experimenting with new software or configurations on a production server

Most human error can be prevented if the system administrator and other technical staff take extreme care when working with or around production systems. Double check all changes that you make, or even better, use a change management system to approve and communicate your changes ahead of time. See Chapter 14, “Internal Communication,” for more details on change management.

Whatever the cause of an unscheduled outage, the events that prompt the maintenance are unexpected. Most unscheduled outages are out of the system administrator's control, as is the amount of downtime they cause. Minimizing the downtime incurred by an unscheduled outage is one of the many challenges a system administrator faces.

Document Outage Procedures

When you experience and resolve an unscheduled outage, document the procedures you took to solve the problem. You are likely to experience the outage more than once, and having documentation up front will help improve the response time for subsequent outages.

Real-World Example: Human Error Outage

An administrator at an ISP was decommissioning an old server in a densely populated rack, and he needed to unplug the server's power cable. The server's cables weren't labeled, and the administrator mistakenly pulled out the power cable for a Network Appliance filer that served email and Web site data for 60,000 users. This accident caused a massive outage. Needless to say, it was heart-stopping for the administrator to hear the critical filer's fans stop spinning! The filer was plugged back in and the administrator verified that everything was functioning properly before continuing to remove the old server. The outage lasted only 5 minutes, but did not go unnoticed by many of the system's 60,000 users. This outage didn't have to occur; had the administrator properly labeled the cables and/or taken the time necessary to trace the proper cable in a densely populated rack that contained business-critical hardware, he could have avoided this embarrassing and potentially costly outage. In fact, the next day the administrator organized a cable labeling project to prevent this kind of mistake from happening again. However, even in well-designed systems, human error can bring everything crashing down and is the cause of more unscheduled outages than anyone is willing to admit.

Partial Service Outages

Some services provide only a single function. For example, POP has the sole purpose of allowing a user to download mail from a mail server. A system administrator easily can determine when a single-function service is down and the repercussions of that outage for the business. In the POP example, an outage means users get an error message when they try to download their mail.

Other services provide more functionality and, therefore, can present a more complex diagnostic and maintenance issue. A Web server, in addition to providing simple static Web pages, can also support file uploading; CGI scripts can process forms, and code on the back end can interface with databases. Add SSL to this mix, and you have an entirely separate secure Web server. Each Web server has its own unique functions, turning the simple HTTP protocol into a complex Internet service.

When somebody reports a Web server outage, therefore, the system administrator may have no clear idea what is causing the outage or what services the outage has taken down with it. Is the entire Web site down, or is just one part of the site down? Is just one script failing, or is a database on the back end down?

Complex services (such as this Web server example) can experience partial service outages where the service as a whole is up and running, but parts of the service are failing. For example, a banking Web site might be available on the Internet, but the page that lets you check your account balances produces an error. For these complex services, the sooner you understand what part of the service is down, the better you'll be able to resolve the problem and end the partial outage. You need users to give detailed problem reports that help pinpoint the nonfunctioning part of the service—and the responsibility for helping users supply this information lies with your help desk. Help desk staff should ask users to describe exactly what they were doing when they experienced the problem, as well as any error messages they received. The more detailed the problem report, the easier it will be for you to track down a small problem with a large service.

Automatically Generated Error Reports

Many applications and operating systems provide a mechanism for automatically generating detailed error reports and sending them to the vendor's help desk. Although no Unix operating systems offer this yet, Network Appliance filers (network file servers) provide an “autosupport” functionality in which system status is periodically emailed to Network Appliance support, including after every reboot or crash. This report helps the support staff gather evidence that you would normally have to provide yourself, as well as allowing them to see potential problems before they cause an outage.

Complete Outages and Degraded Service

A complete service outage is the nightmare all system administrators fear. In a complete service outage, a service is 100% unavailable to its users. These outages are all too often the cause of 3:00 a.m. pages on a Saturday.

Not all outages are as catastrophic as those just described. Sometimes problems just cause a service to become degraded. Much as a power brownout causes lights to dim but not fail, users can still use a degraded service; it just doesn't perform as well as it normally would.

A POP3 mail service example can help illustrate degraded service outages. Users of these servers are used to clicking on a “Retrieve Mail” button on their clients and receiving their new mail within seconds. However, during an outage in which the mail server is overloaded with incoming mail, users experience delays of up to two minutes before receiving any messages. Another two-minute delay separates subsequent message retrievals. While the mail service is working, it is painfully slow, and not practical to use.

Other examples of degraded service include the following:

Slow Web server response due to excessive network traffic
A single failed Web server in a pool of servers, causing every couple of HTTP requests to fail
An application causing excessive paging, slowing down other services on the same server

Service Monitors Detect Degraded Services

As you learned in Chapter 6 , service monitors can detect degraded service using timeouts. If, in the POP3 example just given, the system administrator had used NetSaint's POP3 service monitor and set its timeout to 30 seconds, the administrator would have received an alert for that two-minute delay.

Degraded service, while not a complete service outage, still causes significant problems for end users. Degraded service problems can be among the most difficult to diagnose, because the system is working functionally; you have to find the parts that are failing, investigate the causes, and fix the problems.

Degraded service usually indicates that one or more parts of the service are under stress. In the earlier example, the POP3 service was slow. The logical place to start looking for problems in that situation is in the server itself. Many subsystems can be stressed on a Unix server, including CPU, disk, network, and memory, and the system administrator must examine them all to find the problem. Chapter 11 discusses such diagnostic examinations in detail.

Distributed Service Outages

A running joke among system administrators offers this definition of a distributed service:

A distributed service is one in which a server you have never heard of, in a place you've never been, can cause the machine on your desktop to crash.

As Homer Simpson likes to say, “It's funny because it's true.” Distributed service outages differ from other kinds of outages in one important way: Although most outages on server A result from a failure on server A, a distributed service outage can occur when a failure on server B causes a failure on server A.

A distributed service (all jokes aside) is one that resides on a remote system but is critical to the functioning of another system. DNS (domain name system) and NFS (network file system) are good examples of distributed services. DNS provides critical name resolution for most Internet applications, but it resides on remote servers. NFS serves file systems to remote clients, some of which are critical to the operation of those clients. If DNS servers become unavailable, Internet applications will grind to a halt. A failed NFS server housing shared applications could render all of those applications unavailable to its clients.

To better understand the dynamics of distributed server outages, consider the example of the NFS file server. Instead of installing gigabytes of software on each system in your infrastructure, you can install the software once on an NFS file server and share it with the other servers. Now imagine that you've installed user shells on that NFS server—shells such as bash and tcsh that aren't available by default on your other servers. If the file server ever becomes unavailable, so do the shells. If the shells are unavailable, users can't log in to any of your servers. A failure on the NFS file server has caused an outage on your other servers.

Perhaps the most frustrating aspect of a distributed outage is that the remote server that is failing may be under somebody else's control. You might be responsible for Web servers, but maybe an entirely different team in your company administers the DNS servers that just went down. Although you have done nothing wrong and your servers are functioning normally, somebody else's servers that you rely on can cause an outage in your system, and there's nothing you can do about it but complain and wait.

Real-World Example: A Distributed File System

AFS is a large-scale distributed file system often found at universities. Files are stored in volumes located on any number of file servers anywhere on the network. At one university, the popular email program Pine was installed in an AFS volume, along with all of the other shared applications that over 60,000 students, staff, and faculty used every day. Pine was the university's primary email program at the time. The AFS file servers were running on aging hardware, and as such were constantly crashing due to ever-increasing load. The administrators responsible for the machines running Pine were fed up with the outages, especially when the volume housing Pine became unavailable. To resolve this problem, the administrators installed Pine locally on each server, making the most popular application available regardless of AFS volume unavailability. Working with multiple copies of the program was less convenient, especially during upgrades. But the administrators preferred that small hassle to repeatedly explaining outages that weren't caused by any factor within their control.

Distributed outages are very difficult to prevent, since the whole point of distributed services is to offload services onto other systems, and you are often not in control of those systems. However, there are several steps you can take to minimize the impact a distributed service outage has on your own systems, as follows:

Ensure that there are redundant resources providing each service—especially for critical services like DNS and file servers.
Document the dependents of each distributed service so you know what will be affected when a service fails.
Deploy “problem services” that fail too often locally so you can control these services yourself. See the previous “Real-World Example” for an example.

Third-Party Outages

Third-party outages occur when a system owned by another organization fails and causes a failure on one of your systems. Though similar to distributed outages, third-party outages differ in one important way. A distributed outage actively involves the use of a service on the remote system that causes the failure. In third-party outages, the system administrators whose systems suffer the outage don't even know the remote system exists, and they certainly don't use any services on it.

The classic example of a third-party outage is a backbone failure. Everyone depends on backbones, which typically run at speeds between 45Mbps (DS3) and 2.4Gbps (OC-48), to connect networks around the world to form the Internet. Multiple backbones provide several different high-speed paths between any two machines on the Internet; the Internet wouldn't function without them.

When a major backbone goes down, everyone in the country knows it! No traffic can get from one part of the backbone to another until routers eventually remove the route to the failed backbone and find other ways to route packets to their destinations. The most obvious symptom of this problem is a loss of connectivity to servers you access every day, especially those across the country. If your company suffers this kind of third-party outage, many of your customers will lose connectivity to your services. It's scary to know that a piece of hardware you never asked to use could cause such a major outage for your organization, but that's the nature of the shared network called the Internet.

Some examples of third-party outages include the following:

A major Internet backbone outage (such as a backhoe cutting into fiber optic cabling underground) prevents you from reaching certain sites.
A client's mail server is down, preventing you from sending email to them.
The router that terminates your T1 line at your ISP fails, causing your organization to lose all connectivity to the Internet.

Third-party outages are out of your control as a system administrator. The most important thing you can do is to report any outage to the third party and keep track of any tickets that the third party opens for you. Report these tickets to your help desk and explain the situation to help desk staff so they can adequately update your users who call in to report the problem. Users should know that the problem is out of your control, but that it has been reported and is being worked on. Check back periodically with the third party to verify progress is being made on resolving the outage.

Maintenance Windows

One of your first responsibilities as a Unix system administrator is to specify your organization's maintenance window—the time reserved for routine scheduled maintenance tasks such as rebooting routers, upgrading servers, adding disk drives, and so on. Maintenance windows specify a time when service is not guaranteed so that administrators have time to fix minor problems or upgrade servers.

Routine work such as hardware racking and application installation can be done outside of the maintenance window. But if you are planning to do any work that requires system downtime, or even if you are planning to do work that has only a slight chance of bringing something down, do it during the maintenance window. You'll save yourself a lot of trouble if something does go wrong.

You need to consider three factors when choosing a maintenance window: time of least usage, maximum maintenance time, and business requirements. The following sections discuss these factors in detail.

Time of Least Usage

Common sense dictates that you don't want to bring down your systems when all of your users are using its services. The best times for a maintenance window are during the low points of system usage. By routinely monitoring your services, you can easily determine the hours during which they receive the least usage. Throughout Chapter 6 , you will find many of the tools you can use to do this. MRTG is one such tool.

Figure 8.1 shows an MRTG graph of Internet traffic at a fictitious company. The graph clearly indicates that the low point of usage for this system is at about 5:00 a.m., and that makes the perfect time around which to specify a maintenance window. Graphing your own system use can help you determine the best time for your maintenance window, as well.

Figure 8.1. An rrdtool (an MRTG-like application) graph of network bandwidth usage clearly shows that this organization's optimum maintenance window is between 3:00 a.m. and 7:00 a.m.

Track Usage over Time

Note that you shouldn't choose a maintenance window based solely on one day's usage logs; look at trends of usage over a week or two, and find the average usage lows. Also take all of your services into account and look for common low points that you can take advantage of.

Different types of businesses have different trends in high and low usage points. ISPs usually peak around 8:00 p.m., when everyone is home checking their mail and surfing the Web, and have low points around 4:00 a.m. Universities tend to have a lot of night owl students, so their usage may peak around 10:00 p.m., with the least usage at 4:00 a.m. Regular 9-to-5 businesses peak around 1:00 p.m., with a small dip around noon for lunch. Minimum usage is between 6:00 p.m. and 6:00 a.m. International business complicates the analysis even further. Users in London might be using your service heavily while everyone in the United States is still sleeping.

Only a thorough analysis of your data can tell you the low usage time for your own system, but determining when that time occurs is critical for assigning an effective maintenance window. You need to understand the daily operations of your business, in order to specify the most effective (and least intrusive) maintenance window for everyone.

Maximum Maintenance Time

After you've discovered the time of least usage for your services, you need to decide how much time to allow for maintenance. Allow yourself enough time to fix the most complex of problems without extending maintenance time into periods of significant usage. Typical maintenance windows last anywhere between 2 and 6 hours, with 3 to 4 hours being the norm.

Leave Back-out Time Within Your Window

Always include enough time in your maintenance window to back out any changes you made before the window expires. Not every maintenance job is successful, and you want to allow yourself enough time to clean up any mistakes you made and regroup. You should reserve at least 25% at the end of your maintenance window for this back-out time. If the processes you are performing are unfamiliar to you, allocate even more time to account for the learning curve.

Business Requirements

Your business may have specific requirements that will play a role in determining your optimum maintenance windows. Client contracts may guarantee that services will be available during certain hours—sometimes client contracts even specify the maintenance window for you. To make matters even more complicated, different contracts could specify different maintenance windows—a situation that becomes a real nightmare when working on shared systems, such as a router.

Beyond contractual requirements, some systems operations may depend on services being up at certain times. If a bank generates monthly statements between 12:00 a.m. and 6:00 a.m. on the last day of each month, you can't fix servers at that time. Remember to take your backup schedules into account, as well; don't interfere with backup infrastructure without either disabling or moving the backup schedule for that day.

One very effective method of coordinating all of this information is to keep a simple calendar and post each event, including maintenance windows, scheduled outages, and uptimes required by service level agreements. Recording events on a paper calendar might work for small environments, but an electronic calendar works best for larger organizations. Calendaring software comes standard with the GUI in most operating systems. These calendars immediately catch conflicts and warn you, for example, if scheduled maintenance occurs in a time frame that a client has required your services to be available.

Working Within the Window

After you've established your maintenance window, you should honor its boundaries and perform only routine scheduled maintenance within that window. If you start to bring services down before or after the window, you will likely affect users who expect the services to be up, and lose the trust of those users as well as your management.

One problem with maintenance windows is that your work often unintentionally runs past the end of the window into the normal operating hours of your services. One way to stay within your window is to set a maximum time for instituting the scheduled changes, after which you will back out the changes and end the outage, regardless of circumstances. This is a tricky game to play, however; you must balance the need to complete the change with the need to stay on schedule. If you are only running 10 minutes behind schedule, it might not be worth it to back out all of your work; but if you are running an hour behind, it makes sense to back out because your users will definitely notice the longer outage.

Ultimately, your management should make these kinds of decisions, especially if the outage affects a large portion of your user base. If you do back out because of time constraints, regroup and figure out where the bottlenecks occurred before trying again; don't repeat the same mistakes you made the first time.

Monitoring Compliance with Service Level Agreements

Customers expect a certain level of service from their providers, especially those customers who have signed contracts specifying those levels. These agreements, called service level agreements (SLA), call for administrators to closely monitor the uptime of their services, as contracts depend on those numbers. An SLA can be specified for any number of metrics, though the two most common metrics are uptime and response time.

Monitoring Uptime Compliance

The most important measure of service is uptime. Simply put, for what length of time can your users access and use your services? In Chapter 6 , you learned of the difference between availability and usability. The distinction between these two conditions is important when monitoring uptime and even more important when measuring it.

While a service may be available for use, it is not considered “up” unless users can use the service as they normally would. A mail server that accepts user connections but fails with a “permission denied” error is a system that is available, but not usable. Uptime includes only usable service hours.

Uptime is often measured as a percentage of the total time the service could and should be usable. Some businesses like to report uptime once a month, others once per year. In any case, the goal most organizations strive for is 99.999% uptime, or “the five nines.” This exceptional amount of uptime assumes that all downtime will be used for short-lived routine maintenance.

A ratio of 99.999% uptime works out to be just under 8 hours of downtime per year. Assuming you have no unscheduled outages, this amounts to 2 major 4-hour maintenance windows per year, or 8 short 1-hour outages. Throw some random outages in there, and you will ultimately have less time for patching, upgrades, and whatever else you do during maintenance windows.

Other organizations go for the gold and try to reach “the seven nines,” or 99.99999% uptime. This uptime percentage allows approximately 5 minutes of downtime per year, a lofty goal for sure, but not completely out of reach, as you learn in the discussion of high availability in Chapter 10, “Providing High Availability in Your Unix System.”

Deciding Uptime Requirements

You services' uptime requirements should be dictated by your organization's management and the clients for which you have service level agreements. However, every service requires downtime for routine maintenance; for example, if you need one hour per month of downtime for patches or upgrades, communicate that to your management so no contracts are signed with an SLA of less than one hour per month of downtime.

Reporting uptime can be tricky. Accurate reporting requires constant monitoring of all of your services, without failures in the monitoring system. In addition, the granularity of your monitoring intervals becomes more critical as your uptime demands increase. For example, if you monitor each of your services once every 15 minutes, you'll miss many outages that last less than 15 minutes. Furthermore, every outage reported in the system will have an uncertainty of plus or minus 30 minutes. For example, a 17-minute outage and a 43-minute outage occurring in the system with 15-minute interval monitoring might both appear as 15-minute failures, as shown in Figure 8.2 . That's a discrepancy of 26 minutes—time for which you do not know the status of your service. A 30-minute uncertainty is unacceptable in a seven-nines environment, as the minimum amount of allowed downtime is only 5 minutes.

Figure 8.2. You can't always tell the difference between a 17-minute outage and a 43-minute outage if the monitoring interval is 15 minutes. The gray area represents the outage length of 15 minutes that would be reported by the monitoring software.

A monitoring interval of 1 minute might be more appropriate in this environment; in that case it is much easier to tell the difference between a 1-minute outage and a 5-minute outage. In addition, with that much data, it's much easier to prove your uptime to clients.

The one rule you should take away from this section is that the monitoring interval for a service should be less than the SLAs for those services. These smaller intervals allow you to report actual downtime with more precision, as well as detect short-lived failures that otherwise would go unnoticed.

Netcool's Reporting Functionality

Netcool contains powerful SLA reporting functionality. Netcool can report on historical service levels from a database and even notify you when current service levels cross a predetermined threshold such as 99% availability per hour.

Monitoring Response Time Compliance

The second most tangible aspect of any network service is its response time. How long does a service take to perform and respond to a user's requests? As it plays such a large part in the user experience, most companies dedicate large chunks of time to optimizing their services' response times. Chapter 6 introduced several representative monitoring tools that were able to provide response time statistics, including Netcool and NetSaint.

Response time failures and other timeouts usually qualify as downtime when measuring service levels and should be recorded as such. If you are lucky enough to actually be involved in a service level contract negotiation, look for this clause and verify that you can perform at the levels that are specified. If the contract expects a Web site to respond within 5 seconds for every request, make sure your systems can meet that requirement! Both Chapter 6 and Chapter 11 present information that can help you determine whether your services can perform as requested, and if not, whether they can be tuned to do so.

Observing Production Values

This isn't a book on morality, but every system administrator should know and obey his or her own set of production values. Production values are the rules that minimize risk on production servers and can prevent outages from occurring in the first place. Production values may differ from organization to organization, and even from person to person, but they should all include your department's commitment to honor these basic promises:

To use production servers appropriately
To announce all maintenance
To watch logs and monitors
To respond quickly to outages

Establishing and honoring production values is essential to establish credibility and respect for you and your IT department. To better understand the issues involved in each of the basic values listed here, read the sections that follow.

Using Production Servers Appropriately

Systems are often broken down into three categories: development, staging, and production. Each of these system types has a specific use, and the production system is the most critical to the business of your organization. Your first production value should include a commitment to use the production servers as they should be used, to protect their service to your organization.

To use production servers wisely, you need to understand how all three system categories are used. Development systems are used for testing and developing new services. It doesn't matter if development systems are up or down; a business's financial well-being doesn't depend on those systems (although some developers might complain).

Staging systems are where new services are migrated for testing in a production environment before actual deployment onto production systems. Staging systems usually are designed to look exactly like the production environment, so people can get a good idea of how services will behave in production. Not everyone has or can afford a separate staging environment; in that case, development systems often play this role.

Production systems are the key to your business. They are where the final version of your services are deployed and made available to your users. Their uptime is critical to your organization's success.

It's important to use these systems appropriately. Installing a production Web server on a development server is a bad idea. You're probably not monitoring that server, and it may not have the capacity to handle your production load. At the same time, you shouldn't develop on a production machine. Systems in production have one purpose, and that is to serve users. Developing on those systems takes away vital resources, such as CPU and memory, from your production applications; the loss of those resources can cause production applications to underperform. Even worse, you might overwrite a configuration file and cause a complete service failure.

You can drastically minimize service outages simply by using production systems appropriately. Do your development and testing on development systems, and let the production environment do its noble job of servicing your users.

Announcing All Maintenance

Users usually don't know and don't care about the day-to-day work you perform on your systems, but they do care if the services they use go down without warning. When you are faced with maintenance, no matter how minor, that could potentially cause an outage, you should announce that maintenance to your users.

Your announcement should specify what you are doing in high-level layman's terms and give users an accurate estimate of when the work will be done. An email like this would be appropriate:

Code View: Scroll / Show All

From: Chad Admin
To: Widget Users
Subject: Maintenance

On Sunday February 3 at 2 AM, Widget system administrators will be replacing a
failing disk in the disk array that stores the data for the Widget application.
The work should take no longer than 30 minutes, and no downtime is expected.

Thank you,

The IT Staff

A follow-up email documenting the success or failure of the maintenance would be appropriate as well. You should also think about what means of communication to use to make these announcements; an email will deliver the announcement to each user, whether he or she goes looking for it or not. Making sure all users are informed is essential for critical situations. Less critical work can be posted on a Web site or a newsgroup so users aren't force-fed useless information, but can still be informed about upcoming issues. Chapter 15, “Interacting with Users ,” discusses the use of these and other forums for communicating information to your users.

Grabbing Users' Attention

Emphasize the urgency of maintenance announcements by crafting attention-grabbing subjects. Using words like “URGENT”, “WARNING,” “NOTICE,” and “OUTAGE“ in all capitals will cause most users to pay attention to your announcements, when they would otherwise ignore or delete them. Only use these words for announcing downtime or other urgent events, though—you don't ever want to cry wolf in these situations.

Watching Logs and Monitors

You could have the most verbose logging in recorded history and the most precise monitoring that today's technology can offer, but you'll gain no good from them if you don't pay attention to their output. When your monitoring system notifies you that there's a problem, even with the most minor parts of a system, take it seriously and investigate further. Even a minor problem can be an indication of a greater problem.

You should never rely solely on your monitoring system to reveal system problems; review your logs daily to note any anomalies. System logs contain more information than could possibly be understood by log analyzers such as logsurfer (documented in the section titled “Log Monitoring” in Chapter 6 ), and it's up to you to look for any anomalies that you haven't configured your software to detect. Log analyzer programs are invaluable tools, but they are useless without your configuration. Take some time every day to look at the logs for your critical systems and become familiar with their contents. After you get to know the usual contents of a log file, it is much easier to pick problems out of the thousands of familiar log messages.

Tweak Log Analyzers Along the Way

As you discover the log patterns that correspond to new problems, reconfigure your log analysis tools to recognize those new patterns (a simple process with tools such as logsurfer). These periodic reconfigurations save you time and effort when dealing with recurring problems.

Responding Quickly to Outages

During an outage every minute counts, especially when service levels are involved. When you receive notification of an outage or potential outage, respond quickly. Not only will it reduce the total length of the outage, but people (and clients) are less likely to notice short-lived problems. This is why you should institute proximity requirements for all on-call staff (see Chapter 5, “Support Administration ”). Someone who is no more than 30 minutes from your data center is probably going to respond to an emergency faster than someone who is 2 hours away visiting family.

Provide Remote Access for Administrators

Many outages are software-related and can be solved remotely, eliminating the need for traveling to the data center. To take advantage of this capability, however, all of your on-call administrators must have some kind of remote access—a dial-up ISP, ISDN, DSL, or a cable modem. If remote access becomes a job requirement, your organization should pay for this access.

Outage Procedures

Some administrators joke that they'd need to start ripping cables out of their data centers and put them back an hour later to be recognized for outstanding performance during an “outage.” Right or wrong, a Unix network's outages (or lack thereof) often are the metric with which the Unix system administrator is judged. What was your uptime this year? Did you meet your clients' service levels? Remember that time your mail server was down? Both clients and management care about outage issues, so it's worth your time to craft procedures that will help you minimize outages and their downtimes.

Escalation procedures exist to move a problem up the chain of command until eventually someone in the chain can solve the problem. Procedures shouldn't end at that point though. Developing procedures for handling outages will ensure that nobody misses critical tasks such as handling communication and updating trouble tickets. The actual procedures should be very specific to your organization, but the general guidelines discussed in the following sections can help you get started.

Assigning Problems to Appropriate Staff

While the help desk may assign a problem to your group, not everyone, including the on-call person, is the best fit for every problem. Your group probably has a variety of expertise; some of you may be senior administrators, some junior. Others may know more about Linux than Solaris. Still others may have extensive experience in operating systems but won't touch hardware with a 10-foot pole.

Know your IT staff's strengths and weaknesses and use that information to assign problems to the right person. Even if you are the on-call person, don't spend 2 hours on a problem that another member of your group can fix in 2 minutes. Keep a contact list for your team and ask for assistance when necessary. On-call duty shouldn't mean that you have to solve every problem, but you should certainly be responsible for orchestrating the problem-solving process in the most efficient way possible.

Maintain Ongoing Communication

It's only natural when dealing with a difficult problem to focus all of your energy on solving the problem, while blocking out all other external stimuli. This may speed up your own problem-solving process, but it leaves everyone else in your organization wondering what's going on. Always keep the communication lines open and send back as much information as possible to the help desk, your team, and your managers if necessary.

They in turn can keep other parties informed, like clients and senior management. Periodic check-ins can help facilitate this communication. During long outages, checking in with the help desk every hour or so is a good practice. These periodic check-ins also keep you informed of how serious the outage is perceived to be from the users' point of view.

Use a Headset to Let You Keep Working

Purchase a phone headset to use during emergencies at the data center, so you can work and talk at the same time. The headset is especially helpful when talking to technical support, who usually ask you to type commands.

Of course, after the problem is resolved or you've come to a crossroads (maybe you need to order parts or wait for vendor support), contact the help desk immediately and update the status of the problem. This can also be done with a trouble ticket system like Remedy or req, which streamlines the whole process for you.

Maintain Activity Logs

You may remember and understand everything about an outage the instant you finish working on it, but it's a good bet that you'll forget about 50% of what you did the next day. Keeping a detailed activity log will help you document the entire problem solving process, including command output, vendor contacts, and timelines.

All good trouble ticket management systems provide some sort of logging functionality (Remedy likes to get personal and calls it a diary). These logs are invaluable tools for both future reference and for analysis of a problem that just occurred. A very simple but typical log entry might look like this:

Code View: Scroll / Show All

Wed Jan 30 2002 21:32:00 PM brian:  Users experiencing slow response time on  the mail server "goat".  I am working on the problem.

Wed Jan 30 2002 21:44:00 PM brian:  Logs indicate a failing disk (see below). Will call vendor to replace the disk.

Jan 30 13:35:58 goat scsi: [ID 107833 kern.warning] WARNING: /pci@1f,2000/SUNW, ifp@1/ssd@w21000020374f91d8,0 (ssd15):
Jan 30 13:35:58 goat for Command: write(10)              Error Level: Retryable
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Requested Block: 12214112                 Error Block: 12214112
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Vendor: SEAGATE
                           Serial Number: LS934473
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Sense Key: Aborted Command
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         ASC: 0x47 (scsi parity error), ASCQ: 0x0, FRU: 0x3
Wed Jan 30 2002 21:56:00 PM brian:  Vendor thinks it's a bad Gigabit card. Will send new card and disk by 10:30 AM tomorrow.  Our case number is 234763.

Thu Jan 31 2002 10:45:00 PM brian:  Disk & card received.  Sending steve to replace the card during maintenance window tonight.

Fri Feb  1 2002 02:35:00 PM steve:  Card replaced successfully.  Errors seem to have disappeared.  Will ship new disk and bad card back to vendor.

This log shows the progress of the problem resolution process, including important log data. This data will be very useful in the future, as it was assumed that goat had a disk problem when it was really the gigabit card that was failing; that assumption can be avoided next time now that this data has been logged.

Reference the Activity Logs During Outages

You should consult old activity logs when new outages occur. There may be tips and tricks in those logs that can save you time and effort when dealing with identical or similar problems.

Remain Calm

It is difficult to remain calm in highly visible outage situations, but you can't debug a problem and execute highly technical processes while you're running around like a chicken with its head cut off. As an old coworker once said, “You get ice water in your veins,” meaning that as you experience various outages and problems over the years, you become more and more calm even in the most dire of situations—your blood no longer boils at the mention of the word “outage.”

The more panicked you are during an outage, the more likely you are to make a mistake, possibly worsening the situation. What's worse is that panic is contagious—if you are running around your office or data center screaming, other members of your staff are likely to start doing the same.

Lead by example and stay calm; analyze the problems you need to solve, and take things one step at a time. If other administrators around you are panicking, ask them politely to leave, as they only add to the problem at hand. Nontechnical coworkers are likely to stop by and ask what's going on; it is very easy to get angry at them for bothering you during an outage, so simply ask them to go back to their desks and let you do your job. Even your managers need to be told this as well; for example, you cannot possibly remedy a major outage with a manager looking over your shoulder reminding you how much money the outage is costing the company. Just ask everyone to leave you alone so you can fix the problem.

Root Cause Analysis

All problems, no matter how complex, have a root cause. A root cause is where a problem originated—the spark that caused the fire. Sometimes finding the root cause of a problem is easy. In the example of a problem discussed in the activity log text listed in the preceding section, the root cause of the slow response time on goat was a failing gigabit card. Sometimes it's not so easy; often the problem must be traced back further through many steps to find out what action truly caused the problem.

For example, a user might call into your help desk saying that she isn't receiving any of the email her friends are sending to her. Upon further investigation, you find out that the file system housing her mailbox is full. You reclaim some space, and mail begins flowing into the system again. What is the root cause of the problem? Was it the full file system? That certainly caused the user's problem, but what caused the full file system? Maybe you were suddenly sent a massive amount of spam that filled up the mailboxes on your system. In that case, the spammers are to blame, and you can block them from any further access. The root cause was the spammers, and you remedied the problem by altering your SMTP rule set to deny them access to your mail systems.

Perhaps instead the full file system was the result of a gradual increase in usage over the past few weeks. Administrators were either not aware of or ignored the trend of increasing disk usage. The root cause of the problem in this case is the ignorance of the administrators, which could be remedied by implementing new monitoring procedures or configuring a software monitor to report excessive disk usage before it becomes critical. This example perfectly demonstrates a situation in which proactive monitoring (discussed at length in Chapter 6) is more effective than reactive monitoring.

What can root cause analysis do for you beyond assigning blame? It helps you identify the real causes of your problems rather than the immediate causes or symptoms. While you can deal with immediate causes as they are found, eliminating root causes can make their resulting problems disappear forever, as there is no more seed from which outages can spawn.

In general, identifying a root cause requires you to trace the development of a problem from its symptoms back to the event or condition that set the problem into motion. After determining what caused the actual symptoms of the problem, you must determine what caused those problems, and so on, until you finally reach a problem for which there is no cause that you can remedy—that will be the root cause of the problem. In essence, you are creating a genealogy of the problem, tracing its roots back to the beginning.

Avoid Band-Aid Solutions

Band-Aid solutions are those which mask the symptoms of a problem, but do not actually eliminate the cause of the problem. For example, in the spamming example mentioned previously, adding more disk space would prevent the mailbox file system from filling up, but only temporarily. This is a Band-Aid solution; to truly solve the problem, the root cause—in this case, spammers—must be found and remedied.

Summary

Managing outages can be a challenging part of a system administrator's job; often, these outages can be scary and overwhelming. However, you can't let these outages control your daily life or take over your IT department. As you resolve problems over the years, you learn how to better analyze new problems and fix them using previous experience. Your calm will eventually overtake your anxiety, and you will be handling outage situations with composure you didn't know you had.

This chapter introduced the most common types of outages that can occur and how to manage them. Some outages are created on purpose to service hardware and software; these scheduled outages should be performed within a designated maintenance window. When outages do occur, it is important to accurately measure how long they last to monitor your compliance with service level agreements (SLAs). Your actions during an outage are important as well; setting production values such as responding to problem reports in a timely manner and documenting outage procedures will help you and other administrators deal with problems more effectively. Finally, when an outage is resolved, you should perform a root cause analysis to determine the true cause of the problem and fix it; this goes a long way toward eradicating those outages from your infrastructure for good, and fewer outages make for a happier system administrator!