Chapter 8. Service OutagesYou will learn about the following in this chapter:
Information technology is the lifeblood of most organizations. Revenues, production, scheduling, sales, and many other business functions rely on fully functional IT services. Service outages can represent tremendous losses—in revenues, personnel time, customer goodwill, and other important business commodities. As such, outages—even those that are necessary for routine maintenance or system repair—are considered by most to be the great evil of information technology. As a system administrator, your company expects you to rise up and quickly resolve every outage that rears its ugly head—whether it occurs during regular 9-to-5 weekday working hours, or at 3:00 a.m. on Saturday. System administrators are on-call heroes who must faithfully respond to outage notifications and quickly get systems back up and running, in order to minimize losses to the organization. All too often, in fact, system administrators are recognized more for their performance during outages than for the time and energy they invest in designing and implementing system infrastructures that suffer a minimum of such outages. This chapter discusses some common types of Unix system service outages and how they apply to you, as the Unix system administrator, and your business. You learn about the metrics surrounding outages, including maintenance windows and service level agreements. And you learn the most effective procedures for dealing appropriately with outages and how to use each outage as a learning tool that can help you minimize similar outages in the future. Types of OutagesAfter you've spent years as a system administrator and are looking back at all the service outages you've dealt with in your organization during that time, you'll probably discover that they can be divided into two groups: those that are preventable, and those that aren't. Preventable outages are ones that are either caused by human error or ones that you can see coming based on current monitoring data. Human error, although impossible to prevent, can be minimized with procedures that eliminate guessing on the part of an administrator. Creating procedures for routine tasks like removing a server from a rack or rebooting a production server can prevent the occasional human mishap. Other outages can be prevented because you can see them coming long before they happen. For example, a server with a disk at 50% capacity and usage increasing by 10% of the total capacity per week is likely headed for disaster in five weeks; you can see it coming well ahead of time. This is why proactive monitoring is so important: You're looking for potential problems rather than reacting to them. You can read more about proactive monitoring in Chapter 6, “Monitoring Services.” Graphing these trends over time can help you monitor your system for developing problems. But within those two broad outage categories are several more specific categories of outages based on the cause, duration, and extent of the outage. There is much more to an outage than the unavailability of a service. Although to your users they may all look the same, a system administrator needs to know more than just “the server is down” to assess the severity of each outage; proper categorization of an outage may determine how quickly you are required to respond, who in your organization you need to inform about the outage, and if you should notify users. There are an infinite number of causes for an outage, but they do fall into the following categories:
The sections that follow look more carefully at each of these types of service outage and some of the special demands each can place upon you as a system administrator. Scheduled MaintenanceRoutine maintenance is common on any system, whether it's a computer or a car. Some routine maintenance occurs with new technology or software releases. Just as you change the oil in your car every 3,000 miles, you patch or upgrade your operating systems as new patches and releases become available. Some routine maintenance is unexpected, yet still predictable. For example, you don't know when a disk, memory, or processor will go bad on your server, but you know that at some point you'll need to replace them, just as you know that eventually you'll need to replace your car's tires. At the same time, you know a tire blow-out can cause a nasty accident. To avoid such surprises, you watch your tires for signs of wear, and you note their “wear guarantees” and the number of miles they've logged. That way, you don't depend on a blow-out to tell you that your tires are ready for replacement. The same is true of your Unix network. With proper logging and monitoring, you can anticipate and avoid many software and hardware meltdowns related to age and overuse. And you can schedule the replacement of those components for a time that causes you and your business the least disruption. Planning for Routine Maintenance OutagesScheduled, routine maintenance rarely has to be a critical “show stopper” for your IT department or your business. Routine maintenance issues won't bring down an entire system. Without question, processor, disk, and memory failures can have a dire impact on a system, but that's why smart administrators use logging to provide a “heads up” for impending failures. The administrator can then schedule a time to fix the problems on the system and make announcements appropriate so nobody is taken by surprise, especially the users. Scheduled maintenance puts unique responsibilities on the system administrator. Because this kind of maintenance (and the outage it requires) occurs at the administrator's discretion, he or she must gauge the severity of the problem and choose the best time to take action. You can find detailed information about monitoring logs and other metrics throughout both Chapter 6 and Chapter 11, “Performance Tuning and Capacity Planning.” Don't Procrastinate Routine Maintenance Scheduling Routine Maintenance OutagesScheduled maintenance creates an outage in order to prevent an outage. If you think that sounds like nonsense, think again: Patching a server to avoid a potential problem typically involves rebooting the server, which causes a short outage. If your company is busily using a service, the employees are likely to consider the patching outage an unnecessary loss of productive time. The justification for scheduling such outages can be difficult for nontechnical management to understand, as they may take the “if it ain't broke, don't fix it” attitude, especially in high-availability environments. To enlist the support of management, do your research and have your information ready to explain to them how the process you'll perform during this outage will prevent longer, more costly outages in the future. If you're scheduling an outage to upgrade some part of the Unix system, make sure reluctant managers understand the benefits the upgrade will bring to the business. For example, replacing a slow, aging Web server with a new, top-of-the-line multiprocessor Web server may involve an outage while the switch is made. Managers are likely to accept the inconvenience of the outage, however, when they understand the benefits of the upgrade, such as the ability to handle more concurrent connections. When you are actually scheduling your maintenance, you need to choose a time that minimizes the impact to your users but still allows you to perform the required work. The following guidelines will help you choose the appropriate time:
Commit to a Scheduled Maintenance Time
Perhaps the most important thing a system administrator can do to eliminate excess resistance to scheduled maintenance outages is to provide management with a clear and accurate plan for when the maintenance will take place and how long the outage will last. When management learns to trust your ability to schedule maintenance outages so that they cause minimal disruption to the business and then keep the outage “on schedule,” they'll be more likely to stop second-guessing you on this issue. You learn more about maintaining maintenance schedules in “Working Within the Window,” later in this chapter. Unscheduled OutagesAn unscheduled outage is any outage which occurs without warning. Even the most watchful and careful system administrator can expect to experience unscheduled outages on his or her network. The causes for unscheduled outages take many forms, such as a configuration glitch, hardware failure, human error, or even a building going up in flames. Human error is one common cause of unscheduled outages. Some examples of human error that could cause an outage are as follows:
Most human error can be prevented if the system administrator and other technical staff take extreme care when working with or around production systems. Double check all changes that you make, or even better, use a change management system to approve and communicate your changes ahead of time. See Chapter 14, “Internal Communication,” for more details on change management. Whatever the cause of an unscheduled outage, the events that prompt the maintenance are unexpected. Most unscheduled outages are out of the system administrator's control, as is the amount of downtime they cause. Minimizing the downtime incurred by an unscheduled outage is one of the many challenges a system administrator faces. Document Outage Procedures Real-World Example: Human Error Outage Partial Service OutagesSome services provide only a single function. For example, POP has the sole purpose of allowing a user to download mail from a mail server. A system administrator easily can determine when a single-function service is down and the repercussions of that outage for the business. In the POP example, an outage means users get an error message when they try to download their mail. Other services provide more functionality and, therefore, can present a more complex diagnostic and maintenance issue. A Web server, in addition to providing simple static Web pages, can also support file uploading; CGI scripts can process forms, and code on the back end can interface with databases. Add SSL to this mix, and you have an entirely separate secure Web server. Each Web server has its own unique functions, turning the simple HTTP protocol into a complex Internet service. When somebody reports a Web server outage, therefore, the system administrator may have no clear idea what is causing the outage or what services the outage has taken down with it. Is the entire Web site down, or is just one part of the site down? Is just one script failing, or is a database on the back end down? Complex services (such as this Web server example) can experience partial service outages where the service as a whole is up and running, but parts of the service are failing. For example, a banking Web site might be available on the Internet, but the page that lets you check your account balances produces an error. For these complex services, the sooner you understand what part of the service is down, the better you'll be able to resolve the problem and end the partial outage. You need users to give detailed problem reports that help pinpoint the nonfunctioning part of the service—and the responsibility for helping users supply this information lies with your help desk. Help desk staff should ask users to describe exactly what they were doing when they experienced the problem, as well as any error messages they received. The more detailed the problem report, the easier it will be for you to track down a small problem with a large service. Automatically Generated Error Reports Complete Outages and Degraded ServiceA complete service outage is the nightmare all system administrators fear. In a complete service outage, a service is 100% unavailable to its users. These outages are all too often the cause of 3:00 a.m. pages on a Saturday. Not all outages are as catastrophic as those just described. Sometimes problems just cause a service to become degraded. Much as a power brownout causes lights to dim but not fail, users can still use a degraded service; it just doesn't perform as well as it normally would. A POP3 mail service example can help illustrate degraded service outages. Users of these servers are used to clicking on a “Retrieve Mail” button on their clients and receiving their new mail within seconds. However, during an outage in which the mail server is overloaded with incoming mail, users experience delays of up to two minutes before receiving any messages. Another two-minute delay separates subsequent message retrievals. While the mail service is working, it is painfully slow, and not practical to use. Other examples of degraded service include the following:
Service Monitors Detect Degraded Services
Degraded service, while not a complete service outage, still causes significant problems for end users. Degraded service problems can be among the most difficult to diagnose, because the system is working functionally; you have to find the parts that are failing, investigate the causes, and fix the problems. Degraded service usually indicates that one or more parts of the service are under stress. In the earlier example, the POP3 service was slow. The logical place to start looking for problems in that situation is in the server itself. Many subsystems can be stressed on a Unix server, including CPU, disk, network, and memory, and the system administrator must examine them all to find the problem. Chapter 11 discusses such diagnostic examinations in detail. Distributed Service OutagesA running joke among system administrators offers this definition of a distributed service:
As Homer Simpson likes to say, “It's funny because it's true.” Distributed service outages differ from other kinds of outages in one important way: Although most outages on server A result from a failure on server A, a distributed service outage can occur when a failure on server B causes a failure on server A. A distributed service (all jokes aside) is one that resides on a remote system but is critical to the functioning of another system. DNS (domain name system) and NFS (network file system) are good examples of distributed services. DNS provides critical name resolution for most Internet applications, but it resides on remote servers. NFS serves file systems to remote clients, some of which are critical to the operation of those clients. If DNS servers become unavailable, Internet applications will grind to a halt. A failed NFS server housing shared applications could render all of those applications unavailable to its clients. To better understand the dynamics of distributed server outages, consider the example of the NFS file server. Instead of installing gigabytes of software on each system in your infrastructure, you can install the software once on an NFS file server and share it with the other servers. Now imagine that you've installed user shells on that NFS server—shells such as bash and tcsh that aren't available by default on your other servers. If the file server ever becomes unavailable, so do the shells. If the shells are unavailable, users can't log in to any of your servers. A failure on the NFS file server has caused an outage on your other servers. Perhaps the most frustrating aspect of a distributed outage is that the remote server that is failing may be under somebody else's control. You might be responsible for Web servers, but maybe an entirely different team in your company administers the DNS servers that just went down. Although you have done nothing wrong and your servers are functioning normally, somebody else's servers that you rely on can cause an outage in your system, and there's nothing you can do about it but complain and wait. Real-World Example: A Distributed File System Distributed outages are very difficult to prevent, since the whole point of distributed services is to offload services onto other systems, and you are often not in control of those systems. However, there are several steps you can take to minimize the impact a distributed service outage has on your own systems, as follows:
Third-Party OutagesThird-party outages occur when a system owned by another organization fails and causes a failure on one of your systems. Though similar to distributed outages, third-party outages differ in one important way. A distributed outage actively involves the use of a service on the remote system that causes the failure. In third-party outages, the system administrators whose systems suffer the outage don't even know the remote system exists, and they certainly don't use any services on it. The classic example of a third-party outage is a backbone failure. Everyone depends on backbones, which typically run at speeds between 45Mbps (DS3) and 2.4Gbps (OC-48), to connect networks around the world to form the Internet. Multiple backbones provide several different high-speed paths between any two machines on the Internet; the Internet wouldn't function without them. When a major backbone goes down, everyone in the country knows it! No traffic can get from one part of the backbone to another until routers eventually remove the route to the failed backbone and find other ways to route packets to their destinations. The most obvious symptom of this problem is a loss of connectivity to servers you access every day, especially those across the country. If your company suffers this kind of third-party outage, many of your customers will lose connectivity to your services. It's scary to know that a piece of hardware you never asked to use could cause such a major outage for your organization, but that's the nature of the shared network called the Internet. Some examples of third-party outages include the following:
Third-party outages are out of your control as a system administrator. The most important thing you can do is to report any outage to the third party and keep track of any tickets that the third party opens for you. Report these tickets to your help desk and explain the situation to help desk staff so they can adequately update your users who call in to report the problem. Users should know that the problem is out of your control, but that it has been reported and is being worked on. Check back periodically with the third party to verify progress is being made on resolving the outage. Maintenance WindowsOne of your first responsibilities as a Unix system administrator is to specify your organization's maintenance window—the time reserved for routine scheduled maintenance tasks such as rebooting routers, upgrading servers, adding disk drives, and so on. Maintenance windows specify a time when service is not guaranteed so that administrators have time to fix minor problems or upgrade servers. Routine work such as hardware racking and application installation can be done outside of the maintenance window. But if you are planning to do any work that requires system downtime, or even if you are planning to do work that has only a slight chance of bringing something down, do it during the maintenance window. You'll save yourself a lot of trouble if something does go wrong. You need to consider three factors when choosing a maintenance window: time of least usage, maximum maintenance time, and business requirements. The following sections discuss these factors in detail. Time of Least UsageCommon sense dictates that you don't want to bring down your systems when all of your users are using its services. The best times for a maintenance window are during the low points of system usage. By routinely monitoring your services, you can easily determine the hours during which they receive the least usage. Throughout Chapter 6, you will find many of the tools you can use to do this. MRTG is one such tool. Figure 8.1 shows an MRTG graph of Internet traffic at a fictitious company. The graph clearly indicates that the low point of usage for this system is at about 5:00 a.m., and that makes the perfect time around which to specify a maintenance window. Graphing your own system use can help you determine the best time for your maintenance window, as well. Figure 8.1. An rrdtool (an MRTG-like application) graph of network bandwidth usage clearly shows that this organization's optimum maintenance window is between 3:00 a.m. and 7:00 a.m.Track Usage over Time Different types of businesses have different trends in high and low usage points. ISPs usually peak around 8:00 p.m., when everyone is home checking their mail and surfing the Web, and have low points around 4:00 a.m. Universities tend to have a lot of night owl students, so their usage may peak around 10:00 p.m., with the least usage at 4:00 a.m. Regular 9-to-5 businesses peak around 1:00 p.m., with a small dip around noon for lunch. Minimum usage is between 6:00 p.m. and 6:00 a.m. International business complicates the analysis even further. Users in London might be using your service heavily while everyone in the United States is still sleeping. Only a thorough analysis of your data can tell you the low usage time for your own system, but determining when that time occurs is critical for assigning an effective maintenance window. You need to understand the daily operations of your business, in order to specify the most effective (and least intrusive) maintenance window for everyone. Maximum Maintenance TimeAfter you've discovered the time of least usage for your services, you need to decide how much time to allow for maintenance. Allow yourself enough time to fix the most complex of problems without extending maintenance time into periods of significant usage. Typical maintenance windows last anywhere between 2 and 6 hours, with 3 to 4 hours being the norm. Leave Back-out Time Within Your Window Business RequirementsYour business may have specific requirements that will play a role in determining your optimum maintenance windows. Client contracts may guarantee that services will be available during certain hours—sometimes client contracts even specify the maintenance window for you. To make matters even more complicated, different contracts could specify different maintenance windows—a situation that becomes a real nightmare when working on shared systems, such as a router. Beyond contractual requirements, some systems operations may depend on services being up at certain times. If a bank generates monthly statements between 12:00 a.m. and 6:00 a.m. on the last day of each month, you can't fix servers at that time. Remember to take your backup schedules into account, as well; don't interfere with backup infrastructure without either disabling or moving the backup schedule for that day. One very effective method of coordinating all of this information is to keep a simple calendar and post each event, including maintenance windows, scheduled outages, and uptimes required by service level agreements. Recording events on a paper calendar might work for small environments, but an electronic calendar works best for larger organizations. Calendaring software comes standard with the GUI in most operating systems. These calendars immediately catch conflicts and warn you, for example, if scheduled maintenance occurs in a time frame that a client has required your services to be available. Working Within the WindowAfter you've established your maintenance window, you should honor its boundaries and perform only routine scheduled maintenance within that window. If you start to bring services down before or after the window, you will likely affect users who expect the services to be up, and lose the trust of those users as well as your management. One problem with maintenance windows is that your work often unintentionally runs past the end of the window into the normal operating hours of your services. One way to stay within your window is to set a maximum time for instituting the scheduled changes, after which you will back out the changes and end the outage, regardless of circumstances. This is a tricky game to play, however; you must balance the need to complete the change with the need to stay on schedule. If you are only running 10 minutes behind schedule, it might not be worth it to back out all of your work; but if you are running an hour behind, it makes sense to back out because your users will definitely notice the longer outage. Ultimately, your management should make these kinds of decisions, especially if the outage affects a large portion of your user base. If you do back out because of time constraints, regroup and figure out where the bottlenecks occurred before trying again; don't repeat the same mistakes you made the first time. Monitoring Compliance with Service Level AgreementsCustomers expect a certain level of service from their providers, especially those customers who have signed contracts specifying those levels. These agreements, called service level agreements (SLA), call for administrators to closely monitor the uptime of their services, as contracts depend on those numbers. An SLA can be specified for any number of metrics, though the two most common metrics are uptime and response time. Monitoring Uptime ComplianceThe most important measure of service is uptime. Simply put, for what length of time can your users access and use your services? In Chapter 6, you learned of the difference between availability and usability. The distinction between these two conditions is important when monitoring uptime and even more important when measuring it. While a service may be available for use, it is not considered “up” unless users can use the service as they normally would. A mail server that accepts user connections but fails with a “permission denied” error is a system that is available, but not usable. Uptime includes only usable service hours. Uptime is often measured as a percentage of the total time the service could and should be usable. Some businesses like to report uptime once a month, others once per year. In any case, the goal most organizations strive for is 99.999% uptime, or “the five nines.” This exceptional amount of uptime assumes that all downtime will be used for short-lived routine maintenance. A ratio of 99.999% uptime works out to be just under 8 hours of downtime per year. Assuming you have no unscheduled outages, this amounts to 2 major 4-hour maintenance windows per year, or 8 short 1-hour outages. Throw some random outages in there, and you will ultimately have less time for patching, upgrades, and whatever else you do during maintenance windows. Other organizations go for the gold and try to reach “the seven nines,” or 99.99999% uptime. This uptime percentage allows approximately 5 minutes of downtime per year, a lofty goal for sure, but not completely out of reach, as you learn in the discussion of high availability in Chapter 10, “Providing High Availability in Your Unix System.” Reporting uptime can be tricky. Accurate reporting requires constant monitoring of all of your services, without failures in the monitoring system. In addition, the granularity of your monitoring intervals becomes more critical as your uptime demands increase. For example, if you monitor each of your services once every 15 minutes, you'll miss many outages that last less than 15 minutes. Furthermore, every outage reported in the system will have an uncertainty of plus or minus 30 minutes. For example, a 17-minute outage and a 43-minute outage occurring in the system with 15-minute interval monitoring might both appear as 15-minute failures, as shown in Figure 8.2. That's a discrepancy of 26 minutes—time for which you do not know the status of your service. A 30-minute uncertainty is unacceptable in a seven-nines environment, as the minimum amount of allowed downtime is only 5 minutes. Figure 8.2. You can't always tell the difference between a 17-minute outage and a 43-minute outage if the monitoring interval is 15 minutes. The gray area represents the outage length of 15 minutes that would be reported by the monitoring software.A monitoring interval of 1 minute might be more appropriate in this environment; in that case it is much easier to tell the difference between a 1-minute outage and a 5-minute outage. In addition, with that much data, it's much easier to prove your uptime to clients. The one rule you should take away from this section is that the monitoring interval for a service should be less than the SLAs for those services. These smaller intervals allow you to report actual downtime with more precision, as well as detect short-lived failures that otherwise would go unnoticed. Netcool's Reporting Functionality Monitoring Response Time ComplianceThe second most tangible aspect of any network service is its response time. How long does a service take to perform and respond to a user's requests? As it plays such a large part in the user experience, most companies dedicate large chunks of time to optimizing their services' response times. Chapter 6 introduced several representative monitoring tools that were able to provide response time statistics, including Netcool and NetSaint. Response time failures and other timeouts usually qualify as downtime when measuring service levels and should be recorded as such. If you are lucky enough to actually be involved in a service level contract negotiation, look for this clause and verify that you can perform at the levels that are specified. If the contract expects a Web site to respond within 5 seconds for every request, make sure your systems can meet that requirement! Both Chapter 6 and Chapter 11 present information that can help you determine whether your services can perform as requested, and if not, whether they can be tuned to do so. Observing Production ValuesThis isn't a book on morality, but every system administrator should know and obey his or her own set of production values. Production values are the rules that minimize risk on production servers and can prevent outages from occurring in the first place. Production values may differ from organization to organization, and even from person to person, but they should all include your department's commitment to honor these basic promises:
Establishing and honoring production values is essential to establish credibility and respect for you and your IT department. To better understand the issues involved in each of the basic values listed here, read the sections that follow. Using Production Servers AppropriatelySystems are often broken down into three categories: development, staging, and production. Each of these system types has a specific use, and the production system is the most critical to the business of your organization. Your first production value should include a commitment to use the production servers as they should be used, to protect their service to your organization. To use production servers wisely, you need to understand how all three system categories are used. Development systems are used for testing and developing new services. It doesn't matter if development systems are up or down; a business's financial well-being doesn't depend on those systems (although some developers might complain). Staging systems are where new services are migrated for testing in a production environment before actual deployment onto production systems. Staging systems usually are designed to look exactly like the production environment, so people can get a good idea of how services will behave in production. Not everyone has or can afford a separate staging environment; in that case, development systems often play this role. Production systems are the key to your business. They are where the final version of your services are deployed and made available to your users. Their uptime is critical to your organization's success. It's important to use these systems appropriately. Installing a production Web server on a development server is a bad idea. You're probably not monitoring that server, and it may not have the capacity to handle your production load. At the same time, you shouldn't develop on a production machine. Systems in production have one purpose, and that is to serve users. Developing on those systems takes away vital resources, such as CPU and memory, from your production applications; the loss of those resources can cause production applications to underperform. Even worse, you might overwrite a configuration file and cause a complete service failure. You can drastically minimize service outages simply by using production systems appropriately. Do your development and testing on development systems, and let the production environment do its noble job of servicing your users. Announcing All MaintenanceUsers usually don't know and don't care about the day-to-day work you perform on your systems, but they do care if the services they use go down without warning. When you are faced with maintenance, no matter how minor, that could potentially cause an outage, you should announce that maintenance to your users. Your announcement should specify what you are doing in high-level layman's terms and give users an accurate estimate of when the work will be done. An email like this would be appropriate:
Code View:
Scroll
/
Show All
From: Chad Admin A follow-up email documenting the success or failure of the maintenance would be appropriate as well. You should also think about what means of communication to use to make these announcements; an email will deliver the announcement to each user, whether he or she goes looking for it or not. Making sure all users are informed is essential for critical situations. Less critical work can be posted on a Web site or a newsgroup so users aren't force-fed useless information, but can still be informed about upcoming issues. Chapter 15, “Interacting with Users,” discusses the use of these and other forums for communicating information to your users. Grabbing Users' Attention Watching Logs and MonitorsYou could have the most verbose logging in recorded history and the most precise monitoring that today's technology can offer, but you'll gain no good from them if you don't pay attention to their output. When your monitoring system notifies you that there's a problem, even with the most minor parts of a system, take it seriously and investigate further. Even a minor problem can be an indication of a greater problem. You should never rely solely on your monitoring system to reveal system problems; review your logs daily to note any anomalies. System logs contain more information than could possibly be understood by log analyzers such as logsurfer (documented in the section titled “Log Monitoring” in Chapter 6), and it's up to you to look for any anomalies that you haven't configured your software to detect. Log analyzer programs are invaluable tools, but they are useless without your configuration. Take some time every day to look at the logs for your critical systems and become familiar with their contents. After you get to know the usual contents of a log file, it is much easier to pick problems out of the thousands of familiar log messages. Tweak Log Analyzers Along the Way Responding Quickly to OutagesDuring an outage every minute counts, especially when service levels are involved. When you receive notification of an outage or potential outage, respond quickly. Not only will it reduce the total length of the outage, but people (and clients) are less likely to notice short-lived problems. This is why you should institute proximity requirements for all on-call staff (see Chapter 5, “Support Administration”). Someone who is no more than 30 minutes from your data center is probably going to respond to an emergency faster than someone who is 2 hours away visiting family. Provide Remote Access for Administrators Outage ProceduresSome administrators joke that they'd need to start ripping cables out of their data centers and put them back an hour later to be recognized for outstanding performance during an “outage.” Right or wrong, a Unix network's outages (or lack thereof) often are the metric with which the Unix system administrator is judged. What was your uptime this year? Did you meet your clients' service levels? Remember that time your mail server was down? Both clients and management care about outage issues, so it's worth your time to craft procedures that will help you minimize outages and their downtimes. Escalation procedures exist to move a problem up the chain of command until eventually someone in the chain can solve the problem. Procedures shouldn't end at that point though. Developing procedures for handling outages will ensure that nobody misses critical tasks such as handling communication and updating trouble tickets. The actual procedures should be very specific to your organization, but the general guidelines discussed in the following sections can help you get started. Assigning Problems to Appropriate StaffWhile the help desk may assign a problem to your group, not everyone, including the on-call person, is the best fit for every problem. Your group probably has a variety of expertise; some of you may be senior administrators, some junior. Others may know more about Linux than Solaris. Still others may have extensive experience in operating systems but won't touch hardware with a 10-foot pole. Know your IT staff's strengths and weaknesses and use that information to assign problems to the right person. Even if you are the on-call person, don't spend 2 hours on a problem that another member of your group can fix in 2 minutes. Keep a contact list for your team and ask for assistance when necessary. On-call duty shouldn't mean that you have to solve every problem, but you should certainly be responsible for orchestrating the problem-solving process in the most efficient way possible. Maintain Ongoing CommunicationIt's only natural when dealing with a difficult problem to focus all of your energy on solving the problem, while blocking out all other external stimuli. This may speed up your own problem-solving process, but it leaves everyone else in your organization wondering what's going on. Always keep the communication lines open and send back as much information as possible to the help desk, your team, and your managers if necessary. They in turn can keep other parties informed, like clients and senior management. Periodic check-ins can help facilitate this communication. During long outages, checking in with the help desk every hour or so is a good practice. These periodic check-ins also keep you informed of how serious the outage is perceived to be from the users' point of view. Use a Headset to Let You Keep Working Of course, after the problem is resolved or you've come to a crossroads (maybe you need to order parts or wait for vendor support), contact the help desk immediately and update the status of the problem. This can also be done with a trouble ticket system like Remedy or req, which streamlines the whole process for you. Maintain Activity LogsYou may remember and understand everything about an outage the instant you finish working on it, but it's a good bet that you'll forget about 50% of what you did the next day. Keeping a detailed activity log will help you document the entire problem solving process, including command output, vendor contacts, and timelines. All good trouble ticket management systems provide some sort of logging functionality (Remedy likes to get personal and calls it a diary). These logs are invaluable tools for both future reference and for analysis of a problem that just occurred. A very simple but typical log entry might look like this:
Code View:
Scroll
/
Show All
Wed Jan 30 2002 21:32:00 PM brian: Users experiencing slow response time on the mail server "goat". I am working on the problem. This log shows the progress of the problem resolution process, including important log data. This data will be very useful in the future, as it was assumed that goat had a disk problem when it was really the gigabit card that was failing; that assumption can be avoided next time now that this data has been logged. Reference the Activity Logs During Outages
Remain CalmIt is difficult to remain calm in highly visible outage situations, but you can't debug a problem and execute highly technical processes while you're running around like a chicken with its head cut off. As an old coworker once said, “You get ice water in your veins,” meaning that as you experience various outages and problems over the years, you become more and more calm even in the most dire of situations—your blood no longer boils at the mention of the word “outage.” The more panicked you are during an outage, the more likely you are to make a mistake, possibly worsening the situation. What's worse is that panic is contagious—if you are running around your office or data center screaming, other members of your staff are likely to start doing the same. Lead by example and stay calm; analyze the problems you need to solve, and take things one step at a time. If other administrators around you are panicking, ask them politely to leave, as they only add to the problem at hand. Nontechnical coworkers are likely to stop by and ask what's going on; it is very easy to get angry at them for bothering you during an outage, so simply ask them to go back to their desks and let you do your job. Even your managers need to be told this as well; for example, you cannot possibly remedy a major outage with a manager looking over your shoulder reminding you how much money the outage is costing the company. Just ask everyone to leave you alone so you can fix the problem. Root Cause AnalysisAll problems, no matter how complex, have a root cause. A root cause is where a problem originated—the spark that caused the fire. Sometimes finding the root cause of a problem is easy. In the example of a problem discussed in the activity log text listed in the preceding section, the root cause of the slow response time on goat was a failing gigabit card. Sometimes it's not so easy; often the problem must be traced back further through many steps to find out what action truly caused the problem. For example, a user might call into your help desk saying that she isn't receiving any of the email her friends are sending to her. Upon further investigation, you find out that the file system housing her mailbox is full. You reclaim some space, and mail begins flowing into the system again. What is the root cause of the problem? Was it the full file system? That certainly caused the user's problem, but what caused the full file system? Maybe you were suddenly sent a massive amount of spam that filled up the mailboxes on your system. In that case, the spammers are to blame, and you can block them from any further access. The root cause was the spammers, and you remedied the problem by altering your SMTP rule set to deny them access to your mail systems. Perhaps instead the full file system was the result of a gradual increase in usage over the past few weeks. Administrators were either not aware of or ignored the trend of increasing disk usage. The root cause of the problem in this case is the ignorance of the administrators, which could be remedied by implementing new monitoring procedures or configuring a software monitor to report excessive disk usage before it becomes critical. This example perfectly demonstrates a situation in which proactive monitoring (discussed at length in Chapter 6) is more effective than reactive monitoring. What can root cause analysis do for you beyond assigning blame? It helps you identify the real causes of your problems rather than the immediate causes or symptoms. While you can deal with immediate causes as they are found, eliminating root causes can make their resulting problems disappear forever, as there is no more seed from which outages can spawn. In general, identifying a root cause requires you to trace the development of a problem from its symptoms back to the event or condition that set the problem into motion. After determining what caused the actual symptoms of the problem, you must determine what caused those problems, and so on, until you finally reach a problem for which there is no cause that you can remedy—that will be the root cause of the problem. In essence, you are creating a genealogy of the problem, tracing its roots back to the beginning. Avoid Band-Aid Solutions SummaryManaging outages can be a challenging part of a system administrator's job; often, these outages can be scary and overwhelming. However, you can't let these outages control your daily life or take over your IT department. As you resolve problems over the years, you learn how to better analyze new problems and fix them using previous experience. Your calm will eventually overtake your anxiety, and you will be handling outage situations with composure you didn't know you had. This chapter introduced the most common types of outages that can occur and how to manage them. Some outages are created on purpose to service hardware and software; these scheduled outages should be performed within a designated maintenance window. When outages do occur, it is important to accurately measure how long they last to monitor your compliance with service level agreements (SLAs). Your actions during an outage are important as well; setting production values such as responding to problem reports in a timely manner and documenting outage procedures will help you and other administrators deal with problems more effectively. Finally, when an outage is resolved, you should perform a root cause analysis to determine the true cause of the problem and fix it; this goes a long way toward eradicating those outages from your infrastructure for good, and fewer outages make for a happier system administrator! |
|