I once worked with an organisation that had a mature, load-balanced farm of server systems for its corporate directory. Users started reporting that, infrequently, they had to authenticate twice, or that they had to resend mail to internal staff. Our management console showed the directory service online and healthy. There was no pattern to users’ query failures, but calls to the helpdesk were growing in frequency. It took a one-time school administrator to point us in the right direction.
“That management console,” he said, “doesn’t know everything. Try diagnosing the problem as if the management system’s not there.” So we did it the old-fashioned way, tracking from the bottom up instead of from the top down. Lo and behold we found that one server in the farm had a sick hard drive controller that was intermittently garbling read data.
Business continuity and failure-recovery strategies are based on the assumption that the most expensive failures are the most obvious ones: One or more systems, services, or devices die. But, by design, that system is not looking for lesser signs of trouble.
Some smaller problems compound over time or thrive simply because they aren’t being watched. Whether these little issues go unnoticed because there is nobody left to look out for them or because they don’t seem important enough to monitor, the small stuff can wind up costing more to repair than the big problems you fear most.
By the time one of these creeping, under-the-radar conditions trips the alarm bell, it may have left a trail of damage. In the case of the server with the sick hard drive, the controller didn’t realise there was a problem, so its host didn’t know, and no alert went out. We found that if there had been an alert, it wouldn’t have been heard. The management system was configured (or misconfigured) to listen for alerts only from the master directory server and the load balancer. It didn’t see anything behind the load balancer.
That made the problem difficult to diagnose, which is often the case with failures that start out small. What’s the solution? You need to adjust your administrative practices so you’ll see costly small problems coming.
Through a foggy windshield
Administrators routinely loosen management systems’ alarm thresholds so that they’ll send out fewer alerts. Some of the staffers who made those adjustments to your systems are probably gone now, leaving you uncertain about disabled or misconfigured monitoring settings. Before you do anything else, you’ll need to restore alert defaults and tune them to more realistic thresholds, which will bury you under management alerts for a while. Is that an enjoyable process? No, and that’s why I’d get vendors to handle as much of it as possible.
If you think that being bombarded with too much information is rough, it’s a joy compared to seeing nothing at all. A management system that’s tuned for quiet operation is a great source of calm, but it’s a false comfort. These systems simply aren’t aware of the status of some elements of your operation. You either need to plug these invisible assets into your management system or cook up some other way to track their status. Choose one or the other, because it’s in these dark places that costly troubles fester. Strolling up to a console whenever a user complains is not an effective solution.
Most enterprise products are equipped for management. But not everything is made to the enterprise standard. Products designed to adapt to small and medium businesses default to independent management. A business with two routers and eight servers is not going to spring for a copy of OpenView. Instead, a company that size will use Telnet, X Window, or Terminal Services to keep things tweaked. By now, I think everything can be managed from a Web browser, but every device has its own interface style.
I have yet to see a router or a managed switch, even in the consumer price range, that lacks basic SNMP capabilities. If a box isn’t remotely manageable by default, you can likely configure it so that your management system can monitor it, if not change its settings. With no reporting, or too little of it, you’ll only learn about trouble when it’s serious enough to affect a monitored device living downstream. By then, it can be a major challenge to trace the problem back to its true origin.
Self-managing and locally-managed systems and software (they watch themselves and sometimes watch each other) are also potential trouble sources. These assets do a wonderful job of quickly replacing dead systems and peripherals with living ones, but they may not be very forthcoming about the cumulative conditions leading up to the failure. Indeed, these types of systems often switch when something fails either to respond to a ping or send out a periodic heartbeat. So you get a fast response to failure, but it’s up to you to figure out why it happened. There are innumerable potential causes for an unheard ping or an unsent heartbeat. If it isn’t a blown power supply or a dead network link, some combination of mysterious conditions may be responsible. Eliminate the mystery by turning on whatever logging and alerting the systems will permit. Once it’s an established habit, you’ll learn to spot and fix cumulative problems that lead to a fail-over.
Dim light is better than none
I have a pet peeve about opaque applications that just give you a green light when they’re running, and say nothing when they die or restart. There are only two ways to watch software like that: Act like a client or use OS-level tools.
Simulating a client is one step above the simple dead-or-alive heartbeat. For instance, if you want to keep tabs on a mail service that lacks a suitable monitoring interface, you can use a tool or management system interface that periodically requests a mail service connection.
I find basic “knock, knock” tests such as this inadequate because they don’t diagnose less obvious conditions. Instead of a basic request/accept test on a service, you’d do better to use a simulated client to push one typical transaction through to a verifiably successful termination. Each test yields insight into the service’s current state, and as it deteriorates, a record of periodic tests will give you a place to start for diagnostics.
Did the send and receive times drift farther apart? Were there sporadic delivery failures (bounces)? When you view the messages’ expanded headers, do you see anything unusual?
If there isn’t any practical way to watch a given application, your best alternative is to watch the operating system beneath it. Microsoft’s Performance Monitor is probably the best known OS monitoring tool, and it’s one you should run often, if not constantly, on every Windows server in your shop.
I can’t count all of the ways to watch, log, and diagnose Unix. Every Unix kernel is accompanied by a symbol map that lists the name and location of the running system’s relevant settings and statistics. There are probably hundreds of tools that dig up and analyse this data. You should be using some of them.
Watching OS-level statistics comes with a catch: You need to learn what you’re looking at. The major statistics, such as memory use network errors, are self-explanatory. Colleagues and online forums (especially vendors’ own forums and knowledge bases) are excellent resources. Learn to drill down to more detailed statistics when broader ones indicate a potential problem.
Check the system log
To close, I’ll speak of the unspeakable. There has not been, and may never be, a fancy automated tool that rivals a system log and a pair of trained eyes. Log examinations are always informative.
Viewing logs, I have identified systems with chip-killed (disabled) memory, an ominously rising number of soft (corrected) disk errors, and SCSI buses that had to reduce their speed for stability. Logs should be on your regular reading list, or on the list of your chosen employee. I’ll understand if you try everything else first. I often do because I want to believe that technology has evolved beyond plain text log files.
The best way to spot little cracks in your IT operation’s bedrock may well be a text editor. The use of a text editor should be part of a set of practices aimed at smaller, easily fixed problems before they grow into big ones.