In a darkened network operations centre, operators and engineers mill about, performing their routine morning rituals, drinking coffee, reading e-mails and checking log files.
A large display dominates the scene, providing a topology map that changes colour with the status of each node. An idle event browser is displayed next to the map. All of a sudden, the event browser goes berserk, scrolling faster than the human eye can follow; the map turns a solid red, and the phones start ringing nonstop.
The operators spring into action, frantically banging away on every available keyboard in an effort to determine what went wrong. The operations manager bursts in and bellows, `What's going on?' The lead operator barks back, `We don't know yet, we're working on it.'
In exasperation, the manager looks up at the ceiling, sighs deeply and asks, `Why can't I just get a tool that tells me what's wrong?' The operator hunches back toward his console and whispers to himself, `That's a myth, boss, that's a myth.'
This scenario is played out on a daily basis in corporations that increasingly rely upon their networks to provide business services. However, it doesn't have to be like this. Event correlation is not a myth.
Sophisticated network-management technologies are available today that do, in fact, provide network managers with the ability to pinpoint the source of network problems. However, selecting the right products, getting them to work together and configuring them properly is a daunting task.
The first step is to understand how event correlation works. An effective event correlation system is built upon three levels:
1. The object level, which focuses on isolating problems specific to a particular device.
2. The network level, which focuses on how nodes in the network are connected to each other and the impact each node has on its neighbours.
3. The service level, which is concerned with applications that use the network and how failures at the object, network and/or service levels impact the performance of a particular service.
Most organisations do not achieve effective event correlation because they are unable to establish the relationships among these three levels.
Object level. At the object level, conditions are monitored on an individual object, and information is processed to isolate the root cause of a problem pertaining only to that object (or node in the network).
When a problem occurs, the event correlation engine (ECE) needs to ask a complex set of questions, the answers to which eventually lead to a problem determination.
For example, let's say a router generates an SNMP trap, informing the ECE that an interface has just gone down. The ECE needs to verify some base level information - that the interface is really down and that it's not supposed to be down. Then the ECE will start asking additional questions. Are the I/O buffers overflowing? If so, is CPU utilisation maxed out? From these pieces of information and an intimate knowledge of the network, the problem can be identified - the router is overutilised and needs to be upgraded.
Network level. The network level is concerned with how nodes are related to each other and how the failure of one (or more) node will affect the rest of the network. This needs to be done by examining the connections between the nodes and constructing a database of these connections. The idea is to determine a set of parent/child relationships to every node being monitored. Naturally, this will be a many-to-many relationship because nodes could have multiple children and multiple parents. With this information, the complete path to any node can be known, and it will be possible to recognise that a large stream of alarms are the result of a single node.
At first sight, it may seem that this information is available in the topology map. However, topology maps can miss things such as alternate routes to or from a router, multiple IP addresses associated with a single router, and Hot Standby Routing Protocol, where two routers act as one.
The only effective method of overcoming these shortcomings is to start with the topology information and then refine it by retrieving additional information. The four ways to do that are:
1. Retrieve the information from the router/switch configurations.
2. Develop scripts or programs to access the command-line interface.
3. Leverage existing databases (if they exist).
4. Manually populate the database from existing network documentation.
Service level. At the service level or application level, problems are not really problems at all; they are symptoms of some other problem, the failure of a node (at the object level) or a connection (at the network level) or a subordinate service such as Domain Name System (at the service level).
The ECE needs to recognise that a service has failed and then map the symptom to the actual problem. Knowing the logical relationships and the dependencies among the various network nodes is the key at the service level. Unfortunately, each service is unique and can have a complicated set of dependencies.
There are tools that can help, such as Application Scanner from Ganymede Software. Realistically, the process of finding all the relationships will be an iterative one.
After determining the dependencies, it will then be possible to use tools to measure the performance of the service. There are three basic approaches to measuring application performance:
1. Use simulated transactions.
2. Use agents on every user desktop.
3. Use agents on the servers.
Here's how an event correlation engine would work at the service level: let's assume that an interface is down, and suppose the marketing department has negotiated a 10-second response time for each Web page to be displayed. And let's assume that the Web server makes SQL queries to a Sybase server, which happens to be on the other side of the failed interface.
The ECE needs to notice that the Web page has not been displayed in the 10-second window, then needs to check through the dependencies, discover that the Web server is dependent upon a Sybase database, and that the Sybase server is unavailable because the link is down.
Without accurate relationship information, the ECE would be unable to diagnose the problem.
Putting it all together. Here's what you need to deploy event correlation:
1. Whether it's autodiscovery with a management platform, such as Network Node Manager or NetView 6000, or a manually populated database, there needs to be some way for node information to be added, deleted, and modified. Accurate node information is the mortar and stone of the network management system.
2. Next, there needs to be an ECE. From the conceptual point of view, this is a nebulous thing where event information goes in, gets processed and out pops the root cause of the problem. In reality, it will more than likely be a set of tools working closely together.
3. There needs to be some method of notification. The typical notifications would be to technical support staff or end users, and would take the form of e-mails, pages, events in a browser, sirens, flashing lights or all of the above. There also needs to be a tool-to-tool notification, which would be SNMP traps or User Data Protocol (UDP)/TCP socket connections.
4. The network management system should be divided into components that perform particular tasks. At the top of the system would be the Manager of Managers (MOM). The MOM will receive alarm information, consolidate the information, and provide an enterprise level view of the alarm conditions. More importantly, this will be where the service-level relationship database is maintained and used. The MOM has the big picture and is best-suited to perform the service-level correlation. There are a number of tools available today that can fill this role, such as Micromuse NetCool, Tivoli Event Console or Boole & Babbage Command Post.
5. In a typical enterprise network, there will be at least three ECEs feeding the MOM with alarm information. Each needs to be customised with models for the nodes that it manages. Each is responsible for one layer of the network infrastructure, one for the backbone, one for the routers/switches and one for the servers. Examples of ECEs available today include Veritas' NerveCenter and SMARTS' InCharge.
Some additional tips. Be very wary of off-the-shelf software packages that claim to do everything without needing any maintenance. These may provide a quick solution, but they may not provide the advanced functionality needed for the long term.
In addition, these `do-it-all' products tend to lock the solution into one technology base, and as the technology solution gets bigger and bigger, the development cycle gets slower and slower. Generally speaking, the highest-quality functionality with the quickest turnaround time will be achieved by integrating best-of-breed, point products.
Establishing the relationships is the key to success, and maintaining the relationships requires a database. However, don't over-engineer the database. Do not attempt to make it into more than it needs to be. Keep it as simple as possible. It could be a relational SQL database, or it could be a simple flat file.
Customisation is another key factor in long-term success. Ensure that the selected products are customisable, especially when it comes to the models. It is paramount that models can be customised and enhanced to match a specific series of events. Each network is unique, complete with its own set of quirks. A generic model may provide a good template, but it will almost always need to be adjusted for subtle nuances.
Finally, learn how to crawl before trying to walk, and learn to walk before trying to run. It would be a serious mistake to attempt the service-level correlation before first conquering object-level and then network-level correlation. These are the foundations on which the house is built, and without them, the structure will surely crumble.