Data warehouses and data marts are still struggling for respect. Well-intentioned projects aimed at leveraging the power of corporate information can often end up as failures, skidding over budget and past deadlines. This puts IT managers in a tough spot. Rush the job and you get a slow, resource-hungry application. Take your time, and you risk coming under fire.
And then there are performance issues that arise from the nature of the beast. That is, even if by some freak of circumstance, every original assumption made about a data warehouse/mart was correctly guessed, performance degradation would still occur. As analysts bang away at a data mart and the amount of data grows organically, usage and performance requirements can change.
Now that the data warehouse/mart industry has matured a bit, some standard techniques are emerging that will boost virtually any decision-support system. At what point should you start your tuning? Experts say that performance data collection should start immediately. In particular, you should be using database analysis tools to log what data gets most used. Equally important is what data is hardly touched and -- hardest to quantify -- which queries fail to be answered.
"One of the things that is fairly straightforward is they analyse the kinds of data accessed," explained Rick Roy, vice president and general manager at M&I Data Services, a system integrator of decision support systems for the banking financial industry. "What isn't straightforward is when users are making requests and the mart can't satisfy it. Then you need to find out why not."
Roy suggests training users to contact you if they have a query that has failed or took too long to answer. Database managers agree.
"When going through the process, if users get an unacceptable response time, we sit down with the user and ask them 'is this something you're going to do on a regular basis?'," said Terry McDonald, information resource manager at Hoffman Enclosures, a manufacturer of electrical enclosures.
Hoffman Enclosures uses a 10GB mart built on an AS/400 using DataTracker, data mart software from Silvon Software, a data mart vendor and system integrator.
Currently the mart performs sales and marketing analysis, although McDonald says the plans are to expand it for financial and manufacturing queries. Users are trained that acceptable response-times, even for ad hoc queries, range from 30 seconds to two minutes. Users who get response times of 15 minutes are required to contact the IT team.
"We are constantly working on keeping the performance lined up with users' needs, and to encourage them to come to us when having performance problems, rather than looking for alternative solutions," McDonald said. "We want to show them we are responsive and we do make changes [to the system] as needed."
After about a year of information gathering, you have reached the magic time for your first major tuning, experts say.
"We've found that in most cases of implementation, at 12 to 18 months the customer goes back into the thing and says, 'let's reimplement'," observed John Hughes, vice president of marketing at Silvon Software.
It takes approximately this long for IT managers to gather enough information on how users are accessing, and it takes about this long for users to fully understand what they really need from the system -- as opposed to what they actually asked IT to provide when the system was built.
Probably the biggest immediate boost you can give your system is to reduce the amount of unnecessary data that you are loading, storing, or having to sort through when querying. In fact, experts agree that "too much data" is the number one design mistake made when building decision support systems, particularly marts.
"Most people try to incorporate too many things into a data mart," said Jeff Wingo, data mart designer at Axiom, a data mart system integrator. "A lot of times designers are just thinking really far ahead. They may eventually want to integrate with an accounting system or telemarketing system, so they start to design that into their mart.
"Instead of completing it and making it useful, this lengthens the time and complexity."
Similarly, both Wingo and Roy agree that with data warehouses, one trend is to cull a mart from the sweet spot -- the data that is accessed most frequently. To do this, the warehouse becomes the gold source of scrubbed data. The warehouse can then be used for a series of department-specific or application-specific marts. Queries to the warehouse itself would be restricted to the few people that needed an enterprise to look across departments, such as the chief information officer.
"The warehouse provides a common ground for data," said Ed Connelly, a systems architect at Axiom. "It should be unbiased, just the raw data that hasn't been formatted by, say, applying the accounting department's algorithms."
Let's look at loading
When trying to slim down a plump warehouse, simply dumping unwanted data solves only half the problem. Even more important is loading, users say. "When we built the data mart, everybody wanted everything," said Todd Greenwood, database marketing manager at Harvard Business School Publishing. "We now realise that all of that's not needed. So we can go and redo our extraction routines so the data mart takes a subset of information."
This non-profit division of Harvard University uses Analytix, a Windows NT-based data mart from Customer Insight, for a sales and marketing mart. The 5Gb system supports approximately 10 direct users among approximately 300 employees. Those 10 do primarily ad hoc queries for this book and multimedia publisher at Harvard.
In addition to cutting the unnecessary volume dumped and stored in the system, another way to cut data is to give it a reasonable expiration date.
"The biggest problem I see is the period of time they want the data to be in the warehouse," said Mike Beronick, senior project manager of data warehouses at M&I Data Services. "If you structure it so that you're adding other relations and other stuff as needed, over time, you're going to lose performance. The experience of getting the same file over time will be slower."
One technical red flag that you have in overweight databases is the complexity of the data schematic, experts say. If you're contemplating moving to a snowflake schema so that you can add more dimensions to your data, chances are you're making the system more complex than it needs to be, particularly in the case of a mart.
With a warehouse, a snowflake may eventually become necessary, but it should be viewed only as a last resort, says Jagdish Mirani, product marketing manager of data warehousing at Oracle. A star schema is the norm because such schemas represent the basic dimensions of most business queries. In other words, you may have a dimension called "customer" about which users want to know more details, such as address or account number. Should users say they need to analyse "address" for more detail, a snowflake schema could be the solution.
"But data models are something you don't want to change on a daily basis," Mirani warned.
Instead, both mart and warehouse managers should take advantage of some of the less complex ways to tune their systems. King among tuning techniques is the summary table. Query speed can be significantly enhanced for common queries if the data is compiled into a summary table. However, other areas may take the performance hit, warns Hughes. He believes overall performance is a balance of three items: query response time, disk consumption, and the "batch window" of time it takes to load data.
Emphasising one of these means trading off on the others. For instance, you can significantly speed up your response time with summary tables, but that requires more disk space and more time for updates to load.
You can reduce disk consumption by using fewer summary tables, but response time may suffer. If you are loading changes nightly, you may not have the time to add significant numbers of query tables and still have the system available for work in the morning.
Limiting the number of people who have access to ad hoc query capabilities -- and then training them well -- will also increase performance. This is the suggestion of David Roquemore, a supervisor of information management.
"A lot of users just want to see a report. They don't care how it got there, they just want to view data," Roquemore explained. To that end, we use canned reports whenever and wherever possible, and limit ad hoc query capabilities to identified "power users", he says.
Anytime the same query is asked routinely, performance can be enhanced by issuing a pre-determined report. Power users who are trained on how to structure ad hoc queries can then go in and drill down for more information, if needed.
Along the same lines are performance problems that crop up as a decision support system, in particular, a mart is asked to do too many things.
"Keep the mart intent on what it is doing, and have it do one thing," M&I Data Services' Beronick said.
In fact, the smartest long-term performance tuning you can do doesn't even require you to point and click: develop a plan for integrating multiple data marts, says Tom Hammergren, product manager for data warehousing solutions at Sybase, and author of Data Warehousing: Building the Corporate Knowledge Base (International Thomson Computer Press).
"It's sometimes called Lego marts. It's an enterprise warehouse of marts that should just plug together," Hammergren said.
Such easy integration is accomplished by creating a model by which all marts will be built. That is, the company should standardise on basic definitions of commonly-used data, such as customer or product, and then "each group should be able to extend the definition" so that it supplies the data they need, Hammergren explains. For example, the "customer" field will be identical from system to system when it comes to name, address, and telephone number.
Using Lego marts is a strategy that works well, says Mark Kaminski, senior vice president of information technology at CompuSearch Micro Marketing Data Systems.
CompuSearch uses Sybase IQ as the basis for a series of marts that it uses for its flagship service, marketing demographic data and reports. Kaminski says smaller databases mean much faster query response times. For CompuSearch users, three minutes is too long.
Of course, the system isn't the only part of the equation that needs ongoing tuning. Database managers are well-advised to "tune" their users as well. This means managing expectations of what is and isn't acceptable performance and conducting regular interviews with users to ensure they report performance problems.
In addition, your team should also learn as much as it can about the business side of the system.
And, if all else fails, throw hardware at it. This will not cure a poor design, but it is definitely the solution if the CPU or memory is being taxed. Again, careful auditing of your system should point the way.
Performance tips for data marts
Don't use a snowflake schema. This indicates that you have got too much data and too much complexity, and it is a schema that is nearly impossible to fine tune or alter at a later date.
Limit ad hoc queries to your power users, and make sure they are trained on how to perform them.
Train users on what acceptable response times are and what to do if the system fails to meet them.
Aim for response times of less than two minutes to keep users from pulling support away from the project and building their own department marts.
Audit systems should monitor performance on a daily to weekly basis, no longer.
Include an analysis of the network while monitoring performance.
Use canned reports for information that is routinely queried.