MPP and SMP architectures are in a dogfight for top gun honours among large-scale servers. We'll show you what's pushing the envelope and how to ride the upward trajectory of this huge marketAs fans of combat aviation know, Top Gun is the arena where navy fighter pilots compete for the special designation "best of the best". Capturing the essence of the contest in the hit movie Top Gun, pilot Pete "Maverick" Mitchell leads navigator Nick "Goose" Bradshaw in a preflight chant: "I feel the need . . . the need for speed."
When solutions integrators are confronted by high-end server-system offerings, they often experience a similar rush of top gun bravado.
Instead of Eagles, Tomcats and Stealth fighters, the discussion focuses on technologies like symmetric multiprocessing (SMP), non-uniform memory access (NUMA), and massively parallel processing (MPP).
The names have changed, but the chant remains the same. This is where server tech- nology is pushing the edges of the speed envelope -- and market demand has never been greater.
Sales hot buttons
According to Dan Olds, program manager in Sun Microsystems' data centre and high-performance computing product group, one major sales driver behind large-server purchases is cost-cutting. The trend towards consolidating servers, for example, is about lowering the total cost of ownership (TCO), he says.
"In a hardware-to-hardware comparison, it's cheaper to buy many smaller systems," Olds noted. "In distributed platforms, 75 per cent of TCO is related to personnel costs."
Compare that to the mainframe world, where personnel costs account for only 30 per cent of TCO. By consolidating multiple smaller servers into a single, more capable server, enterprise customers can realise a significant drop in TCO.
Another general trend behind high-end server sales is the growing need to support very large databases.
In fact, says Steve Wanless, marketing manager at Sequent Computer Systems, most of his company's large-scale systems are sold into the "horizontal market" of Web-based access to big databases.
Dave Poole, director of enterprise servers at Digital, also points to the increasing demands made on traditional applications such as online transaction processing (OLTP) and decision support systems (DSS).
High speed and high availability for OLTP, along with the dynamic load-balancing and shared-memory requirements for DSS, are the sales hot-buttons here, he says.
Basically two classes of large-sale server architecture compete for the market title "best of the best": MPP and SMP.
In the former scheme, each node maintains its own copy of the operating system, application software, and stored data. The isolation of each node from its peer node accounts for MPP's nickname, "shared nothing".
SMP systems, on the other hand, are known as "shared everything", because they share memory and disk among CPUs, have a single address space (shared global memory), run a single copy of the operating system, and share a single copy of the application.
MPP servers, such as the IBM SP2, are multiple nodes, each comprising a processor, memory, and in some cases disk, linked through a high-speed system interconnect -- either a bus or a switch. In theory such systems are more scalable and fault-tolerant than SMP, with increased capacity through the addition of more nodes.
But as integrators well know, being the best in the marketplace isn't only about performance. It's also about cost-effectiveness -- which is where MPP has sputtered. Usual explanations include the difficulty developing applications to exploit MPP and the high administrative overhead for those apps not written specifically for parallel-node architecture.
The bus stops here
So why hasn't SMP blown MPP out of the sky? It's largely because of the inherent non-scalability of SMP's main bus interconnect. Running out of bus bandwidth requires a "forklift upgrade" to the next largest server.
Sequent's Wanless observes that "enterprise customers need to buy infrastructure to support a maximum number of processors up front -- a very expensive approach".
It's important, then, that integrators sell a bus architecture that meets large-scale needs. Many firms specify buses designed to last through the addition of at least three to four processors. There is, however, an overall system limit of 12 to 16 processors, based on current SMP architectures, notes Digital's Poole.
Sun Microsystems disagrees. With a combination of high-bandwidth bus technology and operating system enhancements, the vendor fields a 24-processor SMP server -- the Ultra Enterprise 6000. But Sun hasn't stopped there. Courtesy of Cray Research technology acquired from Silicon Graphics in 1996, Sun's Ultra Enterprise 10000 (aka Starfire) utilises a switched interconnect to support 64-processor SMP.
In this case, the main barrier to widespread acceptance is price. One Starfire system sells for more than $1 million. As such, an alternative solution for many integrators, particularly those serving the middle market, may be clustered four-way SMP servers.
Clustering versus NUMA
"With the low purchase price of a four-way SMP server, you can sell multiple servers, configure them in a cluster, and get the fault tolerance and failover capabilities of MPP," explained Bob Holbrook, vice president of product management at Tandem Computer, a Compaq company. Holbrook's comments reflect the popular trend of using clustering technology to circumvent the limits of SMP and to "scale beyond the box". Clustered SMP involves linking two or more SMP systems, or supernodes, via a high-speed system interconnect, thereby enabling the sharing of disks and data between cluster members.
Serious scalability via server clusters is only now becoming available through new techniques for low-latency interconnects, such as memory channel.
When it comes to compensating for SMP's scalability limits, however, not all large-scale server makers are ready to cede the "best" title to clustering. In fact, argues Sequent's Wanless, clustering actually exacerbates SMP performance problems while increasing management complexity.
He proposes NUMA, a type of distributed shared-memory architecture that uses a series of shared memory segments rather than a single, centralised physical memory. In NUMA-based servers, the access time to a memory block varies according to location of the memory segment containing the block, with respect to the processor that needs access.
Not coincidentally, Sequent's NUMA-Q servers are among the first to embody this technology. The "Q" refers to NUMA with quads, a processor/ memory hardware innovation based on blocks of four Intel Pentium Pro processors.
Clearly NUMA is hot, with most major SMP vendors now developing their own variants of the technology for release in 1998. Meanwhile, the speed and performance of Top Gun systems continues to climb.
All vendors agree that the large-scale servers of this year and 1999 will feature an increasing number of capabilities associated with today's mainframes. Given the cost and management baggage that comes with distributed-server architecture, integrators may safely plot customer interest in these systems along a trajectory running up and to the right.
Jon William Toigo is an independent writer and consultant specialising in business-automation solutions. He can be reached at: firstname.lastname@example.org or through his Web site: www.intnet.net/public/dolphinThe upsides and the downsidesMPP servers carry high software costs, while SMP performance eventually hits a bus wall. Choose your markets and apps carefully.
Despite high application-development costs for massively parallel processing (MPP) servers, as well as high administrative overhead for apps not written specifically for parallel-node architecture, some markets are quite receptive to MPP solutions.
Scientific users, for example, aren't as concerned with high software costs because specialty apps (such as simulated nuclear warhead explosions) must be hand-coded anyway.
Certain commercial apps, too, such as large-scale data mining and retrieval, are facilitated using fairly large MPP nodes connected by switching fabric, adds Mashey. But when apps need to go between nodes, the programming model changes.
SMP redresses this and other MPP short-comings by using a common or shared memory as well as a disk I/O subsystem to service multiple CPUs. As more CPUs are added, administrative intervention is not required.
However, the trade-off is this: since memory is shared between CPUs, each CPU cache must be kept up to date with the memory changes of other CPUs. As processors are added, some loss of speed and efficiency accrues.
SMP architecture uses a bus for messaging between CPUs and moving data in and out of memory and disk. As CPUs are added, proportionally less bandwidth is available to each CPU.
Similarly, as more system resources are added, available bandwidth on the SMP bus is sucked up until latency increases and performance degrades across all processors.
A large-scale server lexicon
Bandwidth. Maximum rate at which an interconnect (such as a computer system bus or network) can propagate data once the data enters the interconnect, usually measured in MBps. The bandwidth can be increased by making the interconnect wider or by increasing the frequency of the interconnect so that more data is transferred per second. See also latency, with which bandwidth is often confused.
Big-bus SMP. A type of symmetric multi- processing (SMP) architecture that relies on a single, shared bus among multiple processing elements (CPUs, memory, and input/output devices). Big-bus SMP machines offer good scalability up to a few dozen processors (24 to 36).
Cache. Because the speed with which CPUs access data is so crucial to system performance, adding a small amount of very fast memory close to each CPU yields increased performance. This fast memory holds copies of the most recently accessed data. It is called cache, after the French "cacher", which means to hide. The workings of caches are hidden from the software in many respects.
Cache coherence. In multiprocessor systems, though there's only one memory and therefore only one location for each datum in memory, the individual CPUs have caches of the most recently used data. Multiple caches mean that multiple copies of the same datum can now simultaneously exist. Cache coherence relies on a hardware and/or software mechanism that guarantees that all outstanding copies of a datum are kept identical, and if any one processor updates a copy, all other copies are deleted first.
Cache hit. When a processor finds a needed data item in its cache.
Cache miss. When a processor does not find a needed data item in its cache. A request to retrieve the datum must then be issued to the next level cache or main memory. In the meantime, the processor stalls waiting for the request to complete. Cache misses are grouped into several categories: capacity miss, coherence miss, compulsory miss and conflict miss. The first three occur in both single processors and multiprocessors.
Capacity miss. A miss that occurs because the cache is not large enough to hold all the needed memory blocks.ccNUMA. (Cache-coherent non-uniform memory access.) A variation of the NUMA architecture that features distributed memory tied together to form a single address space. Cache coherence is performed in hardware, not software.
Cluster. A collection of interrelated whole computers (nodes) that are utilised as a single, unified computing resource. The nodes of a cluster run independent copies of the operating system and application(s) but share other computing resources in common, such as a pool of storage.
Coherence miss. A cache miss based on the need to keep the contents of a memory block the same when it is shared by the caches attached to more than one processor. It applies to multiprocessors only. Example: processor A changes the contents of a block that is also held in processor B's cache. When this occurs, processor B's cache must throw out its copy of the block because it is now obsolete. The next time processor B reads that block, a coherence cache miss will occur.
Coherent. At any one time, there is only one possible value uniquely held or shared among CPUs for every datum in memory.
COMA. (Cache-only memory architecture.) A rival to ccNUMA that uses multiple levels of large caches rather than a single large memory. Data coherency is maintained by hardware.
Compulsory miss. The first time a block of memory is referenced, it will not be in the cache. A compulsory cache miss occurs on the first reference of the block.
Conflict miss. A cache miss that occurs because the portion of the cache assigned to a region of memory is not large enough to hold all of that region's blocks. It applies to certain organisations of caches only.
Directory-based cache coherence. A mechanism that preserves data coherency by keeping the sharing status and location of each data block in one and only one place for each data block. The place the sharing status is kept is called a directory (or it may be distributed across several directories). The SCI standard implements directory-based cache coherency in a NUMA-Q machine. Compare to snooping cache coherence.
Failover cluster. A cluster of computers with specialised software that automatically moves active applications from one machine to the other in the event of an outage involving one node in the cluster.
Global shared memory. A term frequently used with MPPs to describe the collection of memories from each cell or node in the system. The system has the appearance of one shared memory by using a software layer to fetch data from remote memories on other nodes.
L1 cache. The first cache (Level 1) searched by the processor for data or code.
L2 cache. The second cache (Level 2) in the cache hierarchy searched by the processor for data or code. It is only searched if the L1 cache fails to find the requested data.
Latency. The length of time required to retrieve data, starting when the initial request is made and ending when the request is satisified. It is usually much more difficult and expensive to decrease the latency than it is to increase the bandwidth. See also bandwidth.
Locality. The tendency of multiple code and data accesses to stay within a given address space. Spatial locality describes the closeness of the addresses of multiple accesses to each other. Temporal locality is the extent to which references over time request the same address(es).
Manageability. How easy it is to manage a computer system or collection of systems. It includes such diverse tasks as administration, backup, security, performance tuning, disaster recovery, and fault recovery.
Node. A single member of a cluster. A node is a whole computer, running its own copy of the operating system and applications. Compare with cluster.
Quad. The building block of Sequent's NUMA-Q architecture, consisting of four Pentium Pro processors, two PCI buses with seven PCI slots, memory, and a 500MBps system bus.
Replicated memory cluster (RMC). A shared-memory cluster, in which a memory replication or memory transfer mechanism between nodes, and a software lock traffic interconnect, maintain coherency between copies of shared memory distributed across nodes.
Remote quad memory. The portion of the single physical memory that resides on quads other than the one quad that contains the processors seeking memory access.
Shared-memory model. A logical architecture for parallel computing, in which multiple processors run a single copy of the operating system.
Snooping cache coherence. A mechanism in which every cache that has a copy of a memory block also has a copy of the sharing status of the block. It employs hardware support known as a snoopy bus. Compare to directory-based cache coherence.
Snoopy bus. The hardware bus used by most big-bus SMP machines. All CPUs "snoop" all activity on the bus and check to see if transmissions on it affect CPU cache contents. Each CPU is responsible for tracking the contents of only its own cache.