In 2007, scientists will begin smashing protons and ions together in a massive, multinational experiment to understand what the universe looked like tiny fractions of a second after the Big Bang. The particle accelerator used in this test will release a vast flood of data on a scale unlike anything seen before, and for that scientists will need a computing grid of equally great capability.
The Large Hadron Collider (LHC), which is being built near Geneva, will be a circular structure about 28km in circumference. It will produce data in the neighborhood of 1.5Gbps, or as many as 10 petabytes of data annually, 1000 times bigger than the Library of Congress' print collection. The data flows will probably begin in earnest during 2008.
As part of this effort, which is costing about $US6.3 billion, scientists are building a grid using 100,000 CPUs, mostly PCs and workstations, available at university and research labs in the US, Europe, Japan, Taiwan and other locations. Scientists need to harness raw computing power to meet computational demands and to give researchers a single view of this disbursed data.
This latter goal - creating a centralised view of data that may be located in Europe, the US or somewhere else - is the key research problem.
Grid like no other
Centralising the data virtually, or creating what is called a data grid, means extending the capability of existing databases, such as Oracle 10g and MySQL, to scale to these extraordinary data volumes. And it requires new tools for coordinating data requests across the grid in order to synchronise multiple, disparate databases. "It's all about pushing the envelope in terms of scale of robustness," leader of Grid Particle Physics (GridPP), a UK-based scientific grid initiative that's also part of the international effort to develop the grid middleware tools, Tony Doyle, said.
Researchers believe that improving the ability of a grid to handle petabyte-scale data, split up among multiple sites, will benefit not only the scientific community but also mainstream commercial enterprises.
"If this works, it will spawn companies that will just set up clusters to provide grid computing to other people," GridPP Collaboration Board chair, Steve Lloyd, said.
GridPP is working with the international team to develop the grid the LHC will use.
CERN, the European laboratory for particle physics, is leading the LHC and its grid effort. The data produced by the particle accelerator will be distributed to nine other major global computing centres, according to grid technical leader at one of the major project sites in France, Fabio Hernandez.
As part of a backup plan, each of the 10 centres will have two-tenths of the total data, so that each one would be responsible for its own 10 per cent plus a duplicate of the 10 per cent held by another centre, Hernandez said.
In total, some 150 universities and research labs worldwide will be connected to this system, all providing some degree of processing capability. The operation will be running on versions of the Linux operating systems running on clusters of Intel and AMD processors.
Developing the grid involves a combination of efforts. In April, the LHC team conducted a test, distributing the data to 10 major sites at a total rate of 600Mbps. Much of the work was low level, such as adjusting parameters of a network card and firewall configurations. "It was important to prove that we can maintain the processes for an extended period ... almost without human attendance," Hernandez said.
Access all areas
This means ensuring that network interconnects were tuned and synchronised and that there was sufficient security and monitoring, as well as staffing and automation, at the respective data gathering sites, he said.
The more difficult aspect is providing simultaneous access to the data by as many as 1000 physicists working around the world.
"You cannot ... predict what the users will want in any given moment," Hernandez said.
One limiting factor that's getting a lot of attention from the about 100 developers working on the grid worldwide has been the capabilities of resource brokers - the middleware that submits the jobs and distributes the work.
If the processing wasn't effectively routed, databases could crash under heavy loads, Doyle said. There was also a need to ensure that the system had no single point of failure.
This involved keeping track of the data. The data could be in one place while the CPU resource capable of processing it is in another. Metadata, which described what the data is about, would play a critical role.
Doyle said these were some of the big challenges.
"The most important thing is to show that the grid model can be used to process real data in a scientific context - and data distributed all over the world," Hernandez said.
"I think it's the most important lesson we are going to learn."