Confessions of the World's Largest Switcherby Daniel H. Steinberg
It's a shame that Apple no longer runs the "Switch" campaign on television. Dr. Srinidhi Varadarajan would make an excellent spokesperson for moving to the Mac. Just as Ellen Feiss' switcher story was the hit at Macworld Expo, so has been Dr. Varadarajan's presentation at the O'Reilly Mac OS X Conference, where he received a standing ovation.
His ad might go something like this. "I was in the market for a new machine. I was hoping to get ten teraflops by the end of the year. I'd never used a Mac and had been looking at Dells and IBMs. Then Apple released the G5 on June 23. A week later I bought 1,100 duals online at the Apple Store. I'm Srinidhi Varadarajan and I build Supercomputers at Virginia Tech."
Goals in Building Virginia Tech's G5 Supercomputer
The timing was right to make a big move at Virginia Tech. Varadarajan explained that they had a new dean and a new program in Computational Sciences and Engineering (CSE). They also had experience building a previous, smaller cluster. The goals were pretty straightforward: To build a world-class program, you need to provide world-class resources. This included creating high-end computing facilities and high-performance networking capabilities to tie the computational facilities into national computational grids.
In addition, there were political goals to get communication and cooperation across department lines. Most universities have subcultures and pockets that don't speak to each other. Varadarajan asks how you get people to talk across these different fiefdoms and explains that another of the goals of the project was to get everyone on the same team by providing support for both experimental and production research. The cooperation was evident in the speed with which the project was accepted and supported. Conference co-chair Derrick Story asked how long the project took from start to finish. The answer was surprising. The project was started in March and April of this year. Within a month it had everyone's backing. Money was raised in April and May and the cluster was launched in ramping-up mode in September.
In addition to these goals, there were also architectural ideals. Varadarajan wanted a high-performance supercomputer based on a 64-bit processor and never looked at 32 bit. In addition, he felt that clusters imply gigabit Ethernet. You need high-performance interconnect with high-bandwidth, ultra-low latency. He also wanted to offer the cluster as a service, which meant he needed connections to Internet1, Internet2, and soon, into the NLR (National Lambda Rail). Lambda is a proposed high-speed network used to support research institutions.
In addition, the project had usage goals--to provide easy access for new investigators and exploratory research. The access policy was open door. Varadarajan explained that he didn't want to shut people out just because they don't have a grant. Often, you need results in order to get a grant. He also wanted to support multi-site research activities. Finally, for a premium, he wanted to support on-demand access to computational cycles. For example, an external customer may ask for so much power and in so much time. This required being able to check-point and store the current state of the system so that currently running applications wouldn't be lost.
Dude, You Need a Mac
A prime constraint in designing the Supercomputer was cost. Academia has small budgets so the focus was on high-price performance. Competing installations include DOE (Department of Energy) installations, which can afford to pay top dollar. The Virginia Tech computer wanted the same performance but for bottom dollar. The cost was more than just the machines. The existing facilities would need to be upgraded with cooling systems and power distribution. And they would have to account for the cost of the cables, memory, and back-up power. Varadarajan's team built one of the cheapest world-class Supercomputers. He laughed that "The fact that it's running is a big deal in itself."
He looked at various architecture options and was in the process of buying Dells when the deal fell through. He also worked with IBM and AMD and couldn't get the price to match. The budgets were coming in at $9 to $12 million dollars. The IBM with a PowerPC 970 was a first choice but the earliest delivery date would have been January 2004. Varadarajan said that you can't design a Supercomputer and wait that long for delivery. "You wouldn't buy a car and leave it at the dealer for a year and a half. We wanted a short three-month-build cycle and could not wait six months.
On June 23 Apple announced the G5. Varadarajan said that contrary to rumors, it was the first that they had heard about it as well. On June 26 they told Apple they were interested in placing a "fairly large order". A day later he flew to California and met with Apple. One of their first questions was how long he'd been a Mac owner. Varadarajan said he never had one. Twenty-four hours later Apple committed. Starting on September 5, the G5's arrived in Virginia. An audience member asked if he'd made the purchase through the Apple store. Varadarajan smiled and said that actually, yes, he had.
Performance and Power
Varadarajan said that a lot of people get the math wrong when calculating the performance of the machines. Each G5 processor has two, double-precision, floating-point units. Each is capable of a fused, multiple-add operation per cycle, so you get 2 flops per cycle. This means that 2GHz corresponds to 8 GFlops, so each dual G5 can deliver a peak of 16 GFlops of double-precision performance. That is more than a modern Cray node.
The primary communications architecture is built on InfiniBand's card, which has two ports on each node connecting into the network at 20 Gbps full-duplex bandwidth. Each node has a connection open to each other node and there is the potential to hold 150K connections per node. This translates into very low latency--less than 10 microseconds.
The computers and cables are just one piece of the infrastructure. Varadarajan also needed a large enough building to house the cluster, with a raised floor, environmental controls, fire suppression, and round-the-clock controlled access. In addition, the power needs include 1.5 MW of power coming in from two substations with back-up UPS and finally, a back-up diesel generator.
If you've ever sat with a TiBook in your lap, you understand that there is a further significant issue. As hot as a G4 runs, a G5 runs hotter. With a traditional air-conditioning setup, the calculations showed that instead of emptying out the air three times an hour as would be typical, they would need to empty the air three times per minute. Computers tend to each cool front to back. So the plan was to arrange the computers in rows back to back and pull the hot air out of the hot aisle. This would have required wind velocity under the floor of more than 60 miles per hour and still would have resulted in some hot spots. They decided instead to use a refrigerator-like system. Chillers cool water to 40 degrees to 50 degrees, which is then used to chill refrigerant, which is piped into a matrix of copper pipes. Effectively, you have a distributed refrigerator.
The computers ran with few customizations. The volunteers started the computers, connected the InfiniBand card, restarted the computers, and cabled them up. The machines are currently running stock Mac OS X 10.2.7. An audience member asked if they use Software Update. Varadarajan said no but that there are plans to Pantherize the system in the next few weeks. This will require an install and a recompile of some of the code. Custom code included InfiniBand drivers and some parallel communication libraries known as MV APICH developed in Dr. Dhabaleswar Panda's lab at Ohio State University. This library had to be ported from Linux to Mac OS X. The PCI-X timing was changed to increase InfiniBand performance to 870MBps. Also, message caching and dynamic memory management were added for improved scientific application performance.
The LINPACK benchmark solves a very large system of linear equations, involving dense matrix operations. The main phase is LU decomposition (Gaussian elimination with partial row pivoting). The G5 cluster solved a system of equations at N=500K. The team realized that the only way to improve the benchmark score is to improve the numerical libraries. This boils down to the BLAS libraries. The core routine--matrix multiply (GEMM)--was optimized by Kazushige Goto. The current impressive benchmark results are due to a mix of Goto's libraries and Apple's veclib framework.
Varadarajan reported that "our latest numbers are 9.555 tera and we still have more tricks left. We are hoping for another 10 percent boost to become the first academic machine to cross 10 tera. The last ratings put us at number three worldwide." During the question-and-answer period at the end, an audience member from the Lawrence Livermore National Laboratory introduced himself as coming from the institution that had the Supercomputer that the Virginia Tech cluster had just passed. He asked whether the details of the Supercomputer would be published. The reply was that in addition to documentation and papers, the plans are to return the changes to MVAPICH to the open source project so that it would be freely available. There are also plans to open source the caching code and Varadarajan expects that Mellanox's code will be available.
Varadarajan said that they are getting requests for clones. "Expect to see a lot more G5 clusters."
Daniel H. Steinberg is the editor for the new series of Mac Developer titles for the Pragmatic Programmers. He writes feature articles for Apple's ADC web site and is a regular contributor to Mac Devcenter. He has presented at Apple's Worldwide Developer Conference, MacWorld, MacHack and other Mac developer conferences.
Return to Mac OS X Conference Coverage
Return to theMac DevCenter