20 Mar 2008
It seems just yesterday that 10Mbit 10BASE-T Ethernet networks were the norm, and the workstation wonks I worked with years ago at US Navy CINPACFLT in Pearl Harbor, Hawaii jockeyed to have high-speed ATM fiber run to their offices. Sure, this was the age when dual bonded ISDN lines represented the state of the art in home Internet connectivity, but who really needed that much bandwidth? What did we have to transfer? Email? Usenet posts? Gopher pages?
Then, slowly over the course of the next 5-10 years, network vendors upgraded their parts at negligible marginal cost, and 100Mbit began to pervade enterprise networks of all sizes, democratizing fast file transfers and streaming multimedia. 100Mbit seemed pretty fast. The Next Big Thing, Gigabit Ethernet, seemed a pipe so big it couldn’t be saturated and certainly not worth the exorbitant prices the hardware was going for at the time.
This week, I’ve been helping out one of our Lab49 teams working at a big investment bank to run performance tests on a large distributed cache deployed a 96-node farm all connected on Infiniband. Interestingly, when we ran these brutal performance tests on a homegrown Gigabit blade setup with eight nodes, it was pretty trivial to saturate the network while the CPU idled along at about 3% utilization. Running them on the 96-node Infiniband system, though, the same tests pegged the CPUs while the network drank mint juleps on the veranda during the first warm day of spring.
The sad thing about this situation is that Gigabit is only now getting sufficient uptake in enterprise NOCs that it is worthwhile to upgrade the NICs out at the clients. The last mile, just like with 100Mbit, is taking a while. Despite the fact that Gigabit just hasn’t been with us that long, it’s pretty clear to me that, with snowballing of HPC, CEP, real-time messaging, and P2P network services (not to mention HD audio and video), Gigabit will be led out to pasture before it ever really gets a chance to race.
We may all still be using Gigabit at the desktop for years to come (or not, if 802.11n and offspring steal the show), but between Infiniband and all the activity we’ve been seeing now in 10Gbit (as evident during the SC07 conference in Reno, NV last November), Gigabit just isn’t “it” anymore. Just like pet rocks were in the late 70’s, Gigabit is a temporary salve for a social ill that ultimately requires a more vital solution. I may not yet know clearly what it is, but I know it ain’t Gigabit.
31 May 2007
For years now, the Tastes Great, Less Filling argument pervading the high-performance computing debate has been Scaling Up, Scaling Out. I’m not sure who penned it originally, but I remember first hearing it back in 1999 while Microsoft was trying to position Windows as a compelling enterprise server platform in the face of multi-way Big Iron from Sun, HP, and others, especially during a period where dual processor Intel configurations were exorbitantly expensive. The Scaling Out story warned of the high costs, low ROI of Scaling Up and beckoned with perfectly elastic scalability. Scaling Up promised better performance potential for particularly ravenous applications and a less ungainly programming and administration model.
Even though, just like that old Miller Lite commercial, the argument ultimately had much more to do with politics and economics than it did with high-performance computing, it still continues to have legs. Over the last five years, just as Scaling Out had seemingly all but usurped the spotlight, the silent multicore revolution has helped Scaling Up elbow its way back to the stage. Scaled-out grids are secretly scaling back up in the natural course of IT departments undergoing periodic server upgrades. The current pricing structure of server hardware is compelling enterprises to specify 64-bit multi-core systems, even though a majority of enterprises have only 32-bit serial applications to run on them. Enterprise grids are getting both bigger and beefier. And though the argument between Scaling Up and Scaling Out lives on and remains an amusing debate for well-heeled parallelogians and gridmakers, the real issues affecting high-performance scalability are seeping out elsewhere.
Earlier this week, I was speaking with Jeff Wierer, Senior Product Manager on the Microsoft High-Performance Computing. He had recently been in London for a Compute Cluster Server User Group Meeting, and the scalability issue on everyone’s mind was physical infrastructure. Specifically, electrical power.
As many of you may already know, London is moving up fast as a world financial capital. Lab49‘s London office has been rained upon with fascinating projects from key financial services customers demanding algorithmic trading, computational finance, real-time data visualization, and high-performance computing. London’s financial institutions have been bulking up their computing horsepower and looking ahead to rich times. Unfortunatley, there’s one little snag.
Jeff told me that some of his customers in London have heard anecdotal reports that the National Grid in the UK is either unable or unwilling to provide the additional physical infrastructure required to support the concentration of power demand coming from the burgeoning financial center in the city. Short of ripping out and starting over, the National Grid may be stuck advising its customers to be happy with the power they got (despite ominous blackouts).
While I haven’t been able to find independent confirmation of this yet, it seems entirely possible given the recent power shortages of East Coast, West Coast and Europe at large, as well as the privatization of European electrical distribution that limited the amount of capital investment available to upgrade power infrastructure. (There are no bigger pockets than the pockets of Big Government…)
In the face of massive metropolitan and regional power crises, the argument between Scaling Up or Scaling Out is irrelevant (and rather pedantic, in my opinion). The real issue is between Scaling In and Scaling Abroad. And like Tastes Great, Less Filling, it really isn’t a debate at all because both are ultimately necessary ingredients of truly scalable architectures.
Scaling In is about putting hardware on a power diet. Let’s just assume that our cores are running at 100%, 100% of the time. Fun power management ideas like Intel SpeedStep, hard drive spindown, or monitor sleeping that reduce power consumption for workstations and laptops aren’t going to bail out institutional data centers. That will require new, more efficient processors and cooling systems, better rack and blade designs, distributed flash-based memories, and a slew of as yet unimagined inventions that software engineers can’t help out with even if they wanted to.
Scaling Abroad is about scaling out to multiple geographical locations. This is commonly a notion attributed to high-availability, business continuity, and disaster recovery, but it’s clearly also a scalability issue now, particularly at the scale of the largest grids (for example, Amazon, Google, and Microsoft). If Google housed all of its computing power in Mountain View, the Governator would come down and force Larry Page and Sergey Brin spelunk through Baja to search for the blown fuse that fritzed California.
At one point during my tenure at Ellington Management Group, air-conditioning and server airflow were the biggest obstacles to growing our cluster. As our server density grew from workstations to rackmount servers, from single processors machines to dual-proc with Hyper-Threading, our server room became a sauna. At first, we bought a bunch of digital thermometers, stuck them around the server room with double-sided type, and made rounds periodically to check that there were no dangerous hot-spots. After we missed a couple rounds over a weekend and carbonized some some silicon, we hooked up SNMP monitors to the built-in motherboard and case thermometers and bought a slew of fans and portable air conditioning units. We ultimately did the right thing and upgraded the HVAC, but not without having to do reconstructive surgery on our office space since our office had neither adequate amperage or ducting/venting to handle an appropriately sized cooler.
Seemed like a tough problem at the time.
Seems quaint now.
17 Apr 2007
One of the more interesting, semi-futurist ideas floated by the morning panel discussion at STREET#GRID 2007 yesterday was the idea that job schedulers would begin to use the hardware monitoring capabilities of modern blade computers to influence task assignments. Kevin Pleiter, IBM Emerging Business Solutions Executive for the Financial Services Sector of IBM, imagined a toolset that allowed job schedulers to take into account whether a particular blade or rack was running too hot, was disk-bound or drawing too much power, etc.
9 Apr 2007
From ACM Queue:
PeakStream Founder and CTO Matthew Papakipos explains how the PeakStream Virtual Machine provides automatic parallelization of programs written in C/C++ so that developers can focus on their application logic — and not the intricate details of parallelizing the application — and ultimately improve the performance of HPC applications when running on multi-core processors.
Matt is a friend of a friend, and I spoke with him several months ago about his company and their premier product, PeakStream Platform. PeakStream provides enabling technologies that allow developers to more quickly take advantage of general-purpose computation on graphical hardware (also known as GPGPU). Current NVIDIA and ATI graphics cards have incredible power to perform certain types of mathematically intensive calculations in a streaming, data-parallel fashion; however, taking advantage of that power requires some deep programmer sophistication. While efforts such as HLSL, Cg, CUDA, and CTM have made GPGPU programming more accessible, a cursory look at the documentation for these technologies proves that it still isn’t anywhere near easy.
3 Apr 2007
Penny Crosman has posted an article on Wall Street & Technology entitled “The High-Speed Arms Race on Wall Street Is Leading Firms to Tap High-Performance Computing” [via InsideHPC]. The article does a great job showing the breadth of architectural choices that might be considered in addressing high-performance computing needs and gives Steve Ballmer plenty of opportunity to talk about how Microsoft is breaking into the scene.
On the other hand, it completely glosses over the incredible difficulty in programming for GPGPU and the IBM-Sony-Toshiba Cell Processor. For giggles, try reading the Programming Guide for NVIDIA CUDA. Once you’re done rolling on the floor, try reading the recent article from Dr Dobbs on “Programming the Cell Processor”, whose subtitle should have been “How We Got 20x Performance Speed-Up By Writing 20x Lines of Code”.
As they say, it’s all fun and games until someone gets an RFP.