From Blade to CUDA (2012)
July 2012

This article is in the context of HPC- High Performance Computing.  HPC refers to a high level of computing power that works on a single application.  Many scientific research, engineering analysis, digital content creation, and finite analysis applications are such types of single applications.  Take bio-molecular behaviour simulation as an example.  The fastest HPC system took 1 day to simulate 69ns of behaviour in 2011.  Do scientists want to achieve a significantly longer period of simulation?  Definitely.  Can they use 2 or more computers to run the same application?  Yes, this has been the approach used for decades.  A HPC system is the equivalence of a million departmental servers - just for an idea.

Blade Server has been the de facto format for HPC up to 2010.  How much space will be needed to accommodate a million departmental servers?  How many cables will be running across such a football field of computers?  Blade is an approach to integrate a large number of servers into the space of a cabinet.  A small blade server takes only 7 height units of a 19” wide cabinet to provide the computing power of 40 Xeon processors.  Instead of cables running around to connect 20 or 40 Xeon servers, blade servers use printed circuits inside the cabinet to save space and cables.  These nodes of servers use 10G Ethernet or Infiniband for interconnection among themselves, and an integrated Gigabit Ethernet Switch for communicating with clients outside of the blade. 

The above approach seemed the best way to go forward till 2011 when a company not traditionally supplying central processing units (CPU) turned up with a newish scheme called CUDA- compute unite device architecture. Nvidia has been a market leader in graphics processors and its GeForce series of graphics cards are well established in the market.  Nvidia turned graphics processing units (GPU) into application computing use and created CUDA compilers to instruct the CPU to send parcels of computing work to the GPU. CPU and GPU are connected with PCI Express which is a high speed serial bus on the motherboard running at 20Gbps per lane in each direction with 16 lanes on version 3.  It is not just the speed of interconnection of processors that beats the blade approach. The high count of cores in the GPU is the most important.  As of July 2012, a NZ$900 GTX680 card provides 1536 cores for an idea.  HPC does not use GTX which is for consumers.  HPC uses uses Tesla which is another Nvidia brand.  

CUDA based HPC rose to fame when the Tianhe (Sky River) computer based in China was rated the fastest computer in the world.  This computer had less than 1/15th of the count of Xeon CPU as the runners-up computer but had 1/30th of the count in the form of Tesla GPU.  This implies that each Tesla provided 30 times the performance of a server CPU for typical HPC applications.

Nvidia alerted Compucon of the new supercomputing ranking event in November 2010.  The alert set the beginning of a journey for Compucon to visit CUDA. Compucon has spent 12 months to understand CUDA and Quadro which is another Nvidia brand for professional graphics processing.  When Nvidia released Tesla 2075 cards, Compucon spent another 3 months to find out how Tesla could help professional graphics processing and has obtained comfort from benchmarking Adobe Premiere Pro CS6 with PPBM6 in June 2012.  We feel that it is time that we promote CUDA HPC systems instead of blades.

The latest as at July 2012 CUDA HPC system platform is designed and developed by Supermicro in US based on Intel C602 chipset (so that Compucon did not steal the credit).  This system is 1U rack mounted and reasonably tiny in space requirements.  It runs a Xeon processor and up to two Tesla cards and provides for 6 hard disks that can be set up in arrays of 0 for speed, 1 for redundancy, and 10 for both.  Compucon will configure and build this model with a 6-core 2GHz 15MBL3 Xeon based on Sandy Bridge micro-architecture and one Tesla card with 448 Fermi CUDA cores and 6GB of GPU memory as the base.  Options include the second Tesla card and up to 256GB ECC of main memory.

The appeals of this model are many: space, cost, performance, and electricity consumption.  It is about 1/10th of the price of a blade server. All these appeals do come with one condition.  This HPC system is only good for applications that have been compiled with CUDA.  This is not really a problem because CUDA is being taught in over 500 universities around the world and CUDA is seen as a new standard for HPC today.