| May 2011 - White Papers 1 |
|
| May 2011 | |
|
CAD1 and CAE2 are specialist professional spaces. Correspondingly, CEW3 should be professionally produced as against the amateur approach of adding a graphics card to a generic computer. Some CAD1 applications are CPU4 intensive and some are GPU5 intensive. This situation emerged not too long ago and is still transitioning. This paper examines the state of hardware technology for CAD1 applications and assists professional designers and engineers to appraise the effectiveness of CEW3 in the market. Table of Content Hardware for CPU Intensive ApplicationsHardware for CPU-Intensive Applications Computer hardware is designed to support software applications and it is a common but simplistic view that higher spec hardware will enable all software applications to perform better. Up until recently, the CPU4 was indeed the only device for computation of software applications. Other processors embedded in a PC or workstation were dedicated to their parent devices such as a graphics adapter card for display, a TCP-offloading card for network interfacing, and a RAID10 algorithm chip for hard disk redundancy or capacity extension. However, the CPU4 is no longer the only processor for software computation. We will explain this in the next section. Legacy software applications still depend on the CPU4 to do computation. That is, the common view is valid for software applications that have not taken advantage of other types of processors for computation. We have done some benchmarking and believe that applications like Maya 03 are CPU4 intensive. For CPU-intensive applications to perform faster, the general rule is to have the highest CPU4 frequency, more CPU4 cores, more main memory, and perhaps ECC memory (see below). Legacy software was not designed to be parallel processed. Therefore we shall check carefully with the software vendor on this issue before expecting multiple-core CPUs to produce higher performance. Irrespectively, we will achieve a higher output from executing multiple incidences of the same application but this is not the same as multi-threading of a single application. ECC is Error Code Detection and Correction. A memory module transmits in words of 64 bits. ECC memory modules have incorporated electronic circuits to detect a single bit error and correct it, but are not able to rectify two bits of error happening in the same word. Non-ECC memory modules do not check at all – the system continues to work unless a bit error violates pre-defined rules for processing. How often do single bit errors occur nowadays? How damaging would a single bit error be? Let us see this quotation from Wikipedia in May 2011, “Recent tests give widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10−10−17 errors/bit-hour, roughly one bit error per hour per gigabyte of memory to one bit error per century per gigabyte of memory.” Hardware for GPU-Intensive Applications The GPU5 has now been developed to gain the prefix of GP for General Purpose. To be exact, GPGPU stands for General Purpose computation on Graphics Processing Units. A GPU5 has many cores that can be used to accelerate a wide range of applications. According to GPGPU.org, which is a central resource of GPGPU news and information, developers who port their applications to GPU5 often achieve speedups of orders of magnitude compared to optimized CPU4 implementations. Many software applications have been updated to capitalize on the newfound potentials of GPU5 . CATIA 03, Ensight 04 and Solidworks 02 are examples of such applications. As a result, these applications are far more sensitive to GPU5 resources than CPU4. That is, to run such applications optimally, we should invest in GPU5 rather than CPU4 for a CEW3. According to its own website, the new Abaqus product suite from SIMULIA – a Dassault Systemes brand – leverages GPU5 to run CAE2 simulations twice as fast as traditional CPU4. Nvidia has released 6 member cards of the new Quadro Fermi family by April 2011, in ascending sequence of power and cost: 400, 600, 2000, 4000, 5000 and 6000. According to Nvidia, Fermi delivers up to 6 times the performance in tessellation of the previous family called Quadro FX. We shall equip our CEW3 with Fermi to achieve optimum price/performance combinations. The potential contribution of the GPU5 to performance depends on another issue: CUDA compliance. State of CUDA Developments According to Wikipedia, CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia GPU5 accessible to software developers through variants of industry-standard programming languages. For example, programmers use C for CUDA (C with Nvidia extensions and certain restrictions) compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU5 . (The latest stable version is 3.2 released in September 2010 to software developers.) The GPGPU website has a preview of an interview with John Humphrey of EM Photonics, a pioneer in GPU5 computing and developer of the CUDA-accelerated linear algebra library. Here is an extract of the preview: “CUDA allows for very direct expression of exactly how you want the GPU5 to perform a given unit of work. Ten years ago I was doing FPGA work, where the great promise was the automatic conversion of high level languages to hardware logic. Needless to say, the huge abstraction meant the result wasn't good.” Quadro Fermi family has implemented CUDA 2.1 whereas Quadro FX implemented CUDA 1.3. The newer version has provided features that are significantly richer. For example, Quadro FX did not support “floating point atomic additions on 32-bit words in shared memory” whereas Fermi does. Other notable improvements are:
Bulk storage is an essential part of a CEW3 for processing in real time and archiving for later retrieval. Hard disks with SATA7 interface are getting bigger in storage size and cheaper in hardware cost over time, but not getting faster in performance or smaller in physical size. To get faster and smaller, we have to select hard disks with SAS8 interfaces, with a major compromise on storage size and hardware price. RAID10 has been around for decades for providing redundancy, expanding the size of volume to well beyond the confines of one physical hard disk, and expediting the speed of sequential reading and writing, in particular random writing. We can deploy SAS8 RAID10 to address the large storage size issue but the hardware price will go up further. SSD9 has turned up recently as a bright star on the horizon. It has not replaced HDD6 because of its high price, limitations of NAND11 memory for longevity, and immaturity of controller technology. However, it has found a place recently as a RAID10 Cache for two important benefits not achievable with other means. The first is a higher speed of random read. The second is a low cost point when used in conjunction with SATA7 HDD6. Intel has released Sandy Bridge CPU4 and chipsets that are stable and bug free since March 2011. System computation performance is over 20% higher than the previous generation called Westmere. The top CPU4 model has 4 editions that are officially capable of over-clocking to over 4GHz as long as the CPU4 power consumption is within the designed limit for thermal consideration, called TDP (Thermal Design Power). The 6-core edition with official over-clocking will come out in June 2011 timeframe. Foreseeable Future Semiconductor manufacturing technology has improved to 22 x 10-9 meters this year 2011 and is heading towards 18 nanometers in 2012. Smaller means more: we will get more cores and more power from a new CPU4 or GPU5 made on advancing nanotechnology. The current laboratory probe limit is 10-18 and this sets the headroom for semiconductor technologists. While GPU5 and CUDA are having big impacts on performance computing, the dominant CPU4 manufacturers are not resting on their laurels. They have started to integrate their own GPU5 into the CPU4. However, the level of integration is a far cry from the CUDA world and integrated GPU5 will not displace CUDA for design and engineering computing in the foreseeable future. This means our current practice as described above will remain the prevailing format for accelerating CAD1, CAE2 and CEW3 . Notes: 1. CAD - Computer Aided Design 2. CAE - Computer Aided Engineering 3. CEW - Computer aided design and Engineering Workstation 4. CPU - Central Processing Unit 5. GPU - Graphics Processing Unit 6. HDD - Hard Disk Drive 7. SATA - Serial AT Attachment 8. SAS - Serial Attached SCSI 9. SSD - Solid State Disk 10. RAID - Redundant Array of Inexpensive Disks 11. NAND - memory based on “Not AND” gate algorithm Retun to White Papers Main Page |

