Menu Content/Inhalt
Home arrow Technology Park arrow HPC arrow CUDA Hardware Architecture

CUDA Hardware Architecture Print
February 2013
The hardware architecture of a GPGPU can be described in terms of processing units and memory hierarchy.  The architecture attempts to address 4 desires which could be in conflict. 

o        Maximum computation performance
o        Minimum power consumption
o        Lowest cost
o        Friendliest for programming

Memory has been the weak link of every computing system based on Von Neumann’s stored program scheme.  Any improvement of memory performance will improve overall computing performance.   Kepler GK110 GPU is the latest release of Tesla hardware by Nvidia as of February 2013, and we will find out what the improvement is.

It has 3 levels of memory hierarchy and they are L1 cache, L2 cache and Global Memory. L1 is closest to individual GPU processors, whereas Global Memory is linked to the main memory of the CPU via PCI Express bus.  L1 and L2 caches are made of high speed and expensive Static Random Access Memory (SRAM).  Global Memory is made of Dynamic Random Access Memory (DRAM) and is specifically GDDR5 (Graphics Double Data Rate version 5).  GDDR5 is derived from DDR3 but improved of its bandwidth and voltage among other issues.  It is faster than DDR3 which is normally used for the main memory of the CPU but it is a lot slower than L1 & L2 cache of the GPU.







A K20 card in the GK110 generation with PCI Express interface consists of 13 multiprocessors with each consists of 192 cores.   It consumes 225W of electricity on full load and this TDP (Thermal Design Power) limit is the same as the last 2 generations of hardware.  This situation implies that “performance per watt” has increased at the same rate as the maximum performance over the 3 generations of hardware technology progression.  Applications compiled for Fermi will gain in performance without recompilation for Kepler.  Applications will gain further in performance if the source code is revised to take advantage of new Kepler architectural features.


A loop of the application source code constitutes one parcel, called a kernel, identified for parallel computing by the GPGPU.  A kernel is assigned to a set of hardware multiprocessors via a grid, which consists of symmetric blocks of threads.  Blocks are mapped to multiprocessors and threads are mapped to cores using a scheduling unit called warps. Each block of threads runs the same code with different data in synchronisation with each other implementing a SIMD (Single Instruction Multiple Data) approach.