Menu Content/Inhalt
Home arrow Technology Park arrow HPC arrow CUDA Software Programming

CUDA Software Programming Print
February 2013
Popular compilers such as C, C++ and Fortransupport CUDA GPU by incorporating appropriate library calls.   
CUDA Libraries include CUFFT (Fast Fourier Transform),CUBLAS (Basic Linear Algebra Subsystem), CURAND (Random number generation), etc.  The full list can be found in https://developer.nvidia.com/ Memory has been the weak link of every computing system based on Von Neumann’s stored program scheme.  Any improvement of memory performance will improve overall computing performance.  Kepler GK110 GPU is the latest release of Tesla hardware by Nvidia as of February 2013, and we will find out what the improvement is. 

 CUDA code is compiled with NVCC compiler.  NVCC separates CPU code from GPU code called PTX for Parallel Thread eXecution.  PTX is further compiled to map GPU code to the GPU hardware.  

Nvidia Tesla complies with OpenCL (Open Computing Language) cross-vendor standard which is maintained by Khronos Group and supported by Intel, AMD, and ARM among others.   OpenCL is not expected to produce binary code that is as efficient as NVCC for Nvidia GPU due to the lack of CUDA libraries and PTX instructions that are present in NVCC. 


The OpenACC Programming Standard (http://www.openacc-standard.org/) attempts to transform OpenMP (Open Message Passing) high level directives to CUDA for existing or legacy application software.  It is supported by PGI and is applicable to Fortran  and C compilers and has issued V1 in 2012.  Programmers simply add hints known as “directives” to the original source to identify which areas of code to accelerate and the compiler will take care of the rest. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator.  Similarly we would not expect the final code to be as efficient as native CUDA but the incentive here is to re-use legacy software on new CUDA GPU.  


A loop of the application source code constitutes one parcel, called a kernel, identified for parallel computing by the GPGPU.  A kernel is assigned to a set of hardware multiprocessors via a grid, which consists of symmetric blocks of threads.  Blocks are mapped to multiprocessors and threads are mapped to cores using a scheduling unit called warps. Each block of threads runs the same code with different data in synchronisation with each other implementing a SIMD (Single Instruction Multiple Data) approach. 

Whilst PTX is designed for CUDA hardware, a group in Georgia Institute of Technology has developed a framework called Ocelot to convert PTX code to run in 4 different non-CUDA hardware targets.  Ocelot is a dynamic compilation environment for PTX code on heterogeneous systems, which allows an extensive analysis of the PTX code and its migration to other platforms. 

There is an attempt to port CUDA code to OpenCL.  See http://multiscalelab.org/swan.  The last version of Swan noted on the website in January 2013 is December 2010.  Presumably there is not enough interest or incentive in code porting.