|
February 2013 |
Current programming efforts require a good understanding of the GPU memory hierarchy and GPU programming execution model in order to fully exploit the GPU capacity and capability for maximum application performance. The scope of challenge to programmers consists of the following areas. Optimisation is the best trade-off between utilisation of resources and limitations imposed by architecture.
o GPU memory management
o Kernel allocation
o CPU and GPU coordination
When hardware of new architecture is released, manual programming effort would be required to gain access to new performance potentials. Over time and eventually compilers would incorporate new architectural improvements as extensions to the compilers. This process appears to be irreplaceable as long as human brains outperform computers in thinking.
There are many attempts to free software programmers of the complexity of hardware so that they can focus on algorithms for solving problems. If we look beyond parallel computing we will find similar attempts everywhere. The best such attempts are the TCP/IP stack and the OSI 7 layer model. Those attempts were visionary and the interfaces between layers are defined for compliance by product vendors. Commercial engineering design applications such as Mathworks Simulink and National Instruments Labview provide graphical user interface and tools to offload design engineers of some programming efforts.
CUDA-CHILL by Ruby1 is an effort to automate the optimisation process of parallel programming in 2010. It uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by non-expert application and library developers. It was an effort to consolidate contributions from 66 previous research efforts of the academic communities. It compared performance against CUBLAS 2.0 library for a range of square matrix sizes up to 8192 elements and claimed to match or outperform CUBLAS.
OpenCL is supposed to be a step for standardising developments for GPGPU computing. However, there is nothing to stop innovative vendors to develop vertically self serving technology eco-systems and CUDA is one example. Similarly, Microsoft Direct Compute is an attempt for standardisation but it requires hardware to comply with DirectX GPU criteria within a Microsoft eco-system.
An exercise (by Wolfe quoted by Ruby section 2.3) took several days of work to get 7 lines of loop for a simple single precision matrix multiplication kernel to improve from 1.7to 208 GFLOPS using GTX 280 in 2008.
As Nicolescu summarised the situation in his review of Microsoft Research Accelerator v22, the more efforts we put into programming optimisation the more performance gains we will obtain. Less efforts such as through high level optimisation leads to less gain.
1. Gabe Rudy, CUDA-CHILL: A Programming Language Interface for GPGPU Optimisations and Code Generations, University of Utah, Thesis for MSc, August 2010
2. Radu Nicolescu, Many Cores - Parallel Programming on GPU, University of Auckland, 13 March 2012
|
|
|
February 2013 |
CUDA hardware offers many dimensions of parallelism by the arrangement of multiprocessors and cores within each. Some parallelism is controlled by hardware but some are left to software to optimize.
K20 adds a new parallel dimension called Hyper Q. K20 is capable of executing up to 32 kernels launched from different CPU processes simultaneously, which increments the percentage of temporal occupancy on the GPU. Previous generation CUDA hardware such as Fermi has one connection to the CPU only. The multi-connectors feature is hardware based. It improves the level of utilisation of CPU and GPU depending on individual scenarios and the key point is that it eliminates the CPU-GPU connection as a potential bottleneck.
Another new architecture of K20 is Dynamic Parallelism. It is an ability to launch new grids from the GPU. Features:
(Hover Mouse over images to Enlarge)
o Dynamically: Based on run-time data.
o Independently: Each thread can launch a different grid.
o Simultaneously: From multiple threads at once.
This reduces the coordination with the CPU via the PCI Express bus and shifts coordination to within the GPU. Internal GPGPU memory transfers are faster than global memory transfer over PCI Express lanes by over 10 times.
CUDA SDK (software development kit) Version 5.0 supports the above mentioned new K20 features.
Titan supercomputer at Oak Ridge National Laboratory published some early experience with K20X and CUDA5 at SC12 in November 2012 on 5 applications. Gain is defined as the processing time with Opteron CPU and K20X GPU over CPU without GPU. Gains ranged from 1.8 to 7.8.
|
|
|
February 2013 |
CUDA hardware offers many dimensions of parallelism by the arrangement of multiprocessors and cores within each. Some parallelism is controlled by hardware but some are left to software to optimize.
K20 adds a new parallel dimension called Hyper Q. K20 is capable of executing up to 32 kernels launched from different CPU processes simultaneously, which increments the percentage of temporal occupancy on the GPU. Previous generation CUDA hardware such as Fermi has one connection to the CPU only. The multi-connectors feature is hardware based. It improves the level of utilisation of CPU and GPU depending on individual scenarios and the key point is that it eliminates the CPU-GPU connection as a potential bottleneck.
Another new architecture of K20 is Dynamic Parallelism. It is an ability to launch new grids from the GPU. Features:
(Hover Mouse over images to Enlarge)
o Dynamically: Based on run-time data.
o Independently: Each thread can launch a different grid.
o Simultaneously: From multiple threads at once.
This reduces the coordination with the CPU via the PCI Express bus and shifts coordination to within the GPU. Internal GPGPU memory transfers are faster than global memory transfer over PCI Express lanes by over 10 times.
CUDA SDK (software development kit) Version 5.0 supports the above mentioned new K20 features.
Titan supercomputer at Oak Ridge National Laboratory published some early experience with K20X and CUDA5 at SC12 in November 2012 on 5 applications. Gain is defined as the processing time with Opteron CPU and K20X GPU over CPU without GPU. Gains ranged from 1.8 to 7.8.
|
|
|
February 2013 |
The hardware architecture of a GPGPU can be described in terms of processing units and memory hierarchy. The architecture attempts to address 4 desires which could be in conflict.
o Maximum computation performance
o Minimum power consumption
o Lowest cost
o Friendliest for programming
Memory has been the weak link of every computing system based on Von Neumann’s stored program scheme. Any improvement of memory performance will improve overall computing performance. Kepler GK110 GPU is the latest release of Tesla hardware by Nvidia as of February 2013, and we will find out what the improvement is.
It has 3 levels of memory hierarchy and they are L1 cache, L2 cache and Global Memory. L1 is closest to individual GPU processors, whereas Global Memory is linked to the main memory of the CPU via PCI Express bus. L1 and L2 caches are made of high speed and expensive Static Random Access Memory (SRAM). Global Memory is made of Dynamic Random Access Memory (DRAM) and is specifically GDDR5 (Graphics Double Data Rate version 5). GDDR5 is derived from DDR3 but improved of its bandwidth and voltage among other issues. It is faster than DDR3 which is normally used for the main memory of the CPU but it is a lot slower than L1 & L2 cache of the GPU.
A K20 card in the GK110 generation with PCI Express interface consists of 13 multiprocessors with each consists of 192 cores. It consumes 225W of electricity on full load and this TDP (Thermal Design Power) limit is the same as the last 2 generations of hardware. This situation implies that “performance per watt” has increased at the same rate as the maximum performance over the 3 generations of hardware technology progression. Applications compiled for Fermi will gain in performance without recompilation for Kepler. Applications will gain further in performance if the source code is revised to take advantage of new Kepler architectural features.
A loop of the application source code constitutes one parcel, called a kernel, identified for parallel computing by the GPGPU. A kernel is assigned to a set of hardware multiprocessors via a grid, which consists of symmetric blocks of threads. Blocks are mapped to multiprocessors and threads are mapped to cores using a scheduling unit called warps. Each block of threads runs the same code with different data in synchronisation with each other implementing a SIMD (Single Instruction Multiple Data) approach.
|
|
|
February 2013 |
Popular compilers such as C, C++ and Fortransupport CUDA GPU by incorporating appropriate library calls.
 CUDA Libraries include CUFFT (Fast Fourier Transform),CUBLAS (Basic Linear Algebra Subsystem), CURAND (Random number generation), etc. The full list can be found in https://developer.nvidia.com/ Memory has been the weak link of every computing system based on Von Neumann’s stored program scheme. Any improvement of memory performance will improve overall computing performance. Kepler GK110 GPU is the latest release of Tesla hardware by Nvidia as of February 2013, and we will find out what the improvement is.
 CUDA code is compiled with NVCC compiler. NVCC separates CPU code from GPU code called PTX for Parallel Thread eXecution. PTX is further compiled to map GPU code to the GPU hardware.
Nvidia Tesla complies with OpenCL (Open Computing Language) cross-vendor standard which is maintained by Khronos Group and supported by Intel, AMD, and ARM among others. OpenCL is not expected to produce binary code that is as efficient as NVCC for Nvidia GPU due to the lack of CUDA libraries and PTX instructions that are present in NVCC.
The OpenACC Programming Standard (http://www.openacc-standard.org/) attempts to transform OpenMP (Open Message Passing) high level directives to CUDA for existing or legacy application software. It is supported by PGI and is applicable to Fortran  and C compilers and has issued V1 in 2012. Programmers simply add hints known as “directives” to the original source to identify which areas of code to accelerate and the compiler will take care of the rest. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator. Similarly we would not expect the final code to be as efficient as native CUDA but the incentive here is to re-use legacy software on new CUDA GPU.
A loop of the application source code constitutes one parcel, called a kernel, identified for parallel computing by the GPGPU. A kernel is assigned to a set of hardware multiprocessors via a grid, which consists of symmetric blocks of threads. Blocks are mapped to multiprocessors and threads are mapped to cores using a scheduling unit called warps. Each block of threads runs the same code with different data in synchronisation with each other implementing a SIMD (Single Instruction Multiple Data) approach.
Whilst PTX is designed for CUDA hardware, a group in Georgia Institute of Technology has developed a framework called Ocelot to convert PTX code to run in 4 different non-CUDA hardware targets. Ocelot is a dynamic compilation environment for PTX code on heterogeneous systems, which allows an extensive analysis of the PTX code and its migration to other platforms.
There is an attempt to port CUDA code to OpenCL. See http://multiscalelab.org/swan. The last version of Swan noted on the website in January 2013 is December 2010. Presumably there is not enough interest or incentive in code porting.
|
|
|
|
<< Start < Prev 111 112 113 114 115 116 117 118 119 120 Next > End >>
|
| Results 991 - 999 of 2511 |