|High Level Compilation|
Current programming efforts require a good understanding of the GPU memory hierarchy and GPU programming execution model in order to fully exploit the GPU capacity and capability for maximum application performance. The scope of challenge to programmers consists of the following areas. Optimisation is the best trade-off between utilisation of resources and limitations imposed by architecture.
o GPU memory management
o Kernel allocation
o CPU and GPU coordination
When hardware of new architecture is released, manual programming effort would be required to gain access to new performance potentials. Over time and eventually compilers would incorporate new architectural improvements as extensions to the compilers. This process appears to be irreplaceable as long as human brains outperform computers in thinking.
There are many attempts to free software programmers of the complexity of hardware so that they can focus on algorithms for solving problems. If we look beyond parallel computing we will find similar attempts everywhere. The best such attempts are the TCP/IP stack and the OSI 7 layer model. Those attempts were visionary and the interfaces between layers are defined for compliance by product vendors. Commercial engineering design applications such as Mathworks Simulink and National Instruments Labview provide graphical user interface and tools to offload design engineers of some programming efforts.
CUDA-CHILL by Ruby1 is an effort to automate the optimisation process of parallel programming in 2010. It uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by non-expert application and library developers. It was an effort to consolidate contributions from 66 previous research efforts of the academic communities. It compared performance against CUBLAS 2.0 library for a range of square matrix sizes up to 8192 elements and claimed to match or outperform CUBLAS.
OpenCL is supposed to be a step for standardising developments for GPGPU computing. However, there is nothing to stop innovative vendors to develop vertically self serving technology eco-systems and CUDA is one example. Similarly, Microsoft Direct Compute is an attempt for standardisation but it requires hardware to comply with DirectX GPU criteria within a Microsoft eco-system.
An exercise (by Wolfe quoted by Ruby section 2.3) took several days of work to get 7 lines of loop for a simple single precision matrix multiplication kernel to improve from 1.7to 208 GFLOPS using GTX 280 in 2008.
As Nicolescu summarised the situation in his review of Microsoft Research Accelerator v22, the more efforts we put into programming optimisation the more performance gains we will obtain. Less efforts such as through high level optimisation leads to less gain.
1. Gabe Rudy, CUDA-CHILL: A Programming Language Interface for GPGPU Optimisations and Code Generations, University of Utah, Thesis for MSc, August 2010
2. Radu Nicolescu, Many Cores - Parallel Programming on GPU, University of Auckland, 13 March 2012