Menu Content/Inhalt
Home

CUDA Hardware Computation Hierarchy Print
June 2013

This note requires the reading of CUDA processor hierarchy and memory hierarchy as a pre-requisite as it attempts to link the two aspects together for computational situations. The content was based on various online resources as of 2013-06.

Kernel, Grid, Block and Thread
o A kernel is a loop of application code that the programmer has identified for CUDA GPU execution.  The programmer is free to assign the number of blocks for a kernel.  When the application has been compiled and gone into execution mode, CUDA GPU will assign the blocks to a SM (streaming multiprocessor) as shown in the figure below which assumes the GPU has 4 SM only.  Threads execute in parallel within a block. Blocks execute in parallel within a kernel.  Kernels are launched sequentially.  If there are several kernels in the application, there will be several grids in the GPU.

o All kernels have access to the global memory.  As such kernels can share data.

o A grid can have 1 or 2 dimensions and a block can have 1, 2 or 3 dimensions.  These dimensions or coordinates are abstractions to assist the programmer only.  The programmer specifies the grid size (number of rows and columns) and block size (rows and columns and layers of threads) to correspond with the problem being addressed.

Warp and Thread
o Warp is the real unit for lockstep execution.  One warp has up to 32 threads.  If some threads in a warp take the “\then” branch and others go in the “\else” direction, they will not operate in lockstep and some threads must wait.  This situation is called thread divergence and is not desirable.  The Oxford paper referenced said Nvidia CUDA Compiler would apply voting or predict which path to take whatever it meant. This appears to be the best Nvidia has done up to the time of the Oxford paper (December 2012).

o Accesses to the global or shared memory are arranged in half-warps (each SM has 16 Store / Load registers) so that the other half can continue to run to use time more effectively.

o Programmers are encouraged to specify a large number of warps in order to achieve minimal waste of latency time and maximum parallel execution effect.