GPGPU - Programmability

Programmability

In principle, any boolean function can be built-up from a functionally complete set of logic operators. An early example of general purpose computing with a GPU involved performing additions by using an early stream processor called a blitter to invoke a special sequence of logical operations on bit vectors. Such methods are seldom used today as modern GPUs now include support for more advanced mathematical operations including addition, multiplication, and often certain transcendental functions.

The programmability of the pipelines have trended according to Microsoft’s DirectX specification, with DirectX 8 introducing Shader Model 1.1, DirectX 8.1 Pixel Shader Models 1.2, 1.3 and 1.4, and DirectX 9 defining Shader Model 2.x and 3.0. Each shader model increased the programming model flexibilities and capabilities, ensuring the conforming hardware follows suit. The DirectX 10 specification introduces Shader Model 4.0 which unifies the programming specification for vertex, geometry (“Geometry Shaders” are new to DirectX 10) and fragment processing allowing for a better fit for unified shader hardware, thus providing one computational pool of programmable resource. Common formats are:

8 bits per pixel – Sometimes palette mode, where each value is an index in a table with the real color value specified in one of the other formats. Sometimes three bits for red, three bits for green, and two bits for blue.
16 bits per pixel – Usually allocated as five bits for red, six bits for green, and five bits for blue.
24 bits per pixel – eight bits for each of red, green, and blue
32 bits per pixel – eight bits for each of red, green, blue, and alpha

For early fixed-function or limited programmability graphics (i.e. up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations, however. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as floating point data formats, to obtain effects such as high dynamic range imaging. Many GPGPU applications require floating point accuracy, which came with graphics cards conforming to the DirectX 9 specification.

DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16. ATI’s R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while Nvidia’s NV30 series supported both FP16 and FP32; other vendors such as S3 Graphics and XGI supported a mixture of formats up to FP24.

Shader Model 3.0 altered the specification, increasing full precision requirements to a minimum of FP32 support in the fragment pipeline. ATI’s Shader Model 3.0 compliant R5xx generation (Radeon X1000 series) supports just FP32 throughout the pipeline while Nvidia’s NV4x and G7x series continued to support both FP32 full precision and FP16 partial precisions. Although not stipulated by Shader Model 3.0, both ATI and Nvidia’s Shader Model 3.0 GPUs introduced support for blendable FP16 render targets, more easily facilitating the support for High Dynamic Range Rendering.

The implementations of floating point on Nvidia GPUs are mostly IEEE compliant; however, this is not true across all vendors. This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs; some GPU architectures sacrifice IEEE compliance while others lack double-precision altogether. There have been efforts to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computation onto the GPU in the first place.

Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once. For instance, if one color is to be modulated by another color , the GPU can produce the resulting color in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional). Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of their higher performance, vector instructions (SIMD) have long been available on CPUs.

In 2002, James Fung et al developed OpenVIDIA at University of Toronto, and demonstrated this work, which was later published in 2003, 2004, and 2005, in conjunction with a collaboration between University of Toronto and nVIDIA. In November 2006 Nvidia launched CUDA, an SDK and API that allows using the C programming language to code algorithms for execution on Geforce 8 series GPUs. OpenCL, an open standard defined by the Khronos Group provides a cross-platform GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia and ARM platforms. GPGPU compared, for example, to traditional floating point accelerators such as the 64-bit CSX700 boards from ClearSpeed that are used in today's supercomputers, current top-end GPUs from AMD and Nvidia emphasize single-precision (32-bit) computation; double-precision (64-bit) computation executes more slowly.