C++ AMP

Overview

C++ AMP is a GPGPU API – it allows you to define functions (kernels) that take some input, perform an expensive calculation on the GPU and return the output to CPU. GPU supports fast calculative operations across many SIMD-like cores – NVidia Tesla supports 512 cores compared to the paltry 10 cores available on the CPU today – even Intel’s Knights Corner will only support 60 cores next year. Suitable only for certain classes of problems (i.e. data parallel algorithms) and not for others (e.g. algorithms with branching or recursion or other complex flow control).
Caveat – you pay a high cost for transferring the input data from the CPU to the GPU and the results back to the CPU, so the computation itself has to be long enough to justify the overhead transfer costs.
DirectX 11 offers DirectCompute API for GPGPU – this requires you to code in HLSL (a C like language for expressing pixel, vertex & tesselation shaders for graphics pipelines). C++ AMP abstracts away from that – is part of Visual C++. You don’t need to use a different compiler or learn different syntax.
The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library. C++ AMP language extensions and compiler restrictions enable you to control how data is moved from the CPU to the GPU and back, which enables you to control the performance impact of moving the data back and forth.
Note: A competitor of C++ AMP is OpenCL , which abstracts over CUDA.
System Requirements – compile time: Visual Studio 2011 dev preview. runtime: DirectX 11.
The C++ AMP Math Library provides support for double-precision functions.

Canonical Example – Matrix Addition

#include <amp.h>

#include <iostream>

using namespace concurrency;

using namespace std;


void CampMethod() {

    int aCPP[] = {1, 2, 3, 4, 5};

    int bCPP[] = {6, 7, 8, 9, 10};

    int sumCPP[5] = {0, 0, 0, 0, 0};


    // Create C++ AMP wrappers for GPU transport – in this case 1D vectors of int

    array_view<int, 1> a(5, aCPP);

   array_view<int, 1> b(5, bCPP);

    array_view<int, 1> sum(5, sumCPP);


    parallel_for_each(

        // Define the compute domain, which is the set of threads that are created.

       sum.grid,

        // Define the lambda expression to run on each thread on the accelerator – pass by val

        [=](index<1> idx) mutablerestrict(direct3d)

        {

            sum[idx] = a[idx] + b[idx];

        }

    );


    // Print the results. The expected output is "7, 9, 11, 13, 15".

    for (int i = 0; i < 5; i++) {

        cout << sum[i] << "\n";

    }

}

THis example uses C++ arrays to construct three C++ AMP array_view objects. You supply four values to construct anarray_viewobject: the data values, the rank, the element type, and the length of thearray_viewobject in each dimension. The rank and type are passed as type parameters. The data and length are passed as constructor parameters. .
The parallel_for_each function provides a mechanism for iterating through the data elements, or compute domain. In this example, the compute domain is specified bysum.grid.The code that you want to execute is contained in a lambda statement, orkernel function. The restrict(direct3d)modifier verifies that the hardware, oraccelerator, that the code runs on complies with the C++ AMP hardware requirements.
The index Class variable, idx, is declared with a rank of one to match the rank of thearray_viewobject. It accesses the individual elements of thearray_viewobjects.

Shaping and Indexing Data: index, extent, and grid

You must define the data values and declare the shape of the data before you can run the kernel code. All data is defined to be an array (rectangular), and you can define the array to have any rank (number of dimensions). The data can be any size in any of the dimensions. If you use an array_view object, the origin can use non-zero index values. For convenience, the runtime library has specific types and functions for 3-dimensional arrays.
index Class
- The index Class specifies a location in the array or array_view object by encapsulating the offset from the origin in each dimension into one object.
- The following example creates a one-dimensional index that specifies the third element in a one-dimensional array_view object. The index is used to print the third element in the array_view object. The output is 3.

int aCPP[] = {1, 2, 3, 4, 5};

array_view<int, 1> a(5, aCPP);

index<1> idx(2);

cout << a[idx];

// Output: 3.

The following example creates a two-dimensional index that specifies the element where the row = 1 and the column = 2 in a two-dimensional array_view object. The first parameter in the index constructor is the row component, and the second parameter is the column component. The output in the cell at the specified index [1,2] is 5.

int aCPP[] = {1, 2, 3, 4, 5, 6};

// 2x3 2D matrix created from array input

array_view<int, 2> a(2, 3, aCPP);

index<2> idx(1, 2);

cout << a[idx];

// Output: 5

2 rows, 3 columns:

1 2 3

4 5 6

The following example creates a three-dimensional index that specifies the element where the depth = 0, the row = 1, and the column = 3 in a three-dimensional array_view object. Notice that the first parameter is the depth component, the second parameter is the row component, and the third parameter is the column component. The output is 8.

nt aCPP[] = {

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// 3D matrix: Length is 4 in the x dimension, 3 in the y dimension, and 2 in the z dimension.

array_view<int, 3> a(2, 3, 4, aCPP);

// Specifies the element at x = 3, y = 1, z = 0.

index<3> idx(0, 1, 3);

cout << a[idx] << "\n";

// Output: 8.

extent Class
- The extent class is a multidimensional slice – it specifies the length of the data in each dimension of thearrayorarray_viewobject.

int aCPP[] = {

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// 3D matrix - 3 rows, 4 columns, depth is 2.

array_view<int, 3> a(2, 3, 4, aCPP);

cout << "The number of columns is " << a.extent[2] << "\n";

cout << "The number of rows is " << a.extent[1] << "\n";

cout << "The depth is " << a.extent[0]<< "\n";

You can construct anarrayorarray_viewobject by using anextentobject in the constructor.

int aCPP[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24};

extent<3> e(2, 3, 4);

array_view<int, 3> a(e, aCPP);

grid Class
- The grid Class specifies an extent at an index. It enables you to specify the set of threads to be created and to conveniently access a subset of your data by defining an extent at a specific location. The array / array_view class exposes a grid object that is defined to have the index at the origin of the array and the extent of the whole array.

nt aCPP[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

// Length is 4 in the x dimension, 3 in the y dimension, and 2 in the z dimension.

array_view<int, 3> a(2, 3, 4, aCPP);

Moving Data to the Accelerator: array and array_view

Two data containers used to move data to the accelerator are defined in the runtime library. They are the array Class and the array_view Class . The fundamental difference between the two classes is that the array class creates a deep copy of the data when the object is constructed – pass by ref in the lambda [=]. Thearray_viewclass is a wrapper that copies the data when the kernel function accesses the data – pass by value in the lambda [&].
array Class
- When an array object is constructed, a deep copy of the data is created on the accelerator. The kernel function modifies the copy on the accelerator. When the execution of the kernel function is finished, you must manually copy the data back to the host. The following example multiplies each element in a vector by 10. After the kernel function is finished, the vector conversion operator is used to copy the data back into the vector object.

vector<int> data(5);

for (int count = 0; count < 5; count++)

{
  data[count] = count;
}

array<int, 1> a(5, data);

parallel_for_each(

    a.grid,

    // a is explicitly captured by reference. Other variables will be captured
    // by value

    [ =, &a ](index<1> idx) restrict(direct3d)

    {
      a[idx] = a[idx] * 10;
    }

);

data = a;

for (int i = 0; i < 5; i++)

{
  cout << data[i] << "\n";
}

array_view Class
- The array_view has nearly the same members as the array class, but the underlying behavior is not the same. Data passed to thearray_viewconstructor is not replicated on the GPU as it is with anarrayconstructor. Instead, the data is copied to thearray_viewobject when the kernel function is executed. Therefore, if you create twoarray_viewobjects that using the same data, botharray_viewobjects refer to the same memory space. When you do this, you need to synchronize any multithreaded access. Additionally, your kernel function must bedense, that is, it must update every element of thearray_viewobject to update thearray_viewdata.

Executing Code over Data: parallel_for_each

The parallel_for_each function defines the code that you want to run on the accelerator against the data in the array or array_view object.
The parallel_for_each method takes two arguments, a compute domain and a lambda expression.
The compute domain is a grid object or a tiled_grid object that defines the set of threads to create for parallel execution. One thread is generated for each element in the compute domain. In this case, the grid object is one-dimensional and has five elements. Therefore, five threads are started. Each thread has access to all the elements in the compute domain.
The lambda expression defines the code to run on each thread. The capture clause, [=] , specifies that the body of the lambda expression accesses all captured variables by value. In this example, the parameter list creates a one-dimensionalindexvariable namedidx. The value of theidx.xis 0 in the first thread and increases by one in each subsequent thread.
The mutable keyword enables the body of a lambda expression to modify variables that are captured by value, in this case the sum variable.
The restrict(direct3d) modifier, or restriction clause , ensures that there is compatibility with hardware targets, enable specialization for hardware targets, and enable code-generation optimizations. The limitations on functions that have the restrict modifier are described in the Restriction clause.
The lambda expression can include the code to execute or it can call a separate kernel function. The kernel function must include the restrict(direct3d) modifier.

#include <amp.h>

#include <iostream>

using namespace concurrency;

using namespace std;

void AddElements(

    index<1> idx,

    array_view<int, 1> sum,

    array_view<int, 1> a,

    array_view<int, 1> b

    ) restrict(direct3d)

{
  sum[idx] = a[idx] + b[idx];
}

void AddArraysWithFunction() {
  int aCPP[] = {1, 2, 3, 4, 5};

  int bCPP[] = {6, 7, 8, 9, 10};

  int sumCPP[5] = {0, 0, 0, 0, 0};

  array_view<int, 1> a(5, aCPP);

  array_view<int, 1> b(5, bCPP);

  array_view<int, 1> sum(5, sumCPP);

  parallel_for_each(

      sum.grid,

      [=](index<1> idx) mutable restrict(direct3d)

      {
        AddElements(idx, sum, a, b);
      }

  );

  for (int i = 0; i < 5; i++) {
    cout << sum[i] << "\n";
  }
}

Simplifying & Accelerating Code: Tiles & Barriers

Tiling divides an array or array_view object into equal rectangular subsets, or tiles. For each thread, you have access to thegloballocation of a data element relative to the wholearrayorarray_viewand access to thelocallocation relative to the tile. Using the indexlocal value simplifies your code because you don’t have to write the code to translate index values from global to local. To use tiling, you call the grid::tile method on the compute domain in the parallel_for_each method, and you use a tiled_index object in the lambda expression.
The code has to access and keep track of values across the tile. You use the tile_static keyword and the tile_barrier::wait method to accomplish this. A variable that is declared with thetile_statickeyword has a scope across an entire tile, and an instance of the variable is created for each tile. You must handle synchronization of tile-thread access to the variable. The tile_barrier::wait Method stops execution of the code until all the threads in the tile have executed. So you can accumulate values across the tile by usingtile_staticvariables. When all the threads in the tile have finished, you can finish any computations that require access to all the values.
The tile_static keyword is used to declare a variable that can be accessed by all threads in a tile that’s in an array or array_view object. The lifetime of the variable starts when execution reaches the point of declaration and ends when the kernel function returns.
The following code example uses the sampling data & code replaces each value in the tile by the average of the values in the tile.

int sampledata[] = {

    2, 2, 9, 7, 1, 4,

    4, 4, 8, 8, 3, 4,

    1, 5, 1, 2, 5, 2,

    6, 8, 3, 2, 7, 2};

// The tiles are:

// 2 2 9 7 1 4

// 4 4 8 8 3 4

//

// 1 5 1 2 5 2

// 6 8 3 2 7 2

// Averages – create an initial null matrix.

int averagedata[] = {

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0,

};

array_view<int, 2> sample(4, 6, sampledata);

array_view<int, 2> average(4, 6, averagedata);

parallel_for_each(

    // Create threads for sample.grid and divide the grid into 2 x 2 tiles.

    sample.grid.tile<2, 2>(),

    [=](tiled_index<2, 2> idx) mutable restrict(direct3d)

    {
      // Create a 2 x 2 array to hold the values in this tile.

      tile_static int nums[2][2];

      // Copy the values for the tile into the 2 x 2 array.

      nums[idx.local.y][idx.local.x] = sample[idx.global];

      // When all the threads have executed and the 2 x 2 array is complete,
      // find the average.

      idx.barrier.wait();

      int sum = nums[0][0] + nums[0][1] + nums[1][0] + nums[1][1];

      // Copy the average into the array_view.

      average[idx.global] = sum / 4;
    }

);

for (int i = 0; i < 4; i++) {
  for (int j = 0; j < 6; j++) {
    cout << average(i, j) << " ";
  }

  cout << "\n";
}

// Output.

// 3 3 8 8 3 3

// 3 3 8 8 3 3

// 5 5 2 2 4 4

// 5 5 2 2 4 4

Creating a C++ AMP Application

To create a project
- File>New>Project>Visual C++>Win32 Console Application,
- type something in the Name box
- In Solution Explorer, delete stdafx.h, targetver.h, and stdafx.cpp from the project.
- Open the .cpp file and put in your C++ AMP code.
Clear the Precompiled header check box.
- In Solution Explorer, Project >Properties > Configuration Properties, > C/C++, Precompiled Headers – For thePrecompiled Headerproperty, selectNot Using Precompiled Headers
Build > Build Solution.

Debugging a C++ AMP Application

Debugging the CPU Code
- In Solution Explorer, Project> Properties > Configuration Properties > Debugging. Verify thatLocal Windows Debuggeris selected.
Debugging the GPU Code
- In Solution Explorer, Project> Properties > Configuration Properties > Debugging. In the Debugger to launch list, selectGPU C++ Direct3D Compute Debugger.
To use the GPU Threads window
- To open the GPU Threads window, on the menu bar, chooseDebug,Windows,GPU Threads.
- shows the total number of active and blocked (by Barrier) GPU threads
- There may be multiple tiles allocated for a computation, where each tile contains a number of threads.
- Because you are debugging locally on the software emulator (reference rasterizer), there will be an active GPU thread emulated by each core of your CPU.
- a yellow arrow pointing to the row that includes the current thread. You can select a row and chooseSwitch To Thread.
- o The Call Stackwindow always displays the call stack of the current GPU thread.
To use the Parallel Stackswindow
- To open the Parallel Stacks window, on the menu bar, choose Debug, Windows, Parallel Stacks.
- You can use the Parallel Stacks window to simultaneouslyinspect the stack frames of multiple GPU threads.
- You can inspect the properties of a GPU thread that are available in the GPU Threads window in the DataTip of the Parallel Stacks window.
- use the Parallel Watch window to inspect the values of an expression across multiple threads at the same time – enter expressions whose values you want to inspect across all GPU threads (via Add Watchcolumn). You can filter & sort expressions.
- can export the content in the Parallel Watch window to Excel by choosing the Excel button
- You can flag specific GPU threads by flagging them in the GPU Threads window, the Parallel Watch window, or the DataTip in the Parallel Stacks window.
- You can group, freeze (suspend) and thaw (resume) GPU threads the same way you do with CPU threads – from either the GPU Threads window or the Parallel Watch window.

restrict keyword

The restriction modifier is applied to function declarations. It enforces restrictions on the code in the function and the behavior of the function in applications that use the C++ AMP runtime. The restrict clause takes the following forms:
- restrict(cpu) The function can run only on the host CPU. (default)
- restrict(direct3d) The function can run only on the Direct3D target and cannot run on the CPU.
The following are not allowed in direct3d (ie you cannot use on the GPU):
- Recursion.
- Variables declared with the volatile keyword.
- Virtual functions.
- Pointers to functions.
- Pointers to member functions.
- Pointers in structures.
- Pointers to pointers.
- goto statements.
- Labeled statements.
- try , catch , or throw statements.
- Global variables.
- Static variables. Use tile_static Keyword instead.
- dynamic_cast casts.
- The typeid operator.
- asm declarations.
- Varargs.

Learn More

I’ll be presenting on C++ AMP on Monday, December 5, 2011 from 5:30 PM to 7:30 PM at Microsoft Israel – see you there http://vcppamp-dec11.eventbrite.com/ !!

This article is part of the GWB Archives. Original Author: Josh Reuben

Replatforming Guide: Pros, Cons, and Impact

Deciding to replatform is no small feat; it’s like setting sails for new horizons with your digital presence. Weighing the

Cypress vs Selenium: Why Cypress is Better!

Navigating the competitive landscape of web testing tools, Cypress emerges as a noteworthy contender, outshining Selenium with its cutting-edge advantages.