Echelon Blog - NFHN Reader

SyCL

January 3, 2015 - 4 comments

The fourth post in my C++ accelerator library series is about SyCL. SyCL is a royality-free provisional specification from the Khronos group. It is spear-headed by Edinburgh, UK based Codeplay Software. Its aim is to simplify programming many-core vector co-processors and multi-core vector processors by offering a modern C++ interface on top of OpenCL. The main selling point of SyCL is similar to that of C++ AMP: single-source development and inline kernels. That means kernel code can be embedded within C++ code. To that effect, kernel code can also be templated, something that is not possible using standard OpenCL¹.

SyCL builds upon a new abstraction – the command_group. It groups “lists of OpenCL commands that are necessary to perform all the work required to correctly process host data on a device using a kernel”. When acting in concert with the command_queue type it also represents an novel attempt of expressing a dataflow graph. I suspect it is also a means for a SyCL compiler to find and extract the code necessary to generate the proper device kernel. In addition it is supposed to simplify memory management. This is what SyCL looks like:

std::vector h_a(LENGTH);             // a vector
std::vector h_r(LENGTH);             // r vector
{
  // Device buffers
  buffer d_a(h_a);
  buffer d_r(h_r);
  queue myQueue;
  command_group(myQueue, [&]()    
  {
    // Data accessors
    auto a = d_a.get_access();
    auto r = d_r.get_access();
    // Kernel
    parallel_for(count, kernel_functor([ = ](id<> item) {
      int i = item.get_global(0);
      r[i] += a[i];
    }));
  });
}

For each command_group a kernel is launched and for each kernel the user specifies how memory should be accessed through the accessor type. The runtime can thus infer which blocks of memory should be copied when and where.

Hence, a SyCL user does not have to issue explicit copy instructions but informs the runtime about the type of memory access the kernel requires (read, write, read_write, discard_write, discard_read_write). The runtime then ensures that buffers containing the correct data are available once the kernel starts and that the result is available in host memory once it is required. Besides simplifying memory management for the user, I can guess what other problem SyCL tries to solve here: zero-copy. That is to say, with such an interface, code does not have to be changed to perform efficiently on zero-copy architectures. Memory copies are omitted on zero-copy systems but the same interface can be used and the user code does not have to change.

Between the accessor abstraction and host memory is another abstraction: the buffer or image. It seems to be a concept representing data that can be used by SyCL kernels that is stored in some well defined but unknown location.

The buffer or image in turn use the storage concept. It encapsulates the “ownership of shared data” and can be reimplemented by the user. I could not determine from browsing the specification where allocation, deallocation and copy occurs. It must be either in the accessor, the buffer or the storage. Oddly enough, storage is not a template argument of the buffer concept but instead is passed in as an optional reference. I wonder how a user can implement a custom version of storage with this kind of interface.

Kernels are implemented with the help of parallel_for meta-functions. Hierarchies can be expressed by nesting parallel_for, parallel_for_workitem and parallel_for_workgroup. Kernel lambda functions are passed an instance of the id<> type. This object can be used by each instance of the kernel to access its location in the global and local work group. There are also barrier functions that synchronize workgroups to facilitate communication.

To summarize, I applaud the plan to create a single-source OpenCL programming environment. With standard C++11 or even C++14 however, for this to work an additional compiler or a non-standard compiler extension is required. Such a compiler must extract the parallel_for meta-functions and generate code that can execute on the target device. As of the publication of this article, such a compiler does not exist. There is however a test-bed implementation of SyCL that runs only on the CPU.

In addition, I fear that the implicit memory management and command_group abstractions might be to inflexible and convoluted for SyCl to become a valuable foundation one can build even fancier things on.

An update:
Andrew Richards (Founder and CEO of Codeplay) adds that they are working on simplifying storage objects. He also believes that the runtime will have enough information best performing scheduling and memory access. Codeplay is going to use a Clang C++ compiler to extract kernels into OpenCL binary kernels but there is no release date yet.

Sorry, the comment form is closed at this time.