Raising the Bar for Cross-Platform Heterogeneous Compute
On the eve of IWOCL 2026, the Khronos® OpenCL Working Group has released OpenCL™ 3.1, bringing widely deployed, field-proven capabilities into the core specification to expand functionality, including SPIR-V ingestion, that developers will be able to rely on across conformant implementations.
The new specification arrives into a growing OpenCL ecosystem, with implementations from multiple silicon vendors, particularly in mobile and embedded markets, and higher-level frameworks including SYCL™ and chipStar increasingly targeting OpenCL as an acceleration backend. The open-source compiler and runtime ecosystem around OpenCL also continues to mature with layered implementations of OpenCL over Vulkan and DirectX 12 — widening OpenCL’s cross-platform availability, including on platforms without native drivers.
OpenCL Evolution Methodology
Features now mandated by OpenCL 3.1 have been deployed as extensions or optional capabilities. This is by design. The OpenCL working group evolves the specification by proving features in the field as extensions first, watching how they get used across multiple implementations, refining them based on developer feedback, and only then graduating them into the core specification.
Features mandated by OpenCL 3.1 will be reliably available across all conformant implementations, eliminating the need for capability checks or fallback paths in application code.
Mandated SPIR-V Ingestion
Every conformant OpenCL 3.1 implementation will be required to consume SPIR-V™ kernels — a feature that has been one of the most requested by developers.
SPIR-V™ is Khronos’s portable intermediate representation, produced by a wide range of open-source compilers, including Clang/LLVM, the SPIR-V LLVM Translator, and the newer SPIR-V LLVM backend. Beyond enabling source language flexibility, SPIR-V also allows kernels to be distributed in pre-compiled, optimized intermediate form rather than as source — protecting kernel IP, reducing application startup times, and enabling ahead-of-time specialization.
OpenCL 3.1 additionally requires support for the SPIR-V query extension, which enables applications to enumerate the SPIR-V capabilities, extensions, and versions that a device supports, simplifying the adoption of new SPIR-V features as they become available.
"Mandatory SPIR-V ingestion is the most consequential change in OpenCL 3.1. SPIR-V has become the natural compilation target for a growing class of higher-level languages and frameworks, including SYCL, ChipStar, and a wide range of domain-specific compilers. Making ingestion a guaranteed capability across every conformant implementation removes the last remaining barrier for these tools to fully commit to OpenCL as a runtime. Combined with the working group's extensions-first methodology, which has ensured that every feature mandated in 3.1 is already shipping in the field today, OpenCL 3.1 strengthens the dependable, portable runtime substrate that modern heterogeneous compute needs," said Neil Trevett, OpenCL Working Group Chair.
Building Blocks for AI and HPC Workloads
Several features essential to HPC and AI kernels are also now mandatory in the core OpenCL 3.1 specification:
Subgroups, including shuffles, rotations, and an expanded set of supported data types. A fundamental building block for tuned reductions, scans, and matrix kernels.
Integer dot products, including saturating and accumulating variants, together with extended bit operations: Both map directly to dedicated hardware instructions on a wide range of modern silicon, and both are common building blocks for matrix multiplications and the low-precision arithmetic central to inference workloads.
A new query for the suggested local work-group size. This gives applications and profilers a runtime hint for the optimal work-group size for a given kernel and device, eliminating the need for manual tuning or repeated size calculations across multiple enqueues and improving performance predictability on diverse hardware.
A standard device UUID query, matching Vulkan’s VkPhysicalDeviceIDProperties::deviceUUID. This allows applications to correlate the same physical device across APIs, which is essential for multi-device systems and for external memory-sharing scenarios that span OpenCL and Vulkan.
Streamlining Development
OpenCL 3.1 also includes refinements that improve everyday development:
Developers can use new language features without relying on extensions. This means cleaner, more portable kernel code that compiles reliably across all conformant implementations without vendor-specific extension guards.
The OpenCL C printf implementation now supports z (size_t) and t (ptrdiff_t) length modifiers. This closes a long-standing portability gap with standard C, allowing device-side debug output to correctly format pointer-sized and difference-type values without casts or format string workarounds.
CL_DEVICE_HOST_UNIFIED_MEMORY has clarified semantics and can now be used to distinguish integrated from discrete GPUs. Applications can now reliably use this flag to select memory allocation strategies at runtime — for example, skipping explicit buffer copies on integrated GPUs where host and device share the same physical memory.
Local memory kernel arguments may be set to zero to indicate no local memory is needed. This enables kernels that opportunistically use local memory to be dispatched without a separate code path for configurations where none is required.
Observing that an event is CL_COMPLETE is now a synchronization point, removing the previous need for an explicit wait. This eliminates a subtle correctness hazard in which code polling an event’s status could race against memory visibility, making event-driven synchronization both simpler and formally safe.
The memory model’s “inclusive scopes” rule has been relaxed so that scopes no longer need to match exactly. This means a finer-grained scope can now satisfy a coarser-grained synchronization requirement.
Although individually small, these changes collectively eliminate long-standing friction points in OpenCL development.
Implementations in Progress
OpenCL 3.1 has been released with multiple implementations in flight from silicon vendors including Arm, Imagination, Intel, and Qualcomm, together open-source implementations including Rusticl as part of the Mesa project, PoCL, and CLVK, spanning desktop, mobile, and embedded markets across Windows, Linux, and Android.
Layered implementations are an increasingly important part of how OpenCL is made available across platforms. OpenCLOn12 layers OpenCL over DirectX 12, providing OpenCL on Windows PCs and cloud instances. CLVK, Ancle, and Rusticl layer OpenCL over Vulkan and Zink, covering Android and the Mesa ecosystem. These layered approaches continue to evolve and play a key role in ensuring broad OpenCL availability across platforms, including when a native driver may not be available.
What’s Next
The extension pipeline that drove OpenCL 3.1 remains active, setting the stage for future core releases. Today’s extensions are a strong indicator of what may become tomorrow’s core specification. Extensions currently in flight include:
Command Buffers for low-overhead replayable workloads. By recording a fixed sequence of commands once and replaying it many times, Command Buffers eliminate the per-submission host overhead that limits throughput in inference serving, simulation loops, and other high-frequency dispatch scenarios.
Unified Shared Memory for simplified pointer-based memory management. USM replaces explicit buffer objects and copy commands with standard pointer semantics, making it significantly easier to port existing CPU code to GPU and to integrate OpenCL into frameworks that assume a unified address space.
Cooperative Matrix operations for high-performance matrix multiplication. These operations map directly to the hardware matrix engines found in modern AI accelerators and GPUs, enabling the dense GEMM performance that is central to both neural network inference and HPC workloads such as molecular dynamics and climate simulation.
New AI data types covering low-precision formats; and improvements to external memory sharing and image tiling controls. Low-precision types such as int4 and fp8 reduce memory bandwidth and compute cost for AI inference workloads, while the external memory and tiling improvements make it easier to interoperate with Vulkan, DirectX 12, and platform media pipelines.
Beyond extensions, the working group is actively exploring OpenCL’s role as a substrate for higher-level programming models, in safety-critical markets, and on emerging device classes including NPUs and RISC-V accelerators.
Two Takeaways
OpenCL is widely deployed and actively evolving. OpenCL’s implementation ecosystem spans native and layered approaches across all major platforms, and the working group has an active roadmap of new functionality in development.
OpenCL 3.1 brings significant, proven functionality into the core specification, most notably mandatory SPIR-V ingestion, meaningfully expanding what developers can rely on across every conformant implementation and laying the groundwork for the next wave of language and compiler innovation built on OpenCL.
Feedback from the developer community drove OpenCL 3.1, and continues to drive what comes next. File issues and proposals on the OpenCL specification GitHub, and join the conversation on the Khronos Discord. If you’re at IWOCL 2026, come talk!