GitHub - abdimoallim/cuda-utils: Collection of utilities for CUDA programming

5 min read Original article ↗

Skip to content

Navigation Menu

cuda-utils

Collection of utilities for CUDA programming.

Tested on an NVIDIA Tesla T4 cloud GPU with the following configurations.

  • Compute Capability: 7.5
  • Max Threads per Block: 1024
  • Max Grid Size: 2147483647 x 65535 x 65535
  • Shared Memory per Block: 48 KB
  • Total Global Memory: 14.56 GB
  • Memory Clock Rate: 5.00 GHz
  • Memory Bus Width: 256 bits
  • Warp Size: 32

Refer to the source.

API Reference

Error handling

  • CUDA_CHECK(call) - Wraps CUDA runtime API calls with automatic error checking. Prints error and exits on failure.
  • CUDA_CHECK_KERNEL() - Checks kernel launch and execution errors. Includes device synchronization.

Thread management

  • __device__ int get_global_id() - Returns global thread index across entire grid.
  • __device__ int get_global_size() - Returns total number of threads in grid.
  • __device__ bool is_thread_valid(int n) - Bounds checking for thread-to-data mapping.
  • __device__ int get_warp_id() - Returns warp index within current block.
  • __device__ int get_lane_id() - Returns thread index within warp (0-31).
  • __device__ bool is_warp_leader() - True if thread is first in warp.

Warp reductions

  • __device__ float warp_reduce_sum(float val) - Shuffle-based warp sum reduction.
  • __device__ float warp_reduce_max(float val) - Shuffle-based warp max reduction.
  • __device__ float warp_reduce_min(float val) - Shuffle-based warp min reduction.
  • __device__ int warp_reduce_sum(int val) - Integer version of warp sum.
  • __device__ int warp_reduce_max(int val) - Integer version of warp max.
  • __device__ int warp_reduce_min(int val) - Integer version of warp min.

Block reductions

  • template<typename T> __device__ T block_reduce_sum(T val) - Two-stage block reduction using shared memory. Requires synchronization.
  • template<typename T> __device__ T block_reduce_max(T val) - Block max reduction. Result valid only in thread 0.
  • template<typename T> __device__ T segmented_reduce(T* data, int* segments, int n, T (*op)(T, T)) - Reduction within segments defined by segment IDs. Only reduces elements with matching segment IDs.
  • template<typename T, int NUM_REDUCTIONS> __device__ void multi_reduce(T* values, T outputs[NUM_REDUCTIONS]) - Performs multiple independent reductions in parallel. Results valid only in thread 0.

Memory operations

  • template<typename T> __device__ void coalesced_load(T* shared_mem, const T* global_mem, int n) - Coalesced load from global to shared memory using all block threads.

  • template<typename T> __device__ void coalesced_store(T* global_mem, const T* shared_mem, int n) - Coalesced store from shared to global memory.

Atomic operations

  • __device__ float atomic_add_float(float* address, float val) - Atomic add for single-precision float.
  • __device__ double atomic_add_double(double* address, double val) - Atomic add for double-precision using CAS loop.
  • __device__ float atomic_max_float(float* address, float val) - Atomic max for float using CAS loop.
  • __device__ float atomic_min_float(float* address, float val) - Atomic min for float using CAS loop.

Vector operations

  • __device__ float3 make_float3_uniform(float val) - Creates float3 with all components equal to val.
  • __device__ float3 operator+(const float3& a, const float3& b) - Vector addition.
  • __device__ float3 operator-(const float3& a, const float3& b) - Vector subtraction.
  • __device__ float3 operator*(const float3& a, float s) - Vector-scalar multiplication.
  • __device__ float dot(const float3& a, const float3& b) - Dot product.
  • __device__ float length(const float3& v) - Vector magnitude.
  • __device__ float3 normalize(const float3& v) - Unit vector using fast reciprocal square root.

Utility functions

  • template<typename T> __device__ void swap(T& a, T& b) - Generic swap function.
  • template<typename T> __device__ T lerp(T a, T b, float t) - Linear interpolation.
  • __device__ float smoothstep(float e0, float e1, float x) - Hermite interpolation for smooth transitions.
  • __device__ bool approx_equal(float a, float b, float epsilon = 1e-6f) - Floating-point equality comparison with tolerance.
  • __device__ int next_power_of_two(int x) - Returns smallest power of 2 >= x.
  • __device__ bool is_power_of_two(int x) - Tests if x is power of 2.

Parallel algorithms

  • template<int BLOCK_SIZE> __device__ void block_scan_exclusive(float* data) - In-place exclusive prefix sum. Data must be in shared memory.
  • __device__ void grid_stride_loop(int n, void (*func)(int)) - Grid-striding loop pattern for large arrays.

Launch configuration

  • dim3 get_grid_size(int n, int block_size) - Computes grid size respecting conventional hardware limits.
  • dim3 get_launch_params(int n, int& block_size) - Determines optimal block size and grid dimensions.

Memory management

  • template<typename T> T* cuda_malloc(size_t n) - Type-safe device memory allocation.
  • template<typename T> void cuda_free(T* ptr) - Safe device memory deallocation.
  • template<typename T> void cuda_memcpy_h2d(T* dst, const T* src, size_t n) - Host to device memory copy.
  • template<typename T> void cuda_memcpy_d2h(T* dst, const T* src, size_t n) - Device to host memory copy.
  • template<typename T> void cuda_memset(T* ptr, int value, size_t n) - Device memory initialization.

RAII memory management

  • cuda_ptr<T> - RAII wrapper for device memory with move semantics. Non-copyable. Methods:

    • cuda_ptr(size_t n): Allocate n elements
    • T* get(): Raw pointer access
    • void reset(size_t n): Reallocate with new size

Device information

  • void print_cuda_device_info() - Prints comprehensive device properties and capabilities.
  • void cuda_device_sync() - Explicit device synchronization.

Constants

  • WARP_SIZE (32) - Threads per warp.
  • MAX_THREADS_PER_BLOCK (1024) - Maximum threads per block.
  • MAX_GRID_SIZE (65535) - Maximum grid dimension.

Utility macros

  • DIV_CEIL(a, b) - Integer ceiling division.
  • MIN(a, b), MAX(a, b) - Min/max without function call overhead.
  • CLAMP(x, min_val, max_val) - Clamp value to range.

License

Apache License 2.0.