GitHub - RightNow-AI/RightNow-Tile: Open-source transpiler for CUDA Tile (13.1) migration

RightNow Tile

CUDA SIMT to cuTile Python Transpiler
Transform your CUDA kernels for NVIDIA Blackwell GPUs

Live Demo • Quick Start • Features • Patterns • Discord

What is RightNow Tile?

RightNow Tile is a production-grade transpiler that converts traditional CUDA SIMT (Single Instruction, Multiple Threads) kernels into cuTile Python code — NVIDIA's new tile-based programming model optimized for Blackwell GPUs (compute capability 10.x+).

Part of the RightNow AI ecosystem — a code editor built for GPU kernel development.

Why cuTile?

NVIDIA's cuTile represents a paradigm shift in GPU programming:

Traditional CUDA	cuTile
Thread-centric programming	Tile-centric programming
Manual memory coalescing	Automatic tile-based loads
Complex index calculations	Declarative tile operations
Low-level synchronization	High-level tile semantics

RightNow Tile bridges the gap — take your existing CUDA kernels and transform them for next-gen hardware.

Quick Start

# Clone the repository
git clone https://github.com/RightNow-AI/RightNow-Tile.git
cd RightNow-Tile

# Install dependencies
npm install

# Start development server
npm run dev

Open http://localhost:3000 and start transpiling!

Features

Intelligent Pattern Detection

Automatically identifies 18 computational patterns with 60+ variant-specific optimizations:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Your CUDA     │ ──► │  Pattern Match   │ ──► │  Optimized      │
│   Kernel        │     │  + Analysis      │     │  cuTile Code    │
└─────────────────┘     └──────────────────┘     └─────────────────┘

9-Stage Transpilation Pipeline

CUDA Source
    │
    ▼
┌──────────────┐
│ 1. Extractor │  Parse kernel signatures, parameters, memory accesses
└──────┬───────┘
       ▼
┌──────────────┐
│ 2. Parser    │  Recognize 150+ CUDA intrinsics & index patterns
└──────┬───────┘
       ▼
┌──────────────┐
│ 3. Semantic  │  Detect reductions, dependencies, race conditions
└──────┬───────┘
       ▼
┌──────────────┐
│ 4. Memory    │  Analyze coalescing, bank conflicts, access patterns
└──────┬───────┘
       ▼
┌──────────────┐
│ 5. Pattern   │  Match against 18 patterns with confidence scoring
└──────┬───────┘
       ▼
┌──────────────┐
│ 6. IR Build  │  Generate intermediate representation with config
└──────┬───────┘
       ▼
┌──────────────┐
│ 7. Optimize  │  Select optimal tile sizes & configurations
└──────┬───────┘
       ▼
┌──────────────┐
│ 8. CodeGen   │  Apply variant-specific templates
└──────┬───────┘
       ▼
┌──────────────┐
│ 9. Validate  │  Verify correctness & generate diagnostics
└──────┴───────┘
       │
       ▼
  cuTile Python

Modern Developer Experience

Monaco Editor — VS Code-quality editing with syntax highlighting
Real-time Transpilation — See results instantly
Dark/Light Themes — Easy on the eyes
Expandable Output — Full-screen code view
One-Click Copy — Get your code ready to deploy

Supported Patterns

Core Compute Patterns

Pattern	Variants	Use Cases	Confidence
GEMM	`naive`, `tiled`, `register_blocked`	Matrix multiplication, deep learning	High
Reduction	`tree`, `warp_shuffle`, `multi_block`, `segmented`	Sum, max, min, dot product	High
Scan	`inclusive`, `exclusive`, `segmented`	Prefix sum, stream compaction	High
Stencil	`1d_3pt`, `1d_5pt`, `2d_5pt`, `2d_9pt`, `3d`	Image processing, PDE solvers	High
Elementwise	`simple`, `vectorized`	Point-wise operations	High

ML/Deep Learning Patterns

Pattern	Variants	Use Cases	Confidence
Attention	`flash_attention`, `flash_attention_v2`, `multi_head`, `causal`, `cross`	Transformer models	High
Normalization	`layernorm`, `rmsnorm`, `batchnorm`, `groupnorm`, `instancenorm`	Neural network layers	High
Convolution	`conv1d`, `conv2d`, `conv3d`, `depthwise`, `grouped`, `winograd`, `im2col`	CNNs, signal processing	High
Pooling	`max_pool_2d`, `avg_pool_2d`, `global_avg`, `global_max`, `adaptive`	Feature downsampling	High
Embedding	`lookup`, `embedding_bag`, `positional`	NLP, recommender systems	Medium

LLM/Transformer-Specific Patterns

Pattern	Variants	Use Cases	Confidence
RoPE	`standard`, `neox`, `cached`	Rotary position embeddings	High
KV Cache	`append`, `paged`, `prefix`, `gqa`	LLM inference optimization	High
Quantization	`int8`, `int4`, `fp8`, `dequantize`	Model compression	Medium
Fused	`matmul_activation`, `matmul_bias_activation`, `layernorm_residual`	Kernel fusion	Medium

Specialized Patterns

Pattern	Variants	Use Cases	Confidence
FFT	`radix2`, `radix4`, `radix8`, `inverse`, `real`	Signal processing	High
Sparse	`spmv_csr`, `spmv_csr_warp`, `spmv_coo`, `spmv_ell`, `spmm`, `sddmm`	Sparse matrix operations	Medium
Histogram	`atomic`, `privatized`, `multipass`, `weighted`, `2d`	Data distribution, statistics	Medium
Sorting	`bitonic`, `bitonic_shared`, `radix`, `merge`	Parallel sorting	Medium

Example

Input: CUDA SIMT Kernel

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Output: cuTile Python

import cuda_tile as ct
import cupy

TILE_SIZE = 256

@ct.kernel
def vector_add(a, b, c, n: ct.Constant[int], tile_size: ct.Constant[int]):
    """
    Elementwise kernel - auto-transpiled from CUDA
    Original: vectorAdd
    Confidence: 100%
    """
    pid = ct.bid(0)

    # Load input tiles
    a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
    b_tile = ct.load(b, index=(pid,), shape=(tile_size,))

    # Compute
    result = a_tile + b_tile

    # Store result
    ct.store(c, index=(pid,), tile=result)


def launch_vector_add(a, b, c):
    """Launch the vector_add kernel"""
    n = a.shape[0]
    grid = (ct.cdiv(n, TILE_SIZE), 1, 1)
    stream = cupy.cuda.get_current_stream()
    ct.launch(stream, grid, vector_add, (a, b, c, TILE_SIZE))

Flash Attention Example

Input: Flash Attention CUDA Kernel

__global__ void flash_attention_kernel(
    float* Q, float* K, float* V, float* O,
    int seq_len, int head_dim, float scale
) {
    // Complex multi-phase attention implementation
    // with online softmax and tiled matrix multiply
    ...
}

Output: cuTile Python (Flash Attention)

import cuda_tile as ct
import cupy

BLOCK_Q = 64
BLOCK_KV = 64

@ct.kernel
def flash_attention(
    Q, K, V, O,
    seq_len_q: ct.Constant[int],
    seq_len_kv: ct.Constant[int],
    head_dim: ct.Constant[int],
    scale: ct.Constant[float],
    block_q: ct.Constant[int],
    block_kv: ct.Constant[int]
):
    """
    Flash Attention kernel - auto-transpiled from CUDA
    Confidence: 95%
    Variant: flash_attention_v2
    """
    block_q_idx = ct.bid(0)
    head_idx = ct.bid(1)

    # Initialize output accumulator and softmax stats
    acc = ct.zeros((block_q, head_dim), dtype=ct.float32)
    m_i = ct.full((block_q,), float('-inf'), dtype=ct.float32)
    l_i = ct.zeros((block_q,), dtype=ct.float32)

    # Load Q tile (stays in registers)
    q_tile = ct.load(Q, index=(head_idx, block_q_idx), shape=(block_q, head_dim))

    # Iterate over K,V blocks with online softmax
    for block_kv_idx in range(0, ct.cdiv(seq_len_kv, block_kv)):
        k_tile = ct.load(K, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
        v_tile = ct.load(V, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))

        # QK^T with scaling
        qk = ct.tile_matmul(q_tile, ct.transpose(k_tile)) * scale

        # Online softmax update
        m_ij = ct.reduce(qk, op=ct.max, axis=1)
        m_new = ct.maximum(m_i, m_ij)
        alpha = ct.exp(m_i - m_new)
        acc = acc * alpha[:, None]
        l_i = l_i * alpha

        p = ct.exp(qk - m_new[:, None])
        l_ij = ct.reduce(p, op=ct.sum, axis=1)
        l_i = l_i + l_ij

        # Accumulate output
        acc = acc + ct.tile_matmul(p, v_tile)
        m_i = m_new

    # Normalize and store
    out = acc / l_i[:, None]
    ct.store(O, index=(head_idx, block_q_idx), tile=out)

API Usage

Use the transpiler programmatically:

import { transpile } from './lib/transpiler';

const result = await transpile(cudaCode);

// Access results
result.tileCode              // Generated cuTile Python code
result.pattern.archetype     // Detected pattern (e.g., 'attention', 'gemm')
result.pattern.confidence    // Confidence score (0-1)
result.pattern.variant       // Specific variant (e.g., 'flash_attention_v2')
result.validation.isValid    // Validation status
result.diagnostics           // Warnings and suggestions
result.memoryAnalysis        // Memory access analysis
result.semanticAnalysis      // Semantic analysis results

REST API

curl -X POST http://localhost:3000/api/transpile \
  -H "Content-Type: application/json" \
  -d '{"code": "__global__ void add(float* a, float* b, float* c, int n) { ... }"}'

Project Structure

rightnow-tile/
├── app/
│   ├── api/transpile/        # REST API endpoint
│   ├── components/           # React components
│   │   ├── ScientificVisualization.tsx
│   │   ├── ThemeProvider.tsx
│   │   └── ThemeToggle.tsx
│   ├── page.tsx              # Main UI
│   └── globals.css           # Styling
├── lib/
│   ├── ast/                  # AST extraction & semantic analysis
│   │   ├── extractor.ts      # Kernel parsing
│   │   ├── semantic-analyzer.ts
│   │   ├── memory-analyzer.ts
│   │   ├── phase-analyzer.ts # Multi-phase kernel detection
│   │   └── types.ts          # 18 archetypes, 60+ variants
│   ├── parser/
│   │   └── intrinsics.ts     # 150+ CUDA intrinsics
│   ├── patterns/             # Pattern matchers (18 patterns)
│   │   └── matchers/
│   │       ├── attention.ts  # Flash Attention, MHA
│   │       ├── fused.ts      # Fused kernels
│   │       ├── fft.ts        # FFT variants
│   │       ├── gemm.ts       # Matrix multiply
│   │       ├── reduction.ts  # Reductions
│   │       ├── scan.ts       # Prefix sums
│   │       ├── stencil.ts    # Stencil patterns
│   │       ├── sparse.ts     # Sparse matrix ops
│   │       ├── histogram.ts  # Histogram
│   │       ├── convolution.ts # CNN convolutions
│   │       ├── sorting.ts    # Sorting algorithms
│   │       ├── pooling.ts    # Pooling layers
│   │       ├── normalization.ts # Norm layers
│   │       ├── embedding.ts  # Embeddings
│   │       ├── rope.ts       # Rotary embeddings
│   │       ├── kvcache.ts    # KV cache ops
│   │       ├── quantization.ts # Quantization
│   │       └── elementwise.ts
│   ├── ir/                   # Intermediate representation
│   │   ├── builder.ts        # 11 specialized IR types
│   │   ├── optimizer.ts
│   │   └── types.ts
│   ├── codegen/              # Code generation
│   │   ├── generator.ts      # Routes to all 18 archetypes
│   │   └── templates/        # 14 template files
│   │       ├── attention.ts
│   │       ├── fused.ts
│   │       ├── sparse.ts
│   │       ├── histogram.ts
│   │       ├── convolution.ts
│   │       ├── sorting.ts
│   │       ├── pooling.ts
│   │       ├── normalization.ts
│   │       ├── embedding.ts
│   │       ├── rope.ts
│   │       ├── kvcache.ts
│   │       ├── quantization.ts
│   │       ├── reduction.ts
│   │       └── stencil.ts
│   ├── validation/           # Validation & diagnostics
│   └── transpiler.ts         # Main entry point
├── docs/                     # Documentation
└── public/                   # Static assets

Tech Stack

Framework: Next.js 16 with Turbopack
Language: TypeScript 5.9
UI: React 19, Tailwind CSS, Framer Motion
Editor: Monaco Editor
Target: NVIDIA cuTile

Requirements

Node.js 18+
npm or yarn
For running generated code: NVIDIA Blackwell GPU (compute capability 10.x+)

Production Deployment

# Build for production
npm run build

# Start production server
npm start

Deploy to Vercel, AWS, or any Node.js hosting platform.

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development

# Run development server
npm run dev

# Type checking
npx tsc --noEmit

# Build
npm run build

Roadmap

Support for 18 CUDA patterns with 60+ variants
Flash Attention and Transformer-specific patterns
LLM inference patterns (RoPE, KV Cache, Quantization)
Comprehensive convolution support (Winograd, im2col)
Batch transpilation for multiple kernels
Performance benchmarking comparisons
VS Code extension integration
CLI tool for CI/CD pipelines
CUDA to Triton transpilation

License

This project is licensed under the MIT License — see the LICENSE file for details.

Links

Made with ♥ by RightNow AI