GitHub - Frikallo/axiom: High-performance C++ tensor library with NumPy/PyTorch-like API

Axiom is an open-source, high-performance C++ tensor library that brings NumPy and PyTorch simplicity to native code.

The Axiom library offers ...

... Python-familiar API: Axiom's C++ API closely follows NumPy and PyTorch. Operator overloading, method chaining, and identical function names mean that if you know NumPy or PyTorch, you already know Axiom.
... Lazy computation: Computations in Axiom are lazy. Operations build a computation graph and are only materialized when needed. The graph compiler automatically fuses operations and reuses buffers for maximum throughput.
... Multi-device: Operations can run on any supported device — currently CPU and Metal GPU, with more coming soon 🤫. Every operation, not just matmul, runs on GPU with the same API.
... Unified memory: On Apple Silicon, CPU and GPU tensors share the same physical memory. Switching between .cpu() and .gpu() is a zero-copy device-tag change — no memcpy, no latency.
... High performance: SIMD vectorization, BLAS acceleration, and aggresive parallelization.
... Cross-platform: macOS, Linux, and Windows.

Axiom is designed for researchers and engineers who need NumPy-level ergonomics with native performance. Learn more in the documentation, or see the Usage Guide for a comprehensive API showcase.

Why Axiom?

Axiom is intuitive.

// NumPy: x = np.where(x > 0, x, 0)
auto x = Tensor::where(x > 0, x, 0);

// NumPy: y = x.reshape(2, -1).T
auto y = x.reshape({2, -1}).T();

// PyTorch: z = F.softmax(scores, dim=-1)
auto z = scores.softmax(-1);

Axiom is fast.

3500+ GFLOPS on M4 Pro. Beats Eigen & PyTorch. See docs/BENCHMARKS.md for full results.

Axiom is expressive.

// Masking — just like NumPy
auto clamped = x.where(x > 0, 0.0f);

// Broadcasting
auto grid = col_vec + row_vec;  // {3,1} + {1,4} → {3,4}

// Zero-copy slicing
auto patch = image[{Slice(0,64), Slice(0,64)}];

// Einops
auto nchw = img.rearrange("b h w c -> b c h w");
auto pooled = x.reduce("b h w c -> b c", "mean");

Complex transformations, readable code.

Axiom is reliable.

Comprehensive test suites covering all operations
CI/CD pipeline testing CPU and GPU paths
Cross-platform validation on macOS, Linux, Windows
NaN/Inf guards and shape assertions catch errors early.
Deterministic behavior across platforms and runs.

Production-ready from day one.

Download

git clone https://github.com/frikallo/axiom.git
cd axiom && make release

Or fetch directly in CMake:

include(FetchContent)
FetchContent_Declare(axiom
    GIT_REPOSITORY https://github.com/frikallo/axiom.git
    GIT_TAG main)
FetchContent_MakeAvailable(axiom)
target_link_libraries(your_target Axiom::axiom)

Quick Start

#include <axiom/axiom.hpp>
using namespace axiom;

int main() {
    // Create tensors — just like NumPy
    auto a = Tensor::randn({3, 4});
    auto b = Tensor::ones({4, 5});

    // Chain operations fluently
    auto result = (a.relu() + 1.0f).matmul(b).softmax(-1);

    // Lazy by default — nothing executes until needed
    auto c = ops::add(a, a);           // Deferred
    auto d = ops::relu(c);             // Still deferred
    float val = d.item<float>({0, 0}); // NOW it runs

    // Zero-copy slicing
    auto row = a[{0}];
    auto block = a.slice({Slice(0, 2), Slice(1, 3)});

    // Masking
    auto positive = a.where(a > 0, 0.0f);

    // Einops
    auto img = Tensor::randn({2, 224, 224, 3});
    auto nchw = img.rearrange("b h w c -> b c h w");

    // Linear algebra
    auto [U, S, Vt] = linalg::svd(a);

    return 0;
}

GPU acceleration? Just change the device. Every operation runs on Metal—no code changes required:

// CPU version
auto x = Tensor::randn({1024, 1024}, DType::Float32, Device::CPU);

// GPU version - same API, 10-20x faster on Apple Silicon
auto x = Tensor::randn({1024, 1024}, DType::Float32, Device::GPU);

// Everything just works: matmul, softmax, reductions, broadcasting, indexing...
auto result = x.matmul(x.T()).softmax(-1).sum({1});  // All on GPU

// On Apple Silicon, device transfers are zero-copy — no memcpy overhead
auto cpu_result = result.cpu();  // Instant: same underlying memory

No other C++ tensor library offers this. Eigen, Armadillo, Blaze—all CPU-only. With Axiom, you get the same clean API with full GPU acceleration on macOS.

Building from Source

# Clone
git clone https://github.com/frikallo/axiom.git
cd axiom

# Build (release mode)
make release

# Run tests
make test

# Install system-wide
sudo cmake --install build

# Optional: Build with OpenMP
cmake -B build -DCMAKE_BUILD_TYPE=Release -DAXIOM_USE_OPENMP=ON
cmake --build build

Contributing

Contributions are welcome! Please ensure:

Code follows the project style (make format)
All tests pass (make test)
New features include tests
Documentation is updated

See CONTRIBUTING.md for detailed guidelines.

License

Axiom is licensed under the MIT License. You are free to use, modify, and distribute Axiom in both open-source and proprietary projects.

See LICENSE for the full license text.

Citation

If Axiom is useful in your research, please cite:

@misc{axiom2025,
  title={Axiom: High-Performance Tensor Library for C++},
  author={Noah Kay},
  year={2025},
  url={https://github.com/frikallo/axiom}
}