GitHub - kressler/fast-containers: Performance focused header-only container library. Currently primarily contains a fast B+Tree implementation.

11 min read Original article ↗

CI

High-performance header-only container library for C++23 on x86-64.

What's Included

  • B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage support
  • Dense Map (kressler::fast_containers::dense_map) - Fixed-size sorted array used internally by btree nodes
  • Hugepage Allocators - Pooling allocators that reduce TLB misses and allocation overhead
    • HugePageAllocator - Single-size allocator for uniform allocations
    • MultiSizeHugePageAllocator - Multi-size pooling for variable-sized allocations (e.g., Abseil btree)
    • PolicyBasedHugePageAllocator - Advanced control with shared pools

Why This Library?

The B+tree implementation provides significant performance improvements over industry standards for large trees. For some workloads with large trees, we've observed:

  • vs Abseil B+tree: 2-5× faster across insert/find/erase operations
  • vs std::map: 2-5× faster across insert/find/erase operations

See benchmark results for detailed performance analysis.

Important qualifications:

  • Performance advantages are most significant for large tree sizes where TLB misses and allocation costs dominate
  • Benchmarks currently focus on 10M element trees; smaller tree sizes have not been comprehensively tested
  • Results are specific to the tested configurations (8-byte keys, 32-byte and 256-byte values)

Key advantages over Abseil's btree:

  • Hugepage allocator integration: 3-5× speedup from reduced TLB misses and pooled allocations
  • SIMD-accelerated search: 3-10% faster node searches using AVX2 instructions
  • Tunable node sizes: Optimize cache behavior for your specific key/value sizes

Notes

Work in progress This is a work in progress. I don't have plans for major changes to the B+tree currently, but am actively cleaning up the implementation.

Platforms This library is really only built and tested on Linux, on x86-64 CPUs with AVX2 support. In theory, it could be built for Windows, though that hasn't been tested. The SIMD implementations are x86-64 specific. Timing in the custom benchmarks is also x86-64 specific (via use of rdtscp).

History/Motivations This project started as an exploration of using AI agents for software development. Based on experience tuning systems using Abseil's B+tree, I was curious if performance could be improved through SIMD instructions, a customized allocator, and tunable node sizes. Claude proved surprisingly adept at helping implement this quickly, and the resulting B+tree showed compelling performance improvements, so I'm making it available here.


Quick Start

Installation

Prerequisites:

  • C++23 compiler (GCC 14+, Clang 19+)
  • CMake 3.30+
  • AVX2-capable CPU (Intel Haswell 2013+, AMD Excavator 2015+)

Include in your project:

  1. Add as a git submodule:

    git submodule add https://github.com/kressler/fast-containers.git third_party/fast-containers
  2. Link in CMakeLists.txt:

    add_subdirectory(third_party/fast-containers)
    target_link_libraries(your_target PRIVATE fast_containers::fast_containers)
  3. Include headers:

    #include <fast_containers/btree.hpp>
    #include <fast_containers/hugepage_allocator.hpp>

Usage

Basic B+tree Example

#include <fast_containers/btree.hpp>
#include <cstdint>
#include <iostream>

int main() {
  // Create a btree mapping int64_t keys to int32_t values
  // Using defaults: auto-computed node sizes, Linear search
  using Tree = kressler::fast_containers::btree<int64_t, int32_t>;

  Tree tree;

  // Insert key-value pairs
  tree.insert(42, 100);
  tree.insert(17, 200);
  tree.insert(99, 300);

  // Find a value
  auto it = tree.find(42);
  if (it != tree.end()) {
    std::cout << "Found: " << it->second << std::endl;  // Prints: 100
  }

  // Iterate over all elements (sorted by key)
  for (const auto& [key, value] : tree) {
    std::cout << key << " -> " << value << std::endl;
  }

  // Erase an element
  tree.erase(17);

  // Check size
  std::cout << "Size: " << tree.size() << std::endl;  // Prints: 2
}

B+tree with Hugepage Allocator (Recommended for Performance)

#include <fast_containers/btree.hpp>
#include <fast_containers/hugepage_allocator.hpp>
#include <cstdint>
#include <cassert>

int main() {
  // Use the hugepage allocator for 3-5× performance improvement
  // Allocator type must match the btree's value_type (std::pair<Key, Value>)
  using Allocator = kressler::fast_containers::HugePageAllocator<
      std::pair<int64_t, int32_t>>;

  using Tree = kressler::fast_containers::btree<
    int64_t,                                 // Key type
    int32_t,                                 // Value type
    96,                                      // Leaf node size
    128,                                     // Internal node size
    std::less<int64_t>,                      // Comparator
    kressler::fast_containers::SearchMode::SIMD,  // SIMD search
    Allocator                                // Hugepage allocator
  >;

  // Tree will default-construct the allocator (256MB initial pool, 64MB growth)
  // The btree automatically creates separate pools for leaf and internal nodes
  Tree tree;

  // Insert 10 million elements - hugepages reduce TLB misses
  for (int64_t i = 0; i < 10'000'000; ++i) {
    tree.insert(i, i * 2);
  }

  // Find operations are much faster with hugepages
  auto it = tree.find(5'000'000);
  assert(it != tree.end() && it->second == 10'000'000);
}

Advanced: Policy-Based Allocator for Shared Pools

For multiple trees or fine-grained control over pool sizes, use PolicyBasedHugePageAllocator:

#include <fast_containers/btree.hpp>
#include <fast_containers/policy_based_hugepage_allocator.hpp>
#include <cstdint>

int main() {
  // Create separate pools for leaf and internal nodes with custom sizes
  auto leaf_pool = std::make_shared<kressler::fast_containers::HugePagePool>(
      512 * 1024 * 1024, true);  // 512MB for leaves
  auto internal_pool = std::make_shared<kressler::fast_containers::HugePagePool>(
      256 * 1024 * 1024, true);  // 256MB for internals

  // Create policy that routes types to appropriate pools
  kressler::fast_containers::TwoPoolPolicy policy{leaf_pool, internal_pool};

  // Create allocator with the policy
  using Allocator = kressler::fast_containers::PolicyBasedHugePageAllocator<
      std::pair<int64_t, int32_t>,
      kressler::fast_containers::TwoPoolPolicy>;

  Allocator alloc(policy);

  using Tree = kressler::fast_containers::btree<
    int64_t, int32_t, 96, 128, std::less<int64_t>,
    kressler::fast_containers::SearchMode::SIMD, Allocator>;

  // Multiple trees can share the same pools
  Tree tree1(alloc);
  Tree tree2(alloc);

  // Both trees share leaf_pool for leaves and internal_pool for internals
  tree1.insert(1, 100);
  tree2.insert(2, 200);
}

Advanced: Multi-Size Hugepage Allocator for Variable-Sized Allocations

For containers that allocate variable-sized objects (like absl::btree_map), use MultiSizeHugePageAllocator:

#include <fast_containers/multi_size_hugepage_allocator.hpp>
#include <absl/container/btree_map.h>
#include <array>
#include <cstdint>

int main() {
  // absl::btree_map allocates different-sized nodes (leaf vs internal)
  // MultiSizeHugePageAllocator routes allocations to size-class-specific pools

  using ValueType = std::array<std::byte, 32>;
  using Allocator = kressler::fast_containers::MultiSizeHugePageAllocator<
      std::pair<const int64_t, ValueType>>;

  // Helper function creates allocator with default settings
  // - 64MB initial size per size class
  // - Hugepages enabled
  // - 64MB growth size per size class
  auto alloc = kressler::fast_containers::make_multi_size_hugepage_allocator<
      std::pair<const int64_t, ValueType>>();

  // Create absl::btree_map with hugepage allocator
  absl::btree_map<int64_t, ValueType, std::less<int64_t>, Allocator> tree(alloc);

  // Insert 1 million elements - multiple size classes created automatically
  for (int64_t i = 0; i < 1'000'000; ++i) {
    tree[i] = ValueType{};
  }

  // Find operations benefit from reduced TLB misses
  auto it = tree.find(500'000);
}

How it works:

  • Allocations are routed to size classes based on requested size
  • Size classes: 0-512B (64B alignment), 513-2048B (256B alignment), 2049+B (power-of-2)
  • Each size class maintains its own HugePagePool with uniform-sized blocks
  • Pools created on-demand as different sizes are requested
  • Provides 2-3× performance improvement for absl::btree_map over standard allocator

When to use each allocator:

  • HugePageAllocator: Simple, automatic separate pools per type (recommended for our btree)
  • MultiSizeHugePageAllocator: Variable-sized allocations (e.g., absl::btree_map, other STL containers with allocator support)
  • PolicyBasedHugePageAllocator: Fine-grained control, shared pools across trees, custom pool sizes

API Overview

The btree class provides an API similar to std::map:

Insertion:

  • std::pair<iterator, bool> insert(const Key& key, const Value& value)
  • std::pair<iterator, bool> emplace(Args&&... args)
  • Value& operator[](const Key& key)

Lookup:

  • iterator find(const Key& key)
  • const_iterator find(const Key& key) const
  • iterator lower_bound(const Key& key)
  • iterator upper_bound(const Key& key)
  • std::pair<iterator, iterator> equal_range(const Key& key)

Removal:

  • size_type erase(const Key& key)
  • iterator erase(iterator pos)

Iteration:

  • iterator begin() / const_iterator begin() const
  • iterator end() / const_iterator end() const
  • Range-based for loops supported

Capacity:

  • size_type size() const
  • bool empty() const
  • void clear()

Other:

  • void swap(btree& other) noexcept
  • key_compare key_comp() const
  • value_compare value_comp() const

Template Parameters

template <
  typename Key,
  typename Value,
  std::size_t LeafNodeSize = default_leaf_node_size<Key, Value>(),
  std::size_t InternalNodeSize = default_internal_node_size<Key>(),
  typename Compare = std::less<Key>,
  SearchMode SearchModeT = SearchMode::Linear,
  typename Allocator = std::allocator<std::pair<Key, Value>>
>
class btree;

Parameters:

  • Key, Value: The key and value types

  • LeafNodeSize: Number of key-value pairs per leaf node

    • Default: Auto-computed heuristic targeting ~2KB (32 cache lines)
    • Formula: 2048 / (sizeof(Key) + sizeof(Value)), rounded to multiple of 8, clamped to [8, 64]
    • Manual tuning: Larger values (64-96) for small values, smaller values (8-16) for large values
  • InternalNodeSize: Number of child pointers per internal node

    • Default: Auto-computed heuristic targeting ~1KB (16 cache lines)
    • Formula: 1024 / (sizeof(Key) + sizeof(void*)), rounded to multiple of 8, clamped to [16, 64]
    • Generally leave at default (stores only 8-byte pointers)
  • Compare: Comparison function (must satisfy ComparatorCompatible<Key, Compare>)

    • Default: std::less<Key>
    • Also supports std::greater<Key> for descending order
  • SearchMode: How to search within a node

    • Default: SearchMode::Linear (scalar linear search)
    • SearchMode::SIMD: AVX2-accelerated search (3-10% faster, requires AVX2 CPU and SIMD-compatible keys: int32_t, uint32_t, int64_t, uint64_t, float, double)
    • SearchMode::Binary: Binary search
  • Allocator: Memory allocation strategy

    • Default: std::allocator<std::pair<Key, Value>>
    • Recommended for performance: HugePageAllocator<std::pair<Key, Value>> for working sets >1GB (3-5× faster)
      • Automatically creates separate pools for leaf and internal nodes via rebind
      • Default: 256MB initial pool, 64MB growth per pool
      • Requires hugepages configured: sudo sysctl -w vm.nr_hugepages=<num_pages>
      • Falls back to regular pages if unavailable
    • For variable-sized allocations: MultiSizeHugePageAllocator<std::pair<Key, Value>>
      • Routes allocations to size-class-specific pools
      • Use with absl::btree_map or other containers that allocate different-sized objects
      • Provides 2-3× speedup for Abseil btree over standard allocator
    • Advanced control: PolicyBasedHugePageAllocator<std::pair<Key, Value>, TwoPoolPolicy>
      • Fine-grained control over pool sizes
      • Share pools across multiple trees
      • Separate pools for leaf and internal nodes with custom sizes

Performance

Benchmarks comparing against Abseil's btree_map and std::map are available in results/btree_benchmark_results.md.

Performance Highlights (8-byte keys, 32-byte values, 10M elements)

Our btree with hugepages (btree_8_32_96_128_simd_hp):

  • INSERT P99.9: 1,023 ns
  • FIND P99.9: 864 ns
  • ERASE P99.9: 1,086 ns

Our btree with standard allocator (btree_8_32_96_128_simd):

  • INSERT P99.9: 3,155 ns (3.1× slower than with hugepages)
  • FIND P99.9: 950 ns (1.1× slower than with hugepages)
  • ERASE P99.9: 1,323 ns (1.2× slower than with hugepages)

vs. Abseil btree with hugepages (absl_8_32_hp using MultiSizeHugePageAllocator):

  • INSERT P99.9: 1,401 ns (27% slower)
  • FIND P99.9: 1,190 ns (38% slower)
  • ERASE P99.9: 1,299 ns (20% slower)

vs. Abseil btree with standard allocator (absl_8_32):

  • INSERT P99.9: 3,287 ns (3.2× slower)
  • FIND P99.9: 1,342 ns (55% slower)
  • ERASE P99.9: 1,679 ns (55% slower)

vs. std::map (map_8_32):

  • INSERT P99.9: 3,587 ns (3.5× slower)
  • FIND P99.9: 2,312 ns (2.7× slower)
  • ERASE P99.9: 2,253 ns (2.1× slower)

Key Insights

Hugepage allocators provide massive performance improvements:

  • Our btree: 2-3× faster with hugepages vs. standard allocator
  • Abseil btree: 2× faster with MultiSizeHugePageAllocator vs. standard allocator
  • Critical for large working sets (>1M elements)

Our implementation maintains significant advantages even with fair comparison:

  • 20-67% faster than Abseil btree even when both use hugepage allocators
  • Advantages from SIMD search, tunable node sizes, and optimized bulk transfers
  • Performance gap widens with larger values (256 bytes: 21-135% faster)

Performance varies by tree size:

  • Large trees (10M elements): Our btree dominates all metrics
  • Small trees (10K elements): Competition intensifies, std::map becomes viable for some workloads
  • See benchmark results for detailed analysis

The hugepage allocator is the single most important optimization, providing benefits by reducing TLB misses (helps find operations) and making allocations extremely cheap through pooling (helps insert/erase operations).


Building from Source

Using CMake Presets (Recommended)

# List available presets
cmake --list-presets

# Configure, build, and test in one workflow
cmake --preset release
cmake --build --preset release
ctest --preset release

# Common presets:
cmake --preset debug          # Debug build
cmake --preset release        # Release with AVX2 (default)
cmake --preset asan           # AddressSanitizer build
cmake --preset release-no-avx2  # Release without AVX2

Manual Build (Alternative)

# Clone with submodules
git clone --recursive https://github.com/kressler/fast-containers.git
cd fast-containers

# Configure
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build build

# Run tests
ctest --test-dir build --output-on-failure

Build Options

Option Default Description
ENABLE_AVX2 ON (Release), OFF (Debug) Enable AVX2 SIMD optimizations
ENABLE_ASAN OFF Enable AddressSanitizer
ENABLE_ALLOCATOR_STATS OFF Enable allocator statistics
ENABLE_LTO ON Enable Link-Time Optimization
ENABLE_NUMA Auto-detected Enable NUMA support (requires libnuma)

Development

Setup

  1. Clone with submodules:

    git clone --recursive https://github.com/kressler/fast-containers.git
    cd fast-containers
  2. One-time development setup:

    This installs pre-commit hooks and configures clang-tidy.

Code Quality Tools

Automatic formatting and checks (via pre-commit hook):

git commit  # Automatically formats code and runs clang-tidy

The pre-commit hook will:

  • Format all staged C++ files with clang-format
  • Check production headers with clang-tidy
  • Fail the commit if warnings are found
  • Auto-create build directories if missing

Manual formatting:

cmake --build build --target format

Manual static analysis:

cmake --build build --target clang-tidy
# Or manually:
clang-tidy-19 -p cmake-build-clang-tidy include/fast_containers/*.hpp

Requirements:

  • clang-format (for code formatting)
  • clang-tidy-19 (for static analysis)
  • cmake (to auto-create build directories)

Bypass hook (when needed):

Development Workflow

  1. Make your changes

  2. Build and test:

    cmake --build build && ctest --test-dir build
  3. Commit (auto-formatted and checked):

    git add .
    git commit -m "Your changes"
    # Pre-commit hook runs automatically

Code Conventions

  • Follow Google C++ Style Guide (enforced by clang-format)
  • Use C++23 features
  • Write tests for new functionality using Catch2
  • Production code must be clang-tidy clean (enforced in CI and pre-commit)
  • Run cmake --build build --target format before submitting PRs

Project Structure

.
├── include/
│   └── fast_containers/         # Public header files
│       ├── btree.hpp, btree.ipp
│       ├── dense_map.hpp, dense_map.ipp
│       ├── hugepage_allocator.hpp
│       ├── multi_size_hugepage_allocator.hpp
│       ├── multi_size_hugepage_pool.hpp
│       ├── policy_based_hugepage_allocator.hpp
│       └── hugepage_pool.hpp
├── tests/                       # Unit tests (Catch2)
│   ├── test_btree.cpp
│   ├── test_dense_map.cpp
│   ├── test_hugepage_allocator.cpp
│   └── test_policy_based_allocator.cpp
├── src/
│   ├── benchmarks/              # Google Benchmark performance tests
│   │   ├── dense_map_search_benchmark.cpp
│   │   └── hugepage_allocator_benchmark.cpp
│   └── binary/                  # Standalone benchmark executables
│       ├── btree_benchmark.cpp
│       └── btree_stress.cpp
├── scripts/
│   └── interleaved_btree_benchmark.py  # A/B testing harness
├── results/
│   └── btree_benchmark_results.md      # Performance analysis
├── third_party/                 # Git submodules
│   ├── catch2/                  # Unit testing framework
│   ├── benchmark/               # Google Benchmark
│   ├── histograms/              # Latency histogram library
│   ├── abseil-cpp/              # Comparison baseline
│   ├── lyra/                    # Command-line parsing
│   └── unordered_dense/         # Dense hash map
├── hooks/                       # Git hooks (install with setup-dev.sh)
│   └── pre-commit               # Auto-format and clang-tidy
└── CMakeLists.txt               # Build configuration

Tech Stack

  • Language: C++23
  • Build System: CMake 3.30+
  • Testing: Catch2 v3.11.0
  • Code Formatting: clang-format (Google C++ Style)
  • Static Analysis: clang-tidy-19