GitHub - kressler/fast-containers: Performance focused header-only container library. Currently primarily contains a fast B+Tree implementation.

High-performance header-only container library for C++23 on x86-64.

What's Included

B+Tree (kressler::fast_containers::btree) - Cache-friendly B+tree with SIMD search and hugepage support
Dense Map (kressler::fast_containers::dense_map) - Fixed-size sorted array used internally by btree nodes
Hugepage Allocators - Pooling allocators that reduce TLB misses and allocation overhead
- HugePageAllocator - Single-size allocator for uniform allocations
- MultiSizeHugePageAllocator - Multi-size pooling for variable-sized allocations (e.g., Abseil btree)
- PolicyBasedHugePageAllocator - Advanced control with shared pools

Why This Library?

The B+tree implementation provides significant performance improvements over industry standards for large trees. For some workloads with large trees, we've observed:

vs Abseil B+tree: 2-5× faster across insert/find/erase operations
vs std::map: 2-5× faster across insert/find/erase operations

See benchmark results for detailed performance analysis.

Important qualifications:

Performance advantages are most significant for large tree sizes where TLB misses and allocation costs dominate
Benchmarks currently focus on 10M element trees; smaller tree sizes have not been comprehensively tested
Results are specific to the tested configurations (8-byte keys, 32-byte and 256-byte values)

Key advantages over Abseil's btree:

Hugepage allocator integration: 3-5× speedup from reduced TLB misses and pooled allocations
SIMD-accelerated search: 3-10% faster node searches using AVX2 instructions
Tunable node sizes: Optimize cache behavior for your specific key/value sizes

Notes

Work in progress This is a work in progress. I don't have plans for major changes to the B+tree currently, but am actively cleaning up the implementation.

Platforms This library is really only built and tested on Linux, on x86-64 CPUs with AVX2 support. In theory, it could be built for Windows, though that hasn't been tested. The SIMD implementations are x86-64 specific. Timing in the custom benchmarks is also x86-64 specific (via use of rdtscp).

History/Motivations This project started as an exploration of using AI agents for software development. Based on experience tuning systems using Abseil's B+tree, I was curious if performance could be improved through SIMD instructions, a customized allocator, and tunable node sizes. Claude proved surprisingly adept at helping implement this quickly, and the resulting B+tree showed compelling performance improvements, so I'm making it available here.

Quick Start

Installation

Prerequisites:

C++23 compiler (GCC 14+, Clang 19+)
CMake 3.30+
AVX2-capable CPU (Intel Haswell 2013+, AMD Excavator 2015+)

Include in your project:

Add as a git submodule:

git submodule add https://github.com/kressler/fast-containers.git third_party/fast-containers

Link in CMakeLists.txt:

add_subdirectory(third_party/fast-containers)
target_link_libraries(your_target PRIVATE fast_containers::fast_containers)

Include headers:

#include <fast_containers/btree.hpp>
#include <fast_containers/hugepage_allocator.hpp>

Usage

Basic B+tree Example

#include <fast_containers/btree.hpp>
#include <cstdint>
#include <iostream>

int main() {
  // Create a btree mapping int64_t keys to int32_t values
  // Using defaults: auto-computed node sizes, Linear search
  using Tree = kressler::fast_containers::btree<int64_t, int32_t>;

  Tree tree;

  // Insert key-value pairs
  tree.insert(42, 100);
  tree.insert(17, 200);
  tree.insert(99, 300);

  // Find a value
  auto it = tree.find(42);
  if (it != tree.end()) {
    std::cout << "Found: " << it->second << std::endl;  // Prints: 100
  }

  // Iterate over all elements (sorted by key)
  for (const auto& [key, value] : tree) {
    std::cout << key << " -> " << value << std::endl;
  }

  // Erase an element
  tree.erase(17);

  // Check size
  std::cout << "Size: " << tree.size() << std::endl;  // Prints: 2
}

B+tree with Hugepage Allocator (Recommended for Performance)

#include <fast_containers/btree.hpp>
#include <fast_containers/hugepage_allocator.hpp>
#include <cstdint>
#include <cassert>

int main() {
  // Use the hugepage allocator for 3-5× performance improvement
  // Allocator type must match the btree's value_type (std::pair<Key, Value>)
  using Allocator = kressler::fast_containers::HugePageAllocator<
      std::pair<int64_t, int32_t>>;

  using Tree = kressler::fast_containers::btree<
    int64_t,                                 // Key type
    int32_t,                                 // Value type
    96,                                      // Leaf node size
    128,                                     // Internal node size
    std::less<int64_t>,                      // Comparator
    kressler::fast_containers::SearchMode::SIMD,  // SIMD search
    Allocator                                // Hugepage allocator
  >;

  // Tree will default-construct the allocator (256MB initial pool, 64MB growth)
  // The btree automatically creates separate pools for leaf and internal nodes
  Tree tree;

  // Insert 10 million elements - hugepages reduce TLB misses
  for (int64_t i = 0; i < 10'000'000; ++i) {
    tree.insert(i, i * 2);
  }

  // Find operations are much faster with hugepages
  auto it = tree.find(5'000'000);
  assert(it != tree.end() && it->second == 10'000'000);
}

Advanced: Policy-Based Allocator for Shared Pools

For multiple trees or fine-grained control over pool sizes, use PolicyBasedHugePageAllocator:

#include <fast_containers/btree.hpp>
#include <fast_containers/policy_based_hugepage_allocator.hpp>
#include <cstdint>

int main() {
  // Create separate pools for leaf and internal nodes with custom sizes
  auto leaf_pool = std::make_shared<kressler::fast_containers::HugePagePool>(
      512 * 1024 * 1024, true);  // 512MB for leaves
  auto internal_pool = std::make_shared<kressler::fast_containers::HugePagePool>(
      256 * 1024 * 1024, true);  // 256MB for internals

  // Create policy that routes types to appropriate pools
  kressler::fast_containers::TwoPoolPolicy policy{leaf_pool, internal_pool};

  // Create allocator with the policy
  using Allocator = kressler::fast_containers::PolicyBasedHugePageAllocator<
      std::pair<int64_t, int32_t>,
      kressler::fast_containers::TwoPoolPolicy>;

  Allocator alloc(policy);

  using Tree = kressler::fast_containers::btree<
    int64_t, int32_t, 96, 128, std::less<int64_t>,
    kressler::fast_containers::SearchMode::SIMD, Allocator>;

  // Multiple trees can share the same pools
  Tree tree1(alloc);
  Tree tree2(alloc);

  // Both trees share leaf_pool for leaves and internal_pool for internals
  tree1.insert(1, 100);
  tree2.insert(2, 200);
}

Advanced: Multi-Size Hugepage Allocator for Variable-Sized Allocations

For containers that allocate variable-sized objects (like absl::btree_map), use MultiSizeHugePageAllocator:

#include <fast_containers/multi_size_hugepage_allocator.hpp>
#include <absl/container/btree_map.h>
#include <array>
#include <cstdint>

int main() {
  // absl::btree_map allocates different-sized nodes (leaf vs internal)
  // MultiSizeHugePageAllocator routes allocations to size-class-specific pools

  using ValueType = std::array<std::byte, 32>;
  using Allocator = kressler::fast_containers::MultiSizeHugePageAllocator<
      std::pair<const int64_t, ValueType>>;

  // Helper function creates allocator with default settings
  // - 64MB initial size per size class
  // - Hugepages enabled
  // - 64MB growth size per size class
  auto alloc = kressler::fast_containers::make_multi_size_hugepage_allocator<
      std::pair<const int64_t, ValueType>>();

  // Create absl::btree_map with hugepage allocator
  absl::btree_map<int64_t, ValueType, std::less<int64_t>, Allocator> tree(alloc);

  // Insert 1 million elements - multiple size classes created automatically
  for (int64_t i = 0; i < 1'000'000; ++i) {
    tree[i] = ValueType{};
  }

  // Find operations benefit from reduced TLB misses
  auto it = tree.find(500'000);
}

How it works:

Allocations are routed to size classes based on requested size
Size classes: 0-512B (64B alignment), 513-2048B (256B alignment), 2049+B (power-of-2)
Each size class maintains its own HugePagePool with uniform-sized blocks
Pools created on-demand as different sizes are requested
Provides 2-3× performance improvement for absl::btree_map over standard allocator

When to use each allocator:

HugePageAllocator: Simple, automatic separate pools per type (recommended for our btree)
MultiSizeHugePageAllocator: Variable-sized allocations (e.g., absl::btree_map, other STL containers with allocator support)
PolicyBasedHugePageAllocator: Fine-grained control, shared pools across trees, custom pool sizes

API Overview

The btree class provides an API similar to std::map:

Insertion:

std::pair<iterator, bool> insert(const Key& key, const Value& value)
std::pair<iterator, bool> emplace(Args&&... args)
Value& operator[](const Key& key)

Lookup:

iterator find(const Key& key)
const_iterator find(const Key& key) const
iterator lower_bound(const Key& key)
iterator upper_bound(const Key& key)
std::pair<iterator, iterator> equal_range(const Key& key)

Removal:

size_type erase(const Key& key)
iterator erase(iterator pos)

Iteration:

iterator begin() / const_iterator begin() const
iterator end() / const_iterator end() const
Range-based for loops supported

Capacity:

size_type size() const
bool empty() const
void clear()

Other:

void swap(btree& other) noexcept
key_compare key_comp() const
value_compare value_comp() const

Template Parameters

template <
  typename Key,
  typename Value,
  std::size_t LeafNodeSize = default_leaf_node_size<Key, Value>(),
  std::size_t InternalNodeSize = default_internal_node_size<Key>(),
  typename Compare = std::less<Key>,
  SearchMode SearchModeT = SearchMode::Linear,
  typename Allocator = std::allocator<std::pair<Key, Value>>
>
class btree;

Parameters:

Key, Value: The key and value types
LeafNodeSize: Number of key-value pairs per leaf node
- Default: Auto-computed heuristic targeting ~2KB (32 cache lines)
- Formula: 2048 / (sizeof(Key) + sizeof(Value)), rounded to multiple of 8, clamped to [8, 64]
- Manual tuning: Larger values (64-96) for small values, smaller values (8-16) for large values
InternalNodeSize: Number of child pointers per internal node
- Default: Auto-computed heuristic targeting ~1KB (16 cache lines)
- Formula: 1024 / (sizeof(Key) + sizeof(void*)), rounded to multiple of 8, clamped to [16, 64]
- Generally leave at default (stores only 8-byte pointers)
Compare: Comparison function (must satisfy ComparatorCompatible<Key, Compare>)
- Default: std::less<Key>
- Also supports std::greater<Key> for descending order
SearchMode: How to search within a node
- Default: SearchMode::Linear (scalar linear search)
- SearchMode::SIMD: AVX2-accelerated search (3-10% faster, requires AVX2 CPU and SIMD-compatible keys: int32_t, uint32_t, int64_t, uint64_t, float, double)
- SearchMode::Binary: Binary search
Allocator: Memory allocation strategy
- Default: std::allocator<std::pair<Key, Value>>
- Recommended for performance: HugePageAllocator<std::pair<Key, Value>> for working sets >1GB (3-5× faster)
  - Automatically creates separate pools for leaf and internal nodes via rebind
  - Default: 256MB initial pool, 64MB growth per pool
  - Requires hugepages configured: sudo sysctl -w vm.nr_hugepages=<num_pages>
  - Falls back to regular pages if unavailable
- For variable-sized allocations: MultiSizeHugePageAllocator<std::pair<Key, Value>>
  - Routes allocations to size-class-specific pools
  - Use with absl::btree_map or other containers that allocate different-sized objects
  - Provides 2-3× speedup for Abseil btree over standard allocator
- Advanced control: PolicyBasedHugePageAllocator<std::pair<Key, Value>, TwoPoolPolicy>
  - Fine-grained control over pool sizes
  - Share pools across multiple trees
  - Separate pools for leaf and internal nodes with custom sizes

Performance

Benchmarks comparing against Abseil's btree_map and std::map are available in results/btree_benchmark_results.md.

Performance Highlights (8-byte keys, 32-byte values, 10M elements)

Our btree with hugepages (btree_8_32_96_128_simd_hp):

INSERT P99.9: 1,023 ns
FIND P99.9: 864 ns
ERASE P99.9: 1,086 ns

Our btree with standard allocator (btree_8_32_96_128_simd):

INSERT P99.9: 3,155 ns (3.1× slower than with hugepages)
FIND P99.9: 950 ns (1.1× slower than with hugepages)
ERASE P99.9: 1,323 ns (1.2× slower than with hugepages)

vs. Abseil btree with hugepages (absl_8_32_hp using MultiSizeHugePageAllocator):

INSERT P99.9: 1,401 ns (27% slower)
FIND P99.9: 1,190 ns (38% slower)
ERASE P99.9: 1,299 ns (20% slower)

vs. Abseil btree with standard allocator (absl_8_32):

INSERT P99.9: 3,287 ns (3.2× slower)
FIND P99.9: 1,342 ns (55% slower)
ERASE P99.9: 1,679 ns (55% slower)

vs. std::map (map_8_32):

INSERT P99.9: 3,587 ns (3.5× slower)
FIND P99.9: 2,312 ns (2.7× slower)
ERASE P99.9: 2,253 ns (2.1× slower)

Key Insights

Hugepage allocators provide massive performance improvements:

Our btree: 2-3× faster with hugepages vs. standard allocator
Abseil btree: 2× faster with MultiSizeHugePageAllocator vs. standard allocator
Critical for large working sets (>1M elements)

Our implementation maintains significant advantages even with fair comparison:

20-67% faster than Abseil btree even when both use hugepage allocators
Advantages from SIMD search, tunable node sizes, and optimized bulk transfers
Performance gap widens with larger values (256 bytes: 21-135% faster)

Performance varies by tree size:

Large trees (10M elements): Our btree dominates all metrics
Small trees (10K elements): Competition intensifies, std::map becomes viable for some workloads
See benchmark results for detailed analysis

The hugepage allocator is the single most important optimization, providing benefits by reducing TLB misses (helps find operations) and making allocations extremely cheap through pooling (helps insert/erase operations).

Building from Source

Using CMake Presets (Recommended)

# List available presets
cmake --list-presets

# Configure, build, and test in one workflow
cmake --preset release
cmake --build --preset release
ctest --preset release

# Common presets:
cmake --preset debug          # Debug build
cmake --preset release        # Release with AVX2 (default)
cmake --preset asan           # AddressSanitizer build
cmake --preset release-no-avx2  # Release without AVX2

Manual Build (Alternative)

# Clone with submodules
git clone --recursive https://github.com/kressler/fast-containers.git
cd fast-containers

# Configure
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build build

# Run tests
ctest --test-dir build --output-on-failure

Build Options

Option	Default	Description
`ENABLE_AVX2`	`ON` (Release), `OFF` (Debug)	Enable AVX2 SIMD optimizations
`ENABLE_ASAN`	`OFF`	Enable AddressSanitizer
`ENABLE_ALLOCATOR_STATS`	`OFF`	Enable allocator statistics
`ENABLE_LTO`	`ON`	Enable Link-Time Optimization
`ENABLE_NUMA`	Auto-detected	Enable NUMA support (requires libnuma)

Development

Setup

Clone with submodules:

git clone --recursive https://github.com/kressler/fast-containers.git
cd fast-containers

One-time development setup:

This installs pre-commit hooks and configures clang-tidy.

Code Quality Tools

Automatic formatting and checks (via pre-commit hook):

git commit  # Automatically formats code and runs clang-tidy

The pre-commit hook will:

Format all staged C++ files with clang-format
Check production headers with clang-tidy
Fail the commit if warnings are found
Auto-create build directories if missing

Manual formatting:

cmake --build build --target format

Manual static analysis:

cmake --build build --target clang-tidy
# Or manually:
clang-tidy-19 -p cmake-build-clang-tidy include/fast_containers/*.hpp

Requirements:

clang-format (for code formatting)
clang-tidy-19 (for static analysis)
cmake (to auto-create build directories)

Bypass hook (when needed):

Development Workflow

Make your changes

Build and test:

cmake --build build && ctest --test-dir build

Commit (auto-formatted and checked):

git add .
git commit -m "Your changes"
# Pre-commit hook runs automatically

Code Conventions

Follow Google C++ Style Guide (enforced by clang-format)
Use C++23 features
Write tests for new functionality using Catch2
Production code must be clang-tidy clean (enforced in CI and pre-commit)
Run cmake --build build --target format before submitting PRs

Project Structure

.
├── include/
│   └── fast_containers/         # Public header files
│       ├── btree.hpp, btree.ipp
│       ├── dense_map.hpp, dense_map.ipp
│       ├── hugepage_allocator.hpp
│       ├── multi_size_hugepage_allocator.hpp
│       ├── multi_size_hugepage_pool.hpp
│       ├── policy_based_hugepage_allocator.hpp
│       └── hugepage_pool.hpp
├── tests/                       # Unit tests (Catch2)
│   ├── test_btree.cpp
│   ├── test_dense_map.cpp
│   ├── test_hugepage_allocator.cpp
│   └── test_policy_based_allocator.cpp
├── src/
│   ├── benchmarks/              # Google Benchmark performance tests
│   │   ├── dense_map_search_benchmark.cpp
│   │   └── hugepage_allocator_benchmark.cpp
│   └── binary/                  # Standalone benchmark executables
│       ├── btree_benchmark.cpp
│       └── btree_stress.cpp
├── scripts/
│   └── interleaved_btree_benchmark.py  # A/B testing harness
├── results/
│   └── btree_benchmark_results.md      # Performance analysis
├── third_party/                 # Git submodules
│   ├── catch2/                  # Unit testing framework
│   ├── benchmark/               # Google Benchmark
│   ├── histograms/              # Latency histogram library
│   ├── abseil-cpp/              # Comparison baseline
│   ├── lyra/                    # Command-line parsing
│   └── unordered_dense/         # Dense hash map
├── hooks/                       # Git hooks (install with setup-dev.sh)
│   └── pre-commit               # Auto-format and clang-tidy
└── CMakeLists.txt               # Build configuration

Tech Stack

Language: C++23
Build System: CMake 3.30+
Testing: Catch2 v3.11.0
Code Formatting: clang-format (Google C++ Style)
Static Analysis: clang-tidy-19