Fast KNN that doesn't make you choose between exact answers and production speed. Drop-in for scikit-learn. SIMD-accelerated. Binary Multi-Index Hashing that beats Faiss's own MIH at matched recall — and finds 512-bit near-duplicates ~40× faster than Faiss's brute-force scan at 100% recall.
PyNear is a metric-space nearest-neighbour library with a C++ core: exact VP-Trees up to ~256-D, IVF-Flat for approximate float search at 512–1024-D (embeddings, RAG), and Multi-Index Hashing / IVF-Binary for Hamming search on binary descriptors — one small, NumPy-only API with a scikit-learn drop-in and pre-built wheels (pip install pynear).
Table of Contents
- Introduction
- Why PyNear?
- Installation
- Quick start
- Migrating from scikit-learn
- Features
- Demos
- Benchmarks
- Real-World Benchmark — SIFT1M Binary
- Development
Introduction
Search, recommendation, deduplication, and retrieval-augmented generation all reduce to the same primitive: turn an item — an image, an audio clip, a document, a face — into a descriptor (a fixed-length vector or bit-string), then find the descriptors nearest to it. Similar items map to nearby points, so "find similar" becomes "find nearest neighbours."
The right way to search depends on the data, and PyNear gives you one API for all three regimes instead of forcing every problem through the same tool:
- Low-to-mid dimensions (a few up to ~256-D) — exact tree search wins. A VP-Tree prunes by distance to vantage points and returns the true nearest neighbours, no recall loss, no tuning.
- High-dimensional float vectors (512–1024-D embeddings) — exact pruning collapses (the curse of dimensionality), so IVF-Flat trades a sliver of recall for large speed-ups.
- Binary descriptors (ORB, BRIEF, perceptual hashes, SimHash) — Hamming distance plus Multi-Index Hashing uses the pigeonhole principle to find near-duplicates without scanning the whole dataset.
What people build with it:
- Image / video deduplication & copy detection — perceptual-hash / ORB
descriptors +
MIHBinaryIndex. - Audio fingerprinting (Shazam-style) — spectrogram-peak descriptors + Hamming search.
- Semantic & RAG retrieval — text/image embeddings +
IVFFlatCosineIndex. - Classic ML — drop-in
KNeighborsClassifier/Regressorbacked by VP-Trees.
New to nearest-neighbour search? See docs/intro.md for a gentle, jargon-free introduction — or the deep dive, The shared recipe behind image search, Shazam, and RAG.
Why PyNear?
| PyNear | Faiss | Annoy | scikit-learn | |
|---|---|---|---|---|
| Metric agnostic | ✅ L2, L1, L∞, cosine, Hamming | L2 / IP / cosine | L2 / cosine / Hamming | L2 / others |
| Binary / Hamming approx | ✅ MIH + IVF, faster than Faiss MIH | ✅ MIH + IVF | ❌ | ❌ |
| scikit-learn drop-in | ✅ adapter classes | ❌ | ❌ | — |
| Zero native deps | ✅ NumPy only | ❌ compiled lib + optional GPU | ❌ | ❌ |
Installation
Requires Python 3.8+ and NumPy ≥ 1.21.2. Pre-built wheels are available for Linux, macOS (x86-64 and Apple Silicon), and Windows — no compiler needed.
Quick start
PyNear's two headline indices: exact VP-Trees for low-to-mid dimensions, and Multi-Index Hashing for binary descriptors.
Low-dimensional exact search (VPTreeL2Index)
VP-Trees partition points by distance to a vantage point, so they prune whole branches in any metric space and return exact neighbours — no recall loss, no tuning — and stay effective up to ~256-D. The same API backs L2, L1, L∞, cosine, and Hamming.
import numpy as np import pynear # 100,000 vectors in 32-D data = np.random.rand(100_000, 32).astype(np.float32) index = pynear.VPTreeL2Index() index.set(data) # KNN search — returns (indices, distances) per query, sorted nearest-first queries = np.random.rand(10, 32).astype(np.float32) indices, distances = index.searchKNN(queries, k=5) # 1-NN shortcut (slightly faster than searchKNN with k=1) nn_indices, nn_distances = index.search1NN(queries)
High-dimensional binary descriptors (MIHBinaryIndex)
MIHBinaryIndex is pynear's flagship for binary descriptors (ORB, BRIEF,
AKAZE, perceptual hashes, SimHash). Multi-Index Hashing splits each d-bit
descriptor into m sub-strings and hashes them; by the pigeonhole
principle, any neighbour within radius Hamming bits is guaranteed to be
found. On wide descriptors it retrieves near-duplicates ~40× faster than
Faiss's brute-force scan at 100% recall — and faster than Faiss's own MIH.
import numpy as np import pynear # 1M × 512-bit descriptors (64 bytes each) db = np.random.randint(0, 256, size=(1_000_000, 64), dtype=np.uint8) queries = np.random.randint(0, 256, size=(100, 64), dtype=np.uint8) mih = pynear.MIHBinaryIndex(m=8) # 8 sub-tables of 64 bits (m=4 for 128/256-bit) mih.set(db) indices, distances = mih.searchKNN(queries, k=10, radius=8) # radius: any true neighbour within this Hamming distance is guaranteed found # (pigeonhole). Increase for higher recall on noisier data.
When you'd rather cap the cost per query than reason about a radius,
IVFFlatBinaryIndex scans a fixed number of clusters instead:
ivf = pynear.IVFFlatBinaryIndex(nlist=512, nprobe=16) ivf.set(db) indices, distances = ivf.searchKNN(queries, k=10) ivf.set_nprobe(32) # trade speed for recall at runtime
Choosing between MIH and IVFFlat:
MIHBinaryIndex |
IVFFlatBinaryIndex |
|
|---|---|---|
| Best for | Near-duplicate retrieval (small Hamming radius) | General approximate Hamming KNN |
| d=512, N=1M query time (near-duplicate) | 0.008 ms | 1.82 ms |
| Recall guarantee | Exact for distance ≤ radius (pigeonhole) | Probabilistic (depends on nprobe) |
| Recall control | radius parameter |
nprobe parameter |
Recommended m |
d/8 bytes (e.g. m=8 for 512-bit) | — |
For wide float vectors (512-D–1024-D embeddings, e.g. text / RAG) reach for
IVFFlatL2Index / IVFFlatCosineIndex. Every index type and its tuning knobs
are covered in docs/README.md.
Migrating from scikit-learn
PyNear provides adapter classes that implement the same interface as
sklearn.neighbors.NearestNeighbors, KNeighborsClassifier, and
KNeighborsRegressor. Changing the import is all that is required in most
cases:
# Before from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean') # After — identical API, backed by a VP-Tree from pynear.sklearn_adapter import PyNearKNeighborsClassifier clf = PyNearKNeighborsClassifier(n_neighbors=5, metric='euclidean')
All three adapters follow the standard scikit-learn workflow:
from pynear.sklearn_adapter import ( PyNearNearestNeighbors, PyNearKNeighborsClassifier, PyNearKNeighborsRegressor, ) # Unsupervised neighbour lookup nn = PyNearNearestNeighbors(n_neighbors=5, metric='euclidean') nn.fit(X_train) distances, indices = nn.kneighbors(X_query) # Classification clf = PyNearKNeighborsClassifier(n_neighbors=5, weights='distance') clf.fit(X_train, y_train) clf.predict(X_test) # class labels clf.predict_proba(X_test) # per-class probabilities clf.score(X_test, y_test) # accuracy # Regression reg = PyNearKNeighborsRegressor(n_neighbors=5, weights='uniform') reg.fit(X_train, y_train) reg.predict(X_test) # predicted values reg.score(X_test, y_test) # R²
Supported metrics: euclidean / l2, manhattan / l1, chebyshev / linf, cosine, hamming
Supported weights: uniform, distance (inverse-distance-weighted)
Note: Input arrays are cast to
float32(oruint8for Hamming) before indexing. scikit-learn usesfloat64internally, so very small numerical differences may appear at the precision boundary, but nearest-neighbour results are identical for all practical datasets.
Features
Available indices
Exact indices — always return the true k nearest neighbours:
| Index | Distance | Data type | Notes |
|---|---|---|---|
VPTreeL2Index |
L2 (Euclidean) | float32 |
SIMD-accelerated |
VPTreeL1Index |
L1 (Manhattan) | float32 |
SIMD-accelerated |
VPTreeChebyshevIndex |
L∞ (Chebyshev) | float32 |
SIMD-accelerated |
VPTreeCosineIndex |
Cosine | float32 |
L2-normalised internally; SIMD-accelerated |
VPTreeBinaryIndex |
Hamming | uint8 |
Hardware popcount |
BKTreeBinaryIndex |
Hamming | uint8 |
Threshold / range search |
Approximate indices — trade a small recall budget for large speed gains; tunable via n_probe / radius:
| Index | Distance | Data type | Notes |
|---|---|---|---|
IVFFlatL2Index |
L2 (Euclidean) | float32 |
BLAS SGEMV inner scan; best for 512-D – 1024-D |
IVFFlatCosineIndex |
Cosine | float32 |
Spherical K-Means + BLAS SGEMV; ideal for text embeddings |
IVFFlatBinaryIndex |
Hamming | uint8 |
Binary K-Means IVF; faster build than Faiss binary IVF |
MIHBinaryIndex |
Hamming | uint8 |
Multi-Index Hashing; ~40× faster than Faiss brute-force on 512-bit near-duplicates, and faster than Faiss's own MIH |
All VPTree and IVFFlat indices support searchKNN(queries, k).
BKTreeBinaryIndex supports find_threshold(queries, threshold) for range queries.
Set n_probe = n_clusters on IVFFlatL2Index to make it exact.
See docs/approximate.md for a full guide on measuring
recall and tuning n_probe for your dataset.
Why approximate search? The curse of dimensionality
Tree pruning loses traction as dimensionality grows: in high-$n$ spaces, nearly all points concentrate in a thin shell near the boundary and distances between any two points become almost equal, leaving the tree nothing to prune. That's why exact tree search offers diminishing returns beyond
Full derivation, with volume integrals and a numerical illustration →
Pickle serialisation
All VPTree and IVFFlat indices are pickle-serialisable — save a built index to disk and reload it without rebuilding:
import pickle, numpy as np, pynear data = np.random.rand(20_000, 32).astype(np.float32) index = pynear.VPTreeL2Index() index.set(data) blob = pickle.dumps(index) index2 = pickle.loads(blob)
Tree inspection
####################
# [VPTree state]
Num Data Points: 100
Total Memory: 8000 bytes
####################
[+] Root Level:
Depth: 0
Height: 14
Num Sub Nodes: 100
...
Note:
to_string()traverses the whole tree — use it for debugging only.
Demos
Two interactive desktop demos ship in demo/ and run with a single command:
pip install PySide6 python demo/point_cloud.py # KNN Explorer — hover over 1M points to find neighbours python demo/voronoi.py # Voronoi diagram — drag seed points, watch cells reshape live
- KNN Explorer — scatter up to 1 million 2-D points and hover to see k nearest neighbours highlighted in real time. Supports zoom, pan, and configurable point size.
- Voronoi Diagram — every canvas pixel is coloured by its nearest seed point. Add, drag, and remove seeds; the diagram redraws live using pynear's batch 1-NN.
See docs/demos.md for full details.
Benchmarks
pynear is evaluated against Faiss, scikit-learn, and Annoy across L2 / L1 / Hamming and dimensionalities from 2-D to 1024-D.
- Binary / Hamming: see the SIFT1M results below
and the reproducible, thread-matched pynear vs Faiss comparison
— ~40× faster than Faiss's brute-force
IndexBinaryFlaton 512-bit near-duplicates, and faster than Faiss's ownIndexBinaryMultiHashat matched recall. - Full report: the formal benchmark PDF covers exact and approximate modes across L2 / L1 / Hamming with the recall–latency Pareto analysis. (Its binary-descriptor numbers are superseded by the thread-matched results/faiss_comparison.md.)
Quick standalone run:
Real-World Benchmark — SIFT1M Binary
Performance of pynear's approximate Hamming-distance indices on the INRIA TEXMEX SIFT1M dataset: 1,000,000 × 128-dim float SIFT descriptors sign-quantised to 128-bit binary (16 bytes/descriptor). Ground truth computed by exact brute-force Hamming k-NN over 500 queries, k=10. Machine: Intel(R) Core(TM) Ultra 9 285K.
The baseline below is a naive numpy scan. For the apples-to-apples comparison
against Faiss's optimised brute-force (IndexBinaryFlat) and Faiss's own
Multi-Index Hashing, see
results/faiss_comparison.md.
| Index | Configuration | Build (s) | ms / query | QPS | Recall@10 |
|---|---|---|---|---|---|
| numpy brute-force (naive) | N=1,000,000 | — | 50.1 | 20 | 1.000 |
| IVFFlatBinaryIndex | nlist=500, nprobe=31 | 6.24 | 1.47 | 679 | 0.825 |
| IVFFlatBinaryIndex | nlist=500, nprobe=62 | 6.24 | 2.85 | 351 | 0.842 |
| IVFFlatBinaryIndex | nlist=500, nprobe=125 | 6.24 | 5.65 | 177 | 0.845 |
| IVFFlatBinaryIndex | nlist=500, nprobe=250 | 6.24 | 10.74 | 93 | 0.845 |
| IVFFlatBinaryIndex | nlist=500, nprobe=500 | 6.24 | 20.95 | 48 | 0.845 |
| MIHBinaryIndex | m=8, radius=4 | 2.81 | 0.09 | 10825 | 0.585 |
| MIHBinaryIndex | m=8, radius=8 | 2.81 | 0.97 | 1031 | 0.829 |
| MIHBinaryIndex | m=8, radius=12 | 2.81 | 0.95 | 1053 | 0.829 |
| MIHBinaryIndex | m=8, radius=16 | 2.81 | 4.73 | 211 | 0.842 |
| MIHBinaryIndex | m=8, radius=24 | 2.81 | 12.37 | 81 | 0.844 |
| MIHBinaryIndex | m=8, radius=32 | 2.81 | 19.79 | 51 | 0.843 |
| MIHBinaryIndex | m=8, radius=48 | 2.81 | 36.34 | 28 | 0.843 |
Recall@10 is the standard
|returned ∩ true| / k, measured against a fixed exact-Hamming ground truth. Because Hamming distances are integers, the 10-th-nearest boundary is often tied, so even an exact scan can score below 1.0 against this reference — the value reflects tie-breaking, not missed neighbours.
Key takeaways:
IVFFlatBinaryIndex(nprobe=125) reaches Recall@10=0.845 at 177 QPS (9× faster than the naive numpy scan).MIHBinaryIndex(radius=4) is the lowest-latency single configuration at 10825 QPS (Recall@10=0.585).- MIH's real advantage shows on wide descriptors (256–512-bit) and small-radius / near-duplicate retrieval. On narrow 128-bit data at high recall, an optimised brute-force scan can outperform it — pick the index to the workload.
Reproduce:
python demo_binary.py· add--smallfor a 10 K quick test ·--n-gt-queries Nto adjust evaluation size.
Development
Building and installing locally
Running tests
Debugging C++ code on Unix
CMake build files are provided for building and running C++ tests independently:
Tests are built in Debug mode by default, so you can debug with GDB:
gdb ./build/tests/vptree-testsDebugging C++ code on Windows
Install CMake (py -m pip install cmake) and pybind11 (py -m pip install pybind11), then:
mkdir build cd build cmake ..\pynear
You may need to pass extra arguments, for example:
cmake ..\pynear -G "Visual Studio 17 2022" -A x64 ^ -DPYTHON_EXECUTABLE="C:\Program Files\Python312\python.exe" ^ -Dpybind11_DIR="C:\Program Files\Python312\Lib\site-packages\pybind11\share\cmake\pybind11"
Build and run vptree-tests.exe from the generated solution.
Formatting code
Star history
If pynear saved you time, consider starring the repo — it's the cheapest way to support the project and helps others discover it.

