Are published ANN-Benchmarks DBMS results trustworthy?

Press enter or click to view image in full size

ANN (Approximate Nearest Neighbor) search is a crucial part of any AI infrastructure. It dramatically enhances the capabilities of LLMs with RAG (Retrieval-Augmented Generation). Initially, there were specialized vector databases optimized for ANN. Now, practically all database management systems provide this functionality, including general-purpose ones like PostgreSQL, MariaDB, and YDB, to name just a few. That’s why performance comparison is essential: you need to know which DBMS delivers the best performance at a given target recall.

One of the most well-known and widely adopted benchmarks is ann-benchmarks. It has 5.5K stars on GitHub and almost 1K forks. There are recent posts comparing pgvector and Qdrant, pgvector and Pinecone, MariaDB and others, etc. In this post, we demonstrate why many published ann-benchmarks results can be unreliable. To illustrate this, we use pgvector and the cohere-wikipedia-22–12 dataset (which is widely used in published results we reference), but the described issues are applicable to other implementations and datasets.

After fixing several major methodological issues in ann-benchmarks and its forks (detailed later in this section), we achieved nearly 20x QPS improvement in pgvector as reported by the benchmark on the client side, with even lower latencies than those measured using the original benchmark version. Here are the results obtained using the fork done by Tiger Data for their Postgres vs. Qdrant and pgvector and Pinecone comparisons, using the cohere-wikipedia-22–12–10M-angular dataset with 500K test vectors (in both cases, the parameters are m=24, ef_construction=200, ef_search=20):

Algorithm          Recall         QPS      P50      P95      P99     P999
--------------------------------------------------------------------------
PGVectorHNSW       0.8926       1149.1     12.8     75.8    124.5    210.3
PGVectorHNSW       0.8601       1095.5     12.8     22.6     98.6    228.5
PGVectorHNSW       0.8032       1123.8     12.4     21.1     85.2    225.7

And the results from our fork with fixes:

Algorithm          Recall        QPS        P50      P95      P99     P999
--------------------------------------------------------------------------
PGVector           0.8926      18195.7      6.3     11.4     15.8     29.0
PGVector           0.8601      20002.7      5.7     10.0     14.0     25.7
PGVector           0.8032      22019.3      5.3      8.6     12.0     22.5

Importantly, neither the database query execution path nor the server-side configuration was modified. The ~20× QPS difference comes entirely from the benchmark’s client-side implementation details, measurement methodology, and reporting logic, not from changes to the database itself. Postgres configuration and cluster setup are identical and therefore out of scope. However, if you’re interested, we use the same setup as in our TPC-C benchmarking, with the following changes:

Switched to Ubuntu 22.04.5 (Linux version 5.15.0–164-generic).
Updated Postgres from 16 to 17.
Used pgvector v0.8.1.

Long story short, the main issues responsible for this discrepancy are summarized below:

The vanilla version calculates QPS as 1.0 / best_search_time, where best_search_time is actually a sum of all latencies divided by the number of queries, i.e. average latency. That's debatable in the case of a single thread and incorrect when requests run in parallel. In particular, this calculation is wrong because QPS is defined as the number of completed requests per unit of wall-clock time, whereas 1.0 / average_latency ignores concurrency and assumes strictly sequential execution. As soon as multiple requests overlap in time, this formula no longer represents throughput and produces misleadingly low QPS values.
In some ann-benchmarks forks, we see that the QPS calculation is changed to request_count / wall_time, which looks correct. However, the total time measured by the benchmark is inaccurate because it includes time spent on creating new processes (if applicable), connecting to the database, and opening sessions. Combined with short benchmark runs, this makes QPS much lower than it actually is. For example, consider pgvector with 5K test requests. The total time might be 4.9 s, while the query execution part is just 0.6 s. So you get an incorrect 1K QPS instead of the actual 8.3K.
The forks we have seen do not try to deal with the thundering herd. Parallel queries are started at the same time without jitter. Again, with short benchmark executions, this seriously affects the results.
Finally, many DBMS “adapters” (“algorithms” in ANN terminology) use threading instead of multiprocessing. But this is Python, and the GIL is still there. Those that use multiprocessing are still affected by the first three issues.

Ann-benchmarks pitfalls

Ann-benchmarks is a toolset/framework for benchmarking ANN search algorithms. It generates datasets, provides execution infrastructure for algorithms, calculates recall, QPS, latencies, and does plotting. Dataset generation is fantastic: with this exact single line of code, we have added the cohere-wikipedia-22–12–10M-angular dataset containing 10M vectors and 5K randomly chosen test vectors. We use cohere-wikipedia because it is widely used, including in some published results mentioned above.

Get Evgeniy Ivanov’s stories in your inbox

Join Medium for free to get updates from this writer.

When you add a dataset, Ann-benchmarks downloads the data and performs an exact k-NN search over the full dataset. The exact nearest neighbors (IDs and distances) are saved as part of the generated dataset. Then, it is used to calculate the recall of ANN search algorithms. Exact search is kind of brute force: it takes a very long time even in the case of medium datasets with a small number of test vectors. For example, it took us between 4–5 days to generate cohere-wikipedia-22–12–100M-angular-1M (100M dataset with 1M test vectors). That might be a reason why most existing datasets are small, while bigger ones have a small number of test vectors.

Back to cohere-wikipedia-22–12–10M-angular: 5K test vectors might be just fine for calculating recall. The main problem is that ann-benchmarks runs the test data only once, and if you have a performant database, the execution is very short (e.g., around 0.2 seconds). Using more test vectors does not change recall, but it stabilizes throughput and latency measurements by amortizing fixed overheads. Running the same small test set multiple times and selecting the best run does not address this issue: it does not increase the total amount of work performed and therefore does not amortize fixed overheads. As a result, throughput and latency measurements remain dominated by startup and coordination costs rather than steady-state query execution.

Let’s describe this more precisely. Ann-benchmarks provides a batch mode that submits all queries to the algorithm implementations at once. This is likely why database developers started to add their databases as ANN algorithms and use batch mode to execute queries in parallel. To be executed in batch mode, an algorithm implements two functions:

batch_query(self, all_test_vectors)
get_batch_results(self)

Then, it is executed as follows:

start = time.time()
algo.batch_query(X, count)
total = time.time() - startresults = algo.get_batch_results()
batch_latencies = algo.get_batch_latencies()

It’s up to the algorithm to decide if it should run many async requests, run threads, fork, whatever. The key insight here is that the measured time includes time spent on forking (if applicable), connecting to the database, and opening sessions. And as we stated above, executions might be short, so this extra time might dramatically decrease calculated QPS, as we already demonstrated.

In principle, an algorithm could mitigate this by warming up a connection pool ahead of time. However, ann-benchmarks does not provide a clear lifecycle hook or explicit separation between initialization and the measured execution phase: algorithm parameters that may affect connection setup are applied between construction and batch execution. As a result, there is no well-defined point at which a connection pool can be safely initialized outside the measured interval. This appears to be a limitation of the benchmark framework rather than a fundamental constraint, and could potentially be addressed with changes to its execution model.

Moreover, even if you fix time calculation as we did, the thundering herd is still an issue. According to our runs, QPS is 3x lower (and latencies are way higher) on short runs despite a warm cache. Adding some jitter on short runs helps, but if you have many threads and short execution, you can’t afford a good spread (at least without deeper reworking of the way ann-benchmarks works).

Forks pitfalls

Now, let’s review some published results.

Tiger Data’s pgvector vs. Qdrant and pgvector vs. Pinecone seem to be based on their fork. They have fixed the RPS calculation bug, but still use a small number of test vectors, and the measured total time is highly inaccurate. Moreover, they use threading in the pgvector and Pinecone modules. Unfortunately, all this renders the reported throughput numbers unreliable.

Let’s have a look at MariaDB’s fork. There are multiple posts based on their ann-benchmarks fork: this and that, to name a few. Their MariaDB module uses multiprocessing. But they use the original, incorrect QPS calculation. Also, we see again datasets with a small number of test vectors. Again, unfortunately, the numbers are not reliable.

Conclusion

Press enter or click to view image in full size

We hope that our findings will help improve benchmarking of ANN search — and benchmarking in general. Performance is a great area, and it’s cool to have fair and correct measurements. We have created a PR, which upstreams proper QPS calculation. Still, algorithms have to be updated to measure accurate wall time and must switch to multiprocessing (or async/await) if they don’t use it. Our fork with all the described fixes and proper pgvector implementation is available here. Generating datasets with plenty of test vectors is time-consuming; thus, we share some pre-generated cohere-wikipedia-22–12 datasets:

> aws s3 --no-sign-request --endpoint-url=https://storage.yandexcloud.net ls s3://ann-data/
                           PRE /
2025-12-23 19:05:38 288852793344 cohere-wikipedia-22-12-100M-angular-100K.hdf5
2025-08-11 18:00:38 288708793344 cohere-wikipedia-22-12-100M-angular.hdf5
2025-08-11 17:58:38 288708793344 cohere-wikipedia-22-12-100M-euclidean.hdf5
2025-11-22 12:03:48 33056008192 cohere-wikipedia-22-12-10M-angular-500K.hdf5
2025-11-22 12:05:50 30953608192 cohere-wikipedia-22-12-10M-angular-50K.hdf5
2025-08-11 13:43:36 28426737041 cohere-wikipedia-22-12-10M-angular.hdf5.gz
2025-11-22 12:01:38 4702728192 cohere-wikipedia-22-12-10k-angular-1M.hdf5
2025-11-22 12:01:39   31195392 cohere-wikipedia-22-12-10k-angular.hdf5
2025-08-11 13:43:49 2877862286 cohere-wikipedia-22-12-1M-angular.hdf5.gz
2025-11-22 12:01:17 153603078144 cohere-wikipedia-22-12-50M-angular.hdf5

As for us, we’re still evaluating the option of using ann-benchmarks to compare YDB against other databases. Stay tuned for updates and our results.