The Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.
Unlike traditional tree-based indexes that are blind to the possible regularity present in the input data, the PGM-index exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query and update time performance.
In addition to that, the PGM-index offers compression, distribution-awareness, and multi-criteria adaptability, thus resulting suitable for addressing the increasing demand for big data systems that adapt to the rapidly changing constraints imposed by the wide range of modern devices and applications.
Features
Learned
It is one of the first results on learned indexes which achieves astonishing performance by capturing the distribution of the input data.
Optimal
It is the first learned index with provably optimal time and space complexity guarantees. This makes it resistant to adversarial inputs and queries.
Memory efficient
It always consumes less space than traditional tree-based indexes, often orders of magnitude less. If this is not enough, there is even a compressed version.
Fast construction
Its construction, based on a single scan of the input data, matches the efficiency of traditional indexes even on gigabytes of data.
Tunable
It can be tailored to various storage devices and memory hierarchies and auto-tuned to any given constraints on memory usage and query time.
Flexible
It can be beneficial in various applications, from databases to geographic information systems and search engines, as it supports several kinds of queries, from point to multidimensional.
Computational complexity
Let $n$ be the number of keys, and $B$ be the page size of the machine.
| PGM-index | B-tree | Self-balancing BST† | Skip list | Sorted array | |
|---|---|---|---|---|---|
| Predecessor query§ (static case) |
$\Oh(\log_B n)$ | $\Oh(\log_B n)$ | $\Oh(\log n)$ | $\Oh(\log n)$ w.h.p. | $\Oh(\log n)$ |
| Predecessor query (dynamic case#) |
$\Oh(\log^2_B n)$ | $\Oh(\log_B n)$ | $\Oh(\log n)$ | $\Oh(\log n)$ w.h.p. | $\Oh(\log n)$ |
| Insert/delete | $\Oh(\log_B n)$ amortised | $\Oh(\log_B n)$ | $\Oh(\log n)$ | $\Oh(\log n)$ w.h.p. | $\Oh(n)$ |
| Index space in words | $\Oh(\frac{n}{B^2})$ w.h.p.‡ | $\Oh(\frac{n}{B})$ | $\Oh(n)$ | $\Oh(n)$ w.h.p. | $\Oh(1)$ |
For more information, visit the computational complexity page.
Running example
#include <vector>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include "pgm/pgm_index.hpp"
int main() {
// Generate some random data
std::vector<int> data(1000000);
std::generate(data.begin(), data.end(), std::rand);
data.push_back(42);
std::sort(data.begin(), data.end());
// Construct the PGM-index
const int epsilon = 128; // space-time trade-off parameter
pgm::PGMIndex<int, epsilon> index(data);
// Query the PGM-index
auto q = 42;
auto range = index.search(q);
auto lo = data.begin() + range.lo;
auto hi = data.begin() + range.hi;
std::cout << *std::lower_bound(lo, hi, q);
return 0;
}
Read more about the C++ API here.
Publications
- Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.
- Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective?. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020).
- Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. On the performance of learned data structures. Theoretical Computer Science, 2021.
Cite us
If you use the library please put a link to this website and cite the following paper:
Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.
@article{Ferragina:2020pgm,
Author = {Paolo Ferragina and Giorgio Vinciguerra},
Title = {The {PGM-index}: a fully-dynamic compressed learned index with provable worst-case bounds},
Year = {2020},
Volume = {13},
Number = {8},
Pages = {1162--1175},
Doi = {10.14778/3389133.3389135},
Url = {https://pgm.di.unipi.it},
Issn = {2150-8097},
Journal = {{PVLDB}}}
Some interesting uses of the PGM-index
- LeMonHash. A monotone minimal perfect hash function that uses the PGM-index in its design.
- PyGM. A Python package of sorted containers that uses the PGM-index for efficient query performance and memory usage.
- Manticore. An open-source fast database that uses the PGM-index in its column-oriented storage library.
We would love to be informed whether you used our code in your projects. We will list the most interesting applications of the PGM-index here!
Contribute
There are a lot of ways to contribute on this project, just to mention a few:
Engineering the support for insertions and deletions.done!- Making the index SIMD aware. For example, you could set the error to the SIMD register width and use vector instructions to traverse the levels of the index with no branches.
- Adding support for concurrent and batch queries.
Feel free to submit issues and pull requests in the GitHub repository.