Lectures in Advanced Data Structures (6.851)

L01 Feb. 7 [+] Temporal: class overview, pointer machine, partial persistence, full persistence, confluent persistence, functional Scribe Notes [src] Erik's Notes Video Persistent data structures

The first lecture is about “persistence” (which corresponds to the “branching universe” model of time travel). On the one hand, we'd like to remember all past versions of our data structure (“partial persistence”). On the other hand, we'd like to be able to modify past versions of our data structure, forking off a new version (“full persistence”). Surprisingly, both of these goals are achievable with only a constant-factor overhead, in a fairly general model called the “bounded-degree pointer machine”. A bigger challenge is the ability to merge different versions into one version, which can e.g. allow you to double your data structure's size in one operation (“confluent persistence”). It's still unsolved whether this is possible with small overhead, but I'll describe what's known.

[Details and references]

This lecture overviews the nine subjects of the course: integer and string data structures, persistent and dynamic data structures, data structures that takes memory-hierarchy into account and data structures that uses a minimal amount of space, the problem of whether there exists an optimal binary search tree and the studying of hashing, and geometric data structures. The first lecture covers persistent data structures. First we introduce the pointer machine model, and four levels of persistence: partial, full, confluent, and functional. Then we show that any pointer-machine data structure with no more than O(1) pointers to any nodes (in any version) can be made partially persistent with O(1) amortized multiplicative overhead & O(1) space per change [Driscoll, Sarnak, Sleator, Tarjan - JCSS 1989] . Then we show that this applies to full persistence as well [Driscoll, Sarnak, Sleator, Tarjan - JCSS 1989] with the help of an order file maintenance data structure Dietz & Sleator - STOC, 1987] We briefly mention a de-amortization technique for partial persistence [Brodal - NJC 1996] with its full-persistence version remaining open. For conflict persistence, in the general transformation scenario, given a version v in the version DAG, we define e(v) = 1+ log(#paths from root to v), which measures the version DAG's deviation from a tree. Confluent persistence with O(e(v)) per operation is possible [Fiat & Kaplan - J.Alg 2003] . A data structure with O(lg n) or lower cost per op remains open. For disjoin transformation, O(lg n) overhead is possible [Callette, Iacono, Langerman - SODA 2012] . For functional persistence, we show a data structure for balanced BST with O(lg n) per op [Okasaki-book 2003] , a data structure for link-cut tree with the same bound [Demaine, Langerman, Price] , one for deques with concatenation in O(1) per op [Kaplan,Okasaki, Tarjan - SICOMP 2000] and update and search in O(lg n) per op [Brodal, Makris, Tsichlas, - ESA 2006] , and one for tries with local navigation and subtree copy/delete and O(1) fingers maintained to "present" [Demaine, Langerman, Price] . What's beyond the scope of this lecture includes a log factor separation for functional persistence from pointer machine [Pippinger - TPLS 1997], the open problem of proving bigger separation for both functional and confluent persistence, general transformations for functional and confluent persistence, lists with split and concatenate, and arrays with copy and paste.

L02 Feb. 9 [+] Temporal: partial retroactivity, full retroactivity, nonoblivious retroactivity Scribe Notes [src] Erik's Notes Video Retroactive data structures

Today's lecture is our second and final lecture on time travel, or more precisely, temporal data structures. Here we will study retroactive data structures, which mimic the "plastic timeline" model of time travel. For example, in Back to the Future, Marty goes back in time, makes changes, and returns to the present that resulted from those changes and all the intervening years. In a partially retroactive data structures, the user can go back in time, perform an additional operation, return to the present that results, and query the resulting data structure. In this way, we maintain a single (changing) timeline, consisting of the sequence of update operations. Retroactive data structures can add, or remove, an update at any time, not just the end (present). A fully retroactive data structure can furthermore query the data structure at any time in the past. Finally, a third type of retroactivity, called "nonobvious", puts queries on the timeline as well, and reports the first query whose answer changed. As you might expect, retroactivity is hard to obtain in general, but good bounds are known for several data structures of interest.

[Details and references]

This lecture introduces the retroactive data structure and a new computation model, the cell probe model. Retroactive data structure maintains a linear timeline and allows updates to be performed at any time [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] . Partial retroactivity only permit queries at the present time, while full retroactivity allows queries at any time. A third kind of retroactivity, non-oblivious retroactivity, lets the user put a series of queries on the time line, and whenever an update is performed, the data structure reports the first query whose value has changed. For partial and full retroactivity, there are two easy cases. The first one concerns commutative and invertible updates. The second, being slightly more complex, concerns decomposable search problem [Bentley & Saxe — J. Alg 1980], which is solved using a segment tree. The harder situation with general transformation [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] can be solved naively using rollback. Concerning general transformation's lower bound, it has been proven that Ω(r) overhead can be necessary [Frandsen, Frandsen, Mitlersen — I&C 2001]. The same paper also proves a Ω((r/lg r)^1/2) lower bound for cell-probe model. Cell probe model is "a model of computation where the cost of a computation is measured by the total number of memory accesses to a random access memory with log n bits cell size. All other computations are not counted and are considered to be free." [NIST] A better lower bound for cell probe model, such as Ω(r/poly lg r), is open. With the help of a geometric representation, we show a partially retroactive priority queue with O(lg n) insert and delete-min [Demaine, Iacono, Langerman — 2003 / T. Alg 2007] . Other data structures mentioned include queue, deque, union-find, a fully retroactive priority queue with O(m^1/2 lg m) per op and the successor problem [Giora & Kaplan — T. Alg 2009]. A better fully persistent priority queue is open. The best successor problem solution requires fractional cascading from lecture 3 and the van Emde Boas data structure from lecture 11. For non-oblivious retroactivity [Acar, Blelloch, Tangwongsan — CMU TR 2007], a priority queue is shown with the help of a geometric representaiton.

L03 Feb. 23 [+] Geometric: point location via persistence, dynamic via retroactive; orthogonal range queries, range trees, layered range trees, dynamizing augmentation via weight balance, fractional cascading Scribe Notes [src] Erik's Notes Video In this lecture, the first of two about geometric data structures, we'll talk about two major problems—point location and range searching—and tie them to several major data structural techniques: persistence, retroactivity, dynamization of augmentation through weight balance, and fractional cascading.

In (planar) point location, we need to preprocess a planar map to support queries of the form "which face contains this point?" This problem is useful for figuring out where you clicked in a GUI, or for figuring out what city/country/etc. contain your GPS coordinates. The static (preprocessing) version of this problem can be solved optimally (in logarithmic time per query) by applying persistence to a classic sweep-line technique. The dynamic version, where the map can be modified, can be solved for orthogonal maps in optimal logarithmic time per operation using retroactive successor from Lecture 2.

In (orthogonal) range searching, we need to preprocess n points in d dimensions to support queries of the form "which points are in this box?" Simple range trees solve this problem in O(log^d n) query time and O(n log^d−1 n) space. A cool crosslinking technique (called layered range trees) reduce the query time to O(log^d−1 n). A general technique for dynamizing augmented search trees, called weight-balanced trees, allows us to additionally support insertion and deletion of points in O(log^d n) time per update. The crosslinking technique is a part of a more general technique called fractional cascading, which can further improve query time by another logarithmic factor.

L04 Feb. 28 [+] Geometric: O(log n) 3D orthogonal range searching via fractional cascading; kinetic data structures Scribe Notes [src] Erik's Notes Video In this lecture, we continue (and finish) the theme of geometric data structures.

First, we'll finish our coverage of fractional cascading from Lecture 3 by illustrating a really cool application: 3D orthogonal range searching in O(log n) time. This gives us the second log factor improvement mentioned in Lecture 3, giving O(log^d−2 n) for d dimensions.

Second, we'll cover a style of data structures for moving data, e.g., where 2D points move at known velocity and acceleration. Although the main study here is in 2D (and 3D), we'll focus on two 1D problems that are well-understood and fit in a lecture: kinetic predecessor and kinetic priority queues.

L05 Mar. 1 [+] Dynamic optimality: binary search trees, analytic bounds, splay trees, geometric view, greedy algorithm Scribe Notes [src] Erik's Notes Video Have you ever wondered how to represent a binary search tree on n nodes as a set of n points in 2D? Today's lecture will reveal a surprising new (2009) such connection, which provides an approach to tackling the venerable dynamic optimality conjecture.

The dynamic optimality conjecture is one of the oldest and biggest open problems in data structures, asking a fundamental question: is there one best binary search tree? Traditional wisdom is that “splay trees” are such a tree (up to constant factors), but we'll see a new “greedy” tree data structure that we think is more obviously best—though we still can't prove it.

Dynamic optimality remains an active and exciting area of research, and this lecture and the next focus on the current state-of-the-art.

L06 Mar. 6 [+] Dynamic optimality: independent rectangle, Wilber, and Signed Greedy lower bounds; key-independent optimality; O(lg lg n)-competitive Tango trees Scribe Notes [src] Erik's Notes Video In this lecture, we'll see the best binary search tree we know, in the sense of achieving the best known competitive ratio of O(lg lg n) compared to the offline optimal. To analyze these “Tango trees”, we compare against a lower bound. Specifically, we describe a Signed Greedy algorithm that, for a given access sequence, computes a number of node touches that every BST in the world must perform when given that access sequence. Indeed, the Signed Greedy algorithm adds points row by row just like the Greedy algorithm we saw last lecture. The only catch is that Signed Greedy doesn't actually satisfy the point set, so it doesn't correspond to a BST… yet it is tantalizingly close to Greedy, which we saw last lecture corresponds to an online BST, suggesting that the two bounds are within a constant factor of each other (making Greedy dynamically optimal). While this gap hasn't been closed yet, we can still use the lower bound to analyze Tango trees, and get the exponential improvement over the trivial O(lg n) competitive ratio held by any balanced binary search tree. L07 Mar. 8 [+] Memory hierarchy: models, cache-oblivious B-trees Scribe Notes [src] Erik's Notes Video The topic of the next three lectures is cache-efficient data structures. A classic result here is that B-trees are good at exploiting that data is transferred in blocks between cache and main memory, and between main memory and disk, and so on. B-trees achieve O(log_B N) insert/delete/predecessor/ successor for N items and memory block transfers of size B. What's more recent and surprising is that you can achieve the same performance even if you don't know what B is, or in other words, simultaneously for all architectures with all values of B. This is a result in the “cache-oblivious” model, and we'll see how to do it over the next two lectures. Cache-oblivious data structures show some exciting practical promise, too, as they form the foundation of a database startup, Tokutek. L08 Mar. 13 [+] Memory hierarchy: ordered-file maintenance, list labeling, order queries, cache-oblivious priority queues Scribe Notes [src] Erik's Notes Video This lecture continues our theme of cache-oblivious data structures.

First we'll finally cover the black box we used last lecture to obtain cache-oblivious B-trees: maintain an ordered file with O(1)-size gaps in O(lg² N) moves per insert/delete in the middle of the file. As an extra bonus, we'll see how to maintain a linked list subject to order queries in O(1) time per insert/delete, which is a black box we used back in Lecture 1 to linearize the version tree when implementing full persistence.

Second we'll cover cache-oblivious priority queues, supporting insert and delete-min in what's usually subconstant time per operation (!), O((1/B) log_M/B (N/B)). This will be the first time we see how to adapt to the cache size M, not just the block size B, in a cache-oblivious data structure.

L09 Mar. 15 [+] Memory hierarchy: distribution sweeping via lazy funnelsort; cache-oblivious orthogonal 2D range searching: batched and online Scribe Notes [src] Erik's Notes Video Our third and final lecture on memory hierarchies is a fun crossover between cache-oblivious data structures and geometric data structures. We'll start with an optimal cache-oblivious sorting algorithm (something we left as a black box in Lecture 8), called Lazy Funnelsort, though we'll skip the analysis, as it's similar to the priority queue. Then we'll see how to combine that algorithm with the sweepline technique from computational geometry to solve batched geometric problems. Specifically, we'll see how to solve batched orthogonal 2D range searching in the sorting bound using this “distribution sweeping” technique. Finally, we'll cover a series of algorithms for regular online orthogonal 2D range searching, including an impressive linear-space cache-oblivious data structure for 2-sided (quarterplane) range queries, as well as the trick for shaving off a log log factor of space from a regular 4-sided range search data structure. L10 Mar. 20 [+] Dictionaries: universal, k-wise independent, simple tabulation hashing; chaining, dynamic perfect hashing, linear probing, cuckoo hashing Scribe Notes [src] Erik's Notes Video Today's lecture gives a fairly thorough overview of different approaches to hashing. First, we talk about different hash functions and their properties, from basic universality to k-wise independence, to a simple but effective hash function called simple tabulation. Then we talk about different approaches to using these hash functions in a data structure: basic chaining, (dynamic) perfect hashing, classic but tricky-to-analyze linear probing, and the more recent cuckoo hashing. L11 Mar. 22 [+] Integer: models, predecessor problem, van Emde Boas, x-fast and y-fast trees, indirection Scribe Notes [src] Erik's Notes Video

This lecture marks our full entry into integer data structures (though hashing was also one), as well as our first of three lectures on the predecessor problem: supporting insert, delete, and predecessor/successor on a set of n w-bit integers. I'll give an overview of what's known about this problem, as well as the appropriate models of computation, and then proceed to the first main result: data structures achieving O(lg w) time per operation. When w is polylogarithmic in n (a common case), this bound is O(lg lg n), an exponential improvement over binary search trees. The first such data structure is called (and by) van Emde Boas, and it uses Θ(2^w) space. A little addition of hashing reduces the space to optimal Θ(n). We'll also see simpler ways to achieve the same bounds, using a simple tree view, indirection, and structures called x-fast and y-fast trees by Willard (who was actually the first to get Θ(n) space).

L12 Apr. 3 [+] Integer: fusion trees: sketching, parallel comparison, most significant set bit Scribe Notes [src] Erik's Notes Video

This lecture describes a tour-de-force in integer data structures, called fusion trees (so named because they were published a year after the 1989 cold-fusion "scandal", and were perhaps just as surprising--though more correct). Fusion trees solve predecessor and successor among n w-bit integers in O(log_w n) time per operation on the word RAM. The basic idea is to build a B-tree with branching factor w^ε. The tricky part is to build a w^ε-size node supporting constant-time predecessor/successor.

We'll see three major techniques for doing so. First, sketching reduces the w^1+ε bits in a node down to w “essential” bits, by reducing each word down to w^1−ε “essential” bits. The impressive part is sketching a word in constant time using a single multiplication. Second, parallel comparison lets us compare all of these sketches with a single query in constant time. Third, we'll see how to find the most significant set bit of a w-bit word in constant time on a word RAM, by using most of the fusion techniques again; this problem has tons of applications beyond fusion trees as well. (As a result, most significant set bit is a built-in instruction on most architectures; see e.g. StackOverflow.)

L13 Apr. 5 [+] Integer: predecessor lower bound via round elimination Scribe Notes [src] Erik's Notes Video This lecture is our first (of two) about integer data structure lower bounds. In particular, we'll prove that the min of van Emde Boas and fusion trees is an optimal (static) predecessor data structure up to a log log factor, assuming polynomial space. The exact trade-off (up to constant factors) is known, but messier; we'll see the bound but not prove it. In particular, it is known that the log log factor improvements are actually impossible to achieve with roughly linear space, making the min of van Emde Boas and fusion trees actually optimal.

These lower bounds hold in the powerful cell-probe model of computation, where we just count the number of words read by the query algorithm. The proof uses a beautiful technique called round elimination, in a communication-complexity view of data structures, and is actually quite clean and simple given an appropriate lemma.

L14 Apr. 10 [+] Integer: sorting in linear time for w = Ω(lg^2+ε n), priority queues Scribe Notes [src] Erik's Notes Video This lecture is about the state-of-the-art in sorting and priority queues on a word RAM. An equivalence by Thorup shows that any sorting algorithm can be transformed into a priority queue with operations taking 1/nth the time to sort. So these are really one and the same problem.

The best results we know for sorting in linear time (and thus for constant-time priority queues) is when w = O(lg n) and when w = Ω(lg^2+ε n). The first result is just radix sort. The second result is the main topic of the lecture: a fancy word-RAM algorithm called signature sorting. It uses a combination of hashing, merge sort, and parallel sorting networks.

The range of w in between lg and lg^2+ε remains unsolved. The best algorithm so far runs in O(n √lg lg n) expected time.

L15 Apr. 12 [+] Static trees: least common ancestor, range minimum queries, level ancestor Scribe Notes [src] Erik's Notes Video

It's amazing what you can do in constant time using only linear space. This lecture is about such data structures for jumping around (static) rooted trees. Given two nodes, we'll see how to find their least common ancestor in constant time. Given a node and a number k, we'll see how to find the kth ancestor of the node in constant time. Both data structures use a nice collection of ideas, including a powerful trick we haven't used before: lookup tables for small inputs.

Along the way, we'll solve another seemingly unrelated problem: preprocess an array of numbers to support queries of the form: what's the smallest number in a given interval of the array? Again, we'll show that constant-time queries are possible using only linear space.

L16 Apr. 19 [+] Strings: suffix tree, suffix array, linear-time construction for large alphabets, suffix tray, document retrieval Scribe Notes [src] Erik's Notes Video This lecture is about efficient data structures for searching in static strings. Think of the strings we're searching in as (large) files, or entire disks. First we'll see how to store a bunch of strings in linear space subject to searching for a string of length P in O(P) time, and answering predecessor queries for a string of length P in O(P + lg Σ) time, where Σ is the size of the alphabet. Then we'll see how to store one string (or more) in linear space subject to determining whether it has a given substring of length P in O(P) time, counting the number of such substring occurrences in O(P) time, and listing k occurrences in O(P + k) time. Finally, we'll see how to build this data structure in linear time. We'll make use of several tools developed in previous lectures: weight-balanced binary search trees (similar to L3), leaf trimming (L15), Cartesian trees (similar to L15), range minimum queries (L15), least common ancestor (L15), and level ancestor (L15). L17 Apr. 24 [+] Succinct: rank, select, tries Scribe Notes [src] Erik's Notes Video This lecture is the first of two about succinct data structures—data structures using very close to the minimum amount of space (“just the data”). In particular, we'll see how to store an n-node binary trie in 2n + o(n) bits (which is optimal up to the little-oh term) and still be able to navigate to a node's left child / right child / parent in constant time per operation, as well as be able to compute the size of the subtree rooted at a node in constant time. Along the way, we'll see how to store an n-bit string in n + o(n) bits (which is again optimal up to little-oh) and be able to compute the number of 1 bits left of a given bit, and find the ith 1 bit, in constant time per operation. These data structures form the basis of many succinct data structures. L18 Apr. 26 [+] Succinct: compact suffix arrays and trees Scribe Notes [src] Erik's Notes Video This lecture continues our theme of succinct data structures. Building on our expertise with succinct tries, we'll continue on to the practical problem of indexing text with suffix arrays/trees, where several compact and some succinct results are known, with reasonably small (though suboptimal) slowdowns in query time. We'll cover a simple version (not the best) of the historically first such result, which is a compact suffix array that incurs a log^ε overhead in queries. Then we'll see how to turn any suffix array into a suffix tree with very little (succinct) additional space overhead. We'll get to use our friends (rank, select, balanced parentheses) from last lecture. L19 May 1 [+] Dynamic graphs: link-cut trees, heavy-light decomposition Scribe Notes [src] Erik's Notes Video This lecture is about a cool data structure for maintaining rooted trees (potentially very unbalanced) in O(log n) time per operation. The operations include linking two trees together by adding an edge, and cutting an edge to split a tree into two trees, so the data structure is called link-cut trees. This result is our first solving a dynamic graph problem (where in general we can insert and delete edges); next lecture will see other solutions for trees and generalization from trees to undirected graphs. Link-cut trees have specific advantages in that they can represent information over root-to-node paths; for example, they can easily compute the min/max/sum of values stored in nodes or edges along any root-to-node paths. Most importantly, link-cut trees introduce two powerful tree decompositions: preferred-path decomposition (which we already used in Tango trees) and heavy-light decomposition. As we will cover them, link-cut trees also demonstrate how a surprisingly “care-free” use of splay trees is actually still efficient. L20 May 3 [+] Dynamic graphs: Euler tour trees, decremental connectivity in trees in O(1), fully dynamic connectivity in O(lg² n), survey Scribe Notes [src] Erik's Notes Video

This lecture continues our theme of dynamic graphs. Beyond surveying results for several such problems, we'll focus on dynamic connectivity, where you can insert and/or delete edges, and the query is to determine whether two vertices are in the same connected component (i.e., have a path between them). We'll cover a few different results for this problem:

Link-cut trees (from last lecture) already solve trees in O(lg n).
Euler-tour trees are a simpler way to solve trees in O(lg n), which allow us to aggregate over subtrees instead of paths.
If we just insert or just delete edges, we can solve trees in O(1), using our friend leaf trimming plus some simple bit-vector tricks.
For general graphs, we'll see how to support updates in O(lg² n) and queries in O(lg n / lg lg n).

This will be our culmination of data structures for dynamic graphs; the next (final) lecture is about matching lower bounds.

L21 May 8 [+] Dynamic graphs: Ω(lg n) lower bound for dynamic connectivity Scribe Notes [src] Erik's Notes Video

This lecture covers our second of two lower bounds. This one is work by Mihai Pătraşcu (former 6.851 TA) and myself. We'll show that maintaining a graph subject to edge insertion, edge deletion, and connectivity queries (are v & w connected by a path?) requires Ω(lg n) time per operation, even if the graph is just a bunch of paths. This in particular proves optimality of the logarithmic dynamic tree data structures, and shows that the O(lg² n) data structure we saw for general graphs is pretty good. The lower bound introduces a new but very simple technique, which at the time was the first “truly logarithmic” lower bound for a natural problem, yet the whole proof is relatively clean.

L22 May 19 [+] History of memory models: idealized 2-level, red-blue pebble game, external memory, HMM, BT, (U)MH, cache oblivious [bonus lecture] Erik's Notes Video

Ever wonder where the external-memory and cache-oblivious models of Lectures 7–9 come from? It turns out that they both have a natural history. The external-memory model combines two older models: Floyd's idealized 2-level model of 1972, which models blocks but not cache; and Hong and Kung's red-blue pebble game of 1981, which models cache but not blocks. The cache-oblivious model follows a series of attempts to model multilevel memory hierarchies, most notably the HMM model of 1987, where it is possible to prove that a single algorithm runs optimally on all possible memory hierarchies.

This lecture is not an official part of the class, but rather was part of a STOC 2012 tutorial on Algorithms for Memory-Sensitive Computing, organized by Michael Bender and Martin Farach-Colton. Given the relevance of the material, we are including it here. Note that the format differs from the usual blackboard lecture: it uses slides.

All scribe notes in one PDF