What's good about offset pagination; designing parallel cursor-based web APIs
brandur.orgTo point out the obvious: generally API providers don’t particularly want you to pararelize your request (they even implement rate limiting to make it harder on purpose). If they wanted to make it easy to get all the results, they would allow you to access the data without pagination - just download all the data in one go.
A certain level of parallelism is generally within the realm of good API citizenship. Even naive rate limiting schemes tend to permit a certain number of concurrent requests (as they well should, since even browsers may perform concurrent requests without any developer intervention).
Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.
Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.
One thing that frequently bugs me is APIs limiting number of items per page for reasons of efficiency. I can perfectly understand low limits for other reasons, like not helping people scrape your data.
But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.
In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.
Strongly agree. I have an API I work on where if you ask for a couple of gigabytes of data, it'll send it to you, because if you're asking for it, that's what you want. It gets streamed out, and the docs warn you that you will get exactly what you asked for, so you may want to chunk up on your side (for this particular API there is a trivial way for clients to do that), or if you can handle a full stream, go nuts.
Pagination would just complicate things. I think with most APIs, intended as APIs (i.e., not just an endpoint primarily meant to feed a front-end page), you're better off thinking of your default as "I'm going to just stream everything they ask for", and look for reasons why that won't work, rather than start from the presumption that everything must be paginated from the beginning.
Don't get me wrong; there are plenty of solid reasons to paginate. You may discover one applies to your API. But if you can leave it out, it's often simpler for both the producer and the consumer. Wait until you find the need for it. Plus, if that happens, you'll have a better understanding of the actual problem you need to solve and better solutions may reveal themselves.
Pagination for a one-page-query is rarely the same cost in my experience, in real-world scenarios.
In very simple cases, like a single table sql query, absolutely - databases effectively have to compute the full result, sort it, and return a window. There's almost no reason to paginate here, at an API level, unless the consumer wants only a subset (say, bandwidth limitations). Sending it all at once can be a huge benefit for those that will use it all, it's both simpler and faster for all parties.
But in most real-world cases, there are at least two additional details that can add significant response time: joins (when not involved in sorting) and additional data-gathering needed to fully build the response (e.g. getting data from other systems, internal or external). Joined data is not typically loaded prior to computing limit/offset since it may be a massive waste, and external data is effectively the same issue but with far higher latency.
And that's before getting into other practical issues, e.g. systems that can't process the response stream as it comes in - a subset will load-and-return faster than the whole content in all cases, so e.g. a website loading some json can show initial UI faster while loading more in the background. Streaming is often possible and that'll negate a lot of the downsides, but it's far less common than processing a request only after it completes.
Strongly disagree. I've seen too many cases of api users that are overfetchting for no reason. I don't mind providing a bulk api, but that is a very different use case that regular endpoints shouldn't have to support.
my intent with pagination is always to prevent problems with open ended size of data set. 1000 at once is usually not a problem but 10x , 100x etc.. is a big problem for transferring over the wire.
> If they wanted to make it easy to get all the results
Speaking from experience...we want to make it easy but also want to keep it performant. Getting the data all in one go is generally not performant and is easy to abuse as an API consumer. For example, always asking for all of the data rather than maintaining a cursor and secondary index (which is so much more performant for everyone involved).
We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.
I agree. I am going to reflect on this and see if there's a way to support dump files long term in our app. We sorta support it today but it's ad hoc implementation since an export can range from a few hundred of a thing to tens of millions of a thing.
Is there any good literature or patterns on supporting dumps in the tens of millions or larger?
I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.
"Is there any good literature or patterns on supporting dumps in the tens of millions or larger?"
The two main things you need are: 1. HTTP is a streaming protocol. You don't need to fully manifest a response in memory before you send it. If your framework forces that, bypass it for this particular call. (If you can't bypass it... your framework choice is now a problem for you.)
2. You presumably have some sort of JSON encoder in your language. As long as it doesn't have some sort of hard-coded "close the stream once we send this JSON" behavior (and if so, file a bug because a JSON encoder has no business doing that), all you have to do is ensure that the right bytes go out on the wire. You, again, don't have to fully manifest the reply in memory before you encode it. Something like:
A lot of times when you're emitting gigabytes of JSON it's still in lots of little chunks where each individual chunk isn't a memory problem on its own, so doing something like this can be very memory-efficient, especially if "toBeSerialized" is itself something like a cursor coming out of a DB where it itself is not manifesting in memory. (Newlines are also a good idea if your JSON encoder isn't naturally doing it already; helps debugging a lot for very little cost.)stream.Write("[") needComma = False for item in toBeSerialized: if needComma: stream.Write(",") json.Write(stream, item) needComma = True stream.Write("]")JSON objects can be more annoying; you may need to manually deserialize one and that's more annoying. Protip: Whenever possible, use the JSON encoder in your language; there is no shame or anything in using the JSON encoder to emit strings corresponding to the keys of your object. Much like HTML, you need to be very careful writing things directly to the stream; it really always should go through the encoder. I even send constant strings through the encoder just to make the code look right.
The last little tidbit is that the HTTP software stack will tend to fight you on the matter of keeping long-lived connections open. There can be a lot of places that have timeouts you may want to extend. If this gets too big you may need to do something other than HTTP. You may also need to consider detecting failures (hashing or something) and the ability to restart. (Although don't underestimate modern bandwidth and the speed you can iterate through SELECT -type queries; definitely check into the virtues of just retrying. 10GB/year of extra bandwidth and processing power is still cheaper than a developer even designing* a solution to that problem, let alone implementing and testing it.)
Oh, and if you can use HTTP, be sure you're gzip'ing. It's dirt cheap nowadays on the CPU; only in extreme situations of bandwidth abundance and CPU shortage can it be worth skipping. My rule-of-thumb on JSON shrinking is about 15:1. CSVs don't quite shrink that much but they still shrink down pretty well.
IMO it's better to use JSONL[1].
Also back in the day IBM had XML for Logging format, where every separate line was an XML fragment [2].
Most markup languages suffering of having a root element, which prevents efficient logging or steaming.
JSONL is often better if you can spec that from scratch, but, sadly, still has somewhat poor support in a lot of environments that support JSON. I've had the most problem with interacting with people using environments that are "helpful" and aren't really programming languages, and I usually end up having to offer them a plain array anyhow.
This can also be helpful when you're outputting, say, a 10MB JSON object, which isn't necessarily that large anymore, on a server where you'd like to not have to allocate 10MB of RAM to it. You can stream out a plain ol' JSON object without the resource usage.
Sadly, it is indeed easier to convince environments to stream JSON out than it is to stream it in.
Thanks for this. It seems obvious reading it but I haven't thought about it this way before. I'm definitely going to explore these concepts!
That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.
If your API doesn’t support a single client making ~6 concurrent requests, let me tell you about browsers since the last millennium.
It’s not about what the API provider can physically support, it’s just about being a polite client. They probably can support 6 concurrent requests without it causing a problem. The point is that if they use offset or page number pagination with a maximum limit of 100, that’s probably a clue that you shouldn’t preemptively grab the first 100 pages in parallel.
Right, that’s my point. Don’t be overzealous obviously. But some concurrency should be expected (and in my opinion designed for and encouraged). There’s a huge gulf between grabbing 100 pages and grabbing the number of resources a browser grabs by default.
It’s not hard to imagine scenarios where an API provider might reasonably prefer n concurrent requests of m batch size over a single request of n * m batch size.
I can more than imagine it, I design for it. Concurrent all the things. The vast majority of GET workloads are either embarrassingly parallel or so badly designed that they fall over with even the gentlest touch (which I’m sad to say I’ve encountered; services falling over from sequential requests!).
Edit to add: this isn’t some wild idea, recent network and web standards have made increasing concurrency both automatic (HTTP/2 and beyond) and tunable (various prefetch APIs).
A few thoughts:
1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...
2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?
3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.
>a way where this information could be obtained in a query
There's no standard way because index implementation details are hidden for a reason.
>in e.g. postgres
You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.
I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.
Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.
Coming from a REST perspective, I wouldn’t implement a separate API, I would use HTTP semantics (eg headers or, if truly necessary query params) on the resource listing to indicate the export/sync intention. Likely with an Accept header. If pagination is still preferred/required, the service could return an ETag or some other continuation token which when provided in subsequent responses could be used to indicate the consistent snapshot being requested. Since this is entirely optional, clients could use this mechanism to opt into stable/parallelizable requests (as I described in less specificity in another sub thread).
At this point, it these requests are expensive you have an opportunity to use a very simple (and optimistic) cache for good faith API users, relegate rate limiting to prevent abuse of cache creation (which should be even easier to detect than just overzealous parallelism), and even use the same or similar semantics to implement deltas for subsequent export/sync.
I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.
I think keeping temporal history and restricting paginated results to the data at the point in time where the first page was retrieved would be a pretty decent way to solve offset based interfaces (regardless of the complexity of making the query implementation efficient). Data with a lot of churn could churn on, but clients would see a consistent view until they return to the point of entry.
Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.
I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.
Why do you think it is important for users to have temporal consistency?
Well I’ll use a recent example I encountered that was actually very frustrating. I was looking for a font to use for a logo for a personal project. The site I was using (won’t name and shame, and I can’t recall the site now anyway) had no sorting options, items were ordered by whatever “popularity” formula they use. As I paginated, many of the fonts I’d previously viewed would appear on subsequent pages, often in a different order. It was frustrating not just because I could tell that I was probably missing fonts that were being bumped up to previous pages, but also because it made me doubt my mental model of my own browsing history: “Did I navigate back too far? Did I forget a tangential click and end up on a different search path?”
It’s not a great UX. And in some ways I suspect that my own views were at least partially causing it, which made me more hesitant to even click on anything unless I was sure it was worth the disruption.
This doesn't sound like a "temporal consistency" problem, rather an inconsistent and untransparent ordering issue.
Huh? It’s absolutely a temporal consistency problem. The ordering was absolutely clear and consistent (most->least popular at time of request). But the popularity scores were changing so rapidly that “at time of request” makes the ordering unstable. If the ordering was determined by popularity at the time I accessed page 1, the ordering would have been stable.
Sure, that popularity score would be stale. But who cares?
Think of it this way. Suppose you’re viewing your Twitter timeline in recent order, and suppose the pagination (lazy loaded on scroll) worked this way, and suppose you have new Tweets arriving at the same rate you scroll. What you would see is the same page of Tweets repeat forever (well, until you hit the pagination cap).
This is why people come up with solutions like cursors. But what I was suggesting is that you can instead continue to use offsets (for the benefits discussed in the article like parallelism and predictability) if you paginate on the state of the data at the time you began (edit: or on the state of your sorting criteria at the time you began, which allows for the mitigations I described upthread).
That’s not to suggest that once you begin a pagination, you’ll forever access stale data. It’s to suggest that a set of pagination requests can be treated as a session accessing a stable snapshot.
This can also be totally transparent to the client, and entirely optional (eg pagination is performed with an offset and an optional validAt token).
Ok, thanks for the explanation. I hadn't expected the popularity score to be so unstable. Means that a lot of users are concurrently scoring the fonts, and that the averages are continuously being recalculated. Unexpected.
But you're right, if the dataset is continuously changing at high frequency, pagination makes no sense.
I think we’re almost on the same page (heh). Pagination may still make sense for a variety of reasons. Even if there’s no meaningful sever optimization gain, clients may benefit from a reduced response size (mobile, expensive data plans, low power devices). That sort of thing is where ensuring consistency (for the sake of brevity I’ll repeat, at a point in time, but there are other ways to allow clients to negotiate this) at the request level over multiple requests is useful, even if the underlying data is changing much faster than the client is consuming it.
It’s worth noting here that this isn’t just applicable to paginated lists. It can also be used where you want to let the client optionally, concurrently access related resources. It can be used for client-side concurrent transactional writes. It’s a barrel of monkeys.
For what it’s worth, I wouldn’t assume their volume of traffic was necessarily the reason the data was in such flux. It could be (and I strongly suspect) that their popularity algorithm just stinks (eg weighting 100% of 1 view over 90% of 100 views). Even so, a snapshot in time is probably a much easier lift/gain for a flailing algorithm than really digging into the data and analytics to get an optimal stable sort without it:
1. Take a snapshot of the query and throw it in a dumb cache.
2. Rate limit cache key creation rather than cache hit requests.
3. Throw the caches out (and repopulate if reasonable) every [vaguely not gonna break the bank age duration].
4. Forget about any further optimization unless you reallllly need it.
5. Document the problems with the suboptimal solution, have a retro so your junior devs can develop better instincts, get them onto the next user story.
6. Put a “should we improve this” on the backlog of backlogs.
7. Profit.
Pagination of an immutable collection is one thing and can be parallelized. Pagination of a mutable collection (e.g. a database table), on the other hand, is risky since two requests might return intersecting data if new data was added between the requests being executed.
True result sets require relative page tokens and a synchronization mechanism if the software demands it.
Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.
Ideally I'd want a system that guarantees at-least-once delivery of every item. I can handle duplicates just fine, what I want to avoid is an item being missed out entirely due to the way I break up the data.
It's more than just de-duplicating, tho. Imagine you query a dataset and get something like a page count and a chunk size. That page count cannot be trusted if the dataset is mutable. If an item is inserted at the beginning of the set, you're going to miss the last item.
Pagination is hard
For dynamic usecase, DynamoDB has implemented pagination with something called lastEvaluatedKey - https://docs.aws.amazon.com/amazondynamodb/latest/developerg...
This is different from LIMIT in RDBMS
Wouldn’t this pattern solve the complexity you are talking about?
That's one way, for sure. You can do this with IDs, dates, etc.
It's important here that "created" is an immutable attribute. Otherwise you could get issues where the same item appears on multiple lists (or doesn't appear at all) because its attributes changed during the scanning process.
I think you could accomplish something similar with token pagination by requesting a number of items that will result in multiple "pages" for your user interface. Then as the user iterates through you can request additional items. This isn't parallelizing, but provides the same low-latency user experience.
From the code sample in the article I didn’t know you could append to a slice from within a go func
As long as you use the mutex locks
Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.
> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend
This is funny. Using offsets is known to be bad practice because.... it’s hard to do.
Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.
No, what is bullshit is having the option to go to page 10 in the first place. If the user does that then the UI is already broken. What is needed is good filter abilities.
The hardest part is to be able to parallelize the sorts in the backend and keep the working sets reasonably sized.
If you ask for "first 50 items after key X", you just need to keep a priority queue of size 50 on each BE node and merge them before returning them (I'm assuming a distributed backend). It doesn't matter on which page you are.
But if you specify "first 50 items after element N" it gets really tricky, each BE shard needs to sort the first N elements, and it can use some trickery to avoid doing a naive merge (see: https://engineering.medallia.com/blog/posts/sorting-and-pagi... ).
You can at most save some transfer over the network.
How do people even figure out they need to be on page 10? Sounds like a filter of some kind (e.g. before a date) would be better, except they trained themselves to use x amount of pages to approximate.
Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors
Maybe they've been there before and happen to remember "page 10".
Maybe someone else has been there before and told them "Go to page ten".
Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.
Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.
I can't find one right now, but I feel like there must be an algorithm that can identify the key that can be used for each page, yet cheaper than a full sort (ie cheaper than offset pagination in generality). Of course, a skip list would work for a secondary index.
This is the best one I've been able to come up with: https://engineering.medallia.com/blog/posts/sorting-and-pagi...