We switched to cursor-based pagination

120 points by qin 3 years ago · 117 comments (116 loaded)

Reader

simonw 3 years ago

This post is not about database cursors. It's about the style of pagination where you have a ?_next=xxx link to get to the next page, where the xxx bit encodes details about the last item on the current page such that the next page can show everything that comes after that record.

This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset

paulmd 3 years ago

Sounds great on paper but my experience is that key-based indexing is broken on SOLR and likely other lucene engine DBs. If you are evaluating conditions on child objects inside a parent document, the child objects get split and stored as a separate document that is joined… and if that child document is in a different block/file on disk, then it won’t necessarily be inside the range being scanned, and so you will be missing some results that meet your logical criteria.
Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.
In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.
- fizx 3 years ago
  
  parent-child in lucene is a hack. It only works when the children immediately follow the parent, and that never happens when it's in a different file because you updated the child. It's a minor miracle you didn't notice the error with numeric pagination.
clementmas 3 years ago

Most users don't care but I very much prefer to have URLs with `/?page=123` instead of `?_next=xxx`. I should get over it but meanwhile I'll stick to offset pagination with deferred join (see aarondf's comment).
jlg23 3 years ago

This post is about "database cursors" and "keyset pagination". In practice, these terms refer to the same thing, one seen bottom-up the other seen top-down. Implementation-wise, one saves the state of the cursor in the pagination [parameters] and resumes reading from the DB with an equivalent cursor.
- nextaccountic 3 years ago
  
  > these terms refer to the same thing
  No, cursor is this https://en.wikipedia.org/wiki/Cursor_(databases) https://www.postgresql.org/docs/current/plpgsql-cursors.html
  I once did pagination using database cursors, which is something different than keyset pagination: The server would keep a cursor open and keep fetching more data from the same query. This enabled the system to have an interface similar to offset pagination (you get the first page, then the second page, etc) but without doing a new query for each page discarding the first n-1 pages per query
  The downside is that it makes the server stateful, and doesn't scale (you would need to keep hundreds of cursors open if you had hundreds of simultaneous users)
  - orf 3 years ago
    
    Words can have two meanings. Cursor pagination and key set pagination do indeed refer to the same thing.
    database cursors are a different thing.
    
    quietbritishjim 3 years ago
    
    You are agreeing with the parent comment with a tone of disagreement.
    They never said what "cursor pagination" is or used the term at all. In this chain of comments, yours is the first one to use the term.
    The comment you're replying to, like the original comment in this chain, merely said that this post is not about using "database cursors" for pagination. Which is correct, and you agreed that this is a different thing.
    
    mbreese 3 years ago
    
    Yes words can have two meanings, but given the similarity of context between these two meanings, calling this cursor pagination is not a great idea. It screams of someone that didn’t know about database cursors (which is only one implementation method) trying to describe a method for web site pagination. I’m not blaming the author here for this, as they likely know the difference. But for a new developer trying to Google this, it will be very confusing.

aarondf 3 years ago

There are ways to mitigate the (although not eliminate) the slowing down of offset/limit pagination in later pages. The technique is called a "deferred join" and it is most effective in MySQL. The basic idea is to paginate as little data as necessary, and then do a self-join to get the rest of the data for a single page.

There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!

Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.

barrkel 3 years ago

To be clear, this technique (which it seems I independently discovered in 2015) mostly only works in MySQL because other databases usually have planners which are smart enough to not pull everything in eagerly.
MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.
The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.
- aarondf 3 years ago
  
  Hey this is a great explanation. Most people just say "this makes no sense, can never work"

hamandcheese 3 years ago

Cursor based pagination doesn’t actually solve the described pitfalls if your results are ordered by anything mutable, though.

A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.

Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).

simonw 3 years ago

The only approach I can think of which might be able to handle mutability of the rows used for sorting would be to support paginating through a snapshot: provide a mechanism whereby a full snapshot of the query at a specific point in time is captured such that the user can then paginate through that snapshot.
Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.
- kibwen 3 years ago
  
  That might be even worse (or just as bad) for users, as now they won't see any updates to the underlying data set even if they want to, and will presumbaly need to perform some explicit action to get a new snapshot.
  Personally, if you care about users not missing any item in a query, you just can't use pagination at all, and you have to give them every item in the query in a single huge dump (running the query again would be the "explicit action" mentioned above that gets the user new data). Conversely, if you use pagination, users are free to assume that they might miss some items unless they already expect the underlying data to be immutable.
  - donavanm 3 years ago
    
    Not quite. The API action and the previous responses pagination token is what denotes the consistent list “snapshot.” They get a current view of the datastore every time they start a new paginated API operation, eg without passing an existing token.
    Edit: you might think of the list results like a materialized view. It works really well with natural ordering on something like item create time and you can pass that in as a query param.
- mikedouglas 3 years ago
  
  This theoretically should be possible with MVCC, right? It's not an area I've explored and I could immediately see some issues with resource clean-up, but I could imagine it being possible with most modern DBs.
  - vbezhenar 3 years ago
    
    Yep, keep transaction open with necessary isolation. But it requires very thorough design of queries, as you'll run into locks pretty quickly. MVCC is not magic.
    
    twic 3 years ago
    
    Or using an Oracle style AS OF query:
    https://oracle-base.com/articles/10g/flashback-query-10g
- mbreese 3 years ago
  
  You could also store timestamps as part of the records. Then your query will always be consistent if you add an extra clause of tstamp < query_tstamp. So long as you store both the query_tstamp and the last record, you’ll get consistent results without needing to store individual cursors/snapshots for each user. (You’ll still have more CPU time per query, but that’s kind of a given here).
  You probably also need to switch from deleting records to adding an archive bit (or timestamp).
  This gets complicated fast.
  - moduspol 3 years ago
    
    This also presumes that your records always are processed in order, right?
    We struggled with this at a prior job because records within the most recent ~10 minutes could be delayed for any reason. We'd be collecting data from thousands of IoT devices and any of them could have momentarily lost a network connection for a few minutes, been rebooted, etc. So it'd be totally normal for end users to see records from 30 seconds ago, but then a few minutes later, see older records appear in between those records and older records because there were minor delays in some of them getting sent and processed.
    We were leaning toward just putting a bound on the range that the end user could see (for example, only show records that are at least ten minutes old) to reduce the variability, but that gave the appearance that our system took ten minutes to process all data. It was a tricky one to solve from a user expectation perspective.
  - indecisive_user 3 years ago
    
    I don't see how that solves the issue. A record can be updated for multiple reasons unrelated to the current database query, and any update at all to the record would hide it using the timestamp approach, regardless if the update would actually affect which page the data is on.
    
    mbreese 3 years ago
    
    I didn’t say it was a great approach, but if you used the timestamp as part of the PK and treated every record as immutable (any update resulted in a new row), it would work.
    I’m not actually a fan of this approach as I think it causes too many scaling issues. If I were to do this, I’d either keep a database cursor in my session or store the full List of PKs in my session for the current query. This way you’d have the order of results as they existed at the time or the query. You still have to deal with the UPDATE/DELETE issue, but o guess this just depends on how much you care about consistent query pagination.
    I’m not that much of a stickler and most datasets don’t change fast enough to make it an issue.

andy800 3 years ago

Why does every web site default to aggressively paginate their information? Pagination sucks, it's a waste of time and of clicks, and should be a last resort. Sure, when Google returns millions of search results, paginate. But:

For instance, if you have 40 entries and specify that there should be 10 items per page

40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.

"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.

"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.

ilammy 3 years ago

> Why does every web site default to aggressively paginate their information?
You get to see more ads while flipping through pages.
Timing metrics for loading smaller pages make marketing happy.
Timing metrics for time spent on the website make marketing happy.
- andy800 3 years ago
  
  Perhaps, for content-based sites. Not if you're a financial institution and visitors to your site are likely looking to find a specific transaction. Not if you are an insurance provider and visitors are trying to find a service provider. If you are a retailer selling stuff, you don't want browsers, you want buyers. I will concede that Amazon does pretty well and they paginate product listings, but I think they use a lot more intelligence to deliver high-value results than typical retailers, including very large ones.
VWWHFSfQ 3 years ago

> Why does every web site default to aggressively paginate their information?
Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.
- andy800 3 years ago
  
  Nothing needs to be "unbounded." Set the cap at 100 or 200 records instead of 10-20. Better yet, plenty of results can be cached. A custom search might require a database query, but Newegg's initial listing of MicroSD cards, graphics cards, hard drives, etc, -- queries that are fulfilled identically hundreds of times per day -- can be cached.
  - pwinnski 3 years ago
    
    At which point you're not actually arguing against pagination, you're suggesting a higher page size.
    I agree, page sizes are usually far too short by default! Especially when I'm on a desktop computer and a fiber connection. I think people using phones and cellular connections probably appreciate short page sizes, so it would be nice if sites had different page sizes for different clients.
- friendzis 3 years ago
  
  Do you perhaps conflate API level pagination with display pagination?
notriddle 3 years ago

How many people actually read past entry five?
- Beltalowda 3 years ago
  
  Your comment is half-way down the page. I read it.
  It really depends on what kind of site we're talking about; there's loads of times I want to see loads of results. Sometimes I don't, but there's no harm is showing me more than I want, either.
hamandcheese 3 years ago

I’ve seen instances where the issue is not database performance, but web server performance. Serializing a graphql response in graphql-ruby, for example, can easily take 100+ms and grows linearly with number of records in a request.
In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.
phendrenad2 3 years ago

I'd guess that most people have written some bad query that only worked for small limits, so they started superstitiously making their limits very low.
Aeolun 3 years ago

And this kind of thinking is how you end up with 2mb per request.
It’ll be like Jira, pulling in literal megabytes of data and still slow.
- Beltalowda 3 years ago
  
  I don't think the problem there is the pagination (or lack thereof). This page has 103 comments right now and is 204K (31K compressed).
  HN is pretty minimalistic, but the point is that text data isn't really what makes requests very large. For reference, 2MB is enough to include an entire book and then some. All of Dune is about 1.7M uncompressed in epub format (and ~1.1M compressed as an epub file). HN threads, Jira discussions, etc. can get pretty large, but I have not seen any yet that have started to approach the size of Dune.
  Images might add more data, but those can be loaded on-demand when scrolled into view pretty easily.
- andy800 3 years ago
  
  No, relying on mega-frameworks and non-optimized graphic images are the express train to large request sizes. Don't forget video.
  Indeed.com's search results are almost all text. Maybe a company avatar. No reason it can't provide 100+ job listings per page rather than 10-20.
- justsomeuser 3 years ago
  
  I don't think that is a bad thing, considering that a YouTube video in HD would be 500MB and people flick through them like web pages.
  I agree with OP that if you are on a powerful machine with fast internet it would be better just to load a massive HTML document of list items.

feike 3 years ago

Reminds me of Markus Winand who hands out stickers on database conferences banning offset.

His site is a great resource for anyone wanting to take a deeper dive on SQL performance:

https://use-the-index-luke.com/sql/partial-results/fetch-nex...

ReactiveJelly 3 years ago

So basically do it like Reddit?
https://old.reddit.com/?count=25&after=t3_wtpvdp
I noticed Reddit's pagination has that "after" parameter, which points to the last post on the current page.
It glitches out if the last item is deleted by moderators, but otherwise it works smoothly.
- djbusby 3 years ago
  
  On Reddit I frequently see the "next" page having the same posts as the previous page. Not all the same but many of the same. Like, maybe after is being respected but the sorting is different or something.
  - radiojasper 3 years ago
    
    You took some time to read the page and while reading it, the homepage changed and some new posts got added to the top. Therefore some posts get moved to the 2nd page.
  - saghm 3 years ago
    
    I see that on Hacker News a decent amount as well when going through the top stories across multiple pages. My assumption has always been that the order changes between when I load the page and when I move to the next one (which sometimes is not for another several minutes).
  - pwinnski 3 years ago
    
    The problem with Reddit is that the sorting isn't by time, so the content order changes.
    The only solution I can think of for that is to track which individual posts have been shown to which users, which is quite a lot of work to do.
- MatmaRex 3 years ago
  
  Yeah, or Wikipedia.
  https://en.wikipedia.org/w/index.php?title=Category:Living_p...
wootest 3 years ago

Which in turn reminds me of: http://simonwillison.net/2022/Aug/16/efficient-pagination-us...
- pvorb 3 years ago
  
  Do you know if this is specific to MySQL or does it also apply to other RDBMS like PostgreSQL?
croes 3 years ago

Can't he just use rowversion?
No need for row value syntax and it works with MS SQL Server
dinkledunk 3 years ago

how to jump to an arbitrary page?
- nextaccountic 3 years ago
  
  You can do that with postgres histograms https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin... - go to the section "Keyset with Estimated Bookmarks"
  > As we saw, plain keyset pagination offers no facility to jump a certain percentage into the results except through client guesswork. However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.
- squeaky-clean 3 years ago
  
  Jumping to a specific page is a bit of an ambiguous / undefined term in this case. Like asking for a specific page in a book that's still being written. Maybe today the plot twist occurred on page 100, but then the author decides chapter 1 needs more backstory, and now the plot twist happens on page 115.
  Unless you can guarantee your data is static, or that the sorting order cannot be mutated and only append later values, the concept of what data belongs in which page could be changing every millisecond.
- jsmith99 3 years ago
  
  Do you really need to jump to an arbitrary page and land on the exact item? For many applications an approximate jump is fine. If your column is fairly uniformly distributed you can guess the index for any arbitrary page.
  - dinkledunk 3 years ago
    
    Yes, my business users will feel like they don't have sufficient access to their data if they can't.
    > If your column is fairly uniformly distributed you can guess the index for any arbitrary page.
    I don't think that'll work in a multi-tenancy situation with complex filters.
    
    waspight 3 years ago
    
    I bet your users does not always know what is best for them.
    
    eurasiantiger 3 years ago
    
    Some people thoroughly enjoy a linear saccade search! See for example any social media app.
    It definitely isn’t in the users’ best interest to have any method of scrolling through a lot of records.
- MatmaRex 3 years ago
  
  You don't, but instead you can jump to an arbitrary place in the results. For example, you could show results starting from the letter P, or show results starting from 2022-04-02.
- hnuser847 3 years ago
  
  Spoiler: you can’t.
  - thaumasiotes 3 years ago
    
    But the entire concept is that this is an adaptation to the fact that data may be added to or removed from the database. If that's true, there would be no benefit in jumping to a specific page - there's no guarantee that that page will display any particular data.

jvolkman 3 years ago

For anyone curious about this approach, I've posted an excerpt [1] of the shared Python + SQLAlchemy + Postgres code we use to handle pagination at my company.

The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.

[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...

drdec 3 years ago

I was hoping to read about how they handled cleaning up cursors, what the effect on the server was when having a bunch of open, long-running cursors, etc. Unfortunately the article only treated the subject at a superficial level.

So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?

imbusy111 3 years ago

You probably assume they are talking about database cursors. The cursor is just a record ID in this article. There is no long term storage of database cursors on the server. Assuming you can sort all your data, next query just returns all records after the given record ID, plus the record with that ID.
One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.
- erikpukinskis 3 years ago
  
  Does that mean you have to scan the entire results set to get the right page? So if I am on page 100, I have to query pages 1-99 and discard them?
  Or is there a trick here I’m missing?
  - nerdponx 3 years ago
    
    I think the point is that clients are not even given the option to think in terms of "page numbers".
    What's the use case for needing the 100th page of a query result, that also doesn't allow you to cache the 100th page locally to retrieve it later?
  - imbusy111 3 years ago
    
    There are no pages anymore. You fetch a record by ID and next N records.
    
    erikpukinskis 3 years ago
    
    Right, but in your database query doesn’t that require you to get a monster result set and filter it down in your application layer?
    There’s no SQL clause for “WHERE IT WOULD COME AFTER id=ah73d IN THE RESULT SET” as far as I’m aware.
erikpukinskis 3 years ago

Yah, another downside of cursor-based pagination is: what happens when the record the cursor refers to is deleted?
Do you just crash and ask the user to start over?
Do you have to nudge open cursors on every delete?
- simonw 3 years ago
  My implementation of cursors works by encoding the primary ID of the last row on the page, along with additional information corresponding to the sort order if that's needed.
  That way it doesn't matter if the record is deleted - I can still return the next page by showing records that come after that provided cursor value.
  There's an example on this page: https://latest.datasette.io/fixtures/sortable?_sort=sortable
  Since the table is sorted by the "sortable" column, the next page link includes this:
  ?_next=15%2Cg%2Cz
  15 is the last value for "sortable" on the page, then g,z are the compound primary key for that last row.
  - erikpukinskis 3 years ago
    
    Interesting… what happens if new records are added with that 15 value? Do you need an implied secondary sort with the record created time?
    Also what if there are more than $page_size records with that 15 value?
    
    simonw 3 years ago
    
    Yes, if you want to support new records being added it's up to you to include something like a created date as part of your specified sort order.
    More than page_size records with that value works fine - that's why the primary key is included as a tie-breaker.
    
    erikpukinskis 3 years ago
    
    Doesn’t that take you back to the original problem though: if you delete the primary key the query fails?
    Would the created_at be a better option than the primary key? That can’t be deleted or changed, and that would give you a stable secondary sort?
- hamandcheese 3 years ago
  
  Instead of the cursor being an ID, it could directly encode whatever column(s) you are sorting by. Then you don’t have to locate any record in particular, you can always return records that sort after the cursor.
christophilus 3 years ago

It’s not a cursor as in a relational database cursor. It’s a cursor, as in an ID of an item in a stably ordered set. There’s no long-running anything to be worried about.

fein 3 years ago

I have a few main requirements for what I consider easy to use pagination.

1. I can set how many items per page.

2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.

3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.

You know, like how a physical book or catalogue with pages works.

Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.

nemothekid 3 years ago

>I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around
I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.
Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.
- robertknight 3 years ago
  
  > Google does it with millions of search results
  Google cheats, but in a way that very few users will notice. You can only access the first 20 pages of search results. Depending on user behavior, this is one way to offer navigation via page number while limiting worst-case cost.
  - zaroth 3 years ago
    
    I doubt that 21st page actually exists, even on the back end.
    I would bet the top-line number of results is some sort of extrapolated estimation thing, a hand-wavey way to represent how deep the hypothetical result set is for that query.
    
    Khoth 3 years ago
    
    It definitely is an estimate. If you search for a rare enough term that there aren't 20 pages of results, it can still estimate thousands of results until you call its bluff by going to a later page, at which point the estimate is replaced by the actual number.
int0x2e 3 years ago

I totally agree with you from a usability standpoint, and while it is possible to make this work well, for larger-scale services it is rarely without costs, and for the most part - people don't seem to care as much as you'd think.
While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.
(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)

debrice 3 years ago

The pagination con given in the article is wrong, switching to stream isn’t fixing the dropping issue, removing sorting is likely why no records are being dropped (you wouldn’t drop records using a numerical sorted ID or creation date)

Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.

indecisive_user 3 years ago

If a user has to go to page 15 to find something useful to them then I would argue that's a bigger failure of the UX/filtering than it is a success of pagination.
- Aeolun 3 years ago
  
  Books work this way. It’s a very relatable to very many people. Conversely, few people understand why they cannot just jump to page 15.

erikpukinskis 3 years ago

Worth noting that another solution to the problems with offset cursors is to do updates. When you add/remove rows you just update the results set and then there’s no issue with missing or duplicated rows.

Not easy to do of course, but it’s one direction you can go.

wruza 3 years ago

Why is offset-based pagination a thing at all? It sucks when a feed is very active, it sucks when you link to a page or visit search results, where it may turn out to be hundreds of pages away in a week. I always wondered why nobody does some obvious (order_col > k, order by order_col, limit n). It’s not a two years old mistake, it’s two decades and sites still do that by default.

Aeolun 3 years ago

Because normal people enjoy referring to ‘item 4 on page 3’ instead of ‘search for item 57533898-2’.
At least I do. Passing numbers under 10 is easy. Passing identifiers is hell.
- wruza 3 years ago
  
  This makes sense for content that grows at the end.

bjourne 3 years ago

Ime, concurrent updates to the data set isn't a problem in practice and nobody cares if they occasionally get duplicate entries. The cluttered and sometimes useless urls cursors cause are, again ime, a much bigger usability problem. Sure, naively implemented queries suffer if the user types ?page=123456 in the url but such problems are quite easy to fix.

nickkell 3 years ago

How do you fix those problems then? Let’s say you have a “last page” button in the UI for example
- dmacedo 3 years ago
  
  Have used the conditional that if current page is greater than last page, just return the last. And same with negative just returning the first. If records are updated / deleted and the last page changed, then you'll just get the results of what the "new last page" are.
  At scale you might care about the duplicate or up-to-date records. But cursor-based doesn't solve the problem if a results page is left open for "too long" and stuff was added behind your cursor (or after it, if navigating backwards).
  It's as if making things less intuitive (to articles' reference to book pages), makes it any easier as long as you don't think about any pitfalls.
  My suggestion is to just use pages, and optimise for the right order (I.e.: sequential IDs, or creation date, alphabetical, etc) that make sense for your data.
  If you REALLY must care if results have changed, some delta being stored would be best (like some timestamp that allows the server side to indicate "hey, your results are out of date, from 7 days ago, maybe you left that page/API response unused for too long")
- bjourne 3 years ago
  
  Don't have a last page button. :) Or limit the number of results to, say, 1000, which is trivial for an rdbms to handle. Or precompute the result set's rankings and transform the offset-limit query into a "where rank >= 991 and rank <= 1000" query.

mike_hock 3 years ago

TL;DR: The headline. TFA doesn't really add any information.

G3rn0ti 3 years ago

Well, except that TFA explains what DB cursors are and why they are faster than page offsets (because they skip DB entries).
- jsmith99 3 years ago
  
  I'm pretty sure they aren't using actual SQL cursors as that wouldn't scale well so it's probably not a 'DB cursor'. It's a shame as actual cursors are the native database way to do pagination.
- thaumasiotes 3 years ago
  
  > TFA explains what DB cursors are and why they are faster than page offsets (because they skip DB entries)
  Except that no such information appears in the article. Here's the entirety of the explanation:
  >> Once you’ve been returned a cursor, you can provide it in your requests to act as a farther-along starting point within your data entries. The server is then able to efficiently skip all entries that come before your specified cursor value
- theteapot 3 years ago
  
  TFA isn't talking about that kind of cursor.
- theogravity 3 years ago
  
  It needs to talk about how to actually implement it. Most articles like this one mention its existence and why it's good, but generally stop there.
  - G3rn0ti 3 years ago
    
    Postgres supports cursors and documents them very well:
    https://www.postgresql.org/docs/current/plpgsql-cursors.html
    Basically, you declare a cursor like that:
    DECLARE cname CURSOR FOR <select statement>;
    You can pass the cursor’s name „cname“ (and even store it on the client side, although, best encrypted) and obtain the „next“ slice of your data corresponding to the query on demand like that:
    FETCH next cname;
    Not sure you really gain that much performance on everyday queries but with a large number of rows it might.
    
    chowells 3 years ago
    
    Of course, this isn't what the article is talking about...

camgunz 3 years ago

I think the challenge is always the stateless nature of HTTP. It's pretty hard to keep a connection to the API server that has your database cursor.

Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.

You can save the results of a search and page through it, but that's a lot of I/O.

You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.

These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.

WesleyJohnson 3 years ago

At my current job, our intranet site has lackluster performance due, in part, to limit/offset pagination. Unfortunately, the business treats the "reports" we author like glorified spreadsheets and want the ability to filter on any column and order by any column. It makes it near impossible to tune/optimize.

The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.

yrgulation 3 years ago

“It makes it near impossible to tune/optimize.”
I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.
- WesleyJohnson 3 years ago
  
  Thanks, I'll have to search for some case studies on the topic to help me understand the approach. We use ElasticSearch in one small area to search for products across multiple points of metadata and I've always wondered if we were underutilizing ES in general.
- jitl 3 years ago
  
  What kind of NoSQL database are you thinking about? What strategy would you take with that database to optimize this problem?
  - yrgulation 3 years ago
    
    This is hilarious - i am sharing my knowledge and getting downvoted for it.
    Anyway, the gist of it is that you store data in denormalised documents whereby searchable columns become keys of a single entry. The storage is a secondary storage not the main data source. You write data in both - sync in your relational db, async via a queue or what works best for your infrastructure. Searches are then made against it. If you go for es you can rank results assuming filtering is done using free text search. I prefer es, the of flavour of nosql doesn't matter, but es is great for free text search.
    
    nerdponx 3 years ago
    
    You said in a sibling comment that denormalized analytics tables are a hack, but I don't see how this is any less of a hack. It's literally the same technique, except now you're adding substantial operational complexity with a whole extra database server (versus just extra tables in the same database). And it does not at all solve the problem of needing to figure out which fields to denormalize and which ones to leave in separate documents/collections.
    And even if you do decide that it makes sense to split your analytics data into a separate database system entirely, document-oriented databases "ain't it".
    I have very little experience with Elasticsearch, although I'm surprised to hear it being recommended for anything other than full text search. GP was talking about spreadsheet-style filtering, not full-text search.
    But I do have a good amount of hands-on experience in production with Mongo, and I can say for sure that it is absolutely not a good option for an analytics database, and that I would much rather deal with traditional joins and filters. Even using it as a primary "application database" for a web app was a constant headache from beginning to end. I never thought I would miss MySQL so much, and I would never consider using Mongo again unless I had the very specific use case of storing a high volume of semi-structured data with no consistent schema, and even then I would convert the data to a more traditional tabular/relational format for use in a data warehouse if the business analysts or data scientists needed to work with it.
    
    yrgulation 3 years ago
    
    I hate mongodb with a passion. It may very well be that we are misunderstanding ops use case and making assumptions. My assumptions are: 1) many types of reports (as such many tables, es can create docs on the fly), 2) reports are made of many rows (otherwise why compare them with spreadsheets). My second assumption is that once you add pagination you can no longer ctrl f for content, you need full text search.
    For a set of reports with consistent column names and values made of aggregate or static data what you are proposing works fine - since you can just increase counters or cache values as data comes in.
    But for a use case where different types of reports have varying columns you can just dump everything into es documents and run basic aggregate queries. Or you can precompute data when making inserts.
    Anyway, i am assuming too much about the use case, my bad. I’d have to hear more about it to defend my point.
  - nerdponx 3 years ago
    
    This sounds pretty typical for "analytics" workloads, which relational databases handle just fine. Maybe by "noSQL" they just meant something with column-oriented storage? But even that seems like it might be overkill, compared to setting up a couple of denormalized "analytics tables" to avoid the cost of complicated joins.
    
    yrgulation 3 years ago
    
    Yeah with all due respect but hacks like these are a bit amateurish. I heard of a dude i think at intuit building their queues in a relational db because they work “just fine”. Prompted a giggle or two. Use the right tool for the task at hand, dont do clever hacks as they bite back later on.
    
    jvolkman 3 years ago
    
    "works just fine" might be a perfectly reasonable tradeoff if it avoids adding additional architectural complexity.
    
    yrgulation 3 years ago
    
    Depends what you define as complexity. What i described is trivial. Unless the userbase and or data are small.

hirundo 3 years ago

My sympathies to the coders downstream of this large change for the amount of work required. I would like to add cursor based pagination to the APIs I manage. It would not be an option to go to our clients and explain that offset-pagination will be removed. There seems to be very little cost in supporting both.

redditor98654 3 years ago

AWS does the pagination thing for all LIST APIs and they encode the token in a way they clients cannot deconstruct anything out of it. For them it is just an opaque string. In fact it is cryptographically encrypted and not something simple like a Base64 encoding of some Json data.
And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.
I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.
I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.
mgkimsal 3 years ago

I didn't see anything specific about timelines. I would expect you'd want to offer both for some length of time, with some deprecation notice and a sunset date for the older approach. But perhaps they seem use cases in their logs that the current offset is used minimally already, and it's better to switch now vs later if/when adoption is higher?

Settings

We switched to cursor-based pagination

Keyboard Shortcuts