Downsides of Caching
msol.ioNever implement a cache that you can't completely disable with the flip of a switch somewhere. Otherwise, when wrong behavior is observed in the system, you don't have an easy way to rule out the cached copies. And worse, as you make "changes" to the system code or the data, you may not be able to tell the difference between a change that had no effect on the problem and a change that was hidden by a stale cache.
This is one of the things that drives me crazy about some of Apple's technologies. For instance, somebody at Apple decided long ago that all application HTML help pages should be cached. The "off switch" for this cache remains a bit of black magic but it's something like "rm -Rf com.apple.help.DamnedNearEverything" followed by "killall UnnecessaryBackgroundHelpProcess" every damned time you modify a page or else the help system might show you an older version of the content that you just "changed".
How I handle kill switches is to first make sure that all code related to catching is just that. If it gets tied anywhere into the data saving logic or even worse, the actual business logic, you are screwed. I tend to manage this by keeping caching as part of a strategy pattern so I can enable or disable stuff using config parameters when starting the app. End to end testing with casper/selenium always runs with cache turned off unless I specifically want to test cache which I actually never have now that I think about it.
I agree with pretty much everything in this post, though I would add one more thing. It's not so much a downside of caching as a misuse: Application-level caches should never cache local data. Cache network responses. Cache the results of computations. Don't cache files or disk reads. Operating systems already implement disk caches, and they do a better job of it than you. That's in addition to a modern computer's numerous hardware caches. For example, take this code:
...
FILE *fp = fopen("example.txt", "r");
char dest;
int bytes_read = fread(&dest, 1, 1, fp);
putchar(dest);
...
Think of how many caches likely contain the first byte of example.txt. There's the internal cache on the hard disk or SSD. There's the OS's filesystem cache in RAM. There's your copy (dest) in RAM, and also in L3, L2, and L1 cache. (These aren't inclusive on modern Intel CPUs. I'm just talking about likelihood.) Implementing your own software RAM cache puts you well into diminishing returns. The increased complexity simply isn't worth it.I wouldn't say "never". Let's say you have a local file with 1e6 words. The file can be updated at some point. Your service gets a word in a request and needs to return "is this word in the list".
Do you really want to read the file every time at request comes in? No, you're going to read it once and store it in an indexed set for quick lookup. You just cached a local data file.
It's about the benefit vs. not caching. Not about local/remote.
Your example is quite valid, and I would probably implement something similar to solve the same problem. But it's not a cache. Caches can miss. Caches have a replacement policy. If it contains the complete, authoritative copy of the data, it's a memory-backed data store.
Re replacement policy - that's why I mentioned that the file can be changed. You'll need an mtime/inode/time check on each request / periodically.
Cache can miss? I don't think that's required. It can miss in a general sense, as in it needs to be lazy loaded. But I'd still call it caching if you're getting a single value. For example, you can still cache whole front page with server-side push into the cache. You also can't miss in that case.
But yeah, that's just details. Cache/memory structure is a rather vague separation.
I wouldn't call that a cache. It's certainly computed state, of the kind that can be thrown away and recomputed ala Crash-Only programming, but in my experience a "cache" is supposed to be transparent: to decorate an file/device/socket/RPC service/object/whatever and expose the same semantics as that whatever, but more performantly. Your indexed set doesn't expose the same semantics as the data file it was constructed from.
I don't think that's very good advice in a heavily-loaded shared hosting environment. A disk read could easily stall for tens of seconds, just because the kernel whimsically decided to throw out the cache (or because your server crowded its memory container). I actually don't want any server touching a disk while it's serving. Everything should be read before service begins and never again.
Your proposed solution (read from disk on startup and never again) is really a memory-backed data store, not a cache. Caches can miss.
But let's analyze your example. If disk reads take tens of seconds and memory usage is high enough to purge the kernel's disk cache, nothing can save you. Had your process read in everything at the start, it would be using even more memory. Given the same load, one of two things will happen:
1. If you have swap enabled, parts of your process's memory will be swapped-out. Accessing "memory" in this case would cause a page fault and tens of seconds of delay.
2. If you have swap disabled, the OOM-killer will reap your process. When it respawns, it's going to read lots of stuff from disk... and disk reads take tens of seconds. Oops.
Even if an application-level data cache improved performance on heavily-loaded shared hosts, the added costs of software development and maintenance far exceed the cost of better hardware. Hardware is cheap. Developers are expensive.
Here's an example. You have a 100MB C++ executable that needs 4GB for its own various purposes and 20GB of data that it's serving. The machine has 64GB of memory. If you allocate 24.1GB of memory to the container for this service, disable swap, and mlock the binary and the data files, nothing will go wrong.
On the same machine is a batch process which is reading a 1TB file and writing another 1TB file. If your serving process was reliant on the OS page cache, it would find that its pages were routinely evicted in favor of the batch process.
You're right about swap, that's why only a crank would enable swap. The moment at which swap was a reasonable solution was already behind us 20 years ago.
In that example, I'm pretty sure forgoing containers and mlock would result in similar performance while using less memory. Process startup time would also be significantly improved. (If there's such high contention for disk I/O, reading 20GB on startup is going to take a very long time.)
The kernel's page cache eviction strategy is smarter than naïve LRU. On the first read, a page is placed in the inactive file list. If it's read again, it's moved to the active file list. Pages in the inactive file list are purged before the active file list.[1] So large sequential reads may cause disk contention, but they won't massacre the file cache.
This I/O situation isn't uncommon. Consumer systems also have big batch jobs that can pollute file caches: large copies, rsyncs, backup software (Déjà Dup, Time Machine, etc). They don't solve this with containers, limits, and mlock()ing. Some programs add a couple calls to fadvise(), using the FADV_NOREUSE or FADV_DONTNEED flags.[2] But for the most part, doing nothing yields excellent performance. Operating systems are pretty good at their job.
1. https://www.kernel.org/doc/gorman/html/understand/understand...
2. This is handy for applications like bittorrent, where multiple reads of the same page are possible, but caching isn't desired.
If only O_STREAMING had made it to the kernel... https://lwn.net/Articles/12100/
In other words, you're advocating using more memory in a shared environment just so your server can (try to) be faster? What happens if everyone else does the same? Everyone ends up competing for the limited amount of memory, and no one wins.
"I'll grab all the memory I can so others can't use it" is a horrible way to think, as anyone who has attempted to simultaneously use multiple applications written with this mindset will know. One takes most of the memory, forcing other apps into swap, and then the opposite happens when you start working with one of the others, accompanied by massive swapping slowdowns.
Can I suggest that if you've got problems including "stalls for tens of seconds on disk reads" you are almost certainly better off directing available resources towards fixing your hosting problems, rather than going down the cache rabbit hole on a hosting platform that's not really suitable for production use?
(With caveats for zero resource projects of course, but even for those I strongly suspect for many people paying $5 or $10 per month for "less crap hosting" is probably a better solution that prematurely optimising by adding caching and all it's inherent complexity to a fundamentally broken platform)
There's nothing you or I can do about the trend. They put more and more cores into a machine and the same number of disks (current-model Xeon servers have 72 threads and 1 or 2 disks), which guarantees that, at some point, the disk is highly oversubscribed.
Sure - the "race to the bottom" for hosting prices inevitably means there's going to be options like GoDaddy offering "a year's worth of webhosting for $5" which clusters 400 WordPress and Drupal sites onto a single RaspberryPi or similar, but you don't _have_ to go there.
I can understand if you're an open source developer who gets paid in Uzbeki Som or Nigerian Naira, the calculation of "do I spend a day or two putting caching in place" versus "do I spend an extra $50 or $100 per year on hosting" might lean very much the other way, but I suspect for the vast majority of HN readers, the prudent approach is "pay a hundred or two dollars a year for hosting before bothering to implement complex caching strategies".
In that sort of environment, I wouldn't be surprised if your app's internal cache ended up paged out anyway...
I don't allow swap on my machines, and I mlock executable pages, so I'd personally be surprised if anything was paged out.
Great article. I'll add one: caching doesn't address underlying performance issues, just "sweeps them under the rug".
I've seen many devs jump to caching before investing time in understanding what is really causing performance problems (I was one of them for a time, of course). Modern web stacks can scream without any caching at all.
Years ago, a talk by Rasmus Lerdorf really opened my eyes up to this idea. [1] He takes a vanilla PHP app (Wordpress, I think) and dramatically increases its throughput by identifying and tweaking a few performance bottlenecks like slow SSL connections. One of the best lines: "Real Performance is Architecture Driven"
[1] I think it was a variation of this one: https://vimeo.com/13768954
Good article which touches on real issues that a lot of developers won't really appreciate themselves until it happens to them (unless they have a strong background in distributed systems theory, and maybe even not then). A little strange that it doesn't use the word "consistency" even once though. :)
By dropping a cache into an existing system, you're weakening consistency in the name of performance. At best, your strongly-consistent system has started taking on eventually-consistent properties (but maybe not even eventual depending on how you invalidate/expire what's in your cache). Eventual consistency can help you scale, but reasoning about it is really hard.
In some sense caching as described by OP is a tool to implement CAP theorem tradeoffs, and Eric Brewer described the reality of trading off the C (consistency) for A/P (availability/partition-tolerance) better than I ever could:
Another aspect of CAP confusion is the hidden cost of
forfeiting consistency, which is the need to know the
system’s invariants. The subtle beauty of a consistent
system is that the invariants tend to hold even when the
designer does not know what they are. Consequently, a
wide range of reasonable invariants will work just fine.
Conversely, when designers choose A, which requires
restoring invariants after a partition, they must be
explicit about all the invariants, which is both
challenging and prone to error. At the core, this is the
same concurrent updates problem that makes multithreading
harder than sequential programming.I agree with all the main points here: caching adds a significantly complex component to the system. You should only do it if you absolutely must pull data closer to a consumer. Adding caching "to pick up quick wins" is always dumb.
With that in mind, I do think most of the pitfalls listed here can be avoided with well-understood tools and techniques. There's no real need to be running your cache in-process with your GC'd implementation language. Cache refilling can be a complex challenge for large scale sites, but I expect that a majority of systems can live with slower responses while the cache refills organically from traffic.
The points about testing and reproducible behavior are dead on - no equivocation needed there. As always keeping it as simple as possible should be a priority.
Fundamentally there's no need, but in-memory caching may still be the right choice. As always, there are tradeoffs. Standing up a separate cache component incurs non-trivial costs. Your service now has a new "unit of management" - a new thing you need to deploy, monitor, and scale. It's a separate thing which might go down unless it's provisioned for sufficient load, and you need to be careful about unwittingly introducing a new bottleneck or failure mode in your system. These are all solvable problems, but solving them comes at a cost.There's no real need to be running your cache in-process with your GC'd implementation language.You can totally argue that engineers should be forced to think about and address these issues up front with more rigor, and in a perfect world I think I'd agree. :)
The article can probably be succinctly summarized as "Premature optimization is the root of all evil". Most of the authors points are valid, in that caching adds more complexity.
That said, caching is absolutely critical to almost every piece of software ever. Even if you explicitly caching isn't used, a wide variety of caches are likely still being depending upon including CPU caching (L1, L2, L3), OS filesystem caching, DNS caching, ARP caching, etc etc.
Caching certainly adds complexity but it's also one of the best patterns for solving a wide range of performance problems. I would recommend developers spend more time learning and understanding the complexities so that they can make use of caching correctly and without applying it as a premature optimization.
What bugs me is not so much caching as redundant caching.
I've seen applications that have 5 redundant caches, if not more (on-disk cache, OS cache, VM OS cache, stdlib cache, programmer-visible cache). And then you end up killing the actually-important caches (CPU caches, etc) from the amount of redundant copying required...
Caching is the classic memory-time tradeoff, everything from memoizing a function, to DNS caching, to storing a web resource that didn't change.
I think that, if a cache is combined with a push indicating a change, then it's basically a local "eventually consistent replica" which catches up as soon as there is a connection to the source of truth.
Seriously, many times you are READING data which changes rarely (read: every X minutes / hours / days). So, in the meantime, every code path that will need access to the data may as well look in the local snapshot first.
The question about consistency is an interesting one. The client's view of the authoritative server state may be slightly out of date, when the user issues a request. If certain events happened in the meantime that affect the user's view, then the action can just be kicked back to the user, to be resolved. But 90%+ of the time, the view depends on 10 things that "change rarely", so a cache is a great improvement.
Related issues involve batching / throttling / waiting for already-sent requests to complete.
PS: That was quick. I posted this and literally 10 seconds later it got a downvote.
Caching is also bad in distributed systems, because by definition you're creating tail latency: the cache miss case. In a distributed system, you're more likely to hit the worst case in one component, so the cache may not buy you any end user benefit. It might just make performance more difficult to debug.
A cache can still be useful if to reduce load and increase capacity... but latency becomes more complex.
That's kinda weird reasoning. Are you saying there's no benefit to an improvement of median latency, if the tail latency remains long? I would disagree. I also would point out that not all systems that can benefit from a cache are latency-sensitive.
Not that there's no benefit, but just that it's more complicated in a distributed system.
Certainly caching is vital to many distributed systems, but it has to be done from a systems perspective. In my experience a lot of caches are just slapped on top of individual components without much thought, and without even some basic monitoring of what the hit rate is. I think it helps to actually measure what the cache is doing for you -- but this is more work than adding the cache itself.
And I agree with another poster in that I've seen many systems with caches papering over severe and relatively obvious performance problems in the underlying code.
I was thinking of this Google publication which outlines some problems with latency variability: http://www.barroso.org/publications/TheTailAtScale.pdf
Interestingly they didn't seem to list caches as one of the causes; they list shared resources, cron jobs, queuing, garbage collection, power saving features, etc.
I read some scribbling by some nerd working on distributed systems. The problem he mentioned is when you take a task and parallelize it, and then hand off the pieces to a bunch of workers, you aren't done until the last worker finishes. In that case long tail latencies can bite you rather hard. If 99 out of a hundred workers finish their bit in 50-100us and one of them stalls out for 10ms, you gained nothing over a single worker.
Implement cache isn't hard conceptionally. If an object is modified, flush it from cache. If object isn't cached, build it.
The problem is that implementing caching is a bit of a canary in a coal mine. If there are problems with the architecture, then trying to add caching into the mix will make things much more difficult.
I wouldn't say adding cache to parts which you know will be heavily read, upfront (or at least adding hoods to make it easier to implement later) is a waste of time or "Premature Optimisation". The 80-20 rule is live and well, just use your judgement.
It occurs to me that sharding shares most of these disadvantages. It avoids the problem of "you no longer read from your source of truth", but the overall complexity and set of failure modes looks strikingly similar.
I wonder how many sleepless nights have been caused by combining the two.
I have worked with a couple of systems that used very course grained sharding at the application level. I did not notice these drawbacks. I have not worked with one that did auto sharding on the back end, that might be trickier.
Some people may benefit from the academic exercise: How many caches are utilized in serving a typical website? (Assume a LAMP-like stack; if your number is less than about seven keep thinking)
Sometimes it's better to fix the database in production rather than add another database to production.
We need smart tools that can automatically make programs utilize a cache. That way, we can have the best of both worlds.