Ask HN: What is your go to performance optimization?
I'm curious what backpocket optimizations you have from your past experiences which you pull out when a workload costs too much in the cloud. For me its.
* disable lower c-states
* enable adaptive coalesce in the network driver
* mount volumes with noatime Reduce how often something runs and the amount of work it does when it does run. Sometimes, things are done too often, or process unnecessary data. I once had a project where a SAP system was crawling and literally causing company-wide stoppages. We found a job that literally ran every minute of every day and it processed a table that contained a few thousand tasks. This was something that could be done once per hour, and only during business hours. Furthermore, it was re-processing thousands of records each time it ran. In reality, after a record was processed once, it could be deleted from the table. We emptied out the table, and scheduled the job to run hourly. The whole company noticed an immediate improvement. This pattern happens a lot. Someone builds a polling system that hits the server once a second in order to see if a task finished. A cron job runs every 5 minutes. All data is processed in a daily job, instead of doing a 24 hour cutoff. The world is filled with computers doing useless work. Mostly, no one notices. How it usually goes for me. I ask business how often should something be updated - they say “real time” (not going into details what real time means really) - it is hard to explain processing all data all the time so everything is fresh takes forever…. After couple of months it turns out they never ever open their “super important dashboard” or do it once in 6 months. Great after a year of bogging down everything I can clean up and make jobs run once a day instead of once a minute because now it is acceptable. Why take the unspecific "real time" answers as gospel? Just propose daily initially (or weekly or monthly), listen to their protests, and see if there are good reasons why it needs to be more often. Then pick a suitable interval that fulfills their needs and that you can guarantee. If you offer pink fluffy unicorns for free, people will always pick them, without thinking. Well it is not clear from my initial post but I do propose daily or whatever else than just "every SeCoND BecAuSE we NeeD iT NOW!!!" but people that are clicking the system right away expect stuff to happen "right away" - and any amount of explanation does not work. Bonus points for trying to explain you want to implement "eventual consistency" - keep in mind I have added {“real time” (not going into details what real time means really)} in text to indicate that I am not some junior whining around but someone with deep understanding of computing... Just an alternative related to pulling. We had a task that checked every minute what could be executed. Instead of doing that, we scheduled the executing code to execute on that time ( if it didn't exist yet) through service bus. Since events can be planned... Easy peasy and avoided a micro service named "scheduler", a db and an Serverless function... ( Hoboy) Push less data through wires. The memory hierarchy is so stark on modern hardware that the 30-year-old adage that "the fastest code is the code you don't run" is maybe less important than, "the fastest code is the code that doesn't spend much time talking to the memory controller." And it's even worse once we start talking about accessing memory that's on an entirely different computer. Serialization/deserialization, IPC, network calls, and all those other things we do with reckless abandon in modern service-oriented and distributed applications are just unbelievably expensive. Last year I took a slow heavily parallelized batch job and improved its throughput by 60% by getting rid of both scale-out and multithreading and just taking it all down to a single thread. Everyone expected it to be slower because we were using a small fraction as many CPU cores, but in truth it was faster because the time savings from having fewer memory fences, less data copying, and less network I/O was just that great. And then the performance gains kept coming because, having simplified things to that extent, I was then in a much better position to judiciously re-introduce parallelism in ways ways that weren't so wasteful. I've seen that happen a lot as JSON(XHR) and ORMs started to become more common... certain queries would return way too much data from related (auto-fetched) records and it was just slow AF on remote computers. Another common one is just poor query performance from a database. Lack of appropriate indexes, or other relatively easy optimizations. Similarly, finding a method of caching that's just bad (in-memory database, with sql queries instead of a dictionary). Isn't so bad for one call, very bad when a a given request (login) makes over 200 calls to this cache for configuration settings. It wasn't a problem per request but in aggregate. > Push less data through wires. In the systems I work on this has been a big one. In SQL people are pretty good about not writing `select *` in production code, but when querying directory servers, redis, mongodb, etc. people get sloppy. When a system is small, it's enticing to pull in lots of data and work with it in code instead of writing real queries. This doesn't scale. Unless you use an ORM, in which case I'm used to seeing it be all SELECT * all the time. And then you get an entire generation of engineers who've never known any other way to talk to a database going around complaining about how this Miata is so slow when really it's just that nobody ever taught them how to shift out of first gear. Your example can really go either way. I've done tht exact opposite, with crushing success. I've spent the last couple of years identifying and resolving N+1 problems in a Django codebase. https://planetscale.com/blog/what-is-n-1-query-problem-and-h... Aside from the performance gains, it's very satisfying to go from 1,000+ inefficient DB queries to 1-2 optimized queries. That's a well written article. For developers newer to relational databases, there's a heuristic at the beginning which I remember hearing elsewhere and keeping it in mind when I'm doing query work. "You might expect that many small queries would be fast and one large, complex query will be slow. This is rarely the case. In practice, the opposite is true." It's a great heuristic. A big part of why it works out that way is that the query planner can only optimize for the query you give it. If you give it a bunch of small queries, it can only make relatively inconsequential micro-optimizations. The one big query gives it a lot more degrees of freedom and opportunities to make big gains. Here's another great resource for getting more out of relational databases: https://use-the-index-luke.com/ I realize you're asking about performance optimizations, but since you put it in the context of a workload's cloud bill being too large, I'll chip in and say that by far the largest impact on cost I've seen over the years is to just rightsize the infrastructure the workloads are running on. What I see more than anything is applications that are reserving 10x more CPU or memory than they're actually using. In some cases this might mean amortizing resource usage over time, by asynchronously consuming some kind of queue, in cases where the extreme reservation of resources is due to some temporary usage spike (downstream client doing some batch processing, for example). Easy trick to making joins 50x faster: don't use Postgres and give your tables a primary key which groups related items together. A lot of people don’t know that a database index doesn’t order the actual rows on disk. It’s just a Btree of pointers. If you use clustered index for a table query pattern, the rows are actually ordered on disk. Most DBs load data in 8KiB chunks. So if you query 100 rows that are 100bytes, if they’re not sorted, you actually need to load nearly 1MiB of data even tho the query result is 10KiB. Speeds up joins and range queries 50x or more, less cache evictions, etc. You can do this in any database except for Postgres. Postgres doesn’t have the ability to keep rows sorted on disk. Although it isn't automatic, doesn't the Postgres CLUSTER command reorder the rows on disk? Or am I misunderstanding something? It does, but it's a bit of a problem when table is large and gets new rows since by default it locks the table and is a slow operation. Oh and also does compression which helps quite a bit with network storages (like cloud disks) Cockroachdb or yugabyte kind of solves for sortedness by pk since it uses rocksdb variants/lsm tree underneath. The fastest code is code that isn't executed. Related: do only what is necessary.
This works on so many levels and is my magic trick to make software faster. It's a pity computers are so fast and have so much memory that people can get away with not caring about minimalism. Also the most secure. I've seen memoization improve performance by enormous amounts. Even for simple functions that do a few simple calculations before returning a result. Another go-to of mine is to take conditionals out of loops that really only need to be checked once. For example: I wouldn't even really call that loop example an "optimization", it is just the obviously more efficient implementation. Yet I see it constantly. That pattern of `for foo in whatever: for bar in foo` is everywhere. People don't even think about it... They just write the looping part then start thinking about the logic below. It's such a common thing I'm surprised compilers and interpreters don't just optimize it away :shrug: Eliminate complicated stacks/frameworks (dockers, npm, reacts, multi-stores, clouds, etc) and use the simplest alternative available (solid, single `exes`, htmx+tailwind, only Postgres, normal hosting, etc). Improve the data(structure) first if possible. @njit on any for loops w/ numpy that require a recursive calculation So many times I have been able to pull something out of my behind that got rid of the recursive element to vectorize stuff. Usually involves ranks and groups. also less logging. Seen this pop up way too many times Log, but keep it in a buffer. On an unhandled exception or crash, emit all logs relating to the exception (such as per request/rpc). Agreed. If you're on AWS, CloudWatch can get really expensive really quickly. On the back-end application level programming side: * Look for O(N^X), that is nested loops even when they are not necessarily expressed as loops on the language level. * If possible, get rid of ORMs in favor of raw SQL. Not because ORMs are very bad but because almost nobody bothers to learn them; they often start causing issues with any non-trivial amount of load. * Study data access patterns and figure out where and what composite indexes might help. I am saying composite indexes because I assume regular indexes are more or less always there, often even too many of them. Especially with the last one I have achieved impressive results without any kind of impressive effort, just setting aside some time to understand the code. In my case, it was database optimizations that reduced the overall costs significantly * Instead of writing big complex queries that had nested SELECT's, I split them into smaller bite-sized chunks that could be cached
* Better caching strategies - reducing how many caches were flushed when a change was made
* Tweaking the index's on database tables to improve WHERE clauses
* Storing intermediate calculations into the database (for example, the number of posts a user has could be stored on the user table instead of counting them each time) When I optimized the database, I could then reduce the size of the DB and the server as they no longer needed to work / wait as much This is a surprisingly uncommon technique, but: Think deeply about what you're making the computer do, and ask it to do less things by being smarter about what you ask it to do. I'd say 95% of the time most of your OOM gains will come from the above. - Remove locks (lock-free algorithms) - Delete as much code as possible If the workload does not benefit from cloud then I just run it locally because hardware is usually much cheaper and much faster than cloud. umm this is kinda tongue-in-cheek ``` #include <time.h> #include <stdio.h> int main(int argc, char argv[])
{
int i = 0;
time_t timep; ``` now, the canonical ``` $> gcc tz-test.c -o obj/tz-test ``` now do this ``` $> unset TZ $> strace -ff ./obj/tz-test 2>&1 | grep 'local' | wc $> strace -ff ./obj/tz-test 2>&1 | grep 'local' | wc moral, always set TZ to avoid localtime(3) from stat'ing /etc/localtime :o) lol never knew about this thanks. Has this actually slowed down someone's code? Assuming you gate things like merges on test results . . . Remove end-to-end tests. Replace with contract tests and service-level functional tests. Much faster feedback! At the same time, better coverage. The only serious problem with the approach is that it upsets the magical thinkers in your org. Often those folks are managers. Making sane DB indices and constraints. It's amazing how often people just don't add indices even when the access pattern is clear from the outset. "Premature optimization is the root of all evil!" ok, so when are we actually going to add that index? (Answer: never) Usually, just checking for access patterns and data structure (like not using a set where it should be). Also, avoiding code that has pointers to many small pieces everywhere in memory and lead to a bad cache misses score. Finally, just good old profiling. On 32bit it remains -fomit-frame-pointer for me. On native compiles -march=native I'm conflicted about this one. march native definitely. Have you seen substantial gains from omitting frame pointers? Meta rules that are often ignored: 1. Establish what is good enough 2. Measure, don't guess 3. Fix the biggest bottleneck first 4. Measure after fixing And some general things: 5. Avoid micro-benchmarks (i.e. things not at the entire system level) 6. Be careful with synthetic data 7. Know your general estimates (e.g. cache, memory, disk, network speeds) Profiling (e.g: Pyroscope) for better understanding (+load test).
More performant libraries with same interface.
DB optimizations (e.g: Indexing, Denormalization, Tuning, Connection Pooling) Use explicit huge pages. better than transparent?
Can become: for foo in whatever:
for bar in foo:
if len(foo) > some_value:
do_something(bar)
This example is trivial and wouldn't gain much but imagine if `len(foo)` was a more computationally expensive function. You'd only need to call it on each iteration of foo instead of every iteration of foo * bar. for foo in whatever:
if len(foo) > some_value:
for bar in foo:
do_something(bar)
} /*
* ok so now we are printing something
**/
printf("Greetings!\n");
/*
* this is a for loop from 0..9
**/
for(i=0; i<10; i++) {
time(&timep);
localtime(&timep);
}
printf("Godspeed, dear friend!\n");
return 0;
$> export TZ=:/etc/localtime 10 77 851
``` 1 5 59