30,000 Peptides and a GPU That Wasn’t Trying Hard Enough

The worst month of the year in the Northern Hemisphere is no doubt February. But this February we got some technical vitamin D as Carolina Cloud collaborated on a project with Apexomic, a bioinformatics consulting company based in Stevenage UK. They offer bespoke data analytics for the life sciences sector. They bring more than 12 years of experience to the table with expertise in both biology and business. Derek and I felt a connection back to our East Anglian past (where it all began in a number of ways).

Apexomic typically operate with on-prem compute so this was a perfect opportunity to collaborate on a small R&D project generating peptide designs using BoltzGen. Briefly, BoltzGen is a molecular design tool that generates novel peptide sequences to bind target proteins; while it can technically run on a CPU, its generative design stage relies on repeated neural network inference so in practice requires a GPU for production-scale performance.

Apexomic had been running this on a dual 10 core Xeon CPUs with 64Gb of RAM hooked up with two GTX 1080 with 8GiB of VRAM each. This setup was giving around 2000 peptide structure designs per day. But the goal was to do 30k peptide designs over the weekend, ready by Monday morning. This grabbed our attention as this was real scientific AI, not just some LLM benchmark. We were confident our RTX 5090, a far newer chip with over 4x the CUDA core count and double the VRAM at 32GiB, could help out.

Apexomic provisioned 20 vCPU, 32GiB container on our AMD EPYC 7742 with the GPU. Cheminformatics pipelines are notoriously hybrid CPU-GPU and BoltzGen is no exception. The pipeline consists of five steps, each using varying proportions of CPU and GPU compute. We started initially with just 50 peptide structure designs.

Local: 25:17 minutes

Cloud: 26:25 minutes

Disappointing. The 5090 gave no speed up with a little added container overhead. But monitoring the GPU and CPU usage on nvtop and htop showed that, while the 1080s were both maxed out, the 5090 peaked at only 75% compute usage and spent most of the time hanging out at 50%. Lots of capacity was being left on the table.

So we did some digging. The first BoltzGen step is the Design step and is the most critical, where the GPU does the heavy lifting and when the peptides are actually designed. Clearly this is the step we needed to focus on. Then we noticed the CLI optional flag --diffusion_batch_size. This controls how many peptide diffusion trajectories are generated in parallel per forward pass; essentially increasing it allows the GPU to process bigger batches simultaneously without changing the underlying algorithm. It was defaulting to 1. The GPU had to load all of the weights to just design one peptide at a time which seemed fine for the 1080s but left the 5090 malnourished. We increased it to 25 and benchmarked on 4000 peptide structures. Lo and behold the 5090 utilization in both cores and VRAM improved resulting in a 5x speedup of the Design step compared to the local machine:

Local 1080 machine:

==========

Step 1: Design 19839s

Step 2: Inverse Folding 754s

Step 3: Folding 47774s

Step 4: Analysis 897s

Step 5: Filtering 15s

Cloud 5090 container:

===========

Step 1: Design 4067s

Step 2: Inverse Folding 1004s

Step 3: Folding 32570s

Step 4: Analysis 14118s

Step 5: Filtering 13s

Side note, we guess there exists a --diffusion_batch_size that is too high that would either result in an OOM error where it fills up too much VRAM or time-slicing, where there’s too much compute for the GPU to handle but the memory is fine. In this case the slowdown would just become linear with number of peptides. But that’s an experiment for another day.

Those of you waiting for the plot twist might notice that Step 4 Analysis is now lagging on our cloud machine, exposed by our success with Step 1. Step 4 is a CPU dominated step so the immediate thought was the throw more CPU cores at it. The question arose - “Can you re-provision my instance with more VCPUs without me having to reinstall everything?”. Our answer - YES! The latest feature added to Carolina Cloud is the ability to resize your container seamlessly. That means no data loss, no reinstallation of packages, no environment rebuild, just add or remove resources using the UI or API.

We rescaled the Apexomic container to 64 vCPUs and 36GiB of RAM and reran the 4000 peptides. The Analysis step fell to 4,321s. Dramatically better but still slower than the local Xeons at 897s. However, content with that final boost, we unleashed 30,000 peptide design structures over the weekend and by Monday morning, Apexomic had exactly the results they needed, beamed by sftp right under the Atlantic to their local servers.

Here are my core takeaways about this project in no particular order:

This is very much NOT an LLM-only, GPU-dominated world. This project was a true case of scientific AI requiring a balanced system of GPU and CPU usage.
Live resizing of the container unlocked the last speedup needed to hit go on the big pipeline run for 30k peptide designs. This ability saved a lot of time we might have spent reinstalling packages and rebuilding the environment in a whole new container. Formulating pipelines is like dating, they rarely reveal the extent of their needs on the first coffee. Carolina Cloud gives you a second chance, resize shamelessly and pretend it was the plan all along.
Sometimes good enough is… good enough. We were a bit surprised that the Analysis step was still slower on the cloud machine, even with the more powerful AMD EPYC 7742 and more cores. Our guess is that this comes down to the nature of the workload in BoltzGen. Maybe sustained clock speeds, memory bandwidth under load, and available RAM (36GiB vs 64GiB locally) mattered more than raw core count? The point is we could have spent another week tuning the memory, profiling cache misses and diving into the BoltzGen code to find multiprocessing defaults, all for the last 15% of optimization. But by Friday our 30k peptide run was feasible and so in the end setting it off as is likely got the results sooner. Not waiting for perfect to RSVP before starting the party.
Our key principles at Carolina Cloud are simplicity, cost and support. This collaboration especially highlighted our true support capabilities. There was rapid iteration, and direct technical communication between us and the team at Apexomic. No ticket-based abstraction. They brought deep domain expertise, a clear project and performance goals while Derek and I provided hardware-level insight and scientific programming experience. A true recipe for success.

30,000 Peptides and a GPU That Wasn’t Trying Hard Enough

Discussion about this post

Ready for more?