Cloud On My Side: LTT Folding@Home Month VII

16 min read Original article ↗

Introduction

Folding@home is a distributed computing project that harnesses the power of volunteers' computers to simulate protein folding, helping researchers better understand diseases like Alzheimer's, Huntington's, various cancers, and viruses like SARS-CoV-2. By running distributed simulations on tens of thousands of computers worldwide, Folding@Home contributed to 226 scientific research papers by 2021. It was also the first exascale computer in 2020, two years before Frontier at Oak Ridge National Laboratory. Folding users often self-organize into teams to experience community, help new folks join up, and promote folding through friendly competition.

As someone who recently had a close family member diagnosed with cancer, this cause has become increasingly personal to me. When the Linus Tech Tips (LTT) team announced their 7th Folding Month competition in September 2024, I knew it was time to get off the sidelines. I'd heard about Folding@Home in the past but had never taken the initiative to learn more about how to optimize my computer for folding or avoid potential pitfalls. Foreshadowing is a literary device in which Please dear reader, do not mistake me for an expert on Folding@Home or in the cloud, I'm merely curious and/or foolish enough to give it a try and share what I've found.

In this note, I'll share my journey, from my initial setup and the challenges I faced to the cloud computing solutions that led to the moniker "KevTheConqueror" by day 13 of the event. Along the way, I'll touch on the technical details of the different "folding rigs" I've dabbled with, share performance metrics and cost optimization strategies, and reflect on the lessons I learned and still hope to learn.

💡

Some folks on the LTT folding team have asked for a guide to Folding@Home in the cloud, which is forthcoming but not included in this post. Sign up if you'd like to be notified when the guide is published.


Getting Started

I began my folding journey with my personal gaming computer, eager to put its beefy GPU to work for a good cause. We'll keep track of the different system configurations and their observed performance, measured in points per day (PPD), as we go. PPD will vary based on which projects are available, so I'll share the PPD I observed in a specific project, #16525. This project from the University of Wisconsin-Madison targets Alzheimer's research and seems to best utilize high-end GPUs at the time of writing. Actual PPD for a given day will be lower as the system assigns a mix of available projects that take less advantage of the GPU and award fewer points overall.

Rig Gaming PC
OS Windows 11
CPU 8-core AMD Ryzen 5800X3D
Memory 64GB DDR4 3200
GPU Nvidia RTX 4090 AD102
Network 1Gb
Cost/GPU/hr $0.056 (electricity)
PPD/GPU (16525) 31,000,000

I'd let the first few days of the folding month pass me by, life got in the way, but on day four I was ready to begin, heartened by the hope that the Nvidia RTX 4090 would help me catch up quickly. After installing the required client software, enabling my GPU, and leaving CPU folding turned off (it provides a minuscule PPD in comparison and my CPU runs hot in a small form factor case), I finally clicked "Fold" for the first time! However, I quickly ran into a mysterious issue, the folding client would download a new fraction of a project called a Work Unit (WU), begin protein folding, and then quickly fail, only to download a new WU and repeat the cycle. Examining the folding client logs, I was met with the following message:

⚠️

Error: Particle coordinate is nan

Excuse me? What do you mean my particle coordinate is nan? As a newbie, this error left me puzzled. I let the client continue trying additional WUs while I searched for possible solutions (this would quickly come back to bite me). Common recommendations were to check system stability for issues like overheating or unstable overclocks. "But my system is top-notch!", I thought to myself. Forgoing CPU folding meant my system ran chilly, as Nvidia's 4090 FE has a monstrous 3-slot heatsink that keeps temps in check. Yes, it was overclocked, but it was a stable overclock that I'd validated in MSI Afterburner, synthetic benchmarks, and many games. I was a fool, it was the overclock. Immediately upon resetting the overclock, the folding client was able to begin stable WU production without issue.

However, the damage was done.

In addition to awarding points based on work units completed, Folding@Home awards bonus points to incentivize users to quickly and reliably complete work units. That reliability threshold is 80% successful completions, and allowing my system to insta-fail over 6,000 WUs while I diagnosed the overclocking issue permanently damaged my reputation within the bonus structure. My PC was completing dozens of WUs per day but at the non-bonus rate. It seemed the folding month would be long over by the time my reliability improved to an 80% success rate and I was eligible for bonus awards. This is no small bonus either, this equated to roughly a 90% reduction in points per WU for my particular rig. I was humbled as my overclock seemed to cost me any chance at performing well in the folding month competition, but I decided to keep at it for the scientific contributions and resolved to find a solution to catch up if at all possible.

Thankfully, this was not so difficult to resolve. Bonus status is attached to a construct called a passkey, or a unique user identifier. However, point totals (both in the official Folding stats and the LTT Folding Month competition) are accumulated to a username, not a passkey. Inherent in this setup is the implication that one could sign up for a new folding account with a new email address but the same username to gain a new passkey with a clean bonus record, but continue using the same username, as the folding system does not require a unique username on signup. Thankfully, this worked and I was quickly up and running with much-improved production on my PC after days of delays and issues, reliably churning out high-quality protein folding for science. Still, I felt a desire to make up for lost time, both because of my 3-day delay in joining the competition and overclocking issues.


Enlisting the help of someone else's computer

Determined to catch up, I turned to TensorDock, a cloud computing platform that specializes in providing GPU instances for machine learning and high-performance computing workloads. Why TensorDock and not a traditional cloud provider? For the most part, cloud providers can only rent you a professional GPU, like Nvidia's H100 or perhaps the cheaper Nvidia RTX 6000 if you're lucky, likely due to Nvidia's driver EULA preventing the use of gaming cards for data center use:

"2.8 You agree that GeForce or Titan SOFTWARE: (i) is licensed for use only on GeForce or Titan hardware products you own, and (ii) is not licensed for datacenter deployment."

High-end data center GPUs and their associated cloud instances are much too costly to rent for a fun little folding competition, and Folding@Home doesn't need the additional GPU memory, tensor cores, or other optimizations that make these data center-class GPUs exponentially more expensive than gaming-class GPUs like the Nvidia RTX 4090. TensorDock, on the other hand, is a marketplace where you can rent someone else's 4090 system. For context, an on-demand H100 instance will run you at least $5/GPU/hr in AWS, while on-demand 4090 instances via TensorDock were roughly $0.38/GPU/hr at the time of the competition. Clearly, one of these is more viable for messing around with for science and trying to win a competition.

With TensorDock, I also saved time, as there's no VPC to set up, no security groups or policies, simply a VM with SSH access and a public IP. They even provide an Ubuntu Server 22.04 image with updated Nvidia drivers and CUDA pre-installed (required for folding). Equipped with multiple Nvidia RTX 4090s per instance, I octupled my folding output overnight. The Linux-based instances also provided slightly better points per day (PPD) compared to my Windows machine, likely due to reduced overhead. One interesting note in the rig details, because TensorDock is a marketplace matching renters with owners rather than a cloud provider, instance configuration and pricing is inconsistent and can swing significantly, though pricing is locked in for as long as you keep instance resources allocated.

Rig TensorDock 1 TensorDock 2
OS Ubuntu Server 22.04 Ubuntu Server 22.04
CPU 8-vCPU AMD EPYC 7282 8-vCPU AMD EPYC 7282
Memory 16GB DDR4 3200 16GB DDR4 3200
GPU 3x Nvidia RTX 4090 AD102 4x Nvidia RTX 4090 AD102
Network 10Gb 10Gb
Cost/GPU/hr $0.38 $0.35
PPD/GPU (16525) 32,000,000 32,000,000

⚠️

The TensorDock systems displayed a consistently slow upload of finished work unit results, despite their claimed 10Gb network connection.

So you want to fold from the command line?

Setting up the Folding@Home client on these instances was not exactly a breeze as support and documentation are limited. The official Folding@Home FAQ has this to say about running on Linux systems without a desktop environment:

"Note: There is no install guide or support in the forum for this type of expert-only installation."

Working with my esteemed colleagues at Oxide can certainly make me feel non-technical at times, but this was a good reminder that everything's relative. In this context, I've become an "expert" doing "expert-only installation" by daring to work from a Linux command line. This guide from Gorgon on the LTT forum is for folding on Ubuntu with a desktop envorinment installed helped me start in the right direction, and I quickly found the config.json file where I could specify particulars like my account and passkey in the absence of a client GUI. His guide also references a separate guide on folding with Ubuntu server, but frankly, I clicked away when I saw it was based on Ubuntu 16.04 from 2016. This was also the beginning of my desire to write up my own guide in the future, both for folding on Ubuntu server and folding using cloud instances.


AWS Spot Instances Strategy

While TensorDock boosted my folding performance, I still wanted to see if I could optimize costs and performance further with AWS Spot Instances. Spot Instances allow you to bid on unused EC2 capacity at a steep discount, up to 90% off the on-demand price. My optimism was bolstered by success with TensorDock and I looked forward to the challenge of deploying spot instances for the very first time. Also tantalizing was AWS' recent addition of Nvidia L40S-backed instances just three months prior. I feel this GPU is an under-appreciated sleeper in Nvidia's lineup where the AI-crunching H100 and B100 get all the glory. Tuned for AI fine-tuning and inference while retaining graphics capability, it's a versatile option for customers who don't need to train the next ChatGPT and need something easy to power & cool in a traditional data center. These instances were appealing for a few reasons:

  • It shares the same AD102 GPU die as Nvidia RTX 4090, for potentially similar or better performance
  • Far cheaper than H100 instances, as low as $0.19/hr
  • Available with minimal CPU and Memory (minimal needed for folding) as compared to most other GPU instances
Rig AWS g6e.xlarge
OS Ubuntu Server 22.04
CPU 4-vCPU AMD EPYC 7R13
Memory 32GB DDR4 3200
GPU Nvidia L40S AD102
Network 20Gb
Cost/GPU/hr $0.1861
PPD/GPU (16525) 34,000,000

To streamline the process of spinning up new instances, I attempted to follow this AWS blog from the height of the pandemic when they documented this precise use case. However, I quickly ran into a few issues with the provided CloudFormation template:

  • It uses an Amazon Linux-based AMI - I have no real experience with RPM-based Linux distros and this would have limited ability to configure or troubleshoot if something went wrong or needed to be updated.
  • The AMI is outdated, generally. Key packages such as Nvidia drivers, CUDA, and the folding client just to name a few. It would likely be easier to start from scratch here than bring everything up to date.
  • It uses many different instances with various GPUs, not including the L40S - this would lead to lower performance and a mix of possible costs.
  • It spreads instances over many regions, making costs even more variable.

However, I resolved to tinker with the template with favorable results after completing the below:

  • Create and point the template to my own Ubuntu Server 22.04 AMI with the latest drivers, CUDA, and Folding@Home client pre-installed. I used a low-cost on-demand g6.xlarge instance to set up this "golden image."
    • This initially caused an issue where the Folding@Home server saw all of my instances as a single client and got extremely flustered, resulting in only one instance folding at a time. This GitHub issue where the same issue arose from cloned machines pointed me in the direction of a solution. I booted up my manually configured instance and deleted the folding client.db file, and then took a new snapshot for an updated AMI version. This way, on boot, each instance would generate a new client.db and be recognized correctly by the server-side folding software.
  • Restrict the template to only deploy in the cheapest availability zone: us-east-1c.
  • Restrict the template to only deploy g6e.xlarge instances.
  • Set a spot instance cost cap to prevent unexpected price fluctuations from blowing my budget (I also had billing alarms configured).

Still, no dice

A final hurdle remained, my instances were failing to provision at the last moment. What could have gone wrong? What had I overlooked? Well, you see, I'd used the slightly different g6.xlarge (non "e") instance via typical on-demand provisioning to do my initial testing and AMI preparation. I figured a spot instance wasn't suitable for this step, since it could be interrupted at any time. What I didn't realize, however, was that while we, the unwashed masses, are allowed to provision "g" series instances on-demand in AWS, we do not have the privilege of deploying those same instances via spot instance requests.

These spot instance requests are heavily restricted by the AWS quota system, with a default account quota of zero and the option to open a support case to explain why you'd like to use their unused capacity and how many vCPU-worth of instances you'd like to request. I proceeded to request 40 vCPU which equates to a limit of up to 10 GPUs via g6e.xlarge instances and was approved quite quickly. I may sound salty because this slowed down the progress I was excited to make, but this system is in place for a good reason, we've all read stories of folks getting over their skis in the public cloud and racking up an unexpected, massive bill. AWS takes these steps to ensure folks ramp up gradually when dabbling with more expensive or convoluted services, and I appreciate that.

nvidia-SMI output of an L40S running project 16525 showing 100% utilization
Project 16525 makes the Nvidia L40S in this spot instance work hard for its points!

With that out of the way, I bid strategically for g6e.xlarge instances in us-east-1c and provisioned instances at just under 19 cents per hour—a 90% discount compared to on-demand pricing and up to a 50% discount compared to Nvidia RTX 4090 GPUs on TensorDock. This allowed me to run more instances for longer, maximizing my folding output while keeping costs in check. While spot instances can be interrupted at any time by other customers willing to deploy on-demand or reserved instances, I've noticed only very minor disruption to my instances, maybe one every day or two for an hour or so; they've been remarkably consistent for the price. Setting interruption behavior to shut down instead of terminate also helped some spot instances resume work units when new capacity became available.

graph showing past 3 months' pricing for AWS G6e.xlarge instances by location, with 1c being notably cheaper than other options
Sidebar- does anyone know why it's consistently so much cheaper in 1c? Spot placement scores were similar for all availability zones.

Competition Success and Lessons Learned

The combination of my initial Nvidia RTX 4090, TensorDock's multi-GPU instances, and cost-optimized AWS Spot Instances has me on the path to victory in the LTT Folding Month VII competition. But more than just the numbers, this journey taught me valuable lessons:

  1. Stability is key: Prioritizing stability over peak performance was crucial in maintaining a high work unit success rate and earning bonus points. Slow and steady truly does win the race in distributed computing. I'll also be using Folding@Home (without my passkey enabled) to test GPU overclocks in the future.
  2. Public cloud is great for spot workloads: Platforms like TensorDock and AWS offer incredible flexibility, enabling individuals to make an impact. By leveraging spot instances and automating deployment, I achieved performance rivaling entire folding teams. For long-term enterprise workloads (if you're spending ~$40K or more per month), an Oxide Cloud Computer is likely more cost-effective and easier to use.
  3. The folding community is incredible: From the LTT forum to the Folding@home forum, I was constantly amazed by the dedication, knowledge, and generosity of the folding community. Their support and collaboration were instrumental in my success.
  4. Every contribution counts: Whether folding on a single GPU or a fleet of cloud instances, every completed work unit brings us one step closer to understanding and treating diseases. No contribution is too small, and every bit of progress is worth celebrating.

What I'm looking forward to

  1. Testing public ipv6 addresses for the AWS instances to further optimize costs. AWS recently started charging for public ipv4 addresses (good on them).
  2. Writing a guide for others to replicate my AWS spot instance configuration and Ubuntu Server generally.
  3. Learn how to connect my instances to this popular Folding@Home database to populate the Nvidia L40S average PPD for everyone to view.
  4. Even more research papers aided by the Folding@Home project.
  5. Regular participation in LTT folding events as a new member of the community.
Folding@Home donos statistics showing a top 1k all time rank
How my Folding@Home production skyrocketed with the help of cloud computing

Conclusion

My experience with Folding@home has been a roller coaster—from frustrating early setbacks to taking the lead in the LTT competition. It reminded me of the optimism required to support perseverance, the importance of stability and reliability, and the potential of individuals to contribute to groundbreaking scientific research.

I hope that by sharing my story, I can inspire others to get involved and make a difference. Whether you're a seasoned folder or new to the world of distributed computing, there's a place for you in the folding community. Start small, embrace your natural curiosity, and don't be afraid to experiment.