Show HN: TPI – Terraform provider for ML and self-recovering spot-instances

12 points by dmpetrov 4 years ago · 4 comments

Reader

dmpetrovOP 4 years ago

Hey all, we are launching Terraform Provider Iterative (TPI).

It was designed for machine learning (ML/AI) teams and optimizes CPU/GPU expenses:

1. Spot instances auto-recovery (if an instance was evicted/terminated) with data and checkpoint synchronization

2. Auto-terminate instances when ML training is finished - you won't forget to terminate your expensive GPU instance for a week :)

3. Familiar Terraform commands and config (HCL)

The secret sauce is auto-recovery logic that is based on cloud auto-scaling groups and does not require any monitoring service to run (another cost-saving!). Cloud providers recover it for you. TPI just unifies auto-scaling groups for all the major cloud providers: AWS, Azure, GCP and Kubernetes. Yeah, it was tricky to unify all clouds :)

We'd love to hear your feedback!

toisanji 4 years ago

Awesome, this project is from the team behind data version control (dvc) and CML, I’ll give it a try!

dmpetrovOP 4 years ago

Yes, it' from DVC/CML team! We started TPI as a "computational backend" for CML project (CI/CD for ML). But then we realized that it can be useful as an independent tool.

ogazitt 4 years ago

Auto-scaling for ML workloads, integrated with the TF workflow - very cool!

Settings

Show HN: TPI – Terraform provider for ML and self-recovering spot-instances

Keyboard Shortcuts