Linum v2 text-to-video models
Today, we're launching two open-weight, 2B-parameter text-to-video models capable of generating 2-5 second clips at up to 720p. Download them to your local rig and hack away!
What is Linum?
As of January 2026, we are a team of two brothers that act as a tiny-yet-powerful AI research lab. We train our own generative media models from scratch, starting with text-to-video.
If this is v2, what was v1?
We started working on Linum in the Fall of 2022 (~3 years ago). We had just wound down our last attempt at a startup and were in-between things. With all that free time, we were able to kick back and enjoy our favorite hobby — watching movies.Saw a couple of great movies that fall, from Barbarian to Tár, Bones and All, and Moonage Daydream. If you haven't checked these out, I'd recommend it!
Growing up, we went to a school focused on the performing arts. I played bass, but my brother Manu spent his time making shorts in video production with his buddies.Shout out to our alma mater, Pacific Collegiate School. When Stable Diffusion sent the world into a frenzy that fall, Manu wondered if these models could be used to help directors storyboard more effectively. With that seed of an idea, we applied to YCombinator and were accepted a week later for the W23 batch.
Talking to filmmakers, we learned pretty quickly that storyboarding was too niche, so we took a look at AI video.Everyone has a different creative process. A bunch of directors skip storyboarding altogether during pre-production (e.g., shot list with photos of action figures against their laptop instead). With such a wide range of preferences and limited budgets, we nixed storyboarding as a product. Back then, there were no generative video models. Instead, folks were algorithmically warping and interpolating AI images to simulate video. We shipped tools that helped musicians generate set visuals and music videos with these techniques, but by the end of the batch it was clear this was also a dead end. We didn't see a clear path from these psychedelic shorts to fully realized stories. So we shifted gears again — this time setting out to build our own text-to-video model.
For Linum v1, we bootstrapped off of Stable Diffusion XL. We extended the model to generate video by doubling its parameters and training on open-source video datasets.We inflated the U-NET's 2D Convolutions into factorized 3D convolutions and added new attention weights for time. This way we transformed the text-to-image model into a 180p, 1 second Discord GIF bot.Some v1 model GIFs.
Extending an off-the-shelf image model into a video model is too hacky of a solution to work long term. The VAE bundled with an image model doesn't know how to handle video, so generation quality is kneecapped from the start. If you don't have the original image dataset that the model was trained on, it's really hard to smoothly transition to video generation given how different these two distributions are. It costs a lot for a model to unlearn and relearn. At that point, you'll be better off building a model from scratch, with full control of every component of the model and dataset. So that's exactly what we did with Linum v2. Build it from soup to nuts.This release uses the Wan 2.1 VAE. We built our own temporal VAE in Fall of 2024 (which we'll open source separately), but the Wan VAE was smaller than ours and worked just as well. We adopted theirs to save on data embedding costs.
Turns out, it's really hard to train a foundation model from scratch with just two people. You own every part of a process that usually takes half a dozen PhDs and several dozen people (at least). On data, you have to manage procurementIf you sell high quality video data hit us up at hello@linum.ai, training and deploying VLMs for data filteringNot to mention 100s of hours manually categorically labeling images and video for their aesthetics strengths and weaknesses., and captioning pipelines for 100+ years of video footage. On compute, you have to benchmark providersLet me tell you a H100 from provider X doesn't work as well as a H100 from provider Y. And, don't get me started about reliability across providers., negotiate prices, and then keep your cluster operational. On research, you have to read the constant influx of new papers, figure out how to sift between the semi-true and the bullsh*t, and then run experiments in a reasonable budget to draw conclusions.
It's taken us two years. But, we're really excited to ship a model that's truly ours.
Where do we go from here?
We believe that access to financing is the limiting reagent for narrative filmmaking. It costs a lot of money to make a movie; and it's really hard to raise money to make your movie. If we can reduce the cost of production by an order of magnitude, we can enable a new generation of filmmakers to get off the ground.
Specifically, we're interested in improving the accessibility of animation.Indie animation like Flow is a testament to creativity and will power. But at a price point of $3-4M, that's still too expensive to make animated filmmaking truly accessible to anyone. We view generative video models as "inverted rendering engines". Traditional animation software like Blender builds up physics from the ground up. As of today, this is the better approach to modeling the real world, but it creates software that is very hard to use.Existing animation software is functionally rich but semantically poor (i.e., you can do anything, but it's really hard to do something). In contrast, generative video models learn lossy, often-inaccurate physics, but offer the possibility of creating more semantically meaningful controls through training.
We believe that by building better text-to-video models More aesthetically pleasing, more physically plausible. we can support high quality animation, while developing much more intuitive creative tools. This should make it easier to go 0 -> 1 and open the doors for a new band of storytellers.
Linum v2 is a huge stepping stone for us, but truthfully we have a long way to go to realize this vision (just take a look at some of the flawed generations below).
Over the next few months, we're going to start by addressing issues with physics, aesthetics, and deformations in this tiny 2B model footprint through post-training (and a couple of other ideas).Today's checkpoints are raw, just pretrained weights. From there, we'll be working on speed enhancements through popular techniques like CFG and timestep distillation.Right now, it takes ~15 minutes to generate a 720p 5 second clip on a H100 with 50 steps. And most importantly, we'll be working on audio capabilities and model scaling.
We're going to be blogging through everything we've done so far and everything that we're going to be working through. So, if you're interested in the sort of writing that sits at the intersection of applied research and engineering, subscribe to Field Notes.
Acknowledgments
Special thank you to our investors and infrastructure partners for their continued support.
Get Field Notes
Technical deep dives on building generative video models from the ground up, plus updates on new releases from Linum.