What machines ask of video
Video systems serve a small set of recognizable workloads. Most production data pipelines around video are some combination of:
Full-Clip Playback
A simple query that touches every frame in the file, making it the most decode-heavy of any access pattern and an exact analog to human-oriented linear playback.
Thumbnail Extraction
One representative frame at time
t, often the first or middle frame of a clip, for indexing and previews. The low overhead means thumbnails are often pre-materialized, however there are still cases where very sparse single frame reads are required without knowing the frame index up-front.Deterministic Evaluation Samples
K frames evenly spaced across the whole clip for reproducible scoring during evaluation — linspace(0, duration, K). Spreading the samples across the whole file stresses the decoder at many disjoint positions rather than within a single window.
Down-Rate Playback
Every Nth source frame at a target cadence (e.g. 10 fps from a 30 fps source) — common preprocessing for video-LLM ingest where the source frame rate is more than the model needs.
Event-Aligned Clip
Frames before and after an anchor
tat a target sampling rate. The event at timettypically comes from a label, lidar pulse, action timestamp, or other annotation. Drag the bounds to change before/after independently; the center handle ist.Scene-Boundary Detection
The first frame of every scene. Another example of a pre-processing stage that is typically materialized ahead of time, but there are still cases where the parameters for scene detection may need to be tuned after-the-fact, for example a fade-out. This access pattern takes the first frame of each scene by measuring the amount of change between adjacent frames and often aligns with where a video encoder might itself insert keyframes.
Scattered Timestamps
K independent timestamps randomly distributed across the clip. The contrasting case to dense window queries where sometimes every output frame corresponds to a different keyframe.
Despite their differences, all of these access patterns reduce to a query over a set of (file, timestamp, output_shape) tuples, possibly aligned across modalities and possibly batched across many files. The job of a multimodal warehouse is to efficiently serve the required access patterns from compressed bytes all the way to on-device RGB tensors, all while keeping within reasonable limits of storage, compute, and networking cost.
Playback isn't the antagonist here. It's the workload that existing video formats and tools were built for, and the warehouse inherits all of that machinery. The interesting question is how to use the same machinery for a differently shaped demand.
Why H.264 in MP4?
This deep dive uses H.264 inside MP4 because it is the mainstream baseline many datasets already contain: common in production video collections, broadly supported by tools, and widely hardware-accelerated through NVDEC, VideoToolbox, AMD VCN, and Intel Quick Sync hardware decode paths. It is also old enough that its container/codec split is well understood. The point is not that H.264 is the only interesting format. It is that the warehouse problem shows up in the ordinary format people already have.
H.264 also keeps the mechanics legible. The MP4 container exposes the timing and byte-address side; the H.264 bitstream exposes the prediction graph; and that split is close enough to H.265 and AV1 that the same planning vocabulary carries over. Newer codecs mostly make the constants harsher: they spend more decode complexity to save delivery bandwidth, which is the right tradeoff for playback and a dangerous default when a warehouse pays for decode again on every sampled frame.
AV2makes the direction clear. VideoLAN's dav2d note describes AV2 decoding as “roughly five times” AV1's complexity and says that software on today's machines will struggle to hit real-time without architecture-specific optimization.
Tracing a query
We'll follow the access pattern selected above through the rest of the chapter. Right now that is event-aligned clip: 8 before, 8 after @ 10 fps · 16 frames over 1.60s · t = 0.00s.
That query is relatively simple, but there are many stages to go through before the model sees a usable tensor. The deep dive will walk through each of the following stages in enough detail to help you build an intuition for the costs involved.
timestamps -> presentation frames -> compressed samples (PTS -> DTS) -> decode closure -> byte ranges -> decoder bitstream -> decoded YUV surfaces -> RGB / crop / resize / normalize -> tensors
Timestamps are not frame numbers
The first edge in the plan is timestamps -> presentation frames. Sometimes that looks like arithmetic: if the clip is truly fixed-rate and the query lands exactly on the frame grid, t × fps gets you the right displayed frame. The trouble is that a reader cannot assume those conditions. The query arrives in seconds because upstream systems speak in seconds: labels, camera clocks, subtitles, scene changes, and sampling policies. The file answers in frame identities.
A frame number is only meaningful after you say which timeline it belongs to. “The 480th displayed frame” is a presentation index. “The frame at 16 s” is a time query. “All frames every 100 ms” is a sampling policy. Those can line up in a tidy clip, but they are not the same request.
The container assigns each presented frame a presentation timestamp (PTS) and a duration, both stored as integer ticks on the track's timescale. For constant-frame-rate video, those intervals usually form a tidy grid. For VFR video, dropped-frame pipelines, and screen recordings, they may not. Either way, the reader has to compare the query timestamp to the actual PTS intervals, not to a rounded fps label.
The selection rule is part of the workload, not something the container decides for you. Some readers pick the nearest PTS. Some pick the frame whose interval contains t. A scene-cut query may ask for the first frame after each cut. The important thing is not that one policy is universally right; it is that the policy is explicit before later stages talk about bytes or tensors.
MP4 answers these questions with index blocks. Some describe timing, some mark random-access frames, some describe codec setup, and some point from samples to bytes. In a classic MP4, most of that metadata sits in the movie index; in a fragmented MP4, smaller metadata runs can appear throughout the file beside the media data.
This structure is what tooling caches as a frame map. ffprobe can print a normalized view of frame PTS, duration, keyframe flags, and packet positions. TorchCodec can cache the same kind of timing data as frame_mappings. Those maps make timestamp lookup cheap and deterministic. They identify which presentation frames the query wants. The later sections explain the cost of fetching and decoding those frames.
Compression shifts work to decode
Video compression exploits two kinds of structure. Spatial redundancy is what every image codec starts from: pixels close together in a frame tend to have similar values. Frequency-domain transforms — DCT in JPEG and H.264, wavelets in JPEG 2000 — decorrelate that spatial redundancy into a sparse set of coefficients that quantize and entropy-code well. Temporal redundancy is video-specific: regions of a frame (blocks) look like regions in nearby frames, so the codec stores motion + residual per block instead of another full picture. That second kind is what makes video dramatically smaller than per-frame image compression.
The two sliders below pair the spatial and temporal halves of H.264 on the active video. Drag the JPEG quality knob to feel the DCT-quantize tradeoff; drag the motion slider to watch block matches reconstruct the next frame from a nearby one.
The third major piece is entropy coding. After spatial and temporal prediction have reduced the signal, the encoder still has motion vectors, residual coefficients, block modes, and other syntax symbols to store. H.264 uses CABAC or CAVLC to spend fewer bits on likely symbols and more bits on unlikely ones. This stage is not another visual prediction step; it is statistical coding over the syntax left behind by prediction and quantization.
Spatial prediction reduces the amount that must be stored within a frame. Temporal prediction reduces the amount that must be stored across frames. Entropy coding squeezes the remaining syntax into fewer bits. The temporal part is what changes the shape of a read: if frame 101 is described as a change from frame 100, the decoder needs frame 100 to make sense of frame 101.
The frame graph
If every frame were predicted only from the previous one, reading any frame would mean decoding from the beginning of the file. That would compress well, but seeking would be miserable. Video codecs need reset points because humans also jump around in videos. Those reset points are I-frames: frames coded from their own bytes, without depending on earlier pictures.
Between those reset points, P-frames recover compression by predicting from earlier decoded frames. Then B-frames go one step further: they can interpolate from surrounding decoded frames, often using both an earlier and a later picture. Those choices turn a timeline into a dependency graph. At frame-level granularity, frames are nodes, and each reference is an edge from the dependent frame back to the frame that supplies decoded data.
selected DTS 3closure DTS 0, 1, 2, 3
The diagram shows an interesting result. The B-frame at playback position 3 references data from the later P-frame at position 4. This dependency requires the later P-frame to be decoded first. That is why frame decode order diverges from display order, and why compressed video samples carry both DTS (decode timestamp) and PTS (presentation timestamp).
The real bitstream is messier than “P means previous frame, B means previous and next.” H.264 carries reference lists: ordered sets of decoded pictures that a frame is allowed to predict from. A P-frame chooses one reference per predicted block from list 0. A B-frame may choose from list 0, list 1, or combine one from each. The diagrams below collapse that block-level machinery into one node per frame and one edge per frame-level reference.
A frame's decode closure consists of all frames that must be decoded before it. In other words, the transitive closure of the frame graph. The graph below starts with the combined closure for the selected access pattern; click any node to inspect that frame's individual closure.
Once the closure is known, the codec question is answered: these are the compressed samples required for the requested frame. The next problem is addressing. The reader still has to find where those samples live in the file.
From samples to byte ranges
At this point, the query is no longer abstract. The selected access pattern names output frames, and the frame graph expands those into a closure of compressed samples. The container's job is to turn those sample identities into byte offsets and sizes.
The blue spans are the bytes that a closure-aware reader would request from storage before invoking the decoder. Adjacent samples collapse into one range. Scattered samples stay scattered. The counts in the figure are computed from the selected MP4's real sample table and, where the frame graph is available, its parsed H.264 references.
Object stores make every independent range pay latency and request overhead; SSDs and operating-system caches prefer nearby bytes. Good readers coalesce adjacent sample ranges and deliberately overfetch a little when one larger read is cheaper than many small ones, especially across a batch of clips.
The decoder boundary
A decoder turns a stream of compressed samples into decoded pictures. Software decoders do that on CPU cores. Hardware decoders — NVDEC on NVIDIA, VideoToolbox on Apple, VCN on AMD, Quick Sync on Intel — use fixed-function video blocks built for this exact bitstream work. NVDEC is not CUDA cores running a kernel; it is separate hardware sitting beside the CUDA cores.
compressed samples -> bitstream parse # slice headers, NAL units -> entropy decode (CABAC / CAVLC) # serial, stateful -> inverse quantization # parallel within frame -> inverse transform # parallel within frame -> prediction + reconstruction # uses decoded references -> deblocking + loop filters -> decoded YUV surface
Entropy decode is the hard serial step. CABAC and CAVLC are adaptive: each symbol depends on context produced by previous symbols. Hardware makes that serial machine fast, but it does not make one bitstream arbitrarily parallel.
Its reference frames live in the decoder's decoded-picture buffer, and later frames depend on that state. A datacenter GPU may expose only a handful of hardware decode engines — roughly five to seven on modern NVIDIA chips — so throughput comes from keeping those engines fed with independent closures, not from splitting one closure across CUDA cores.
In our own H100 CUVID runs, hardware decode has become visible: sparse H.264 workloads can drive decoder utilization to the ceiling. Treat that as a warning about where the bottleneck can move, not as a universal throughput law. The exact limit depends on the codec profile, resolution, bitrate, GOP shape, batching, scheduler behavior, and the placement of post-decode CUDA work relative to the model. The planner's defensible claim is narrower: fewer decoded closure frames and tighter byte ranges reduce the work handed to whichever decoder path is available.
Video surfaces and color conversion
Decoders do not usually output RGB tensors. They output video surfaces, commonly NV12 for 8-bit 4:2:0 video or P010 for 10-bit 4:2:0 video. Both are YCbCr layouts: luma is stored separately from lower-resolution chroma.
The conversion exists because RGB is a display-oriented representation, not a compression-friendly one. In camera video, the red, green, and blue channels carry a lot of the same brightness structure. A luma/chroma transform pulls that shared brightness signal into Y and leaves color-difference signals in the chroma planes. That separation lets prediction, transform coding, and quantization spend bits according to what each component actually contributes.
The lower chroma resolution is the next economy. Humans are much more sensitive to luma edges and texture than to high-frequency color detail, so most camera videos store fewer chroma samples than luma samples. In 4:2:0, each chroma plane has half the width and half the height of the luma plane: one chroma sample covers a 2×2 block of luma samples. That cuts the chroma sample count to a quarter while preserving the brightness detail that carries most of the visible structure.
YUV/YCbCr is not inherently worse than RGB. The losses usually come from compression, quantization, and chroma subsampling, not from the name of the color transform. But converting a subsampled YCbCr surface into RGB requires chroma upsampling, a color matrix, range handling, and rounding. The exact choice matters when the model sees the result as numeric input.
Models mostly want RGB because the surrounding ecosystem does: pretrained image backbones, augmentation libraries, and normalization constants are conventionally defined for RGB image tensors. The video reader therefore still has work to do after decoding: YCbCr-to-RGB, crop, resize, normalize, and assemble batches. On a GPU pipeline, those steps run on CUDA cores and move memory across host, device, and sometimes PCIe boundaries. If scheduled poorly, post-decode work competes directly with the model.
From timestamps to tensors
Now we come back to the full query flow:
timestamps -> presentation frames -> compressed samples (PTS -> DTS) -> decode closure -> byte ranges -> decoder bitstream -> decoded YUV surfaces -> RGB / crop / resize / normalize -> tensors
A general library such as TorchCodec or FFmpeg exposes this through a decoder-shaped API. It resolves timestamps or frame indices using demuxer state or a timing cache such as TorchCodec's frame_mappings, seeks to an appropriate keyframe, feeds compressed samples in decode order, decodes forward until the requested presentation frame appears, and returns the requested frame or tensor. That is the right abstraction for broad compatibility: the decoder owns the hidden work.
The gap is that general-purpose libraries usually pay at the GOP level. Frame mappings make timestamp lookup cheap, but they do not make a P-frame or B-frame independently decodable. The common path still seeks to a sync sample, decodes forward, discards intermediate frames, and treats each request like a small playback.
A good warehouse reader can do better because it owns the batch. It can cache the MP4 sample index and timestamp mapping, parse the H.264 reference graph into a closure index, plan the union of closures for all requested outputs, coalesce the sample byte ranges, feed compressed samples in DTS order, and materialize RGB tensors only for output frames.
We now introduce Spiral's source reader. On an existing H.264 MP4, it can plan sparse byte ranges around the actual samples required by the requested frames, choose accelerated decode paths, such as NVDEC, when available, and schedule range fetches, bitstream assembly, decode engines, post-decode CUDA work, and model consumption as a single pipeline. The win comes from tighter decode closures and better scheduling, not from a different interpretation of MP4.
There is still a ceiling. A source reader can avoid unnecessary work around the given graph, but it cannot make that graph simpler. If the requested frames sit behind long reference chains or scattered samples, the best reader still has to pay those closures. The closure and range counts above make that distinction explicit: better source reads help, but repeated access patterns eventually require a file to be written for the query.
Why not flatten it?
If video takes this much machinery to read well, the obvious question is why not flatten it into something easier: raw RGB tensors, one JPEG per frame, short-GOP encodes, fixed clips, thumbnails, embeddings. Those are all valid materialized views when they match a frequent query.
They are not replacements for the source. Flattening buys simpler access by spending storage, egress, compute, quality, or generality. The compression ratio is the reason we are willing to do the planning work.
There is also a third move: pre-materialize a derived view. Thumbnails and latents are not alternate encodings of the source so much as cached answers to expected questions. That can be excellent when the query is stable: a fixed thumbnail policy, a fixed embedding model, a fixed scene-cut detector. It decays when the question changes. A new model invalidates latents; a new UI may need middle-frame thumbnails instead of I-frame thumbnails; a scene-search workflow may want cuts rather than fixed posters.
The database question is not whether to flatten everything. It is which views are stable enough to materialize, which queries should read the source directly, and which repeated access patterns justify a new encode.
Designing files for repeated queries
A smart source reader plans around a given frame graph. If the warehouse also controls the writer, it can change the graph. For repeated machine views, the first goal is to ensure that the useful frames land on smaller, more bounded decode closures rather than long, irregular reference paths.
That imposed structure is not free. The encoder gives up some predictive freedom, so the compression ratio can worsen. The bet is the same one databases make with indexes, clustering, and materialized views: spend some storage or sacrifice encoding efficiency so that a repeated query becomes predictable and cheap to execute.
Control the frame graph
Start with the retained cadence, such as 30 fps to 10 fps. The encoder can make those retained frames cheap anchors, then fill the gaps with B-frame refinement layers. Exact cadences matter because every third frame can belong to the 10 fps retained layer, while the intervening frames remain full-rate refinement.
Exact cadence ladder
Exact-cadence layers are assigned from presentation position, but P-frames are sparse anchors every 12 display frames. Everything between anchors is a B-frame refinement tree. For 30 -> 10 fps, retained frames such as the every 3 frames positions can be B-refs; the prefix property is that their closure stays inside the retained cadence.
selected frame 6 · view layer 0 · DTS 2 · closure frames 0, 12, 6
This is more precise than “use shorter GOPs.” The encoder controls which frames are cheap anchors, which frames are enhancement layers, and how much closure cost a down-rate query pays.
Repack samples for prefixes
The second knob is the physical sample order. Repacking writes the compressed samples so the low-rate ladder appears at the start of the file. A down-rate read can then fetch a prefix range instead of collecting scattered samples across the whole object.
Prefix view
display-order samples
0v0
1v1
2v1
3v0
4v1
5v1
6v0
7v1
8v1
9v0
10v1
11v1
12v0
prefix-packed sample order
prefix 5/13 samples
0v0
3v0
6v0
9v0
12v0
1v1
2v1
4v1
5v1
7v1
8v1
10v1
11v1
The same idea is the physical version of the source-reader facts above. Ranges can drop because the retained view is prefix-readable in byte order. Closure size can drop because the GOP is bounded by design. The file is shaped around the access pattern, and the reader can extract the view with minimal work.
The decision framework, expressed as the tradeoff every database makes:
| Access pattern | Layout choice |
|---|---|
| Unknown / one-off | Plan against the source file. No transcoding. |
| Repeated / shaped | Transcode to a layout that matches the shape. |
| Mixed / discovery-led | Keep source; precompute derived modalities. |
A footnote on what the codec spec already wants to do. H.264 has a scalability extension — Scalable Video Coding (SVC), defined as Annex G of the H.264 standard — that puts spatial and temporal pyramids inside a single bitstream. One file can carry multiple resolutions and frame rates as nested sub-bitstreams, with the decoder extracting whichever sub-bitstream a query needs. In principle, this is exactly what a multimodal warehouse wants. In practice, NVDEC and most consumer decoders implement only the base profile; an SVC bitstream falls back to software decode and pays back every cycle the hardware was supposed to save. The layout work above ends up living at the warehouse layer rather than the codec layer.
Producing these layouts means a custom encoder that controls the frame graph, slice ordering, and sample byte order to fit a known retention pattern. The output is standard (non-scalable) H.264 — any decoder can read it — but the structure is shaped around the physical plan rather than the playback timeline. Spiral's transcoder is one realization of this.