Long-Form Video Understanding: Bottlenecks and Design Choices - Part 1

Recently I have been hearing and reading seemingly contradicting opinions on long-form video (from tens of minutes to several hours) understanding, such as:

“Sweeping through the whole video is necessary - we should focus on making that as efficient as possible” vs. “there are many clever tricks to selectively retrieve - let’s explore those.”
“We should just keep improving MLLMs until they can handle everything” vs. “agents are the future of video understanding - let’s build more agent swarms.”

My thesis is that these views are not really disagreeing about what is true - they are making different tradeoffs about where to spend limited “budget”. Unlike a text document or dozens of images, a two-hour video breaks the memory and compute budget in the absence of intentional compressing, sampling, or retrieving. The contradictions above reflect different design choices to solve this core challenge, and there is no consensus yet on a universally optimal design. Quite the contrary - two distinct axes of tradeoff are being actively explored:

The memory axis. When you cannot afford to attend to everything, do you throw information away and lean on adaptive retrieval - or do you keep all the information and compress the attention/KV-cache? Two different answers to the same memory ceiling.
The compute axis. When you cannot yet compute the answer accurately in one pass, do you buy accuracy with an agentic system that runs many inferences - or do you internalize that agentic behavior into the model itself, so it's a learned, native capability rather than an external orchestration loop?

And there is a third bottleneck that is, as always, evaluation. The problem is some benchmarks do not control for the complexity and dependency of the tasks they bundle together, which makes “approach A beats approach B” claims much weaker than they look:

Complexity isn’t controlled. Even the benchmarks marketed as “long” rarely exceed an hour, while real production workloads often run for hours.
Dependency isn’t controlled. Plenty of “video” questions are anchored on a single frame or a few seconds, or are answerable from the transcript/subtitle alone, with no real long-range understanding required.

To keep this writeup focused and readable, I’ll use the rest of it to survey the design choices along these two axes, and leave evaluation and benchmarks for long-form video understanding to a separate future writeup (Part 2).

Sidebar clarification: technically one could push a two-hour video into a long-context model like Gemini. But it does not work reliably for tasks that genuinely require long-form temporal understanding, e.g., “what is the story arc for the character first appearing between 20:02 ~ 20:22 min in a blue coat?” While it’s hard to know the full details of these closed models, it is reasonable to conjecture that they are leaning on subtitles/ASR, metadata, or a handful of frames for most questions. As shown in recent benchmarks built from movies and TV shows (InfiniBench, Ataallah et al. 2025), models can score pretty well for certain tasks purely based on subtitles and metadata (related to the aforementioned issue of “dependency isn’t controlled” ). In summary, “feed the whole video to a frontier model” is not a silver bullet for long-form video understanding.

Of course, solutions are not as simple as keep nothing vs. keep everything - there is a full spectrum, running from aggressively throwing information away to keeping all of it and paying the cost somewhere else.

Instead of uniform sampling, prev work has proposed selecting only what matters:

Lightweight learned selectors of frames (M-LLM Based Video Frame Selection, Hu et al. 2025)
Training-free key clip selection that keeps short coherent segments instead of isolated frames (From Frames to Clips, Sun et al. 2025)
RL for samplers (Temporal Sampling Policy Optimization, Tang et al. 2025) where an event-aware “temporal agent” is trained for keyframe selection
Reasoning driven sampling that traverses coarse summaries, refines its focus, and halts once it has enough evidence (LongVideo-R1, Qiu et al. 2026)
Joint RL training of the sampler and the model (MSJoE, Tan et al. 2026)

There is a second, somewhat orthogonal design choice here: most selectors are “query conditioned”, meaning they will select frames / clips based on the user question; while others are “query agnostic”. For example, Attend Before Attention / AutoGaze (Shi et al. 2026) does patch-level pre-encoder selection, trained to keep the minimal set of patches that still reconstructs each frame within an error budget. Depending on the application, “query conditioned” tends to win on accuracy, but “query agnostic” is the only option when there is no query up front (e.g. building an index).

Instead of hard-dropping frames, we can shrink the representation itself - and there's more than one way to do that:

Compression: Hierarchical Differential Distillation / ViLAMP (Cheng et al. 2025) proposed a “mixed precision” style approach to keep keyframes intact and compress the rest at the patch level
Pooling: LVC (Wang et al. 2025) studied retrofitting long-form video understanding capabilities onto existing VLMs by query-weighted pooling, collapsing windows of densely-sampled frames into a handful of “pseudo-frames”
Cheaper encoding: LiteFrame (Kim et al. 2026) points out that, with aggressive “post-hoc” visual token reduction (after feature extraction), the bottleneck moves to the per-frame vision encoder. So it distills a more efficient vision encoder for better latency-accuracy tradeoffs. (Note: this line of work can help regardless of keep or discard since it sits upstream)

Another school of thinking is we shouldn’t drop anything, because whatever you discard early may very well be what the question turns out to need later. InternVideo3 (Yan et al. 2026) explicitly rejects “aggressive frame subsampling, retrieval, or summarization” and instead proposes an attention re-parameterization (Multimodal Multi-head Latent Attention, M2LA) that compresses KV-cache while preserving the full multimodal token stream. It's the clearest video-native instance of the keep-everything bet, potentially inspired by the broader efficient-long-context-attention line (latent / multi-head latent attention).

Regardless of keep or discard, speculative decoding can speed up models’ token-by-token generation in lossless ways. ParallelVLM (Kong et al. 2026) - following the SpecVLM (Ji et al. 2025) work - parallelizes the draft-then-verify pipeline to speed up decoding.

Streaming is where keep-it-all is not an option since frames keep arriving forever - the memory budget is bounded while the input is not. A body of work (CurveStream, Wang et al. 2026; FluxMem, Xie et al. 2026; VAM, Li et al. 2026; OASIS, Liang et al. 2026) has converged on “hierarchical memory”: maintain a fixed budget and route incoming frames into keep vs. discard vs. not-sure buckets. It is worth noting some failure modes here: memory built out of the previous model-generated narrations can compound errors, and naive retrieval into the context window can contaminate reasoning. To put it simply, bad memory is worse than less memory.

This axis of design choice boils down to: for long-form video understanding, there is a need for step-by-step computation (e.g., decompose the query → retrieve evidence → reason through the evidence → final answer). We have to decide where and how this type of step-by-step computation happens.

This is the classic option: turn the video into text (video caption, audio transcription) or structured symbols (object tracks, bounding boxes) and then invoke a text LLM to reason over these intermediate artifacts. Despite the fact that the video representation here is not adaptive (artifacts are fixed after the initial step), this option is actually pretty robust in practice - and thus is always worth baselining with. As mentioned above, subtitle/ASR alone is very competitive for some tasks. In addition, ObjectMLLM (Tang et al. 2025) found that explicit object structure remains necessary and the best way to feed it is as plain text instead of distributed visual embeddings.

The agent loop is typically: give the model tools - crop the video, retrieve a clip, run a detector - then let it observe, reason, act, and repeat across rounds. The difference across work is how the agent's policy is obtained.

Prompted / training-free orchestration - the policy is hand-built or zero-shot prompted:

VideoMind (Bhatnagar et al. 2026) - a single MLLM plays multiple roles: decomposes the query into sub-queries and switches between operational modes (multi-scale temporal search vs. single-frame visual detail), allocating compute on the fly.
Deep Video Discovery (Zhang et al. 2025) - first indexes the video into a multi-granular, searchable database (segmented clips → captions/embeddings), then lets an LLM agent autonomously search and retrieve over it with tools, rather than following a fixed retrieval procedure.

Trained policy - a clear recent trend: instead of fixing the scaffold by hand, train the agent's policy. The recipe usually starts with an SFT cold start (imitate expert traces to teach a tool's format and semantics), then a policy-optimization step - and there are two flavors of that step:

Trajectory preference optimization - VideoExplorer (Yuan et al. 2025) intertwines planning, temporal grounding, and re-perception in one loop; after SFT, it applies trajectory-level DPO (TDPO) to reward faithful full trajectories and penalize flawed reasoning paths.
Online RL (GRPO) - LongVT (Yang et al. 2026) trains a native "crop-and-re-inspect" tool for a global-skim-then-local-zoom loop; VideoSeeker (Zhao et al. 2026) trains instance-level view/crop tools driven by visual prompts - among a fast-growing cluster (for example see also LongVideo-R1 mentioned above).

It’s worth noting that training recipe is finicky here: base models don’t use tools on their own - SFT is needed to ground the tool. An explicit “use the tool” reward yields little gain once SFT has grounded the tool, and naive recall-based grounding rewards get hacked easily (IoU-style reward is better). Following SFT, RL mostly helps get better agent policies for the same external loop, e.g. fewer wasted inference calls.

Option 3 takes the Option-2 trend one step further. The RL-trained agents above still run an explicit external loop - RL just gives them a better policy for it. Option 3 asks: once the policy lives in the weights, why keep the external loop at all? - and collapses the step-by-step computation into the MLLM's own forward pass, so the model learns the agentic behavior rather than relying on an orchestration wrapped around it. Similar to everything else, there is a spectrum of how far you take this:

Latent reasoning - instead of emitting explicit tool calls or text, just do the intermediate reasoning in continuous hidden states. In theory this can be efficient and fully end-to-end, but in practice supervision, training, and generalization may get very challenging - and that’s perhaps why I couldn’t find a good recent example for long-form video understanding.
Internalized discrete operations - instead of executing operations externally, keep the operations internal and interpretable. ATLAS (Guo et al. 2026) is a good recent example: it represents each visual operation as a single discrete “functional token” - no external tool call, no context-switching, yet still an interpretable trace. However, it is done on image reasoning - not long-form video yet.
Internalized full reasoning loop - InternVideo3 (Yan et al. 2026), mentioned above in the memory section, formulates “Multimodal Contextual Reasoning”: observe → reason → act → update, inside one context rather than an external loop. It’s worth noting that this is made possible by the attention trick (M2LA). The model can still call tools, but the agent loop is now a native property of the model’s context, not an external scaffold.

Frontier labs might be converging on this last approach, for example, Kimi K2.5 (Kimi Team 2026) starts from a native multimodal model and trains its agentic behavior in - its multi-agent “swarm” orchestrator is RL-trained rather than hand-built.

However, one question still remains unsettled: does internalizing the agent loop into the model always beat external agent loops? Or are we just hiding the same computation behind a different interface (now the model itself)? Does a fundamentally different approach exist, yet to be fully explored?

This last approach is also a good example to show that the memory and the compute axes are never fully independent from each other: InternVideo3 pairs “Multimodal Contextual Reasoning” with “Multimodal Multi-head Latent Attention”. The third axis, evaluation, is deeply entangled with everything discussed here too - I will cover that in a future writeup (Part 2).

Long-Form Video Understanding: Bottlenecks and Design Choices - Part 1

Discussion about this post

Ready for more?