Foresight
Can AI benefit from "imagining" the future before making decisions?
This research project explored whether AI systems could improve their reasoning by generating video predictions and checking them against reality—like how humans often visualize outcomes before acting.
Result: Not yet. Current models aren't capable of this, but we've documented what works, what doesn't, and propose tracking this as a benchmark for future AI systems.
The Idea (Plain English)
When you're about to pour coffee, you might briefly imagine the liquid filling the cup. If you imagined it overflowing, you'd pour less. This "mental simulation" helps you make better decisions.
We asked: Can AI do something similar?
The plan was:
- Show the AI an image (a cup, a ball, etc.)
- Ask "What happens if I push this?"
- Have it generate a video prediction of the outcome
- Compare that prediction to what actually happens
- Use the difference to improve future predictions
If this worked, AI systems could catch their own mistakes by noticing when their predictions look wrong—just like you'd notice if your mental image of pouring coffee showed it going sideways instead of down.
What We Found
Summary Table
| Phase | Question | Result | Key Finding |
|---|---|---|---|
| 1. Reconstruction | Can we decode images from AI's internal representation? | ✅ Passed | Hybrid approach (DINOv2 + VLM) preserves spatial info |
| 2. Bridging | Can we connect the language model to a video generator? | ✅ Passed | Small 10M adapter works better than large 100M one |
| 3. Prediction | Can the AI predict what happens next? | ❌ Failed | 7 architectures tested—none beat just copying the input |
| 4. Verification | Does comparing predictions to reality help? | ❌ Failed | Perceptual similarity doesn't indicate correctness |
Detailed Results
| Metric | What It Measures | Achieved | Needed | Status |
|---|---|---|---|---|
| Spatial IoU | Position accuracy | 0.837 | > 0.60 | ✅ |
| LPIPS | Visual quality | 0.162 | < 0.35 | ✅ |
| Prediction vs Copy | Can it predict better than copying? | -4.5% | > 0% | ❌ |
| LPIPS-Correctness correlation | Does visual error indicate wrong answer? | 0.106 | > 0.30 | ❌ |
| Self-correction rate | Can it fix mistakes with feedback? | 7.4% | > 15% | ❌ |
The Key Failures
1. VLMs Can't Predict the Future
We tried 7 different approaches to make the language model predict what happens next in a video:
- Single frame input
- Multiple frames input
- Temporal transformers
- Contrastive learning
- Pixel-level feedback
- Fine-tuning the model
All of them performed worse than simply copying the current frame as the "prediction." The language model understands what's in an image, but it cannot predict what will change.
2. Visual Similarity ≠ Semantic Correctness
Even when we used a video model to generate predictions (which looked reasonable), comparing them to reality using perceptual metrics (LPIPS) didn't help. Surprisingly, wrong predictions often looked MORE similar to reality than correct ones.
This means you can't use "does it look right?" to catch mistakes—the visual appearance doesn't indicate whether the prediction is semantically correct.
What Did Work
Despite the negative results, we made useful discoveries:
| Finding | Why It Matters |
|---|---|
| Hybrid encoder (DINOv2 + VLM) preserves spatial information | Solves the problem of VLMs losing position data |
| VLMs understand generated video (93% retention) | Video models generate content VLMs can reason about |
| Small adapters work (10M beats 100M) | Efficient bridging between models is possible |
| Video Predicts → VLM Describes works | Use each model for what it's good at |
Benchmark Proposal: VideoReason
We're releasing this as a benchmark to track when this approach becomes viable. As video models improve, these capabilities may emerge.
Tasks to track:
- Future frame prediction accuracy
- Action understanding in generated vs real video
- Verification metric correlation with correctness
- Self-correction success rate
Why track this? Video generation is improving rapidly. The capabilities we found lacking in 2026 may emerge in future systems. A standardized benchmark helps identify when "visual imagination" becomes useful for AI reasoning.
Prerequisites
To reproduce the experiments, you'll need accounts with these services:
| Service | Purpose | Sign Up |
|---|---|---|
| Modal | GPU compute for experiments (A100s) | modal.com |
| Hugging Face | Model downloads (Qwen2.5-VL, LTX-Video) | huggingface.co |
| Weights & Biases | Experiment tracking and logging | wandb.ai |
Dataset: Something-Something v2
The experiments use the Something-Something v2 dataset for action prediction. This must be downloaded manually:
- Go to Qualcomm AI Datasets
- Request access and download the dataset
- Extract to
data/something-something-v2/
The dataset contains ~220K videos of humans performing 174 different actions (pushing, pulling, dropping, etc.).
Setup
# 1. Clone and install dependencies git clone https://github.com/a1j9o94/foresight.git cd foresight uv sync # 2. Copy environment template cp .env.example .env # Edit .env with your API keys (WANDB_API_KEY, HF_TOKEN) # 3. Configure Modal secrets modal secret create wandb-api-key WANDB_API_KEY=<your-wandb-key> modal secret create huggingface-secret HF_TOKEN=<your-hf-token> # 4. Download models (first time only, ~20GB) uv run modal run infra/modal/app.py::download_models # 5. Verify setup uv run modal run infra/modal/app.py::smoke_test
Quick Start
# Run demo locally (shows the working parts) cd demo/backend && uvicorn main:app --reload --port 8000 cd demo/frontend && bun run dev # Open http://localhost:3000 # Run experiments on Modal GPUs uv run modal run infra/modal/app.py::run_experiment --experiment-id <id> # Test experiment harness without GPU uv run modal run infra/modal/app.py::run_experiment --experiment-id c1-vlm-latent-sufficiency --stub-mode
Demo
- Video Walkthrough: https://youtu.be/YJxDt_zCrUI
- Live Demo: https://foresight-demo-kappa.vercel.app
- Backend API: https://foresight-demo.fly.dev
Project Structure
foresight/
├── paper/ # Research paper (LaTeX)
├── research/ # Experiment results & findings
│ ├── FINDINGS.md # Summary of all results
│ └── experiments/ # Per-experiment details
├── infra/modal/ # GPU experiment infrastructure
│ └── handlers/ # Experiment implementations
├── demo/ # Live demo (React + FastAPI)
├── packages/ # Modular code packages
└── configs/ # Model/training configs
Tools & Models Used
| Component | Tool | Notes |
|---|---|---|
| Vision-Language Model | Qwen2.5-VL-7B | Frozen, used for encoding |
| Visual Encoder | DINOv2-ViT-L | Spatial feature extraction |
| Video Generation | LTX-Video | Real-time video synthesis |
| Perceptual Metric | LPIPS | Learned perceptual similarity |
| GPU Compute | Modal | A100-80GB for experiments |
| Experiment Tracking | Weights & Biases | Metrics and artifacts |
| Package Manager | uv | Fast Python packaging |
| Frontend | React + TypeScript + Bun | Demo UI |
| Backend | FastAPI | Demo API |
Documentation
- Research Findings - Detailed experiment results
- Paper - Full writeup with citations
- CLAUDE.md - Development guide
Citation
If you use this work, please cite:
@misc{obleton2026foresight, title={Foresight: Can Video Prediction Ground Language Model Reasoning? A Negative Result and Benchmark Proposal}, author={Adrian Obleton}, year={2026}, url={https://github.com/a1j9o94/foresight}, note={Research prototype and benchmark} }
License
Research prototype - released for academic use. See paper for full methodology.