[Paper Review] VGGT-Ω: Scaling Feed-Forward 3D Reconstruction (CVPR 2026 Oral)
Paper: VGGT-Ω: Scaling Feed-Forward 3D Reconstruction
Venue: CVPR 2026 (Oral Presentation)
Authors: Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht
Affiliations: Visual Geometry Group (VGG), University of Oxford + Meta AI
arXiv: 2605.15195
GitHub: facebookresearch/vggt-omega
One-Line Summary
The successor to VGGT. A 1B-parameter feed-forward model that cuts memory by 70% via Register Attention, trains on 15× more data, and dramatically improves 3D reconstruction for both static and dynamic scenes.

1. Background and Problem Statement
VGGT (CVPR 2025 Best Paper) successfully applied the foundation model philosophy to 3D reconstruction. However, practical deployment revealed several structural limitations.
Limitations of VGGT:
- Global Attention memory explosion: In odd layers, all tokens across all N frames attend to one another — O(N²K²) complexity that scales quadratically with frame count.
- No dynamic scene support: Training data was predominantly static scenes, making the model ill-equipped for scenes with moving objects.
- Multi-head complexity: Three specialized heads (Camera, DPT, Tracking) limit scalability.
- High-resolution conv bottleneck: The DPT head’s high-resolution convolution layers create memory and speed bottlenecks.
VGGT-Ω’s central question:
“If the right architectural choices are combined with sufficiently large data, does the quality of 3D reconstruction models scale predictably?”
2. Core Ideas
Three contributions working in concert.
① Register Attention
Global Self-Attention is replaced by bottleneck communication through register tokens. Inter-frame information exchange happens exclusively through registers; image tokens only perform within-frame attention. Complexity drops from O(N²K²) to O(N²R² + NK²), where R ≪ K.
② Single Dense Prediction Head
VGGT’s three specialized heads (Camera, DPT, Tracking) are consolidated into a single multi-task Dense Prediction Head. High-resolution convolutional layers are removed. The simpler design is more amenable to scaling.
③ Massive, Diverse Training
- 15× more supervised data compared to VGGT
- New self-supervised learning protocol for unlabeled video
- Newly built dynamic scene annotation pipeline
3. Architecture
Overall Pipeline
N input images
→ patch tokenizer → image tokens (K tokens per frame)
→ register tokens (R tokens per frame, R ≪ K) appended
→ Transformer layers (Register Attention)
│ ├─ Within-frame Self-Attention: image + register tokens (K+R)
│ └─ Cross-frame Self-Attention: register tokens only (R × N frames)
└─→ Single Dense Prediction Head (multi-task supervision)
→ camera extrinsics (rotation + translation)
→ camera intrinsics (FoV)
→ depth maps + confidence maps
→ scene registers (reusable for downstream tasks)

Register Attention in Detail
The critical bottleneck in VGGT’s Alternating-Attention was the Global Self-Attention step (all tokens across all frames attending to one another). VGGT-Ω decouples this into two stages:
- Within-frame attention: Each frame’s image tokens and register tokens attend together. Spatial information is compressed into the registers.
- Cross-register attention: Only register tokens from all frames attend globally. Scene-wide 3D structure is exchanged at this compressed bottleneck.
Image tokens can only access cross-frame information by routing through registers. The registers act as per-frame scene aggregators.

Memory Efficiency
Register Attention cuts memory by 70% relative to Global Self-Attention. Combined with the removal of high-resolution convolutional layers, VGGT-Ω requires approximately 30% of VGGT’s memory.
| # Frames | GPU Memory (GB) | Resolution |
|---|---|---|
| 1 | 6.02 | 624×416 |
| 10 | 6.67 | 624×416 |
| 25 | 7.80 | 624×416 |
| 50 | 9.66 | 624×416 |
| 100 | 13.37 | 624×416 |
| 200 | 20.82 | 624×416 |
| 300 | 28.26 | 624×416 |
| 500 | 43.15 | 624×416 |
4. Training
Data Composition
VGGT-Ω uses 15× more supervised training data than VGGT. Two major expansions:
New: Dynamic Scene Annotation Pipeline
- An automated pipeline for annotating dynamic scenes with moving objects
- Overcomes the limitation of existing datasets being dominated by static scenes
New: Self-Supervised Learning
- A self-supervised learning protocol for unlabeled video data
- Enables learning 3D structure from internet-scale video without manual annotation

5. Experimental Results
Main Benchmark Results



Dynamic Scene Results
VGGT-Ω is the first in this line of work to directly target dynamic scenes. On the Sintel benchmark, it achieves a 77% improvement over the previous best.



6. Ablation Study

7. Downstream Use of Scene Registers
A hidden benefit of Register Attention is that the learned scene registers transfer to downstream tasks. During 3D reconstruction training, registers learn to encode compact spatial representations of each frame’s scene — and these representations transfer to other spatial understanding tasks.
Confirmed downstream applications:
- Vision-Language-Action (VLA) models: Scene registers from VGGT-Ω benefit robot manipulation and other VLA tasks requiring spatial understanding.
- Language Alignment: The VGGT-Omega-1B-256-Text-Alignment checkpoint outputs text-aligned embeddings for cross-modal retrieval.
This suggests that 3D reconstruction is not just a geometry task but a powerful and scalable proxy task for spatial understanding.


8. Comparison with VGGT
| Aspect | VGGT | VGGT-Ω |
|---|---|---|
| Cross-frame attention | Global Self-Attention (all tokens) | Register Attention (registers only) |
| Prediction heads | Camera + DPT + Tracking (3 heads) | Single Dense Prediction Head |
| Dynamic scenes | Not supported | Supported |
| Training data | ~17 datasets | ~15× larger + self-supervised |
| GPU memory (100 frames) | ~21 GB (336×518) | 13.37 GB (624×416, higher res, less memory) |
| Sintel camera estimation | Baseline | +77% improvement |
| Scene register reuse | None | Transfers to VLA / language alignment |
| Scaling validation | — | Scaling laws empirically demonstrated |
9. Limitations
- High frame counts at high resolution still require substantial memory (500 frames at 624px = 43.15 GB)
- Self-supervised signal may be noisier than supervised annotations
- Text-aligned model is limited to 256px low resolution
- HuggingFace model access requires manual approval
10. Summary
VGGT-Ω empirically validates that scaling laws apply to 3D reconstruction. The core message:
“Register Attention + single head + massive data = predictable scaling”
If VGGT proved the foundation model paradigm could work for 3D vision, VGGT-Ω confirms that it scales according to the same laws observed in NLP and 2D vision. Reduce architectural complexity (Register Attention, single head), increase data (15× supervised + self-supervised), and performance improves predictably — while the learned representations transfer to other spatial understanding tasks.
Scaling laws for 3D reconstruction are no longer a hypothesis.