[Paper Review] VGGT: Visual Geometry Grounded Transformer (CVPR 2025 Best Paper)
Paper: VGGT: Visual Geometry Grounded Transformer
Venue: CVPR 2025 (Best Paper Award)
Authors: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Affiliations: Visual Geometry Group (VGG), University of Oxford + Meta AI
arXiv: 2503.11651
GitHub: facebookresearch/vggt
One-Line Summary
Feed N images in a single forward pass and get camera parameters + depth maps + point clouds + 3D tracking all at once in under 0.2 seconds — with a 1.2B-parameter Transformer.
1. Background and Problem Statement
The traditional pipeline for 3D reconstruction follows a sequential structure: SfM (Structure from Motion) → MVS (Multi-View Stereo) → Bundle Adjustment. Each stage is decoupled and requires iterative optimization, making reconstruction of a single scene take anywhere from tens of seconds to several minutes.
Recent learning-based methods DUSt3R and MASt3R attempted to change this paradigm by processing image pairs and directly predicting 3D point maps. However, they still have notable limitations:
- Pairwise processing: Only two images are handled at a time; the full scene is assembled by merging pairwise results in a post-processing step.
- Global alignment required: An optimization step is mandatory to register all pairwise point maps into a unified coordinate system.
- Slow speed: 7–9 seconds per scene.
VGGT asks a simple question: “Can we do all of this at once?”
2. Core Idea
VGGT’s central claim is straightforward:
A large Transformer trained on sufficiently diverse 3D data can solve all tasks in 3D vision simultaneously, without any geometry-specific inductive biases or post-processing.
This directly applies the foundation model philosophy, proven in NLP and 2D vision, to 3D reconstruction. The design principles are threefold:
- Process all input images simultaneously (no sequential or pairwise ordering).
- Single forward pass — no iterative optimization.
- Multi-task output — camera poses, depth, point maps, and tracking, all at once.
3. Architecture
Overall Pipeline
N input images
→ DINOv2 patch tokenizer → image tokens (K tokens per image)
→ camera token (1 per image) + register tokens (4 per image) appended
→ Alternating-Attention Transformer (24 layers, dim=1024, 16 heads)
├─→ Camera Head → intrinsic & extrinsic camera parameters
├─→ DPT Head → depth maps, point maps, uncertainty maps
└─→ Tracking Head → 2D point correspondences across frames

Alternating-Attention (Key Design Choice)
The 24 Transformer layers alternate between two types of attention:
- Even layers → Frame-wise Self-Attention: Attention is restricted to tokens within the same image, preserving spatial context for each frame.
- Odd layers → Global Self-Attention: All tokens across all frames attend to one another, enabling multi-view correspondence and 3D reasoning.
Cross-attention is not used. The ablation below validates this design choice (ETH3D Overall, lower is better):

Camera Head
The camera tokens pass through 4 additional self-attention layers and are projected linearly to produce a 9-dimensional vector per image:
- Quaternion rotation (4D) + translation (3D) + field of view FoV (2D)
The first image’s camera is always fixed to the identity (serving as the world coordinate origin).
DPT Head
Image tokens are upsampled back to dense maps via DPT (Dense Prediction Transformer) upsampling:
- Depth map \(D_i \in \mathbb{R}^{H \times W}\)
- Point map \(P_i \in \mathbb{R}^{3 \times H \times W}\) (world coordinates per pixel)
- Tracking feature map \(T_i \in \mathbb{R}^{C \times H \times W}\)
- Uncertainty map (aleatoric uncertainty)
Tracking Head
Adapted from the CoTracker2 architecture. Given a query point in a reference frame, it predicts the 2D correspondences in all other frames via the tracking feature maps. No assumption is made about frame ordering.
4. Training
Loss Function
\[\mathcal{L} = \mathcal{L}_{\text{camera}} + \mathcal{L}_{\text{depth}} + \mathcal{L}_{\text{pmap}} + 0.05 \times \mathcal{L}_{\text{track}}\]- \(\mathcal{L}_{\text{camera}}\): Huber loss on quaternion rotation, translation, and FoV
- \(\mathcal{L}_{\text{depth}}, \mathcal{L}_{\text{pmap}}\): Uncertainty-weighted L1 + gradient smoothness − α·log(uncertainty)
- \(\mathcal{L}_{\text{track}}\): L2 on 2D correspondence locations + visibility BCE
Training Data (~17 datasets)
| Category | Datasets |
|---|---|
| Object-centric | Co3Dv2, BlendMVS, DL3DV, Synthetic Assets |
| Outdoor scenes | MegaDepth, WildRGB, Mapillary |
| Indoor | ScanNet, HyperSim, Replica, Habitat |
| Synthetic | Kubric, Virtual KITTI, Aria Synthetic Environments, MVS-Synth |
| Video tracking | PointOdyssey |
Training Configuration
- AdamW, LR 0.0002, cosine schedule, 8K warmup steps
- 160K total iterations, 64× A100 GPUs, 9 days
- bfloat16 + gradient checkpointing
5. Experimental Results
Camera Pose Estimation (AUC@30, higher is better)


Point Cloud Reconstruction — ETH3D (Chamfer distance, lower is better)

Notably, combining the depth map and camera parameters outperforms using the point map head alone (0.677).

DTU Multi-View Depth Estimation (Chamfer distance, lower is better)

Without any camera information, VGGT achieves performance comparable to MASt3R, which has access to ground-truth cameras.
2-View Matching — ScanNet-1500 (AUC, higher is better)

A general-purpose model surpasses a dedicated matching specialist.

Dynamic Point Tracking — TAP-Vid (AJ metric, higher is better)

Using VGGT features as a backbone substantially improves CoTracker2 across all benchmarks.

Speed Profile (H100 GPU, 336×518 resolution)

6. Comparison with DUSt3R
| Aspect | DUSt3R / MASt3R | VGGT |
|---|---|---|
| Input granularity | Image pairs | 1 to hundreds of images simultaneously |
| Processing paradigm | Pairwise → global alignment | All views processed jointly |
| Post-processing | Required (global alignment, BA) | Optional (competitive without it) |
| Attention structure | Cross-attention (between pairs) | Alternating self-attention |
| Pose accuracy (RealEstate10K) | 76.4 | 85.3 (FF) / 93.5 (BA) |
| Point cloud accuracy (ETH3D) | 0.826 | 0.709 |
| Speed | ~7–9 seconds | ~0.2 seconds (50× faster) |
The fundamental difference: DUSt3R aggregates results from multiple pairs after the fact, whereas VGGT reasons over all views jointly from the very beginning.
7. Ablation: The Effect of Multi-Task Learning
Impact on performance when individual tasks are removed from training:

Notably, camera supervision has the greatest influence on point cloud quality. This suggests that learning to predict camera parameters forces the model to internalize a consistent 3D coordinate system, which positively propagates to all other prediction heads.
8. Limitations
- Fisheye and panoramic cameras are not supported
- Performance degrades under extreme rotational inputs
- Vulnerable to large non-rigid deformations
- Memory scales linearly with frame count (200 frames = 40.6 GB)
- Not specifically trained for monocular reconstruction

The authors note that these limitations can be addressed with additional fine-tuning.
9. Summary
VGGT is a compelling demonstration that the foundation model paradigm can be successfully applied to 3D computer vision. The core message is clear:
“Geometric inductive biases < Large-scale data + Large-scale models”
Rather than decomposing the pipeline into specialized stages, a single Transformer with sufficient data and a well-designed attention structure can solve all tasks in 3D vision simultaneously — faster and more accurately.
If DUSt3R exposed the limits of “aggregate pairwise results,” VGGT proves that “reason over everything together from the start” is the fundamentally superior approach.