[Paper Review] VGGT: Visual Geometry Grounded Transformer (CVPR 2025 Best Paper)

Paper: VGGT: Visual Geometry Grounded Transformer
Venue: CVPR 2025 (Best Paper Award)
Authors: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Affiliations: Visual Geometry Group (VGG), University of Oxford + Meta AI
arXiv: 2503.11651
GitHub: facebookresearch/vggt


One-Line Summary

Feed N images in a single forward pass and get camera parameters + depth maps + point clouds + 3D tracking all at once in under 0.2 seconds — with a 1.2B-parameter Transformer.


1. Background and Problem Statement

The traditional pipeline for 3D reconstruction follows a sequential structure: SfM (Structure from Motion) → MVS (Multi-View Stereo) → Bundle Adjustment. Each stage is decoupled and requires iterative optimization, making reconstruction of a single scene take anywhere from tens of seconds to several minutes.

Recent learning-based methods DUSt3R and MASt3R attempted to change this paradigm by processing image pairs and directly predicting 3D point maps. However, they still have notable limitations:

  • Pairwise processing: Only two images are handled at a time; the full scene is assembled by merging pairwise results in a post-processing step.
  • Global alignment required: An optimization step is mandatory to register all pairwise point maps into a unified coordinate system.
  • Slow speed: 7–9 seconds per scene.

VGGT asks a simple question: “Can we do all of this at once?”


2. Core Idea

VGGT’s central claim is straightforward:

A large Transformer trained on sufficiently diverse 3D data can solve all tasks in 3D vision simultaneously, without any geometry-specific inductive biases or post-processing.

This directly applies the foundation model philosophy, proven in NLP and 2D vision, to 3D reconstruction. The design principles are threefold:

  1. Process all input images simultaneously (no sequential or pairwise ordering).
  2. Single forward pass — no iterative optimization.
  3. Multi-task output — camera poses, depth, point maps, and tracking, all at once.

3. Architecture

Overall Pipeline

N input images
    → DINOv2 patch tokenizer → image tokens (K tokens per image)
    → camera token (1 per image) + register tokens (4 per image) appended
    → Alternating-Attention Transformer (24 layers, dim=1024, 16 heads)
    ├─→ Camera Head   → intrinsic & extrinsic camera parameters
    ├─→ DPT Head      → depth maps, point maps, uncertainty maps
    └─→ Tracking Head → 2D point correspondences across frames

vggt-fig2

Figure 2: VGGT architecture overview. Input images are tokenized via DINOv2 into patch tokens, combined with per-image camera and register tokens, and processed by the Alternating-Attention Transformer. Three prediction heads output camera parameters, dense maps (depth/point/uncertainty), and tracking features respectively.

Alternating-Attention (Key Design Choice)

The 24 Transformer layers alternate between two types of attention:

  • Even layers → Frame-wise Self-Attention: Attention is restricted to tokens within the same image, preserving spatial context for each frame.
  • Odd layers → Global Self-Attention: All tokens across all frames attend to one another, enabling multi-view correspondence and 3D reasoning.

Cross-attention is not used. The ablation below validates this design choice (ETH3D Overall, lower is better):

vggt-tab9

Table 9: Attention mechanism ablation. Alternating-attention (0.709) outperforms both cross-attention (1.061) and global self-attention only (0.827) on ETH3D Overall.

Camera Head

The camera tokens pass through 4 additional self-attention layers and are projected linearly to produce a 9-dimensional vector per image:

  • Quaternion rotation (4D) + translation (3D) + field of view FoV (2D)

The first image’s camera is always fixed to the identity (serving as the world coordinate origin).

DPT Head

Image tokens are upsampled back to dense maps via DPT (Dense Prediction Transformer) upsampling:

  • Depth map \(D_i \in \mathbb{R}^{H \times W}\)
  • Point map \(P_i \in \mathbb{R}^{3 \times H \times W}\) (world coordinates per pixel)
  • Tracking feature map \(T_i \in \mathbb{R}^{C \times H \times W}\)
  • Uncertainty map (aleatoric uncertainty)

Tracking Head

Adapted from the CoTracker2 architecture. Given a query point in a reference frame, it predicts the 2D correspondences in all other frames via the tracking feature maps. No assumption is made about frame ordering.


4. Training

Loss Function

\[\mathcal{L} = \mathcal{L}_{\text{camera}} + \mathcal{L}_{\text{depth}} + \mathcal{L}_{\text{pmap}} + 0.05 \times \mathcal{L}_{\text{track}}\]
  • \(\mathcal{L}_{\text{camera}}\): Huber loss on quaternion rotation, translation, and FoV
  • \(\mathcal{L}_{\text{depth}}, \mathcal{L}_{\text{pmap}}\): Uncertainty-weighted L1 + gradient smoothness − α·log(uncertainty)
  • \(\mathcal{L}_{\text{track}}\): L2 on 2D correspondence locations + visibility BCE

Training Data (~17 datasets)

Category Datasets
Object-centric Co3Dv2, BlendMVS, DL3DV, Synthetic Assets
Outdoor scenes MegaDepth, WildRGB, Mapillary
Indoor ScanNet, HyperSim, Replica, Habitat
Synthetic Kubric, Virtual KITTI, Aria Synthetic Environments, MVS-Synth
Video tracking PointOdyssey

Training Configuration

  • AdamW, LR 0.0002, cosine schedule, 8K warmup steps
  • 160K total iterations, 64× A100 GPUs, 9 days
  • bfloat16 + gradient checkpointing

5. Experimental Results

Camera Pose Estimation (AUC@30, higher is better)

vggt-tab1

Table 1: Multi-view camera pose estimation results (RealEstate10K, CO3Dv2). VGGT feed-forward (85.3) significantly outperforms the previous best VGGSfM v2 (78.9) on RealEstate10K. With Bundle Adjustment, performance further improves to 93.5.

vggt-fig4

Figure 4: Qualitative camera pose estimation comparison. Camera trajectories predicted by VGGT align far more closely to ground truth than those of DUSt3R and MASt3R.

Point Cloud Reconstruction — ETH3D (Chamfer distance, lower is better)

vggt-tab2

Table 2: ETH3D 3D reconstruction results. VGGT achieves the best Accuracy, Completeness, and Overall scores while running ~50× faster than DUSt3R/MASt3R (~0.2s vs ~7–9s).

Notably, combining the depth map and camera parameters outperforms using the point map head alone (0.677).

vggt-fig3

Figure 3: Qualitative 3D reconstruction comparison on ETH3D. VGGT's point clouds are more complete and geometrically accurate than those produced by DUSt3R.

DTU Multi-View Depth Estimation (Chamfer distance, lower is better)

vggt-tab3

Table 3: DTU multi-view depth estimation results. Without any camera information (Camera Known: ✗), VGGT (0.382) achieves performance comparable to MASt3R (0.374), which uses ground-truth cameras.

Without any camera information, VGGT achieves performance comparable to MASt3R, which has access to ground-truth cameras.

2-View Matching — ScanNet-1500 (AUC, higher is better)

vggt-tab4

Table 4: ScanNet-1500 image matching results. VGGT's Tracking Head (AUC@5: 33.9) surpasses the dedicated matching specialist RoMa (31.8) across all AUC thresholds.

A general-purpose model surpasses a dedicated matching specialist.

vggt-fig5

Figure 5: Additional qualitative comparisons on multi-view reconstruction and depth estimation across diverse scene types.

Dynamic Point Tracking — TAP-Vid (AJ metric, higher is better)

vggt-tab5

Table 5: TAP-Vid point tracking results. Replacing CoTracker2's backbone with VGGT features yields consistent gains across Kinetics (49.6→57.2), RGB-S (67.4→72.1), and DAVIS (61.8→64.7).

Using VGGT features as a backbone substantially improves CoTracker2 across all benchmarks.

vggt-fig6

Figure 6: Qualitative point tracking and 2D correspondence results. VGGT's Tracking Head maintains accurate correspondences across frames.

Speed Profile (H100 GPU, 336×518 resolution)

vggt-tab6

Table 6: Inference speed and GPU memory profile. At 10 frames, VGGT processes in 0.14s using 3.63 GB — near real-time. At 200 frames, it requires 8.75s and 40.63 GB.

6. Comparison with DUSt3R

Aspect DUSt3R / MASt3R VGGT
Input granularity Image pairs 1 to hundreds of images simultaneously
Processing paradigm Pairwise → global alignment All views processed jointly
Post-processing Required (global alignment, BA) Optional (competitive without it)
Attention structure Cross-attention (between pairs) Alternating self-attention
Pose accuracy (RealEstate10K) 76.4 85.3 (FF) / 93.5 (BA)
Point cloud accuracy (ETH3D) 0.826 0.709
Speed ~7–9 seconds ~0.2 seconds (50× faster)

The fundamental difference: DUSt3R aggregates results from multiple pairs after the fact, whereas VGGT reasons over all views jointly from the very beginning.


7. Ablation: The Effect of Multi-Task Learning

Impact on performance when individual tasks are removed from training:

vggt-tab7

Table 7: Multi-task learning ablation. Removing camera supervision causes the largest drop in ETH3D performance (0.709 → 0.834), demonstrating that camera training is the most critical component for point cloud quality.

Notably, camera supervision has the greatest influence on point cloud quality. This suggests that learning to predict camera parameters forces the model to internalize a consistent 3D coordinate system, which positively propagates to all other prediction heads.


8. Limitations

  • Fisheye and panoramic cameras are not supported
  • Performance degrades under extreme rotational inputs
  • Vulnerable to large non-rigid deformations
  • Memory scales linearly with frame count (200 frames = 40.6 GB)
  • Not specifically trained for monocular reconstruction

vggt-fig7

Figure 7: VGGT failure cases. Reconstruction quality degrades on fisheye distortion, extreme rotations, and non-rigid deformations — inputs that fall outside the training distribution.

The authors note that these limitations can be addressed with additional fine-tuning.


9. Summary

VGGT is a compelling demonstration that the foundation model paradigm can be successfully applied to 3D computer vision. The core message is clear:

“Geometric inductive biases < Large-scale data + Large-scale models”

Rather than decomposing the pipeline into specialized stages, a single Transformer with sufficient data and a well-designed attention structure can solve all tasks in 3D vision simultaneously — faster and more accurately.

If DUSt3R exposed the limits of “aggregate pairwise results,” VGGT proves that “reason over everything together from the start” is the fundamentally superior approach.

* Posts in this blog were written with the assistance of Claude Code.