[Paper Review] π³: Permutation-Equivariant Visual Geometry Learning (ICLR 2026)

Paper: π³: Permutation-Equivariant Visual Geometry Learning
Venue: ICLR 2026
Authors: Yifan Wang*, Jianjun Zhou*, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Affiliations: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shanghai Innovation Institute, Zhejiang University, USTC, Fudan University
arXiv: 2507.13347
GitHub: yyfz/Pi3


One-Line Summary

A permutation-equivariant 3D reconstruction model that completely eliminates the fixed reference view relied on by prior methods, predicting camera poses and point maps independent of input image ordering.


1. Background and Problem Definition

Reference View Bias

Modern feed-forward 3D reconstruction methods — including DUSt3R, MASt3R, and VGGT — share a common inductive bias: the first image is used as the anchor (world coordinate origin).

This seemingly natural choice creates structural weaknesses:

  • Order dependence: Results change with input ordering. A blurry or partially occluded first image destabilizes the entire reconstruction.
  • Asymmetric processing: The reference view receives structural privilege through special tokens or fixed coordinate systems.
  • Failure modes: A suboptimal reference view can cause the entire reconstruction to fail.

VGGT fixes the first camera as identity; DUSt3R/MASt3R use explicit reference views during pairwise processing. π³ is the first work to quantify how much this matters.

When the same scene is fed to VGGT with different frame orderings, the standard deviation in reconstruction accuracy reaches 0.033cm. π³ reduces this to 0.003cm — 10× more stable.

π³’s Question

“Can accurate and stable 3D reconstruction be achieved without any reference view?”


2. Core Idea

Permutation Equivariance

The central claim of π³ is one mathematical property:

\[f(\sigma(\mathbf{I})) = \sigma(f(\mathbf{I})) \quad \forall \sigma \in S_N\]

Applying any permutation σ to the input image set should yield the same permutation applied to the output — every image is processed identically regardless of its position in the sequence.

Two design choices enforce this:

  1. Affine-Invariant Camera Poses: Rotation is predicted in SO(3); translation is expressed without a global anchor point, using an affine-invariant representation.
  2. Scale-Invariant Local Point Maps: Per-view point maps are predicted in local camera coordinates. Scale ambiguity is absorbed into the invariant representation rather than resolved by a reference frame.

Both choices together make the full pipeline permutation-equivariant.

Fundamental Differences from Prior Work

Aspect DUSt3R / MASt3R VGGT π³
Reference view Required (pairwise) Required (first = identity) None
Order dependence High High None
Pose representation Absolute coordinates Absolute coordinates Affine-invariant
Point map World coordinates World coordinates Scale-invariant local

3. Architecture

Full Pipeline

N input images (B × N × 3 × H × W)
    → DINOv2 patch encoder → image token sequences
    → Alternating-Attention Transformer (36 layers)
    ├─→ Camera Head    → affine-invariant camera poses (SE(3))
    ├─→ Point Map Head → scale-invariant local point maps
    └─→ Confidence Head → confidence maps

Alternating-Attention (36 Layers)

Similar to VGGT’s structure but with 36 layers instead of 24, and with all order-dependent components removed:

  • View-wise Self-Attention: Tokens within the same image attend to each other. Extracts per-frame spatial features.
  • Global Self-Attention: All tokens across all frames attend to each other. Learns cross-view 3D consistency.

Removed elements (to guarantee permutation equivariance):

  • Frame index positional embeddings (no order information injected)
  • Reference-view special tokens (no reference token exists)
  • Cross-attention between views (no asymmetric processing)

Camera Head

Predicts SE(3) matrices (4×4) in affine-invariant form. The first frame is NOT fixed as identity; only relative relationships between predicted poses are used as supervision.

Point Map Head

Predicts 3D points per pixel in local camera coordinates — not world coordinates. This is the key: consistency is maintained regardless of which view is “first.”

Pi3X Extension (December 2025)

A follow-up release added:

  • Convolutional Output Head: Eliminates grid-like artifacts from MLP upsampling
  • Conditional inputs: Optional injection of camera poses, intrinsics, and depth
  • Approximate metric scale: Metric-scale reconstruction capability

4. Training

Loss Function

\[\mathcal{L} = \mathcal{L}_{\text{point}} + \lambda_{\text{normal}} \mathcal{L}_{\text{normal}} + \lambda_{\text{conf}} \mathcal{L}_{\text{conf}} + \lambda_{\text{cam}} \mathcal{L}_{\text{cam}} + \lambda_{\text{trans}} \mathcal{L}_{\text{trans}}\]
Loss Role Weight
\(\mathcal{L}_{\text{point}}\) Scale-invariant point map L1 (after optimal scale alignment) 1.0
\(\mathcal{L}_{\text{normal}}\) Surface normal angular error 1.0
\(\mathcal{L}_{\text{conf}}\) Confidence map BCE 0.05
\(\mathcal{L}_{\text{cam}}\) Geodesic rotation error 0.1
\(\mathcal{L}_{\text{trans}}\) Scaled translation error 100.0

Two-Stage Training

Stage Resolution GPU Notes
Stage 1 224×224 fixed 16× A100 DINOv2 encoder frozen
Stage 2 100K–255K pixels variable 64× A100 Full network trained

80 epochs per stage, 800 iterations/epoch. Total parameters: 959M (24% lighter than VGGT’s 1.26B).

Training Data (15 Datasets)

CO3D, ScanNet, TartanAir, Habitat, and synthetic renderings spanning indoor, outdoor, and dynamic scenes.


5. Experimental Results

Camera Pose Estimation

RealEstate10K (Zero-shot):

Method RTA@30 AUC@30
DUSt3R 76.1 67.7
MASt3R 76.4
VGGT 93.13 77.62
π³ 95.62 85.90

Sintel (lower is better):

Method ATE (↓) RPE trans (↓)
VGGT 0.167 0.062
π³ 0.074 0.040

π³ improves ATE over VGGT by 55% on Sintel.

Point Map Reconstruction

DTU Dataset (cm, lower is better):

Method Accuracy (↓) Completion (↓) Normal Consistency (↑)
DUSt3R 1.620 2.241 0.640
MASt3R 1.406 2.015 0.662
VGGT 1.338 1.896 0.676
π³ 1.198 1.849 0.678

7-Scenes (Dense Views, cm):

Method Accuracy (↓) Completion (↓)
VGGT 0.022 0.026
π³ 0.016 0.022

Video Depth Estimation — KITTI

Metric π³ VGGT Gain
Abs Rel (↓) 0.037 0.052 29%
δ<1.25 (↑) 0.986 0.968 +1.8%
FPS (↑) 57.4 43.2 33% faster
Parameters 959M 1.26B 24% lighter

Permutation Robustness

Standard deviation of DTU Accuracy across different frame orderings of the same scene:

Method std (↓)
VGGT 0.033
π³ 0.003

π³ is 10× more stable than VGGT. This directly validates the permutation-equivariant design.


6. Comparison with VGGT

Aspect VGGT π³
Architecture 24-layer alternating attention 36-layer alternating attention
Reference view First frame = identity (required) None
Order dependence Present (std 0.033) None (std 0.003)
Pose representation Absolute coordinates Affine-invariant
Point map World coordinates Scale-invariant local
Parameters 1.26B 959M
Inference FPS (KITTI) 43.2 57.4
RealEstate10K AUC@30 77.62 85.90
DTU Accuracy (cm) 1.338 1.198
Sintel ATE 0.167 0.074

The key takeaway: π³ is smaller, faster, and more accurate than VGGT. Removing the reference view bias is the root cause of every improvement.


7. Ablation: Effect of Affine/Scale Invariance

Model ETH3D Acc. 7-Scenes Acc. NRGBD Acc.
Baseline (no invariance) 0.229 0.020 0.034
+ Scale-invariant point 0.197 0.020 0.031
+ Affine-invariant camera (full) 0.131 0.019 0.028

Affine-invariant camera modeling is the dominant contributor. Scale-invariant geometry shows pronounced benefits on outdoor datasets, with more modest gains indoors.


8. Limitations

  • Transparent objects: Simplified light transport assumptions preclude handling of transparent or reflective surfaces.
  • Grid-like artifacts: MLP-based point cloud upsampling produces visible grid patterns in uncertain regions (partially addressed in Pi3X with convolutional heads).
  • Fine-grained detail: Falls short of diffusion-based reconstruction in high-frequency detail.
  • Dynamic scenes: No explicit handling of non-rigid motion beyond training data diversity.

9. Summary

π³ answers one question: “Is a reference view actually necessary?”

The answer is no. Removing the reference view and designing a permutation-equivariant architecture yielded a model that is smaller, faster, and more accurate, with 10× better stability under input reordering.

“The reference view was a convenience, not a necessity.”

If VGGT showed that “processing all views together at once” beats DUSt3R’s pairwise approach, π³ shows that “eliminating the hierarchy among views (reference frame)” is the next step forward.

The inductive biases of 3D reconstruction are being removed one by one. What comes next?

* Posts in this blog were written with the assistance of Claude Code.