[Paper Review] π³: Permutation-Equivariant Visual Geometry Learning (ICLR 2026)

Paper: π³: Permutation-Equivariant Visual Geometry Learning
Venue: ICLR 2026
Authors: Yifan Wang*, Jianjun Zhou*, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Affiliations: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shanghai Innovation Institute, Zhejiang University, USTC, Fudan University
arXiv: 2507.13347
GitHub: yyfz/Pi3

One-Line Summary

A permutation-equivariant 3D reconstruction model that completely eliminates the fixed reference view relied on by prior methods, predicting camera poses and point maps independent of input image ordering.

1. Background and Problem Definition

Reference View Bias

Modern feed-forward 3D reconstruction methods — including DUSt3R, MASt3R, and VGGT — share a common inductive bias: the first image is used as the anchor (world coordinate origin).

This seemingly natural choice creates structural weaknesses:

Order dependence: Results change with input ordering. A blurry or partially occluded first image destabilizes the entire reconstruction.
Asymmetric processing: The reference view receives structural privilege through special tokens or fixed coordinate systems.
Failure modes: A suboptimal reference view can cause the entire reconstruction to fail.

VGGT fixes the first camera as identity; DUSt3R/MASt3R use explicit reference views during pairwise processing. π³ is the first work to quantify how much this matters.

When the same scene is fed to VGGT with different frame orderings, the standard deviation in reconstruction accuracy reaches 0.033cm. π³ reduces this to 0.003cm — 10× more stable.

π³’s Question

“Can accurate and stable 3D reconstruction be achieved without any reference view?”

2. Core Idea

Permutation Equivariance

The central claim of π³ is one mathematical property:

\[f(\sigma(\mathbf{I})) = \sigma(f(\mathbf{I})) \quad \forall \sigma \in S_N\]

Applying any permutation σ to the input image set should yield the same permutation applied to the output — every image is processed identically regardless of its position in the sequence.

Two design choices enforce this:

Affine-Invariant Camera Poses: Rotation is predicted in SO(3); translation is expressed without a global anchor point, using an affine-invariant representation.
Scale-Invariant Local Point Maps: Per-view point maps are predicted in local camera coordinates. Scale ambiguity is absorbed into the invariant representation rather than resolved by a reference frame.

Both choices together make the full pipeline permutation-equivariant.

Fundamental Differences from Prior Work

Aspect	DUSt3R / MASt3R	VGGT	π³
Reference view	Required (pairwise)	Required (first = identity)	None
Order dependence	High	High	None
Pose representation	Absolute coordinates	Absolute coordinates	Affine-invariant
Point map	World coordinates	World coordinates	Scale-invariant local

3. Architecture

Full Pipeline

N input images (B × N × 3 × H × W)
    → DINOv2 patch encoder → image token sequences
    → Alternating-Attention Transformer (36 layers)
    ├─→ Camera Head    → affine-invariant camera poses (SE(3))
    ├─→ Point Map Head → scale-invariant local point maps
    └─→ Confidence Head → confidence maps

Alternating-Attention (36 Layers)

Similar to VGGT’s structure but with 36 layers instead of 24, and with all order-dependent components removed:

View-wise Self-Attention: Tokens within the same image attend to each other. Extracts per-frame spatial features.
Global Self-Attention: All tokens across all frames attend to each other. Learns cross-view 3D consistency.

Removed elements (to guarantee permutation equivariance):

Frame index positional embeddings (no order information injected)
Reference-view special tokens (no reference token exists)
Cross-attention between views (no asymmetric processing)

Camera Head

Predicts SE(3) matrices (4×4) in affine-invariant form. The first frame is NOT fixed as identity; only relative relationships between predicted poses are used as supervision.

Point Map Head

Predicts 3D points per pixel in local camera coordinates — not world coordinates. This is the key: consistency is maintained regardless of which view is “first.”

Pi3X Extension (December 2025)

A follow-up release added:

Convolutional Output Head: Eliminates grid-like artifacts from MLP upsampling
Conditional inputs: Optional injection of camera poses, intrinsics, and depth
Approximate metric scale: Metric-scale reconstruction capability

4. Training

Loss Function

\[\mathcal{L} = \mathcal{L}_{\text{point}} + \lambda_{\text{normal}} \mathcal{L}_{\text{normal}} + \lambda_{\text{conf}} \mathcal{L}_{\text{conf}} + \lambda_{\text{cam}} \mathcal{L}_{\text{cam}} + \lambda_{\text{trans}} \mathcal{L}_{\text{trans}}\]

Loss	Role	Weight
\(\mathcal{L}_{\text{point}}\)	Scale-invariant point map L1 (after optimal scale alignment)	1.0
\(\mathcal{L}_{\text{normal}}\)	Surface normal angular error	1.0
\(\mathcal{L}_{\text{conf}}\)	Confidence map BCE	0.05
\(\mathcal{L}_{\text{cam}}\)	Geodesic rotation error	0.1
\(\mathcal{L}_{\text{trans}}\)	Scaled translation error	100.0

Two-Stage Training

Stage	Resolution	GPU	Notes
Stage 1	224×224 fixed	16× A100	DINOv2 encoder frozen
Stage 2	100K–255K pixels variable	64× A100	Full network trained

80 epochs per stage, 800 iterations/epoch. Total parameters: 959M (24% lighter than VGGT’s 1.26B).

Training Data (15 Datasets)

CO3D, ScanNet, TartanAir, Habitat, and synthetic renderings spanning indoor, outdoor, and dynamic scenes.

5. Experimental Results

Camera Pose Estimation

RealEstate10K (Zero-shot):

Method	RTA@30	AUC@30
DUSt3R	76.1	67.7
MASt3R	—	76.4
VGGT	93.13	77.62
π³	95.62	85.90

Sintel (lower is better):

Method	ATE (↓)	RPE trans (↓)
VGGT	0.167	0.062
π³	0.074	0.040

π³ improves ATE over VGGT by 55% on Sintel.

Point Map Reconstruction

DTU Dataset (cm, lower is better):

Method	Accuracy (↓)	Completion (↓)	Normal Consistency (↑)
DUSt3R	1.620	2.241	0.640
MASt3R	1.406	2.015	0.662
VGGT	1.338	1.896	0.676
π³	1.198	1.849	0.678

7-Scenes (Dense Views, cm):

Method	Accuracy (↓)	Completion (↓)
VGGT	0.022	0.026
π³	0.016	0.022

Video Depth Estimation — KITTI

Metric	π³	VGGT	Gain
Abs Rel (↓)	0.037	0.052	29%
δ<1.25 (↑)	0.986	0.968	+1.8%
FPS (↑)	57.4	43.2	33% faster
Parameters	959M	1.26B	24% lighter

Permutation Robustness

Standard deviation of DTU Accuracy across different frame orderings of the same scene:

Method	std (↓)
VGGT	0.033
π³	0.003

π³ is 10× more stable than VGGT. This directly validates the permutation-equivariant design.

6. Comparison with VGGT

Aspect	VGGT	π³
Architecture	24-layer alternating attention	36-layer alternating attention
Reference view	First frame = identity (required)	None
Order dependence	Present (std 0.033)	None (std 0.003)
Pose representation	Absolute coordinates	Affine-invariant
Point map	World coordinates	Scale-invariant local
Parameters	1.26B	959M
Inference FPS (KITTI)	43.2	57.4
RealEstate10K AUC@30	77.62	85.90
DTU Accuracy (cm)	1.338	1.198
Sintel ATE	0.167	0.074

The key takeaway: π³ is smaller, faster, and more accurate than VGGT. Removing the reference view bias is the root cause of every improvement.

7. Ablation: Effect of Affine/Scale Invariance

Model	ETH3D Acc.	7-Scenes Acc.	NRGBD Acc.
Baseline (no invariance)	0.229	0.020	0.034
+ Scale-invariant point	0.197	0.020	0.031
+ Affine-invariant camera (full)	0.131	0.019	0.028

Affine-invariant camera modeling is the dominant contributor. Scale-invariant geometry shows pronounced benefits on outdoor datasets, with more modest gains indoors.

8. Limitations

Transparent objects: Simplified light transport assumptions preclude handling of transparent or reflective surfaces.
Grid-like artifacts: MLP-based point cloud upsampling produces visible grid patterns in uncertain regions (partially addressed in Pi3X with convolutional heads).
Fine-grained detail: Falls short of diffusion-based reconstruction in high-frequency detail.
Dynamic scenes: No explicit handling of non-rigid motion beyond training data diversity.

9. Summary

π³ answers one question: “Is a reference view actually necessary?”

The answer is no. Removing the reference view and designing a permutation-equivariant architecture yielded a model that is smaller, faster, and more accurate, with 10× better stability under input reordering.

“The reference view was a convenience, not a necessity.”

If VGGT showed that “processing all views together at once” beats DUSt3R’s pairwise approach, π³ shows that “eliminating the hierarchy among views (reference frame)” is the next step forward.

The inductive biases of 3D reconstruction are being removed one by one. What comes next?