[Paper Review] π³: Permutation-Equivariant Visual Geometry Learning (ICLR 2026)
Paper: π³: Permutation-Equivariant Visual Geometry Learning
Venue: ICLR 2026
Authors: Yifan Wang*, Jianjun Zhou*, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Affiliations: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shanghai Innovation Institute, Zhejiang University, USTC, Fudan University
arXiv: 2507.13347
GitHub: yyfz/Pi3
One-Line Summary
A permutation-equivariant 3D reconstruction model that completely eliminates the fixed reference view relied on by prior methods, predicting camera poses and point maps independent of input image ordering.
1. Background and Problem Definition
Reference View Bias
Modern feed-forward 3D reconstruction methods — including DUSt3R, MASt3R, and VGGT — share a common inductive bias: the first image is used as the anchor (world coordinate origin).
This seemingly natural choice creates structural weaknesses:
- Order dependence: Results change with input ordering. A blurry or partially occluded first image destabilizes the entire reconstruction.
- Asymmetric processing: The reference view receives structural privilege through special tokens or fixed coordinate systems.
- Failure modes: A suboptimal reference view can cause the entire reconstruction to fail.
VGGT fixes the first camera as identity; DUSt3R/MASt3R use explicit reference views during pairwise processing. π³ is the first work to quantify how much this matters.
When the same scene is fed to VGGT with different frame orderings, the standard deviation in reconstruction accuracy reaches 0.033cm. π³ reduces this to 0.003cm — 10× more stable.
π³’s Question
“Can accurate and stable 3D reconstruction be achieved without any reference view?”
2. Core Idea
Permutation Equivariance
The central claim of π³ is one mathematical property:
\[f(\sigma(\mathbf{I})) = \sigma(f(\mathbf{I})) \quad \forall \sigma \in S_N\]Applying any permutation σ to the input image set should yield the same permutation applied to the output — every image is processed identically regardless of its position in the sequence.
Two design choices enforce this:
- Affine-Invariant Camera Poses: Rotation is predicted in SO(3); translation is expressed without a global anchor point, using an affine-invariant representation.
- Scale-Invariant Local Point Maps: Per-view point maps are predicted in local camera coordinates. Scale ambiguity is absorbed into the invariant representation rather than resolved by a reference frame.
Both choices together make the full pipeline permutation-equivariant.
Fundamental Differences from Prior Work
| Aspect | DUSt3R / MASt3R | VGGT | π³ |
|---|---|---|---|
| Reference view | Required (pairwise) | Required (first = identity) | None |
| Order dependence | High | High | None |
| Pose representation | Absolute coordinates | Absolute coordinates | Affine-invariant |
| Point map | World coordinates | World coordinates | Scale-invariant local |
3. Architecture
Full Pipeline
N input images (B × N × 3 × H × W)
→ DINOv2 patch encoder → image token sequences
→ Alternating-Attention Transformer (36 layers)
├─→ Camera Head → affine-invariant camera poses (SE(3))
├─→ Point Map Head → scale-invariant local point maps
└─→ Confidence Head → confidence maps
Alternating-Attention (36 Layers)
Similar to VGGT’s structure but with 36 layers instead of 24, and with all order-dependent components removed:
- View-wise Self-Attention: Tokens within the same image attend to each other. Extracts per-frame spatial features.
- Global Self-Attention: All tokens across all frames attend to each other. Learns cross-view 3D consistency.
Removed elements (to guarantee permutation equivariance):
- Frame index positional embeddings (no order information injected)
- Reference-view special tokens (no reference token exists)
- Cross-attention between views (no asymmetric processing)
Camera Head
Predicts SE(3) matrices (4×4) in affine-invariant form. The first frame is NOT fixed as identity; only relative relationships between predicted poses are used as supervision.
Point Map Head
Predicts 3D points per pixel in local camera coordinates — not world coordinates. This is the key: consistency is maintained regardless of which view is “first.”
Pi3X Extension (December 2025)
A follow-up release added:
- Convolutional Output Head: Eliminates grid-like artifacts from MLP upsampling
- Conditional inputs: Optional injection of camera poses, intrinsics, and depth
- Approximate metric scale: Metric-scale reconstruction capability
4. Training
Loss Function
\[\mathcal{L} = \mathcal{L}_{\text{point}} + \lambda_{\text{normal}} \mathcal{L}_{\text{normal}} + \lambda_{\text{conf}} \mathcal{L}_{\text{conf}} + \lambda_{\text{cam}} \mathcal{L}_{\text{cam}} + \lambda_{\text{trans}} \mathcal{L}_{\text{trans}}\]| Loss | Role | Weight |
|---|---|---|
| \(\mathcal{L}_{\text{point}}\) | Scale-invariant point map L1 (after optimal scale alignment) | 1.0 |
| \(\mathcal{L}_{\text{normal}}\) | Surface normal angular error | 1.0 |
| \(\mathcal{L}_{\text{conf}}\) | Confidence map BCE | 0.05 |
| \(\mathcal{L}_{\text{cam}}\) | Geodesic rotation error | 0.1 |
| \(\mathcal{L}_{\text{trans}}\) | Scaled translation error | 100.0 |
Two-Stage Training
| Stage | Resolution | GPU | Notes |
|---|---|---|---|
| Stage 1 | 224×224 fixed | 16× A100 | DINOv2 encoder frozen |
| Stage 2 | 100K–255K pixels variable | 64× A100 | Full network trained |
80 epochs per stage, 800 iterations/epoch. Total parameters: 959M (24% lighter than VGGT’s 1.26B).
Training Data (15 Datasets)
CO3D, ScanNet, TartanAir, Habitat, and synthetic renderings spanning indoor, outdoor, and dynamic scenes.
5. Experimental Results
Camera Pose Estimation
RealEstate10K (Zero-shot):
| Method | RTA@30 | AUC@30 |
|---|---|---|
| DUSt3R | 76.1 | 67.7 |
| MASt3R | — | 76.4 |
| VGGT | 93.13 | 77.62 |
| π³ | 95.62 | 85.90 |
Sintel (lower is better):
| Method | ATE (↓) | RPE trans (↓) |
|---|---|---|
| VGGT | 0.167 | 0.062 |
| π³ | 0.074 | 0.040 |
π³ improves ATE over VGGT by 55% on Sintel.
Point Map Reconstruction
DTU Dataset (cm, lower is better):
| Method | Accuracy (↓) | Completion (↓) | Normal Consistency (↑) |
|---|---|---|---|
| DUSt3R | 1.620 | 2.241 | 0.640 |
| MASt3R | 1.406 | 2.015 | 0.662 |
| VGGT | 1.338 | 1.896 | 0.676 |
| π³ | 1.198 | 1.849 | 0.678 |
7-Scenes (Dense Views, cm):
| Method | Accuracy (↓) | Completion (↓) |
|---|---|---|
| VGGT | 0.022 | 0.026 |
| π³ | 0.016 | 0.022 |
Video Depth Estimation — KITTI
| Metric | π³ | VGGT | Gain |
|---|---|---|---|
| Abs Rel (↓) | 0.037 | 0.052 | 29% |
| δ<1.25 (↑) | 0.986 | 0.968 | +1.8% |
| FPS (↑) | 57.4 | 43.2 | 33% faster |
| Parameters | 959M | 1.26B | 24% lighter |
Permutation Robustness
Standard deviation of DTU Accuracy across different frame orderings of the same scene:
| Method | std (↓) |
|---|---|
| VGGT | 0.033 |
| π³ | 0.003 |
π³ is 10× more stable than VGGT. This directly validates the permutation-equivariant design.
6. Comparison with VGGT
| Aspect | VGGT | π³ |
|---|---|---|
| Architecture | 24-layer alternating attention | 36-layer alternating attention |
| Reference view | First frame = identity (required) | None |
| Order dependence | Present (std 0.033) | None (std 0.003) |
| Pose representation | Absolute coordinates | Affine-invariant |
| Point map | World coordinates | Scale-invariant local |
| Parameters | 1.26B | 959M |
| Inference FPS (KITTI) | 43.2 | 57.4 |
| RealEstate10K AUC@30 | 77.62 | 85.90 |
| DTU Accuracy (cm) | 1.338 | 1.198 |
| Sintel ATE | 0.167 | 0.074 |
The key takeaway: π³ is smaller, faster, and more accurate than VGGT. Removing the reference view bias is the root cause of every improvement.
7. Ablation: Effect of Affine/Scale Invariance
| Model | ETH3D Acc. | 7-Scenes Acc. | NRGBD Acc. |
|---|---|---|---|
| Baseline (no invariance) | 0.229 | 0.020 | 0.034 |
| + Scale-invariant point | 0.197 | 0.020 | 0.031 |
| + Affine-invariant camera (full) | 0.131 | 0.019 | 0.028 |
Affine-invariant camera modeling is the dominant contributor. Scale-invariant geometry shows pronounced benefits on outdoor datasets, with more modest gains indoors.
8. Limitations
- Transparent objects: Simplified light transport assumptions preclude handling of transparent or reflective surfaces.
- Grid-like artifacts: MLP-based point cloud upsampling produces visible grid patterns in uncertain regions (partially addressed in Pi3X with convolutional heads).
- Fine-grained detail: Falls short of diffusion-based reconstruction in high-frequency detail.
- Dynamic scenes: No explicit handling of non-rigid motion beyond training data diversity.
9. Summary
π³ answers one question: “Is a reference view actually necessary?”
The answer is no. Removing the reference view and designing a permutation-equivariant architecture yielded a model that is smaller, faster, and more accurate, with 10× better stability under input reordering.
“The reference view was a convenience, not a necessity.”
If VGGT showed that “processing all views together at once” beats DUSt3R’s pairwise approach, π³ shows that “eliminating the hierarchy among views (reference frame)” is the next step forward.
The inductive biases of 3D reconstruction are being removed one by one. What comes next?