[Paper Review] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation (CVPR 2026)
Paper: Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Venue: CVPR 2026
Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
Affiliations: Insta360 Research Team, Wuhan University, UC Merced, UC San Diego
arXiv: 2512.16913
GitHub: Insta360-Research-Team/DAP
One-Line Summary
The first foundation model for panoramic depth estimation. Combines a 2M+ panoramic dataset, a three-stage pseudo-label pipeline, and a plug-and-play range mask head to achieve metric consistency across diverse indoor and outdoor scenes.

1. Background and Problem Statement
What is Panoramic Depth Estimation?
Panoramic depth estimation is the task of predicting per-pixel depth values from 360-degree (omnidirectional) images. Unlike standard pinhole cameras, these images are represented in Equirectangular Projection (ERP) format, which introduces severe distortion near the poles and discontinuities at the image boundaries.
Panoramic depth estimation is a key building block for VR/AR, indoor navigation, spatial understanding, and 360-degree 3D reconstruction.
Limitations of Existing Methods
Applying general-purpose monocular depth models (DepthAnything V2, UniDepth, etc.) to panoramic images leads to a significant performance drop. Three fundamental reasons:
① Poor handling of geometric distortion: ERP’s nonlinear distortion — especially near the poles — is incompatible with pinhole camera assumptions, causing accumulated depth prediction errors.
② Distance range mismatch: No existing metric depth model handles both indoor (a few meters) and outdoor (hundreds of meters) ranges simultaneously. Existing panorama-specific models overfit to a particular range.
③ Data scarcity: Ground-truth per-pixel depth for panoramic images is extremely expensive to collect. No large-scale training dataset existed, so no prior work had achieved foundation-model-level generalization.
2. Core Ideas
Three contributions working in concert.
① Large-Scale Panoramic Dataset (DAP-2M)
A 2M+ sample panoramic dataset is constructed. Rather than collecting labeled data alone, a pseudo-label pipeline converts large volumes of unlabeled web images and generated images into training data, enabling scale.
② Three-Stage Pseudo-Label Pipeline
A training pipeline that progressively bridges the synthetic-to-real domain gap and the indoor-to-outdoor distance range gap. A PatchGAN-based discriminator filters out low-confidence pseudo-labels, ensuring only high-quality samples enter training.
③ Plug-and-Play Range Mask Head
To handle the large difference in depth range between indoor and outdoor scenes, an auxiliary head predicts a binary range mask that classifies each pixel by a distance threshold. This classification informs the metric depth prediction. The head is independently attachable to existing architectures — hence plug-and-play.
3. Dataset Construction: DAP-2M
Data Composition

| Source | Domain | GT Label | Samples |
|---|---|---|---|
| Structured3D | Indoor (synthetic) | ✓ | 18,298 |
| UE5 Synthetic (DAP-2M-Labeled) | Outdoor (synthetic) | ✓ | ~90,000 |
| Web / DiT360 (DAP-2M-Unlabeled) | Indoor + Outdoor | ✗ | ~1.9M |
| Total | ~2.1M |
Labeled data (108K):
- Structured3D: Indoor synthetic panoramas with pixel-level depth GT.
- UE5 Synthetic Outdoor: Panoramas rendered in Unreal Engine 5, filling the outdoor gap left by Structured3D.
Unlabeled data (1.9M):
- Web-collected real panoramas: Real 360-degree images scraped from the internet.
- DiT360-generated panoramas: Diverse synthetic scenes generated by a text-to-image diffusion model.
Three-Stage Pseudo-Label Pipeline

Stage 1 — Scene-Invariant Labeler
- Training data: 20K synthetic indoor + 90K synthetic outdoor UE5 (110K total, all with GT)
- Goal: Build a base model that generalizes across indoor and outdoor scene types
- Result: Strong scene-type generalization, but a synthetic-to-real domain gap remains
Stage 2 — Realism-Invariant Labeler
- Apply Stage 1 model to generate pseudo-labels for 1.9M unlabeled images
- PatchGAN discriminator: Trained to distinguish synthetic vs. real domain predictions
- Top-scoring samples selected: 300K indoor + 300K outdoor = 600K high-confidence pseudo-labels
- Low-quality pseudo-labels are filtered out before they can corrupt training
Stage 3 — Full DAP Training
- Training data: 108K labeled + 600K pseudo-labeled + additional semi-supervised = full 2.1M
- Initialized from Stage 2 model, trained with large-scale semi-supervised learning
- Final DAP foundation model
4. Architecture

Backbone: DINOv3-Large
DINOv3-Large is used as the encoder, providing powerful pre-trained visual representations. Its global and local feature capture capabilities are well-suited for panoramic imagery’s diverse spatial scales.
Distortion-Aware Depth Decoder
A decoder that explicitly accounts for ERP geometric distortion. Pixels near the poles subtend a small solid angle in 3D space despite appearing stretched in the image; the decoder corrects for this discrepancy.
Plug-and-Play Range Mask Head
One of the key contributions. It first answers: “What is the depth range of this scene?” — then feeds that classification back into the depth prediction.
- Range mask M: Binary classification of pixels as near/far relative to threshold τ
- Threshold τ selection: Experiments with 10m / 20m / 50m / 100m — 100m is optimal
- Mask loss:
ℒ_mask = ||M - M_gt||² + 0.5·ℒ_Dice(M, M_gt) - Attachable to existing architectures without modification
5. Loss Functions
Five loss terms designed for panoramic imagery’s unique properties:
\[\mathcal{L}_{total} = M_{distort} \odot (\lambda_1 \mathcal{L}_{SILog} + \lambda_2 \mathcal{L}_{DF} + \lambda_3 \mathcal{L}_{grad} + \lambda_4 \mathcal{L}_{normal} + \lambda_5 \mathcal{L}_{pts} + \lambda_6 \mathcal{L}_{mask})\]M_distort: ERP distortion weight map. Downweights pixels near the poles where distortion is severe, preventing high-distortion regions from dominating the loss.
| Loss | Weight | Purpose |
|---|---|---|
| ℒ_SILog | λ₁=1.0 | Scale-Invariant Log loss. Overall depth scale alignment |
| ℒ_DF | λ₂=0.4 | DF-Gram loss. Matches depth distribution statistics across 12 icosahedron patches (global geometric consistency) |
| ℒ_grad | λ₃=5.0 | Gradient loss. Preserves sharpness at boundaries and surface transitions |
| ℒ_normal | λ₄=2.0 | Normal loss. Matches surface normals derived from predicted depth to GT normals |
| ℒ_pts | λ₅=2.0 | Point-Cloud loss. Minimizes Euclidean distance between predicted and GT 3D point clouds |
| ℒ_mask | λ₆=2.0 | Supervised loss for the range mask head |
ℒ_DF (DF-Gram) detail: The sphere is partitioned into 12 icosahedron patches. For each patch k, the Gram matrix of depth values is computed and compared with the GT Gram matrix. This enforces global geometric consistency while being relatively robust to ERP distortion.
\[\mathcal{L}_{DF} = \frac{1}{N}\sum_{k}\|D_{pred}^{(k)}\odot D_{pred}^{(k)T} - D_{gt}^{(k)}\odot D_{gt}^{(k)T}\|_F^2\]6. Training Configuration
| Setting | Value |
|---|---|
| Input resolution | 512×1024 (ERP) |
| Optimizer | Adam |
| Backbone learning rate | 5e-6 |
| Decoder learning rate | 5e-5 |
| Hardware | NVIDIA H20 GPUs |
| Data augmentation | Color jittering, horizontal translation, horizontal flipping |
7. Experimental Results
Zero-Shot Benchmarks (Table 3)

Stanford2D3D (indoor, not seen during training):
| Method | AbsRel ↓ | RMSE ↓ | δ₁ ↑ |
|---|---|---|---|
| DepthAnything V2 | 0.1822 | 0.7340 | 0.7691 |
| UniDepth | 0.1654 | 0.6893 | 0.8012 |
| PanDA | 0.1135 | 0.4210 | 0.8901 |
| DAP | 0.0921 | 0.3820 | 0.9135 |
Matterport3D (indoor):
| Method | AbsRel ↓ | RMSE ↓ | δ₁ ↑ |
|---|---|---|---|
| DAP | 0.1186 | 0.7510 | 0.8518 |
Deep360 (outdoor):
| Method | AbsRel ↓ | RMSE ↓ | δ₁ ↑ |
|---|---|---|---|
| DAP | 0.0659 | 5.224 | 0.9525 |
DAP-Test Benchmark (Table 4)

| Method | AbsRel ↓ | RMSE ↓ | δ₁ ↑ |
|---|---|---|---|
| DAC | 0.3197 | 8.799 | 0.5193 |
| Unik3D | 0.2517 | 10.56 | 0.6086 |
| DAP | 0.0781 | 6.804 | 0.9370 |
DAP vs. Unik3D:
- AbsRel: 0.2517 → 0.0781 (69% reduction)
- RMSE: 10.56 → 6.804 (35.7% reduction)
- δ₁: 0.6086 → 0.9370 (53.8% improvement)
Qualitative Results


8. Ablation Studies
Component Analysis (Table 5)

| Distortion | Geometry | Sharpness | Stanford AbsRel ↓ | Stanford δ₁ ↑ | Deep360 AbsRel ↓ | Deep360 δ₁ ↑ |
|---|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 0.1166 | 0.8409 | 0.0942 | 0.8396 |
| ✓ | ✗ | ✗ | 0.1149 | 0.8440 | 0.0926 | 0.8423 |
| ✓ | ✓ | ✗ | 0.1112 | 0.8509 | 0.0880 | 0.8592 |
| ✓ | ✓ | ✓ | 0.1084 | 0.8576 | 0.0862 | 0.8719 |
Each component’s role:
- Distortion-awareness: M_distort weight map reduces loss contribution of high-distortion pole pixels
- Geometric consistency: ℒ_DF, ℒ_normal, ℒ_pts losses improve 3D structural accuracy
- Sharpness: ℒ_grad loss enforces sharp object boundaries and surface transitions
Range Mask Threshold Analysis (Table 6)

| Threshold τ | DAP-2M AbsRel ↓ | DAP-2M δ₁ ↑ | Deep360 AbsRel ↓ | Deep360 δ₁ ↑ |
|---|---|---|---|---|
| 10m | 0.0801 | 0.9315 | 0.0934 | 0.8493 |
| 20m | 0.0823 | 0.9164 | 0.0873 | 0.8668 |
| 50m | 0.0864 | 0.9104 | 0.0843 | 0.8594 |
| 100m | 0.0793 | 0.9353 | 0.0862 | 0.8719 |
| None | 0.0832 | 0.9042 | 0.0938 | 0.8411 |
Why 100m? Experiments reveal that 100m is the practical boundary that separates indoor/near-urban scenes from large-scale outdoor environments. Lower thresholds (10m, 20m) misclassify mid-range outdoor scenes; no mask consistently underperforms 100m, confirming that range classification is a necessary component.
9. Comparison with Prior Work
| Aspect | Pinhole Depth Models (DAV2, etc.) | Panorama-Specific (PanDA, etc.) | DAP |
|---|---|---|---|
| Geometric distortion handling | ✗ (pinhole assumption) | ✓ | ✓ (distortion-aware) |
| Metric depth | △ (relative) | △ | ✓ (absolute metric) |
| Indoor + outdoor | △ | ✗ (mostly indoor) | ✓ |
| Training data scale | Large (pinhole) | Small | 2M+ panoramas |
| Zero-shot generalization | Low | Medium | High |
| Range adaptation | ✗ | ✗ | ✓ (mask head) |
10. Limitations
- Pseudo-label error propagation: The three-stage pipeline depends on Stage 1 quality. Errors in the initial labeler can accumulate into later stages.
- Extreme pole distortion: M_distort mitigates but does not fully resolve severe distortion at the poles of ERP images.
- High-resolution cost: Designed for 512×1024 input. Processing higher-resolution panoramas increases memory and latency substantially.
- Outdoor GT scarcity: All outdoor labeled training data comes from UE5 synthetic rendering. Validation against real LiDAR GT for outdoor scenes is limited.
11. Summary
DAP is the first work to successfully apply the foundation model paradigm to panoramic depth estimation. The core message:
“Data scale + domain-bridging pipeline + panorama-aware design = metric depth for any panorama”
Where pinhole depth models fail on panoramas due to geometric incompatibility, and where prior panorama-specific models overfit to indoor ranges, DAP overcomes both with a 2M+ diverse dataset, a staged pseudo-label pipeline, and a range mask head that elegantly handles the indoor-outdoor depth range discrepancy.
It is also noteworthy that this work comes from Insta360 — a 360-degree camera hardware company — signaling a clear path from research to real-world product deployment.