[Paper Review] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation (CVPR 2026)

Paper: Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Venue: CVPR 2026
Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
Affiliations: Insta360 Research Team, Wuhan University, UC Merced, UC San Diego
arXiv: 2512.16913
GitHub: Insta360-Research-Team/DAP

One-Line Summary

The first foundation model for panoramic depth estimation. Combines a 2M+ panoramic dataset, a three-stage pseudo-label pipeline, and a plug-and-play range mask head to achieve metric consistency across diverse indoor and outdoor scenes.

dap-fig1

Figure 1: DAP overview. Precise metric depth is predicted from diverse indoor and outdoor panoramas regardless of scene distance. Consistent depth estimation is demonstrated from close-range indoor spaces to outdoor scenes spanning hundreds of meters.

1. Background and Problem Statement

What is Panoramic Depth Estimation?

Panoramic depth estimation is the task of predicting per-pixel depth values from 360-degree (omnidirectional) images. Unlike standard pinhole cameras, these images are represented in Equirectangular Projection (ERP) format, which introduces severe distortion near the poles and discontinuities at the image boundaries.

Panoramic depth estimation is a key building block for VR/AR, indoor navigation, spatial understanding, and 360-degree 3D reconstruction.

Limitations of Existing Methods

Applying general-purpose monocular depth models (DepthAnything V2, UniDepth, etc.) to panoramic images leads to a significant performance drop. Three fundamental reasons:

① Poor handling of geometric distortion: ERP’s nonlinear distortion — especially near the poles — is incompatible with pinhole camera assumptions, causing accumulated depth prediction errors.

② Distance range mismatch: No existing metric depth model handles both indoor (a few meters) and outdoor (hundreds of meters) ranges simultaneously. Existing panorama-specific models overfit to a particular range.

③ Data scarcity: Ground-truth per-pixel depth for panoramic images is extremely expensive to collect. No large-scale training dataset existed, so no prior work had achieved foundation-model-level generalization.

2. Core Ideas

Three contributions working in concert.

① Large-Scale Panoramic Dataset (DAP-2M)

A 2M+ sample panoramic dataset is constructed. Rather than collecting labeled data alone, a pseudo-label pipeline converts large volumes of unlabeled web images and generated images into training data, enabling scale.

② Three-Stage Pseudo-Label Pipeline

A training pipeline that progressively bridges the synthetic-to-real domain gap and the indoor-to-outdoor distance range gap. A PatchGAN-based discriminator filters out low-confidence pseudo-labels, ensuring only high-quality samples enter training.

③ Plug-and-Play Range Mask Head

To handle the large difference in depth range between indoor and outdoor scenes, an auxiliary head predicts a binary range mask that classifies each pixel by a distance threshold. This classification informs the metric depth prediction. The head is independently attachable to existing architectures — hence plug-and-play.

3. Dataset Construction: DAP-2M

Data Composition

dap-tab2

Table 2: DAP-2M dataset composition. 18,298 labeled samples from Structured3D, 90K labeled synthetic outdoor samples from UE5, and 1.9M unlabeled web-collected/DiT360-generated panoramas. Total: ~2.1M panoramic samples.

Source	Domain	GT Label	Samples
Structured3D	Indoor (synthetic)	✓	18,298
UE5 Synthetic (DAP-2M-Labeled)	Outdoor (synthetic)	✓	~90,000
Web / DiT360 (DAP-2M-Unlabeled)	Indoor + Outdoor	✗	~1.9M
Total			~2.1M

Labeled data (108K):

Structured3D: Indoor synthetic panoramas with pixel-level depth GT.
UE5 Synthetic Outdoor: Panoramas rendered in Unreal Engine 5, filling the outdoor gap left by Structured3D.

Unlabeled data (1.9M):

Web-collected real panoramas: Real 360-degree images scraped from the internet.
DiT360-generated panoramas: Diverse synthetic scenes generated by a text-to-image diffusion model.

Three-Stage Pseudo-Label Pipeline

dap-fig3

Figure 3: Three-stage pseudo-label pipeline. Stage 1: Scene-Invariant Labeler initialized on synthetic data. Stage 2: Realism-Invariant Labeler selects 600K high-confidence pseudo-labels using a PatchGAN discriminator. Stage 3: Full DAP trained on all 2.1M samples.

Stage 1 — Scene-Invariant Labeler

Training data: 20K synthetic indoor + 90K synthetic outdoor UE5 (110K total, all with GT)
Goal: Build a base model that generalizes across indoor and outdoor scene types
Result: Strong scene-type generalization, but a synthetic-to-real domain gap remains

Stage 2 — Realism-Invariant Labeler

Apply Stage 1 model to generate pseudo-labels for 1.9M unlabeled images
PatchGAN discriminator: Trained to distinguish synthetic vs. real domain predictions
Top-scoring samples selected: 300K indoor + 300K outdoor = 600K high-confidence pseudo-labels
Low-quality pseudo-labels are filtered out before they can corrupt training

Stage 3 — Full DAP Training

Training data: 108K labeled + 600K pseudo-labeled + additional semi-supervised = full 2.1M
Initialized from Stage 2 model, trained with large-scale semi-supervised learning
Final DAP foundation model

4. Architecture

dap-fig2

Figure 2: DAP architecture. Three modules: DINOv3-Large backbone encoder, distortion-aware depth decoder, and plug-and-play range mask head. The range mask classifies each scene's depth range, guiding the metric depth prediction.

Backbone: DINOv3-Large

DINOv3-Large is used as the encoder, providing powerful pre-trained visual representations. Its global and local feature capture capabilities are well-suited for panoramic imagery’s diverse spatial scales.

Distortion-Aware Depth Decoder

A decoder that explicitly accounts for ERP geometric distortion. Pixels near the poles subtend a small solid angle in 3D space despite appearing stretched in the image; the decoder corrects for this discrepancy.

Plug-and-Play Range Mask Head

One of the key contributions. It first answers: “What is the depth range of this scene?” — then feeds that classification back into the depth prediction.

Range mask M: Binary classification of pixels as near/far relative to threshold τ
Threshold τ selection: Experiments with 10m / 20m / 50m / 100m — 100m is optimal
Mask loss: ℒ_mask = ||M - M_gt||² + 0.5·ℒ_Dice(M, M_gt)
Attachable to existing architectures without modification

5. Loss Functions

Five loss terms designed for panoramic imagery’s unique properties:

\[\mathcal{L}_{total} = M_{distort} \odot (\lambda_1 \mathcal{L}_{SILog} + \lambda_2 \mathcal{L}_{DF} + \lambda_3 \mathcal{L}_{grad} + \lambda_4 \mathcal{L}_{normal} + \lambda_5 \mathcal{L}_{pts} + \lambda_6 \mathcal{L}_{mask})\]

M_distort: ERP distortion weight map. Downweights pixels near the poles where distortion is severe, preventing high-distortion regions from dominating the loss.

Loss	Weight	Purpose
ℒ_SILog	λ₁=1.0	Scale-Invariant Log loss. Overall depth scale alignment
ℒ_DF	λ₂=0.4	DF-Gram loss. Matches depth distribution statistics across 12 icosahedron patches (global geometric consistency)
ℒ_grad	λ₃=5.0	Gradient loss. Preserves sharpness at boundaries and surface transitions
ℒ_normal	λ₄=2.0	Normal loss. Matches surface normals derived from predicted depth to GT normals
ℒ_pts	λ₅=2.0	Point-Cloud loss. Minimizes Euclidean distance between predicted and GT 3D point clouds
ℒ_mask	λ₆=2.0	Supervised loss for the range mask head

ℒ_DF (DF-Gram) detail: The sphere is partitioned into 12 icosahedron patches. For each patch k, the Gram matrix of depth values is computed and compared with the GT Gram matrix. This enforces global geometric consistency while being relatively robust to ERP distortion.

\[\mathcal{L}_{DF} = \frac{1}{N}\sum_{k}\|D_{pred}^{(k)}\odot D_{pred}^{(k)T} - D_{gt}^{(k)}\odot D_{gt}^{(k)T}\|_F^2\]

6. Training Configuration

Setting	Value
Input resolution	512×1024 (ERP)
Optimizer	Adam
Backbone learning rate	5e-6
Decoder learning rate	5e-5
Hardware	NVIDIA H20 GPUs
Data augmentation	Color jittering, horizontal translation, horizontal flipping

7. Experimental Results

Zero-Shot Benchmarks (Table 3)

dap-tab3

Table 3: Zero-shot benchmark results. DAP outperforms all prior methods on Stanford2D3D (indoor), Matterport3D (indoor), and Deep360 (outdoor). Both adapted pinhole models (DepthAnything V2, UniDepth) and panorama-specific methods fall well short of DAP.

Stanford2D3D (indoor, not seen during training):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DepthAnything V2	0.1822	0.7340	0.7691
UniDepth	0.1654	0.6893	0.8012
PanDA	0.1135	0.4210	0.8901
DAP	0.0921	0.3820	0.9135

Matterport3D (indoor):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.1186	0.7510	0.8518

Deep360 (outdoor):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.0659	5.224	0.9525

DAP-Test Benchmark (Table 4)

dap-tab4

Table 4: DAP-Test benchmark results. On the DAP-curated test set covering both indoor and outdoor scenes, DAP achieves 69% AbsRel reduction, 35.7% RMSE reduction, and 53.8% δ₁ increase over the previous best (Unik3D).

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAC	0.3197	8.799	0.5193
Unik3D	0.2517	10.56	0.6086
DAP	0.0781	6.804	0.9370

DAP vs. Unik3D:

AbsRel: 0.2517 → 0.0781 (69% reduction)
RMSE: 10.56 → 6.804 (35.7% reduction)
δ₁: 0.6086 → 0.9370 (53.8% improvement)

Qualitative Results

dap-fig4

Figure 4: Qualitative comparison. DAP (rightmost) significantly outperforms DAC and Unik3D in boundary sharpness, distant region accuracy, and global geometric consistency. Even at panorama edges and near the poles, DAP maintains stable depth predictions.

dap-fig5

Figure 5: DAP depth estimates on diverse real-world panoramas. Consistent metric depth across urban streets, indoor spaces, natural environments, and complex structures. The range mask head automatically adapts to each scene's depth range.

8. Ablation Studies

Component Analysis (Table 5)

dap-tab5

Table 5: Component ablation. Distortion-awareness, geometric consistency, and sharpness are added incrementally. All three components contribute meaningfully on both Stanford2D3D and Deep360.

Distortion	Geometry	Sharpness	Stanford AbsRel ↓	Stanford δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
✗	✗	✗	0.1166	0.8409	0.0942	0.8396
✓	✗	✗	0.1149	0.8440	0.0926	0.8423
✓	✓	✗	0.1112	0.8509	0.0880	0.8592
✓	✓	✓	0.1084	0.8576	0.0862	0.8719

Each component’s role:

Distortion-awareness: M_distort weight map reduces loss contribution of high-distortion pole pixels
Geometric consistency: ℒ_DF, ℒ_normal, ℒ_pts losses improve 3D structural accuracy
Sharpness: ℒ_grad loss enforces sharp object boundaries and surface transitions

Range Mask Threshold Analysis (Table 6)

dap-tab6

Table 6: Range mask threshold ablation. Among thresholds of 10m, 20m, 50m, 100m, and no mask, 100m achieves the best performance on both DAP-2M and Deep360. The mask itself is essential — removing it (None) consistently hurts performance.

Threshold τ	DAP-2M AbsRel ↓	DAP-2M δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
10m	0.0801	0.9315	0.0934	0.8493
20m	0.0823	0.9164	0.0873	0.8668
50m	0.0864	0.9104	0.0843	0.8594
100m	0.0793	0.9353	0.0862	0.8719
None	0.0832	0.9042	0.0938	0.8411

Why 100m? Experiments reveal that 100m is the practical boundary that separates indoor/near-urban scenes from large-scale outdoor environments. Lower thresholds (10m, 20m) misclassify mid-range outdoor scenes; no mask consistently underperforms 100m, confirming that range classification is a necessary component.

9. Comparison with Prior Work

Aspect	Pinhole Depth Models (DAV2, etc.)	Panorama-Specific (PanDA, etc.)	DAP
Geometric distortion handling	✗ (pinhole assumption)	✓	✓ (distortion-aware)
Metric depth	△ (relative)	△	✓ (absolute metric)
Indoor + outdoor	△	✗ (mostly indoor)	✓
Training data scale	Large (pinhole)	Small	2M+ panoramas
Zero-shot generalization	Low	Medium	High
Range adaptation	✗	✗	✓ (mask head)

10. Limitations

Pseudo-label error propagation: The three-stage pipeline depends on Stage 1 quality. Errors in the initial labeler can accumulate into later stages.
Extreme pole distortion: M_distort mitigates but does not fully resolve severe distortion at the poles of ERP images.
High-resolution cost: Designed for 512×1024 input. Processing higher-resolution panoramas increases memory and latency substantially.
Outdoor GT scarcity: All outdoor labeled training data comes from UE5 synthetic rendering. Validation against real LiDAR GT for outdoor scenes is limited.

11. Summary

DAP is the first work to successfully apply the foundation model paradigm to panoramic depth estimation. The core message:

“Data scale + domain-bridging pipeline + panorama-aware design = metric depth for any panorama”

Where pinhole depth models fail on panoramas due to geometric incompatibility, and where prior panorama-specific models overfit to indoor ranges, DAP overcomes both with a 2M+ diverse dataset, a staged pseudo-label pipeline, and a range mask head that elegantly handles the indoor-outdoor depth range discrepancy.

It is also noteworthy that this work comes from Insta360 — a 360-degree camera hardware company — signaling a clear path from research to real-world product deployment.