EHAK

나가하마만게츠 광안리점 - 부산 라멘 맛집

2026-06-09T00:00:00+00:00

안녕하세요, 이학입니다.
오늘은 지난 주말 다녀온 라멘 맛집! [ 나가하마만게츠 광안리점 ] 리뷰글이에요.

정보

자세한 위치 및 운영시간 등의 정보는 위의 네이버지도를 참고하시면 되겠습니다. :)

우선 나가하마만게츠는 미슐랭 빕구르망에 선정된 부산 라멘집으로 유명합니다! 아마 지금 후기나 리뷰글을 검색하고 계신 분들도 다 아는 사실일건데요.. ㅎㅎ
원래 해운대점이 본점이고 해운대점만 있을때는 웨이팅도 어마어마하다고 들어서 언젠간 가봐야지.. 생각만 하던 곳이긴 합니다.
근데 최근 광안리점이 2호점으로 생겼다고 하고 웨이팅도 본점에 비해서는 덜하다고도 해서, 저희 가족도 큰 맘 먹고! 가보기로 결정했습니다.

웨이팅 및 예약은 캐치테이블로 할 수 있는데요.
우선 가게 오픈은 오전 11시이지만, 오전 10시부터 미리 현장 캐치테이블 예약이 가능합니다. 그 후, 오픈 시간인 11시부터는 온라인으로도 가능하다고 해요.
저희는 토요일에 갔는데 이왕 오픈런 뛸 거 현장예약 하자 결심해서, 9시 40분쯤 도착했는데 첫번째여서 완전 나이스 했습니다. 어차피 예약해두고 광안리 바닷가 구경 잠시하다가 오면 되니까요. ㅎㅎ
그리고 11시 입장할때는 저희 뒤에 추가로 15팀정도 있었고, 회전율도 빠른 편이라 생각보다 광안점은 오픈런 하는 기준으로 웨이팅 할만한거 같더라구요!!

주차장은 가게 앞에는 별도의 주차장은 없지만 주변에 여러 주차장이 있는 것으로 알아서 주차하구 조금 걸어오시면 될 것 같습니다.
저희도 주변 주차장에 대고 왔습니다.

테이블 및 서비스

테이블 앞에는 가게 역사와 먹는법 및 팁들이 적혀있습니다.
아래에서도 추가로 말씀드리겠지만 면 삶기 단계에 대해서도 적혀있고, 다시마 식초나 특제매운소스, 마늘 토핑이나 교자 간장 만드는 법 등 여러가지 꿀팁들이 적혀있으니 라멘 기다리는 동안 읽어보시면 좋습니다.

여러가지 라멘에 곁들여 먹을 소스류 및 토핑들이 잘 담겨있네요.

그리고 딱 자리에 앉으면 앞치마와 안경 쓴 사람들에게는 안경클리너도 사용하라고 주십니다. 꼭 사용하진 않더라도 먼저 챙겨주는 이런 사소한 부분이 기분 좋게 만들어주더라구요.

가게 내부도 모두 다찌석인데도 정말 널찍하고 확실히 엄청 깨끗합니다. 아마 비슷한 규모의 식당이면 여기가 훨씬 쾌적하다고 느낄만한 그런 느낌. 확실히 이런 부분들도 먹는 내내 편안하게 만들어 주는 것 같습니다.

메뉴

나가하마만게츠의 메인 메뉴인 돈코츠 “나가하마라멘” 입니다.
단품은 11,000원, 라멘+교자세트는 18,000원, 차슈추가 +4,000원, 면추가 +3,000원 이었습니다. 특히, 단품기준으로도 웨이팅하는 맛집인데도 저렴한 것 같습니다.
사진은 차슈 추가가 들어간 상태입니다. 특히, 두번째 사진을 보시면 차슈가 정말 큼직하고 맛있습니다.
라멘이 올라가있는 나무 트레이에도 가게 이름이 새겨져 있네요. ㅎㅎ

라멘과 관련한 추가 정보들을 몇 가지 적어보자면,

처음 가게에 입장하면 우선 키오스크에서 주문을 하고 교환권 출력받습니다.
그 후에, 안내받은 자리에 앉고 교환권을 제출하면 셰프님께서 면 삶기 단계를 먼저 여쭤봐주십니다. (면 삶기 단계를 바꿀 때 따로 말하는 것이 아니라 여쭤봐주는 점은 정말 좋은 것 같습니다.)
그리고 면 추가가 정말 혁신(?)적이었는데요. 주문할 때 면 추가를 하면 처음부터 면을 더 넣어주거나 면만 따로 더 주는 방식이 아닌, 처음에 기본 라멘을 먹다가 면을 다 먹고 셰프님께 말씀드리면 아예 새 육수에 면을 추가로 넣어서 한 그릇을 추가로 줍니다! (아무래도 기본 라멘 육수 세팅에 면을 더 넣으면 맛이 달라질 수 밖에 없을텐데, 이렇게 하는 점이 정말 좋은 것 같습니다. 새 그릇을 받았다는 생각에 저희같은 손님 입장에서도 덩달아 기분 좋아지구요. ㅎㅎ)

저는 기본(꼬들한 면) 단계 면으로 먹었는데 다른 단계는 안 먹어봤지만 일단 기본 단계가 확실히 무난하게 좋은 것 같아요. 면이 얇은데도 탄탄해서 든든한 돈코츠라멘에 잘 어울립니다. 그리고 저는 반쯤먹고 특제 매운소스를 살짝 넣어서 먹었는데 엄청 맵진 않으니 다들 드시다가 슬쩍 넣어서 드시면 좋을 것 같아요!

나가하마만게츠에는 위의 돈코츠라멘과 야끼라멘 두 종류가 있는데 돈코츠라멘이 메인이라고 보시면 될 것 같습니다.(야끼라멘에 대해서는 저도 자세힌 모르는데 다른 리뷰들도 보면 돈코츠가 메인인 것 같더라구요!)

교자는 사진에는 두 점인데 원래는 당연히 더 길게 나옵니다. 가족끼리 나누고 사진을 찍어서..ㅎㅎ
저희는 라멘+교자 세트 하나랑 라멘 단품 2개를 시켰는데, 교자를 드실 분들은 세트를 주문하면 1,000원 더 저렴합니다! 혹시 모르니 이건 현장 키오스크에 가시면 한번 더 확인하시면 좋을 것 같네요.
저는 원래 교자류는 별로 안 좋아하는데(그래서 보통 교자랑 가라아게가 둘 다 사이드로 있는 라멘집에 가면 가라아게를 시킵니다.), 여기 교자 먹고 생각이 바꼈습니다. ㅋㅋㅋ 바삭부드러운 느낌이 정말 절묘해요. 분명 바삭한데 부드러운… 처음 방문하시면 사이드로 교자는 꼭 시켜드시는 걸 추천합니다.

공깃밥은 밥만 나오는 게 아니라 사진처럼 고기가 위에 올려 나옵니다. 보통 라멘집 가면 있는 미니차슈덮밥 느낌이죠.
가격은 1,500원으로 굉장히 저렴합니다.
밥에는 따로 간이 없어서 돈코츠 국물에 말아먹기 딱 좋은 상태구요! 개인적으로 ‘라멘 면추가만’ or ‘라멘 기본 + 공깃밥’ 조합중에 고민이라면 ‘라멘 기본 + 공깃밥’이 베스트인 것 같습니다! 면추가도 좋지만 공깃밥은 무조건 라멘먹고 추가로 먹어야 한다고 생각해요.ㅎㅎ

라멘을 맛있게 먹다보면 디저트로 먹을 크림치즈를 서비스로 주십니다. 마지막에 입가심으로 좋았고, 치즈인데 제 기준으로 우유푸딩 맛이 나서 맛있게 먹었습니다.
포장용으로도 팔던데 개인적으로는 따로 구매할 정도는 아니어서 후식으로 맛나게 먹었습니다.ㅎㅎ

이상, 웨이팅해서라도 최소 한번은 먹어볼만한 나가하마만게츠 광안리점 후기였습니다.

[Paper Review] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation (CVPR 2026)

2026-06-03T00:00:00+00:00

Paper: Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Venue: CVPR 2026
Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
Affiliations: Insta360 Research Team, Wuhan University, UC Merced, UC San Diego
arXiv: 2512.16913
GitHub: Insta360-Research-Team/DAP

One-Line Summary

The first foundation model for panoramic depth estimation. Combines a 2M+ panoramic dataset, a three-stage pseudo-label pipeline, and a plug-and-play range mask head to achieve metric consistency across diverse indoor and outdoor scenes.

Figure 1: DAP overview. Precise metric depth is predicted from diverse indoor and outdoor panoramas regardless of scene distance. Consistent depth estimation is demonstrated from close-range indoor spaces to outdoor scenes spanning hundreds of meters.

1. Background and Problem Statement

What is Panoramic Depth Estimation?

Panoramic depth estimation is the task of predicting per-pixel depth values from 360-degree (omnidirectional) images. Unlike standard pinhole cameras, these images are represented in Equirectangular Projection (ERP) format, which introduces severe distortion near the poles and discontinuities at the image boundaries.

Panoramic depth estimation is a key building block for VR/AR, indoor navigation, spatial understanding, and 360-degree 3D reconstruction.

Limitations of Existing Methods

Applying general-purpose monocular depth models (DepthAnything V2, UniDepth, etc.) to panoramic images leads to a significant performance drop. Three fundamental reasons:

① Poor handling of geometric distortion: ERP’s nonlinear distortion — especially near the poles — is incompatible with pinhole camera assumptions, causing accumulated depth prediction errors.

② Distance range mismatch: No existing metric depth model handles both indoor (a few meters) and outdoor (hundreds of meters) ranges simultaneously. Existing panorama-specific models overfit to a particular range.

③ Data scarcity: Ground-truth per-pixel depth for panoramic images is extremely expensive to collect. No large-scale training dataset existed, so no prior work had achieved foundation-model-level generalization.

2. Core Ideas

Three contributions working in concert.

① Large-Scale Panoramic Dataset (DAP-2M)

A 2M+ sample panoramic dataset is constructed. Rather than collecting labeled data alone, a pseudo-label pipeline converts large volumes of unlabeled web images and generated images into training data, enabling scale.

② Three-Stage Pseudo-Label Pipeline

A training pipeline that progressively bridges the synthetic-to-real domain gap and the indoor-to-outdoor distance range gap. A PatchGAN-based discriminator filters out low-confidence pseudo-labels, ensuring only high-quality samples enter training.

③ Plug-and-Play Range Mask Head

To handle the large difference in depth range between indoor and outdoor scenes, an auxiliary head predicts a binary range mask that classifies each pixel by a distance threshold. This classification informs the metric depth prediction. The head is independently attachable to existing architectures — hence plug-and-play.

3. Dataset Construction: DAP-2M

Data Composition

Table 2: DAP-2M dataset composition. 18,298 labeled samples from Structured3D, 90K labeled synthetic outdoor samples from UE5, and 1.9M unlabeled web-collected/DiT360-generated panoramas. Total: ~2.1M panoramic samples.

Source	Domain	GT Label	Samples
Structured3D	Indoor (synthetic)	✓	18,298
UE5 Synthetic (DAP-2M-Labeled)	Outdoor (synthetic)	✓	~90,000
Web / DiT360 (DAP-2M-Unlabeled)	Indoor + Outdoor	✗	~1.9M
Total			~2.1M

Labeled data (108K):

Structured3D: Indoor synthetic panoramas with pixel-level depth GT.
UE5 Synthetic Outdoor: Panoramas rendered in Unreal Engine 5, filling the outdoor gap left by Structured3D.

Unlabeled data (1.9M):

Web-collected real panoramas: Real 360-degree images scraped from the internet.
DiT360-generated panoramas: Diverse synthetic scenes generated by a text-to-image diffusion model.

Three-Stage Pseudo-Label Pipeline

Figure 3: Three-stage pseudo-label pipeline. Stage 1: Scene-Invariant Labeler initialized on synthetic data. Stage 2: Realism-Invariant Labeler selects 600K high-confidence pseudo-labels using a PatchGAN discriminator. Stage 3: Full DAP trained on all 2.1M samples.

Stage 1 — Scene-Invariant Labeler

Training data: 20K synthetic indoor + 90K synthetic outdoor UE5 (110K total, all with GT)
Goal: Build a base model that generalizes across indoor and outdoor scene types
Result: Strong scene-type generalization, but a synthetic-to-real domain gap remains

Stage 2 — Realism-Invariant Labeler

Apply Stage 1 model to generate pseudo-labels for 1.9M unlabeled images
PatchGAN discriminator: Trained to distinguish synthetic vs. real domain predictions
Top-scoring samples selected: 300K indoor + 300K outdoor = 600K high-confidence pseudo-labels
Low-quality pseudo-labels are filtered out before they can corrupt training

Stage 3 — Full DAP Training

Training data: 108K labeled + 600K pseudo-labeled + additional semi-supervised = full 2.1M
Initialized from Stage 2 model, trained with large-scale semi-supervised learning
Final DAP foundation model

4. Architecture

Figure 2: DAP architecture. Three modules: DINOv3-Large backbone encoder, distortion-aware depth decoder, and plug-and-play range mask head. The range mask classifies each scene's depth range, guiding the metric depth prediction.

Backbone: DINOv3-Large

DINOv3-Large is used as the encoder, providing powerful pre-trained visual representations. Its global and local feature capture capabilities are well-suited for panoramic imagery’s diverse spatial scales.

Distortion-Aware Depth Decoder

A decoder that explicitly accounts for ERP geometric distortion. Pixels near the poles subtend a small solid angle in 3D space despite appearing stretched in the image; the decoder corrects for this discrepancy.

Plug-and-Play Range Mask Head

One of the key contributions. It first answers: “What is the depth range of this scene?” — then feeds that classification back into the depth prediction.

Range mask M: Binary classification of pixels as near/far relative to threshold τ
Threshold τ selection: Experiments with 10m / 20m / 50m / 100m — 100m is optimal
Mask loss: ℒ_mask = ||M - M_gt||² + 0.5·ℒ_Dice(M, M_gt)
Attachable to existing architectures without modification

5. Loss Functions

Five loss terms designed for panoramic imagery’s unique properties:

\[\mathcal{L}_{total} = M_{distort} \odot (\lambda_1 \mathcal{L}_{SILog} + \lambda_2 \mathcal{L}_{DF} + \lambda_3 \mathcal{L}_{grad} + \lambda_4 \mathcal{L}_{normal} + \lambda_5 \mathcal{L}_{pts} + \lambda_6 \mathcal{L}_{mask})\]

M_distort: ERP distortion weight map. Downweights pixels near the poles where distortion is severe, preventing high-distortion regions from dominating the loss.

Loss	Weight	Purpose
ℒ_SILog	λ₁=1.0	Scale-Invariant Log loss. Overall depth scale alignment
ℒ_DF	λ₂=0.4	DF-Gram loss. Matches depth distribution statistics across 12 icosahedron patches (global geometric consistency)
ℒ_grad	λ₃=5.0	Gradient loss. Preserves sharpness at boundaries and surface transitions
ℒ_normal	λ₄=2.0	Normal loss. Matches surface normals derived from predicted depth to GT normals
ℒ_pts	λ₅=2.0	Point-Cloud loss. Minimizes Euclidean distance between predicted and GT 3D point clouds
ℒ_mask	λ₆=2.0	Supervised loss for the range mask head

ℒ_DF (DF-Gram) detail: The sphere is partitioned into 12 icosahedron patches. For each patch k, the Gram matrix of depth values is computed and compared with the GT Gram matrix. This enforces global geometric consistency while being relatively robust to ERP distortion.

\[\mathcal{L}_{DF} = \frac{1}{N}\sum_{k}\|D_{pred}^{(k)}\odot D_{pred}^{(k)T} - D_{gt}^{(k)}\odot D_{gt}^{(k)T}\|_F^2\]

6. Training Configuration

Setting	Value
Input resolution	512×1024 (ERP)
Optimizer	Adam
Backbone learning rate	5e-6
Decoder learning rate	5e-5
Hardware	NVIDIA H20 GPUs
Data augmentation	Color jittering, horizontal translation, horizontal flipping

7. Experimental Results

Zero-Shot Benchmarks (Table 3)

Table 3: Zero-shot benchmark results. DAP outperforms all prior methods on Stanford2D3D (indoor), Matterport3D (indoor), and Deep360 (outdoor). Both adapted pinhole models (DepthAnything V2, UniDepth) and panorama-specific methods fall well short of DAP.

Stanford2D3D (indoor, not seen during training):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DepthAnything V2	0.1822	0.7340	0.7691
UniDepth	0.1654	0.6893	0.8012
PanDA	0.1135	0.4210	0.8901
DAP	0.0921	0.3820	0.9135

Matterport3D (indoor):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.1186	0.7510	0.8518

Deep360 (outdoor):

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.0659	5.224	0.9525

DAP-Test Benchmark (Table 4)

Table 4: DAP-Test benchmark results. On the DAP-curated test set covering both indoor and outdoor scenes, DAP achieves 69% AbsRel reduction, 35.7% RMSE reduction, and 53.8% δ₁ increase over the previous best (Unik3D).

Method	AbsRel ↓	RMSE ↓	δ₁ ↑
DAC	0.3197	8.799	0.5193
Unik3D	0.2517	10.56	0.6086
DAP	0.0781	6.804	0.9370

DAP vs. Unik3D:

AbsRel: 0.2517 → 0.0781 (69% reduction)
RMSE: 10.56 → 6.804 (35.7% reduction)
δ₁: 0.6086 → 0.9370 (53.8% improvement)

Qualitative Results

Figure 4: Qualitative comparison. DAP (rightmost) significantly outperforms DAC and Unik3D in boundary sharpness, distant region accuracy, and global geometric consistency. Even at panorama edges and near the poles, DAP maintains stable depth predictions.

Figure 5: DAP depth estimates on diverse real-world panoramas. Consistent metric depth across urban streets, indoor spaces, natural environments, and complex structures. The range mask head automatically adapts to each scene's depth range.

8. Ablation Studies

Component Analysis (Table 5)

Table 5: Component ablation. Distortion-awareness, geometric consistency, and sharpness are added incrementally. All three components contribute meaningfully on both Stanford2D3D and Deep360.

Distortion	Geometry	Sharpness	Stanford AbsRel ↓	Stanford δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
✗	✗	✗	0.1166	0.8409	0.0942	0.8396
✓	✗	✗	0.1149	0.8440	0.0926	0.8423
✓	✓	✗	0.1112	0.8509	0.0880	0.8592
✓	✓	✓	0.1084	0.8576	0.0862	0.8719

Each component’s role:

Distortion-awareness: M_distort weight map reduces loss contribution of high-distortion pole pixels
Geometric consistency: ℒ_DF, ℒ_normal, ℒ_pts losses improve 3D structural accuracy
Sharpness: ℒ_grad loss enforces sharp object boundaries and surface transitions

Range Mask Threshold Analysis (Table 6)

Table 6: Range mask threshold ablation. Among thresholds of 10m, 20m, 50m, 100m, and no mask, 100m achieves the best performance on both DAP-2M and Deep360. The mask itself is essential — removing it (None) consistently hurts performance.

Threshold τ	DAP-2M AbsRel ↓	DAP-2M δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
10m	0.0801	0.9315	0.0934	0.8493
20m	0.0823	0.9164	0.0873	0.8668
50m	0.0864	0.9104	0.0843	0.8594
100m	0.0793	0.9353	0.0862	0.8719
None	0.0832	0.9042	0.0938	0.8411

Why 100m? Experiments reveal that 100m is the practical boundary that separates indoor/near-urban scenes from large-scale outdoor environments. Lower thresholds (10m, 20m) misclassify mid-range outdoor scenes; no mask consistently underperforms 100m, confirming that range classification is a necessary component.

9. Comparison with Prior Work

Aspect	Pinhole Depth Models (DAV2, etc.)	Panorama-Specific (PanDA, etc.)	DAP
Geometric distortion handling	✗ (pinhole assumption)	✓	✓ (distortion-aware)
Metric depth	△ (relative)	△	✓ (absolute metric)
Indoor + outdoor	△	✗ (mostly indoor)	✓
Training data scale	Large (pinhole)	Small	2M+ panoramas
Zero-shot generalization	Low	Medium	High
Range adaptation	✗	✗	✓ (mask head)

10. Limitations

Pseudo-label error propagation: The three-stage pipeline depends on Stage 1 quality. Errors in the initial labeler can accumulate into later stages.
Extreme pole distortion: M_distort mitigates but does not fully resolve severe distortion at the poles of ERP images.
High-resolution cost: Designed for 512×1024 input. Processing higher-resolution panoramas increases memory and latency substantially.
Outdoor GT scarcity: All outdoor labeled training data comes from UE5 synthetic rendering. Validation against real LiDAR GT for outdoor scenes is limited.

11. Summary

DAP is the first work to successfully apply the foundation model paradigm to panoramic depth estimation. The core message:

“Data scale + domain-bridging pipeline + panorama-aware design = metric depth for any panorama”

Where pinhole depth models fail on panoramas due to geometric incompatibility, and where prior panorama-specific models overfit to indoor ranges, DAP overcomes both with a 2M+ diverse dataset, a staged pseudo-label pipeline, and a range mask head that elegantly handles the indoor-outdoor depth range discrepancy.

It is also noteworthy that this work comes from Insta360 — a 360-degree camera hardware company — signaling a clear path from research to real-world product deployment.

[논문리뷰] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation (CVPR 2026)

2026-06-03T00:00:00+00:00

논문: Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
학회: CVPR 2026
저자: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
소속: Insta360 Research Team, Wuhan University, UC Merced, UC San Diego
arXiv: 2512.16913
GitHub: Insta360-Research-Team/DAP

한 줄 요약

파노라마 깊이 추정의 첫 파운데이션 모델. 2M+ 규모의 파노라마 데이터셋과 3단계 수도 레이블 파이프라인, 플러그앤플레이 레인지 마스크 헤드를 결합해 실내외 다양한 거리 범위에서 metric 일관성을 달성했다.

Figure 1: DAP 개요. 다양한 실내외 파노라마 이미지에 대해 거리 범위에 무관하게 정밀한 metric 깊이를 예측한다. 가까운 실내 공간부터 수백 미터 이상의 야외 장면까지 일관된 깊이 추정을 보여준다.

1. 배경과 문제 정의

파노라마 깊이 추정이란?

360도 카메라로 촬영된 전방위(omnidirectional) 이미지에서 픽셀 단위의 깊이값을 예측하는 태스크다. 일반 핀홀 카메라와 달리 등장방형(Equirectangular Projection, ERP) 포맷으로 표현되며, 이 과정에서 극점 근처의 왜곡이 심하고 이미지 경계에서 불연속이 발생한다.

파노라마 깊이 추정은 VR/AR, 실내 네비게이션, 공간 이해, 360도 3D 재건 등 다양한 분야에서 핵심 역할을 한다.

기존 방법의 한계

일반 핀홀 카메라 기반 깊이 추정 모델(DepthAnything V2, UniDepth 등)은 파노라마 이미지에 그대로 적용하면 성능이 크게 떨어진다. 세 가지 근본적인 이유가 있다.

① 기하 왜곡 처리 미흡: ERP 특유의 비선형 왜곡(특히 극점 근처)을 핀홀 카메라 가정으로 처리하면 깊이 예측 오류가 누적된다.

② 거리 범위 불일치: 실내(수 미터)와 실외(수백 미터)를 동시에 처리하는 metric 깊이 모델이 없다. 기존 파노라마 전용 모델은 특정 범위에 과적합된다.

③ 데이터 부재: 파노라마 이미지에 pixel-level depth GT를 취득하는 비용이 매우 높다. 대규모 학습 데이터가 존재하지 않아 파운데이션 모델 수준의 일반화를 달성한 연구가 없었다.

2. 핵심 아이디어

세 가지 기여가 맞물린다.

① 대규모 파노라마 데이터셋 (DAP-2M)

2M+ 샘플로 구성된 파노라마 데이터셋을 구축한다. 단순히 데이터를 모으는 것이 아니라, 레이블이 없는 대량의 웹 이미지와 생성 이미지를 수도 레이블로 변환하는 파이프라인을 설계해 규모를 확장했다.

② 3단계 수도 레이블 파이프라인

합성-실제 간 도메인 갭, 실내-실외 간 거리 범위 차이를 단계적으로 브리징하는 학습 파이프라인. 품질 낮은 수도 레이블이 학습에 악영향을 주는 것을 막기 위해 PatchGAN 기반 판별기로 고신뢰도 샘플만 선별한다.

③ 플러그앤플레이 레인지 마스크 헤드

실내/실외의 깊이 범위가 크게 다른 문제를 해결하기 위해, 거리 임계값(range threshold)을 예측하는 보조 헤드를 추가한다. 이 헤드는 scene의 깊이 범위를 먼저 분류한 뒤, 해당 정보를 metric 깊이 예측에 반영한다. 기존 아키텍처에 독립적으로 추가 가능한 플러그앤플레이 방식이다.

3. 데이터셋 구축: DAP-2M

데이터 구성

Table 2: DAP-2M 데이터셋 구성. Structured3D 레이블 데이터 18,298개, UE5 합성 야외 데이터 90K, 웹 수집 및 DiT360 생성 비레이블 데이터 1.9M으로 구성된다. 총 2.1M개 이상의 파노라마 샘플이다.

소스	실내/실외	GT 레이블	샘플 수
Structured3D	실내	✓	18,298
UE5 합성 (DAP-2M-Labeled)	실외	✓	~90,000
웹/DiT360 생성 (DAP-2M-Unlabeled)	실내+실외	✗	~1.9M
합계			~2.1M

레이블 데이터 (108K):

Structured3D: 실내 합성 데이터. pixel-level 깊이 GT 제공.
UE5 합성 야외: Unreal Engine 5 시뮬레이터로 렌더링한 야외 파노라마. 실내 GT만 있던 Structured3D의 야외 부재를 보완.

비레이블 데이터 (1.9M):

웹 수집 실제 파노라마: 인터넷에서 수집한 실제 360도 이미지
DiT360 생성 파노라마: 텍스트-이미지 생성 모델로 생성한 다양한 장면의 파노라마

3단계 수도 레이블 파이프라인

Figure 3: 3단계 수도 레이블 파이프라인. Stage 1: Scene-Invariant Labeler가 합성 데이터로 초기화. Stage 2: Realism-Invariant Labeler가 PatchGAN 판별기로 고신뢰도 수도 레이블 600K 선별. Stage 3: 전체 2.1M 데이터로 DAP 최종 학습.

Stage 1 — Scene-Invariant Labeler (장면 불변 레이블러)

학습 데이터: UE5 합성 실내 20K + 합성 야외 90K (총 110K, 모두 GT 레이블)
목적: 실내/실외 장면에 무관하게 안정적인 깊이를 예측할 수 있는 기초 모델 학습
결과: 합성 데이터 기반으로 강한 장면 일반화, 하지만 실제 이미지와 도메인 갭 존재

Stage 2 — Realism-Invariant Labeler (현실성 불변 레이블러)

Stage 1 모델로 1.9M 비레이블 이미지에 수도 레이블 생성
PatchGAN 기반 판별기: 합성 도메인 vs. 실제 도메인을 구분하는 판별기 학습
판별기 신뢰도 점수가 높은 상위 샘플만 선별: 실내 300K + 실외 300K = 600K 고신뢰도 수도 레이블
저품질 수도 레이블이 학습에 섞이는 것을 원천 차단

Stage 3 — DAP 최종 학습

학습 데이터: 레이블 108K + 수도 레이블 600K + 추가 비레이블(자기지도) → 전체 2.1M
Stage 2 모델을 초기화로 사용, 대규모 반지도 학습
최종 DAP 파운데이션 모델 완성

4. 아키텍처

Figure 2: DAP 아키텍처. DINOv3-Large 백본, 왜곡 인식 깊이 디코더, 플러그앤플레이 레인지 마스크 헤드의 세 모듈로 구성된다. 레인지 마스크가 장면 거리 범위를 분류하고, 이 정보가 metric 깊이 예측을 안내한다.

백본: DINOv3-Large

강력한 비전 프리트레이닝 표현을 제공하는 DINOv3-Large를 인코더로 사용한다. 파노라마 이미지의 글로벌 구조와 지역적 세부를 동시에 포착하는 데 유리하다.

왜곡 인식 깊이 디코더 (Distortion-Aware Depth Decoder)

ERP 포맷의 기하 왜곡을 명시적으로 처리하는 디코더. 극점 근처의 픽셀이 실제로는 작은 입체각을 나타냄에도 이미지상 넓게 펼쳐지는 왜곡을 보정한다.

플러그앤플레이 레인지 마스크 헤드

핵심 기여 중 하나. “이 장면의 깊이 범위는 얼마인가?” 를 먼저 분류한 뒤, 그 정보를 깊이 예측에 반영한다.

레인지 마스크 M: 거리 임계값 τ를 기준으로 픽셀을 근거리/원거리로 이진 분류
임계값 τ 선택: 10m / 20m / 50m / 100m 중 실험적으로 100m가 최적
마스크 손실: ℒ_mask = ||M - M_gt||² + 0.5·ℒ_Dice(M, M_gt)
기존 아키텍처에 독립적으로 부착 가능 (플러그앤플레이)

5. 손실 함수

왜곡된 파노라마 이미지의 특성을 반영한 5종의 손실을 조합한다.

M_distort: ERP 왜곡 가중치 맵. 극점 근처의 왜곡이 큰 픽셀에 낮은 가중치를 부여해, 왜곡이 심한 영역이 손실을 지배하는 것을 방지.

손실	가중치	역할
ℒ_SILog	λ₁=1.0	Scale-Invariant Log 손실. 전체적인 깊이 스케일 맞춤
ℒ_DF	λ₂=0.4	DF-Gram 손실. 12개 정이십면체 패치 단위로 깊이 분포 통계 매칭 (글로벌 기하 일관성)
ℒ_grad	λ₃=5.0	Gradient 손실. 경계와 표면 전환점의 날카로움 보존
ℒ_normal	λ₄=2.0	Normal 손실. 깊이에서 유도한 표면 법선을 GT 법선과 매칭 (기하 정확도)
ℒ_pts	λ₅=2.0	Point-Cloud 손실. 3D 포인트 클라우드 레벨에서의 유클리드 거리 최소화
ℒ_mask	λ₆=2.0	레인지 마스크 헤드 지도 손실

ℒ_DF (DF-Gram) 상세: 구면을 12개 정이십면체 패치로 분할한 뒤, 각 패치의 깊이값 Gram 행렬을 GT와 비교한다. 이를 통해 ERP의 왜곡에 영향을 덜 받으면서 글로벌 기하 구조의 일관성을 강제한다.

\[\mathcal{L}_{DF} = \frac{1}{N}\sum_{k}\|D_{pred}^{(k)}\odot D_{pred}^{(k)T} - D_{gt}^{(k)}\odot D_{gt}^{(k)T}\|_F^2\]

6. 학습 세부 설정

설정	값
입력 해상도	512×1024 (ERP)
옵티마이저	Adam
백본 학습률	5e-6
디코더 학습률	5e-5
학습 하드웨어	NVIDIA H20 GPU
데이터 증강	Color jittering, 수평 이동, 좌우 반전

7. 실험 결과

제로샷 벤치마크 (Table 3)

Table 3: 제로샷 벤치마크 결과. Stanford2D3D (실내), Matterport3D (실내), Deep360 (실외) 세 데이터셋에서 DAP가 모든 기존 방법을 상회한다. 핀홀 기반 일반 깊이 모델들(DepthAnything V2, UniDepth)과 파노라마 전용 모델들 모두 크게 앞선다.

Stanford2D3D (실내, 학습 미포함):

방법	AbsRel ↓	RMSE ↓	δ₁ ↑
DepthAnything V2	0.1822	0.7340	0.7691
UniDepth	0.1654	0.6893	0.8012
PanDA	0.1135	0.4210	0.8901
DAP	0.0921	0.3820	0.9135

Matterport3D (실내, 학습 미포함):

방법	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.1186	0.7510	0.8518

Deep360 (실외):

방법	AbsRel ↓	RMSE ↓	δ₁ ↑
DAP	0.0659	5.224	0.9525

DAP-Test 자체 벤치마크 (Table 4)

Table 4: DAP-Test 벤치마크 결과. 실내외를 모두 포함한 DAP 자체 평가 세트에서 기존 최고 방법(Unik3D) 대비 AbsRel 69% 감소, RMSE 35.7% 감소, δ₁ 53.8% 향상이라는 압도적 성능 차이를 보인다.

방법	AbsRel ↓	RMSE ↓	δ₁ ↑
DAC	0.3197	8.799	0.5193
Unik3D	0.2517	10.56	0.6086
DAP	0.0781	6.804	0.9370

DAP는 Unik3D 대비:

AbsRel: 0.2517 → 0.0781 (69% 감소)
RMSE: 10.56 → 6.804 (35.7% 감소)
δ₁: 0.6086 → 0.9370 (53.8% 향상)

정성적 비교

Figure 4: 정성적 비교. DAP(맨 오른쪽)는 DAC, Unik3D와 비교해 경계 선명도, 원거리 영역 정확도, 글로벌 기하 일관성 모두에서 월등한 품질을 보인다. 특히 파노라마의 가장자리와 극점 근처에서도 안정적인 깊이 예측을 유지한다.

Figure 5: 다양한 실제 파노라마에 대한 DAP의 깊이 추정 결과. 도심 거리, 실내 공간, 자연 환경, 복잡한 구조물 등 다양한 장면에서 일관된 metric 깊이를 출력한다. 레인지 마스크가 장면마다 적절한 깊이 범위를 자동 적응함을 확인할 수 있다.

8. 어블레이션

구성 요소 분석 (Table 5)

Table 5: 구성 요소 어블레이션. 왜곡 인식(Distortion), 기하 일관성(Geometry), 선명도(Sharpness) 세 요소를 순차적으로 추가할 때의 Stanford2D3D와 Deep360 성능 변화. 세 요소 모두 유의미하게 기여한다.

왜곡 인식	기하 일관성	선명도	Stanford AbsRel ↓	Stanford δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
✗	✗	✗	0.1166	0.8409	0.0942	0.8396
✓	✗	✗	0.1149	0.8440	0.0926	0.8423
✓	✓	✗	0.1112	0.8509	0.0880	0.8592
✓	✓	✓	0.1084	0.8576	0.0862	0.8719

각 요소의 역할:

왜곡 인식: M_distort 가중치맵으로 극점 왜곡 픽셀의 손실 영향 감소
기하 일관성: ℒ_DF (DF-Gram), ℒ_normal, ℒ_pts 손실로 3D 구조 정확도 향상
선명도: ℒ_grad 손실로 물체 경계 및 표면 전환점의 예리함 강화

레인지 마스크 임계값 분석 (Table 6)

Table 6: 레인지 마스크 거리 임계값 어블레이션. 10m, 20m, 50m, 100m, None(마스크 없음) 중 100m 임계값이 DAP-2M과 Deep360 모두에서 최고 성능을 달성한다. 마스크 자체(None 대비 유의미한 개선)가 성능에 필수적임을 확인.

임계값 τ	DAP-2M AbsRel ↓	DAP-2M δ₁ ↑	Deep360 AbsRel ↓	Deep360 δ₁ ↑
10m	0.0801	0.9315	0.0934	0.8493
20m	0.0823	0.9164	0.0873	0.8668
50m	0.0864	0.9104	0.0843	0.8594
100m	0.0793	0.9353	0.0862	0.8719
None	0.0832	0.9042	0.0938	0.8411

100m 임계값의 의미: 실내(~10m 이하)와 대규모 야외 장면(수백m)을 실용적으로 구분하는 경계가 100m임을 실험적으로 발견. 너무 낮은 임계값(10m, 20m)은 야외 중거리 장면을 올바르게 분류하지 못하고, 임계값이 없는 경우보다 100m가 항상 우수하다.

9. 기존 방법과의 비교

항목	핀홀 깊이 모델 (DAV2 등)	파노라마 전용 (PanDA 등)	DAP
파노라마 기하 처리	✗ (핀홀 가정)	✓	✓ (왜곡 인식)
Metric 깊이	△ (상대적)	△	✓ (metric 절대값)
실내/실외 동시 지원	△	✗ (주로 실내)	✓
학습 데이터 규모	대규모 (핀홀)	소규모	2M+ 파노라마
제로샷 일반화	낮음	중간	높음
레인지 적응	✗	✗	✓ (마스크 헤드)

10. 한계점

수도 레이블 오류 전파: 3단계 파이프라인은 Stage 1의 품질에 의존. 초기 레이블러 오류가 후속 단계로 누적될 수 있다.
극점 왜곡: M_distort로 완화했지만, 극점 근처의 심각한 왜곡 영역에서는 여전히 품질 저하 가능.
고해상도 처리 비용: 512×1024 입력 기준 설계. 더 높은 해상도 파노라마 처리 시 메모리·속도 부담.
실제 야외 GT 부재: 야외 학습 데이터 중 GT 레이블은 모두 UE5 합성 기반. 실제 야외 LiDAR GT로 검증이 제한적.

11. 정리

DAP는 파노라마 깊이 추정 분야에 파운데이션 모델 패러다임을 처음으로 성공적으로 도입한 논문이다. 핵심 메시지:

“데이터 규모 + 도메인 브리징 파이프라인 + 파노라마 특화 설계 = 실내외 범용 metric 깊이”

기존 핀홀 기반 깊이 모델들이 파노라마에 그대로 적용 불가능하고, 파노라마 전용 모델들은 특정 환경(주로 실내)에 과적합되었던 문제를 2M+ 규모의 다양한 데이터와 단계적 수도 레이블 파이프라인으로 극복했다. 레인지 마스크 헤드는 실내외의 깊이 범위 차이를 우아하게 해결하는 실용적 아이디어다.

Insta360이라는 360도 카메라 하드웨어 기업이 주도했다는 점에서, 실제 제품 적용을 염두에 둔 실용적 연구라는 점도 주목할 만하다.

[Paper Review] VGGT-Ω: Scaling Feed-Forward 3D Reconstruction (CVPR 2026 Oral)

2026-06-02T00:00:00+00:00

Paper: VGGT-Ω: Scaling Feed-Forward 3D Reconstruction
Venue: CVPR 2026 (Oral Presentation)
Authors: Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht
Affiliations: Visual Geometry Group (VGG), University of Oxford + Meta AI
arXiv: 2605.15195
GitHub: facebookresearch/vggt-omega

One-Line Summary

The successor to VGGT. A 1B-parameter feed-forward model that cuts memory by 70% via Register Attention, trains on 15× more data, and dramatically improves 3D reconstruction for both static and dynamic scenes.

Figure 1: VGGT-Ω overview. A single forward pass predicts camera parameters, depth maps, and scene registers simultaneously, supporting both static and dynamic scenes. Performance scales predictably with model and data size.

1. Background and Problem Statement

VGGT (CVPR 2025 Best Paper) successfully applied the foundation model philosophy to 3D reconstruction. However, practical deployment revealed several structural limitations.

Limitations of VGGT:

Global Attention memory explosion: In odd layers, all tokens across all N frames attend to one another — O(N²K²) complexity that scales quadratically with frame count.
No dynamic scene support: Training data was predominantly static scenes, making the model ill-equipped for scenes with moving objects.
Multi-head complexity: Three specialized heads (Camera, DPT, Tracking) limit scalability.
High-resolution conv bottleneck: The DPT head’s high-resolution convolution layers create memory and speed bottlenecks.

VGGT-Ω’s central question:

“If the right architectural choices are combined with sufficiently large data, does the quality of 3D reconstruction models scale predictably?”

2. Core Ideas

Three contributions working in concert.

① Register Attention

Global Self-Attention is replaced by bottleneck communication through register tokens. Inter-frame information exchange happens exclusively through registers; image tokens only perform within-frame attention. Complexity drops from O(N²K²) to O(N²R² + NK²), where R ≪ K.

② Single Dense Prediction Head

VGGT’s three specialized heads (Camera, DPT, Tracking) are consolidated into a single multi-task Dense Prediction Head. High-resolution convolutional layers are removed. The simpler design is more amenable to scaling.

③ Massive, Diverse Training

15× more supervised data compared to VGGT
New self-supervised learning protocol for unlabeled video
Newly built dynamic scene annotation pipeline

3. Architecture

Overall Pipeline

N input images
    → patch tokenizer → image tokens (K tokens per frame)
    → register tokens (R tokens per frame, R ≪ K) appended
    → Transformer layers (Register Attention)
    │   ├─ Within-frame Self-Attention: image + register tokens (K+R)
    │   └─ Cross-frame Self-Attention: register tokens only (R × N frames)
    └─→ Single Dense Prediction Head (multi-task supervision)
           → camera extrinsics (rotation + translation)
           → camera intrinsics (FoV)
           → depth maps + confidence maps
           → scene registers (reusable for downstream tasks)

Figure 2: Architecture comparison between VGGT and VGGT-Ω. Global Self-Attention (left) is replaced by Register Attention (right). Image tokens no longer communicate directly across frames — all inter-frame information flows through the register tokens.

Register Attention in Detail

The critical bottleneck in VGGT’s Alternating-Attention was the Global Self-Attention step (all tokens across all frames attending to one another). VGGT-Ω decouples this into two stages:

Within-frame attention: Each frame’s image tokens and register tokens attend together. Spatial information is compressed into the registers.
Cross-register attention: Only register tokens from all frames attend globally. Scene-wide 3D structure is exchanged at this compressed bottleneck.

Image tokens can only access cross-frame information by routing through registers. The registers act as per-frame scene aggregators.

Figure 3: Register Attention mechanism. Register tokens (■) collect information from within-frame image tokens (within-frame attention), then exchange that information globally with registers from other frames (cross-register attention). Image tokens (○) have no direct cross-frame communication.

Memory Efficiency

Register Attention cuts memory by 70% relative to Global Self-Attention. Combined with the removal of high-resolution convolutional layers, VGGT-Ω requires approximately 30% of VGGT’s memory.

# Frames	GPU Memory (GB)	Resolution
1	6.02	624×416
10	6.67	624×416
25	7.80	624×416
50	9.66	624×416
100	13.37	624×416
200	20.82	624×416
300	28.26	624×416
500	43.15	624×416

4. Training

Data Composition

VGGT-Ω uses 15× more supervised training data than VGGT. Two major expansions:

New: Dynamic Scene Annotation Pipeline

An automated pipeline for annotating dynamic scenes with moving objects
Overcomes the limitation of existing datasets being dominated by static scenes

New: Self-Supervised Learning

A self-supervised learning protocol for unlabeled video data
Enables learning 3D structure from internet-scale video without manual annotation

Figure 4: VGGT-Ω training pipeline. Three pillars: supervised training data (static + dynamic scenes), self-supervised learning (unlabeled video), and the dynamic scene annotation pipeline.

5. Experimental Results

Main Benchmark Results

Table 1: Main benchmark results on static scenes. VGGT-Ω surpasses VGGT and all prior methods by a large margin across camera pose estimation, point cloud reconstruction, and depth estimation benchmarks.

Figure 5: Qualitative 3D reconstruction comparison on static scenes. VGGT-Ω produces significantly more complete and geometrically accurate point clouds than VGGT.

Figure 7: Qualitative camera pose estimation comparison. VGGT-Ω's predicted camera trajectories align substantially closer to ground truth.

Dynamic Scene Results

VGGT-Ω is the first in this line of work to directly target dynamic scenes. On the Sintel benchmark, it achieves a 77% improvement over the previous best.

Table 2: Dynamic scene benchmark results. VGGT-Ω achieves a 77% improvement over the prior best on Sintel camera estimation, with strong results across other dynamic scene datasets.

Figure 6: Qualitative dynamic scene reconstruction. VGGT-Ω handles scenes with moving objects (people, vehicles, etc.) stably and accurately.

Figure 8: Qualitative depth estimation comparison. Fine object boundaries and surface details are preserved across both static and dynamic scenes.

6. Ablation Study

Table 3: Ablation study. Each contribution — Register Attention, single Dense Prediction Head, and self-supervised learning — is isolated to quantify its individual impact. All three components contribute meaningfully to final performance.

7. Downstream Use of Scene Registers

A hidden benefit of Register Attention is that the learned scene registers transfer to downstream tasks. During 3D reconstruction training, registers learn to encode compact spatial representations of each frame’s scene — and these representations transfer to other spatial understanding tasks.

Confirmed downstream applications:

Vision-Language-Action (VLA) models: Scene registers from VGGT-Ω benefit robot manipulation and other VLA tasks requiring spatial understanding.
Language Alignment: The VGGT-Omega-1B-256-Text-Alignment checkpoint outputs text-aligned embeddings for cross-modal retrieval.

This suggests that 3D reconstruction is not just a geometry task but a powerful and scalable proxy task for spatial understanding.

Figure 9: Downstream use of scene registers. Register representations learned via 3D reconstruction transfer to language alignment and VLA models, demonstrating the value of reconstruction as a pretraining objective.

Figure 10: Scaling curves. Performance on key benchmarks improves predictably as model size and training data grow — the first empirical demonstration of scaling laws in feed-forward 3D reconstruction.

8. Comparison with VGGT

Aspect	VGGT	VGGT-Ω
Cross-frame attention	Global Self-Attention (all tokens)	Register Attention (registers only)
Prediction heads	Camera + DPT + Tracking (3 heads)	Single Dense Prediction Head
Dynamic scenes	Not supported	Supported
Training data	~17 datasets	~15× larger + self-supervised
GPU memory (100 frames)	~21 GB (336×518)	13.37 GB (624×416, higher res, less memory)
Sintel camera estimation	Baseline	+77% improvement
Scene register reuse	None	Transfers to VLA / language alignment
Scaling validation	—	Scaling laws empirically demonstrated

9. Limitations

High frame counts at high resolution still require substantial memory (500 frames at 624px = 43.15 GB)
Self-supervised signal may be noisier than supervised annotations
Text-aligned model is limited to 256px low resolution
HuggingFace model access requires manual approval

10. Summary

VGGT-Ω empirically validates that scaling laws apply to 3D reconstruction. The core message:

“Register Attention + single head + massive data = predictable scaling”

If VGGT proved the foundation model paradigm could work for 3D vision, VGGT-Ω confirms that it scales according to the same laws observed in NLP and 2D vision. Reduce architectural complexity (Register Attention, single head), increase data (15× supervised + self-supervised), and performance improves predictably — while the learned representations transfer to other spatial understanding tasks.

Scaling laws for 3D reconstruction are no longer a hypothesis.

[논문리뷰] VGGT-Ω: Scaling Feed-Forward 3D Reconstruction (CVPR 2026 Oral)

2026-06-02T00:00:00+00:00

논문: VGGT-Ω: Scaling Feed-Forward 3D Reconstruction
학회: CVPR 2026 (Oral Presentation)
저자: Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht
소속: Visual Geometry Group (VGG), University of Oxford + Meta AI
arXiv: 2605.15195
GitHub: facebookresearch/vggt-omega

한 줄 요약

VGGT의 후속작. 레지스터 어텐션으로 메모리를 70% 절감하고 15배 더 많은 데이터로 학습해, 정적·동적 장면 3D 재건 성능을 대폭 끌어올린 1B 파라미터 feed-forward 모델.

Figure 1: VGGT-Ω 개요. 단일 forward pass로 카메라 파라미터·깊이 맵·씬 레지스터를 동시 예측하며, 정적/동적 장면을 모두 지원한다. 스케일링 법칙에 따라 모델·데이터 규모가 커질수록 성능이 예측 가능하게 향상된다.

1. 배경과 문제 정의

VGGT(CVPR 2025 Best Paper)는 파운데이션 모델 철학을 3D 재건에 성공적으로 적용해 큰 반향을 일으켰다. 그러나 실용화 과정에서 몇 가지 구조적 한계가 드러났다.

VGGT의 한계:

Global Attention의 메모리 폭발: 홀수 레이어에서 N개 프레임의 모든 토큰이 서로 어텐션. 프레임 수에 제곱으로 증가하는 메모리 비용.
동적 장면 미지원: 정적 장면 위주 학습 데이터로 인해 움직이는 물체가 포함된 동적 장면 처리 불가.
멀티헤드 복잡성: Camera Head, DPT Head, Tracking Head 각각의 전문화된 설계가 확장성을 제약.
고해상도 Conv 병목: DPT 헤드의 고해상도 컨볼루션 연산이 메모리·속도 병목으로 작용.

VGGT-Ω의 핵심 질문:

“올바른 아키텍처 선택과 충분히 큰 데이터가 결합되면, 3D 재건 모델 성능은 예측 가능하게 스케일링되는가?”

2. 핵심 아이디어

세 가지 기여가 맞물려 있다.

① 레지스터 어텐션 (Register Attention)

Global Self-Attention을 레지스터 토큰을 경유한 병목 통신으로 교체. 프레임 간 정보는 오직 레지스터를 통해서만 교환되며, 이미지 토큰은 프레임 내 어텐션만 수행한다. 연산 복잡도가 O(N²K²) → O(N²R² + NK²)로 감소(R ≪ K).

② 단일 Dense Prediction Head

VGGT의 세 개 특화 헤드(Camera, DPT, Tracking)를 단일 멀티태스크 Dense Prediction Head로 통합. 고해상도 컨볼루션 레이어 제거. 구조가 단순해져 스케일링에 유리.

③ 대규모·다양성 있는 학습

기존 VGGT 대비 15배 많은 지도학습 데이터 확보
레이블 없는 동영상에 대한 자기지도 학습(self-supervised learning) 프로토콜 도입
동적 장면 어노테이션 파이프라인 신규 구축

3. 아키텍처

전체 파이프라인

N개의 입력 이미지
    → 패치 토크나이저 → 이미지 토큰 (프레임당 K개)
    → 레지스터 토큰 (프레임당 R개, R ≪ K) 추가
    → Transformer 레이어 (Register Attention)
    │   ├─ 프레임 내 Self-Attention: 이미지 토큰 + 레지스터 (K+R개)
    │   └─ 프레임 간 Self-Attention: 레지스터 토큰만 (R개 × N프레임)
    └─→ 단일 Dense Prediction Head (multi-task supervision)
           → 카메라 외부 파라미터 (회전 + 이동)
           → 카메라 내부 파라미터 (FoV)
           → 깊이 맵 + 신뢰도 맵
           → 씬 레지스터 (다운스트림 재활용 가능)

Figure 2: VGGT와 VGGT-Ω 아키텍처 비교. VGGT의 Global Self-Attention(좌)이 VGGT-Ω에서는 Register Attention(우)으로 대체된다. 이미지 토큰 간 직접 통신이 없어지고, 레지스터 토큰을 통해서만 프레임 간 정보가 교환된다.

Register Attention 상세

기존 VGGT의 Alternating-Attention에서 핵심 병목이었던 Global Self-Attention (모든 프레임의 모든 토큰이 서로 어텐션)을 다음 두 단계로 분리한다:

프레임 내 어텐션: 각 프레임의 이미지 토큰과 레지스터 토큰이 함께 어텐션 수행. 공간 정보를 레지스터로 압축.
레지스터 간 어텐션: 모든 프레임의 레지스터 토큰끼리만 글로벌 어텐션. 씬 전반의 3D 구조 정보를 교환.

이미지 토큰이 다른 프레임의 정보를 얻으려면 반드시 레지스터를 경유해야 한다. 레지스터는 각 프레임의 씬 요약자(scene aggregator) 역할을 한다.

Figure 3: Register Attention 메커니즘. 레지스터 토큰(■)이 프레임 내 이미지 토큰으로부터 정보를 수집하고(프레임 내 어텐션), 레지스터끼리 글로벌 어텐션으로 프레임 간 정보를 교환한다. 이미지 토큰(○) 간 직접 크로스프레임 통신은 없다.

메모리 효율

프레임 수	GPU 메모리 (GB)	해상도
1	6.02	624×416
10	6.67	624×416
25	7.80	624×416
50	9.66	624×416
100	13.37	624×416
200	20.82	624×416
300	28.26	624×416
500	43.15	624×416

4. 학습

학습 데이터 구성

VGGT-Ω는 VGGT 대비 15배 많은 지도학습 데이터를 사용한다. 핵심 확장 방향은 두 가지다.

신규: 동적 장면 어노테이션 파이프라인

움직이는 물체가 포함된 동적 장면 데이터를 자동으로 어노테이션하는 파이프라인 구축
기존 정적 장면 위주 데이터셋의 한계를 극복

신규: 자기지도 학습

레이블 없는 동영상 데이터를 활용하는 자기지도 학습 프로토콜 도입
인터넷 규모의 비레이블 비디오로부터 3D 구조 정보 학습

Figure 4: VGGT-Ω 학습 파이프라인. 지도학습 데이터(정적+동적 장면), 자기지도 학습(레이블 없는 비디오), 동적 장면 어노테이션 파이프라인의 세 축이 결합된다.

5. 실험 결과

주요 벤치마크 결과

Table 1: 정적 장면 주요 벤치마크 결과. VGGT-Ω는 카메라 포즈 추정, 포인트 클라우드 재건, 깊이 추정 등 전 부문에서 VGGT를 포함한 이전 최고 성능을 대폭 갱신한다.

Figure 5: 정적 장면 3D 재건 정성적 비교. VGGT-Ω가 VGGT 대비 더 정밀하고 완전한 포인트 클라우드를 생성한다.

Figure 7: 카메라 포즈 추정 정성적 비교. VGGT-Ω의 카메라 궤적이 ground truth에 현저히 가까운 정렬을 보인다.

동적 장면 결과

VGGT가 완전히 다루지 못했던 동적 장면에서의 성능을 새로 측정한다. Sintel 벤치마크에서 이전 최고 대비 77% 향상이라는 압도적인 결과를 기록했다.

Table 2: 동적 장면 벤치마크 결과. Sintel, TartanAir 등 동적 씬 데이터셋에서 VGGT-Ω가 이전 방법들을 크게 앞선다. Sintel에서 이전 최고 대비 77% 향상.

Figure 6: 동적 장면 재건 정성적 결과. 움직이는 물체(사람, 차량 등)가 포함된 장면에서도 안정적인 재건 품질을 보인다.

Figure 8: 깊이 추정 정성적 비교. 정적·동적 장면 모두에서 세밀한 경계와 물체 디테일이 잘 보존된다.

6. 어블레이션

Table 3: 어블레이션 스터디. Register Attention, 단일 Dense Prediction Head, 자기지도 학습 각 요소의 기여를 분해 분석한다. 세 요소 모두 최종 성능에 유의미하게 기여한다.

7. 레지스터의 다운스트림 활용

Register Attention의 숨겨진 이점은 씬 레지스터 토큰이 다운스트림 태스크에 재활용 가능하다는 점이다. 3D 재건 학습 과정에서 레지스터는 씬의 공간적 구조를 압축적으로 표현하는 방법을 학습하게 되는데, 이 표현이 다른 공간 이해 태스크로 전이된다.

확인된 다운스트림 활용 사례:

Vision-Language-Action (VLA) 모델: 로봇 조작 등 공간 이해가 필요한 VLA 모델에서 씬 레지스터 활용
언어 정렬(Language Alignment): 256px 체크포인트(VGGT-Omega-1B-256-Text-Alignment)는 텍스트-이미지 정렬 임베딩 출력 지원

3D 재건이 단순한 메트릭 성능 개선을 넘어 공간 이해를 위한 강력한 프리트레이닝 태스크임을 시사한다.

Figure 9: 씬 레지스터의 다운스트림 활용. 3D 재건으로 학습된 레지스터 표현이 언어 정렬 및 VLA 모델로 전이되는 과정을 보여준다.

Figure 10: 스케일링 곡선. 모델 크기와 학습 데이터 양이 증가함에 따라 주요 벤치마크 성능이 예측 가능하게 향상된다. 이는 3D 재건 영역에서도 스케일링 법칙이 성립함을 처음으로 보인다.

8. VGGT와의 비교

항목	VGGT	VGGT-Ω
프레임 간 어텐션	Global Self-Attention (모든 토큰)	Register Attention (레지스터만)
예측 헤드	Camera + DPT + Tracking (3개)	단일 Dense Prediction Head
동적 장면	미지원	지원
학습 데이터	~17개 데이터셋	~15× 확장 + 자기지도 학습
GPU 메모리 (100프레임)	~21 GB (336×518)	13.37 GB (624×416, 더 높은 해상도에서 더 적은 메모리)
Sintel 카메라 추정	기준	+77% 향상
씬 레지스터 활용	없음	VLA / 언어 정렬 전이 가능
스케일링 검증	—	모델·데이터 스케일 법칙 실증

9. 한계점

고해상도(512px 이상) 처리 시 여전히 수백 GB급 메모리 필요 (500프레임·624px = 43.15 GB)
자기지도 학습의 신뢰도가 지도학습 대비 낮을 수 있음
텍스트 정렬 모델은 256px 저해상도로 제한
HuggingFace 접근 승인 절차 필요 (공개 즉시 접근 불가)

10. 정리

VGGT-Ω는 “스케일링 법칙이 3D 재건에도 적용된다”는 명제를 실증한 논문이다. 핵심 메시지:

“Register Attention + 단일 헤드 + 대규모 데이터 = 예측 가능한 스케일링”

VGGT가 3D 비전의 파운데이션 모델 가능성을 증명했다면, VGGT-Ω는 그 가능성을 스케일링 법칙의 언어로 확인했다. 아키텍처 복잡성을 줄이고(레지스터 어텐션, 단일 헤드), 데이터를 늘리면(15× 지도 + 자기지도) 성능이 예측 가능하게 향상되며, 그 표현은 다른 공간 이해 태스크로도 전이된다.

NLP와 2D 비전에서 확립된 스케일링 법칙이 이제 3D 재건에서도 작동한다.

[Paper Review] WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild (CVPR 2025)

2026-05-25T00:00:00+00:00

Paper: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
Venue: CVPR 2025
Authors: Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou
Affiliations: Imperial College London (Potamias, Deng, Zafeiriou), Shanghai Jiao Tong University (Zhang)
arXiv: 2409.12259
GitHub: rolpotamias/WiLoR

One-Line Summary

An end-to-end 3D hand reconstruction pipeline combining a real-time fully convolutional hand detector with a ViT-Large + multi-scale refinement module, achieving state-of-the-art on FreiHAND and HO3Dv2 while delivering 2–3× better temporal coherence than HaMeR, trained on 4.2M images including the new 2M+ in-the-wild WHIM dataset.

1. Background and Problem Definition

The Fragmented Two-Stage Pipeline

3D hand reconstruction systems have traditionally been developed as two separate stages: a hand detector finds the hand region, and a separate pose estimation model recovers the 3D mesh from the detected crop. This separation introduces three problems:

Detection bottleneck: False positives or misses from the upstream detector directly limit full-pipeline performance
Detection speed: Prior hand detectors (ContactHands: 3 FPS) are unsuitable for real-time applications
Temporal incoherence: Single-frame pose estimation causes inter-frame jitter in video

Additionally, prior methods lack a refinement step after initial MANO parameter estimation, making image-space alignment inaccurate. Where HaMeR uses a single query token for direct regression, WiLoR introduces an initial prediction → image-aligned feature residual refinement structure.

WiLoR’s Core Proposal

WiLoR addresses all three issues simultaneously:

Real-time hand detection: Fully convolutional detection network at 130+ FPS
High-fidelity 3D reconstruction: ViT-Large backbone with a multi-scale image-aligned refinement module
Large-scale in-the-wild data: 2M+ automatically annotated images (WHIM dataset)

2. Output Representation: MANO Hand Model

WiLoR also uses the parametric hand model MANO as its output space.

MANO parameters:

Pose \(\theta \in \mathbb{R}^{48}\): Finger joint rotations (PCA-based)
Shape \(\beta \in \mathbb{R}^{10}\): Per-identity hand shape variables
Camera \(K_{cam}\): Weak-perspective camera parameters

The full output \(\{\theta, \beta, K_{cam}\}\) deterministically yields a 778-vertex mesh \(V_{3D}\) and 21 joint locations \(J_{3D}\).

3. Architecture Overview

WiLoR is an end-to-end pipeline of two networks:

Single RGB image
    → WiLoR-Det (hand detection network)
        → bounding boxes + left/right labels
    → hand crop extraction
    → WiLoR-Rec (3D reconstruction network)
        → initial MANO parameter estimation (ViT-L)
        → refinement module (multi-scale image alignment)
        → final MANO parameters (θ, β, K_cam)
        → MANO layer → 3D mesh + joint coordinates

4. Hand Detection Network (WiLoR-Det)

Architecture

WiLoR-Det is a real-time object detection architecture in the YOLOv8 family, specialized for hand detection.

Backbone: DarkNet — extracts last three feature maps \(\{C_3, C_4, C_5\}\)
Neck: PANet (Path Aggregation Network) — multi-scale feature fusion
Head: Three anchor-free detection heads (predict bounding boxes + left/right labels at each scale)

Provided in two sizes:

WiLoR-M: 25 MB, 138 FPS
WiLoR-S: 7 MB, 175 FPS

Detection Loss

\[\mathcal{L}_{det} = \lambda_0 \mathcal{L}_{BCE} + \lambda_1 \mathcal{L}_{DFL} + \lambda_2 \mathcal{L}_{CIoU} + \lambda_3 \mathcal{L}_{kpts}\]

Term	Weight	Role
\(\mathcal{L}_{BCE}\)	\(\lambda_0 = 0.5\)	Classification (hand presence + left/right)
\(\mathcal{L}_{DFL}\)	\(\lambda_1 = 1.5\)	Distribution focal loss (box coordinate distribution)
\(\mathcal{L}_{CIoU}\)	\(\lambda_2 = 15\)	Bounding box shape regression
\(\mathcal{L}_{kpts}\)	\(\lambda_3 = 10\)	Keypoint alignment

5. 3D Hand Reconstruction Network (WiLoR-Rec)

Backbone: ViT-Large

Fine-tuned from ViTPose pretrained weights
Hidden dimension 1,280
Input: image patch tokens \(\mathbf{T}_{img}\) concatenated with learnable tokens for pose \(\theta\), shape \(\beta\), and camera \(K_{cam}\)
Initial (coarse) MANO parameters estimated via MLP from ViT output tokens

WiLoR uses ViT-L rather than HaMeR’s ViT-H, but compensates through ViTPose pretraining and the refinement module.

This is WiLoR-Rec’s key differentiator. Rather than stopping at initial MANO parameters, it extracts image-aligned features to predict residual corrections.

How it works:

Feature map generation: Deconvolutional layers upsample ViT image output tokens into multi-resolution feature maps \(\{F_0, F_1, \ldots, F_n\}\)
Mesh projection: Each vertex \(v\) of the initially estimated hand mesh is projected onto the image plane using the estimated camera \(K_{cam}\)
Bilinear sampling: Multi-scale features are bilinearly interpolated at the projected coordinates

\[f_v^0 = \pi(v, K_{cam})\] \[\text{per-vertex feature} = \text{bilinear\_sample}(\{F_i\}, f_v^0)\]

Residual prediction: Per-vertex features from mesh level \(M_l\) are aggregated to compute pose and shape residuals

\[\Delta\beta = \text{MLP}_\beta\left(\bigoplus_{v \in M_l} f_v^0\right), \quad \Delta\theta = \text{MLP}_\theta\left(\bigoplus_{v \in M_l} f_v^0\right)\]

The key insight is image alignment: rather than regressing from global image features, the model directly references local image features corresponding to estimated mesh positions, correcting errors through local context.

Reconstruction Loss

\[\mathcal{L}_{rec} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D} + \lambda_{pose}\mathcal{L}_{MANO,\theta} + \lambda_{shape}\mathcal{L}_{MANO,\beta} + \mathcal{L}_{adv}\]

Term	Weight	Formula
\(\mathcal{L}_{3D}\)	0.05	\(\|V_{3D} - \hat{V}_{3D}\|_1\)
\(\mathcal{L}_{2D}\)	0.01	\(\|\pi(J_{3D}, K_{cam}) - \hat{J}_{2D}\|_1\)
\(\mathcal{L}_{MANO,\theta}\)	0.001	\(\|\theta - \hat{\theta}\|_2^2\)
\(\mathcal{L}_{MANO,\beta}\)	0.0005	\(\|\beta - \hat{\beta}\|_2^2\)
\(\mathcal{L}_{adv}\)	—	\(\|D(\theta, \beta) - 1\|_2\)

6. WHIM Dataset

Motivation and Scale

The core problem with prior in-the-wild training data is insufficient diversity. WiLoR builds the WHIM (Wild Hand In-the-wild Monocular) dataset via an automated annotation pipeline.

Scale: 2M+ in-the-wild hand images
Source: 1,400+ YouTube videos (sign language, cooking, sports, games, ego/exocentric viewpoints)
Annotations: 2D bounding boxes, left/right labels, 3D MANO parameters

Automated Annotation Pipeline

3D ground truth is generated through an automated fitting pipeline rather than manual annotation.

Stage 1 — Person detection:

VitPose + AlphaPose with confidence threshold 0.65

Stage 2 — Hand detection ensemble:

Three detectors (MediaPipe, OpenPose, ContactHands) ensembled
Confidence-weighted bounding box fusion:

\[\hat{y} = \frac{\sum_i P(b_i | d_i) \cdot b_i}{\sum_i P(b_i | d_i)}\]

Stage 3 — 3D MANO fitting: Optimized with three losses:

Reprojection loss: \(\mathcal{L}_{proj} = \|J_M - \pi(\hat{J}_s, K)\|_1\)
Biomechanical loss: \(\mathcal{L}_{BMC} = \mathcal{L}_{BL} + \mathcal{L}_{A}\) (bone length + joint angle constraints)
PCA prior loss: \(\mathcal{L}_{prior} = \|X - [(X - \mu)U^T]U + \mu\|_2\) (enforces natural hand shapes)

Explicitly embedding biomechanical constraints in the fitting loss prevents generation of physically implausible hand poses.

7. Training Configuration

Detection Model

Optimizer: Adam, 200 epochs with 30-epoch early stopping
Learning rate: 0.01 → 1e-6 (linear decay)
Hardware: 2× RTX 4090, batch size 256, 3 weeks
Augmentation: Mosaic (0.7 probability), rotation [-60°, 60°], scale [0.5, 1]

Reconstruction Model

Optimizer: Adam, 1,000 epochs, learning rate 1e-5, weight decay 1e-4
Training data: 14 datasets, 4.2M images (55%+ more than prior methods)
7 existing datasets with 3D annotations (FreiHAND, HO-3D, InterHand2.6M, etc.) + 7 additional including WHIM

8. Experimental Results

FreiHAND Benchmark (Table 3)

Method	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	6.0	5.7	0.785	0.990
WiLoR	5.5	5.1	0.825	0.993

WiLoR surpasses HaMeR on all FreiHAND metrics: 8.3% improvement in PA-MPJPE, 5.1%p improvement in F@5mm.

HO3Dv2 Benchmark (Table 4)

Method	AUCⱼ ↑	PA-MPJPE (mm) ↓	AUCᵥ ↑	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	0.846	7.7	0.841	7.9	0.635	0.980
WiLoR	0.851	7.5	0.846	7.7	0.646	0.983

Consistent improvement on the hand-object interaction benchmark as well.

Hand Detection Benchmark (Table 1, COCO)

Method	Model Size	FPS ↑	AP@0.5 ↑	mAP ↑
ContactHands	819 MB	3	50.29	16.67
ViTDet	1,400 MB	1	41.64	13.21
WiLoR-S	7 MB	175	46.96	18.56
WiLoR-M	25 MB	138	62.48	25.97

WiLoR-M achieves 45× faster speed at 32× smaller size compared to ContactHands while outperforming it by 12.19%p in AP@0.5.

On the WHIM test set, WiLoR-M scores AP@0.5 96.06, mAP 53.79.

Temporal Coherence (Table 6)

Despite per-frame independent inference, temporal coherence on video:

Method	MPFVE×100 ↓	MPFJE×100 ↓	Jitter ↓	RTE ↓
HaMeR	10.60	1.768	20.43	2.92
WiLoR	4.43	0.762	5.92	0.07

2.4× improvement in MPFVE, 3.4× improvement in Jitter over HaMeR — temporal coherence achieved without any explicit temporal modeling.

9. Ablation Study

Reconstruction Component Analysis (Table 5, FreiHAND)

Configuration	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
w. FastViT backbone	6.5	6.3	0.741	0.967
w/o ViTPose pretrain	5.9	5.7	0.795	0.989
w. Single-scale refinement	6.0	5.9	0.793	0.991
w/o Refinement module	6.1	5.8	0.795	0.991
w. FreiHAND only	6.1	5.8	0.793	0.990
Full model (WiLoR)	5.5	5.1	0.825	0.993

Key observations:

Backbone capacity matters: ViT-L over FastViT gives 1.0mm improvement in PA-MPJPE. Capacity difference directly translates to performance.
ViTPose pretraining is critical: Starting from ViTPose weights versus standard ViT-L gives an additional 0.4mm improvement. Domain similarity between hand pose and body pose makes transfer beneficial.
Refinement module effect: Without refinement 6.1mm → with refinement 5.5mm. Image-aligned residual prediction contributes 0.6mm gain.
Multi-scale matters: Single-scale (6.0mm) vs. multi-scale (5.5mm) gives 0.5mm additional improvement.
WHIM data contribution: FreiHAND-only training (6.1mm) vs. full data (5.5mm) shows 0.6mm gain. Out-of-domain in-the-wild data benefits even a controlled studio benchmark.

10. Limitations

Detection dependency: Reconstruction performance directly depends on upstream detection quality; detection failures propagate through the pipeline
Tight crop requirement: Optimal performance requires tight crops where the hand is sufficiently contained in the image
No explicit temporal modeling: Despite strong temporal coherence, the approach does not explicitly leverage temporal context and may produce errors under rapid motion
Noisy automatic annotations: WHIM’s 3D annotations are generated by an automated fitting pipeline and contain more noise than manual annotations
MANO representation constraints: Performance degrades on extreme hand deformations or tool manipulation scenarios that fall outside the MANO parameter space

11. Summary

WiLoR inherits HaMeR’s scaling hypothesis and extends it in two important directions.

First, end-to-end integration. Detection and reconstruction are unified into a single pipeline, and the detector’s own performance is simultaneously improved. WiLoR-M achieves better hand detection while being 45× faster.

Second, image-aligned refinement. Going beyond single-pass regression, WiLoR introduces a coarse-to-fine structure that projects initial predictions back into image space and corrects errors using locally aligned features. This design delivers 0.5–0.6mm quantitative gains and, notably, dramatic improvements in temporal coherence.

The WHIM dataset is WiLoR’s hidden infrastructure. The pipeline that constructs 2M+ automatic annotations from 1,400 YouTube videos — with biomechanical constraints embedded in the fitting process to minimize noise — enables the scale of in-the-wild training that makes the performance possible.

Comparing HaMeR and WiLoR:

	HaMeR	WiLoR
Reconstruction backbone	ViT-H	ViT-L + ViTPose
Refinement module	None	Multi-scale image-aligned
Detector	External dependency	Built-in (WiLoR-Det)
Training data	2.7M (10 datasets)	4.2M (14 datasets)
FreiHAND PA-MPJPE	6.0 mm	5.5 mm
Temporal coherence (Jitter)	20.43	5.92

“From detection through reconstruction as one, then return to the image to correct errors — WiLoR’s next step.”

[논문리뷰] WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild (CVPR 2025)

2026-05-25T00:00:00+00:00

논문: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
학회: CVPR 2025
저자: Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou
소속: Imperial College London (Potamias, Deng, Zafeiriou), Shanghai Jiao Tong University (Zhang)
arXiv: 2409.12259
GitHub: rolpotamias/WiLoR

한 줄 요약

실시간 완전 합성곱 손 검출기와 ViT-Large + 다중 스케일 정제 모듈로 구성된 엔드-투-엔드 3D 손 재건 파이프라인으로, 200만 개 이상의 in-the-wild 학습 데이터(WHIM)를 활용해 FreiHAND·HO3Dv2 최고 성능과 함께 HaMeR 대비 2–3배 향상된 시간적 일관성을 달성.

1. 배경과 문제 정의

기존 방법의 분리된 파이프라인

3D 손 재건 시스템은 전통적으로 두 단계로 분리되어 개발되었다. 먼저 손 검출기가 이미지에서 손 영역을 찾고, 별도의 포즈 추정 모델이 검출된 크롭에서 3D 메시를 복원한다. 이 분리된 접근 방식이 가져오는 문제는 세 가지다.

검출 병목: 업스트림 검출기의 오탐 또는 미검출이 전체 파이프라인 성능을 제한
검출 속도: 기존 손 검출기(ContactHands: 3 FPS)는 실시간 응용에 부적합
시간적 불일관성: 단일 프레임 포즈 추정은 비디오에서 프레임 간 떨림 발생

또한 기존 방법들은 초기 MANO 파라미터 추정 후 정제(refinement) 단계가 없어 이미지-공간 정렬이 부정확하다는 문제가 있다. HaMeR가 단일 쿼리 토큰으로 직접 회귀하는 방식을 택한 것과 달리, WiLoR는 초기 예측 → 이미지 정렬 특징 기반 잔차 정제 구조를 도입한다.

WiLoR의 핵심 제안

WiLoR는 다음 세 가지를 동시에 해결한다.

실시간 손 검출: 130+ FPS의 완전 합성곱 검출 네트워크
고정밀 3D 재건: 다중 스케일 이미지 정렬 정제 모듈을 갖춘 ViT-Large 기반 재건기
대규모 in-the-wild 데이터: 200만 개 이상의 자동 주석 데이터셋(WHIM)

2. 출력 표현: MANO 손 모델

WiLoR 역시 파라메트릭 손 모델 MANO를 출력 공간으로 사용한다.

MANO 파라미터:

포즈 \(\theta \in \mathbb{R}^{48}\): 손가락 관절 회전 (PCA 기반)
형태 \(\beta \in \mathbb{R}^{10}\): 개인별 손 형태 변수
카메라 \(K_{cam}\): weak-perspective 카메라 파라미터

최종 출력 \(\{\theta, \beta, K_{cam}\}\)으로부터 778개 꼭짓점 메시 \(V_{3D}\)와 21개 관절 위치 \(J_{3D}\)가 결정론적으로 계산된다.

3. 아키텍처 개요

WiLoR는 두 네트워크로 구성된 엔드-투-엔드 파이프라인이다.

단일 RGB 이미지
    → WiLoR-Det (손 검출 네트워크)
        → 바운딩 박스 + 좌우 레이블
    → 손 크롭 추출
    → WiLoR-Rec (3D 재건 네트워크)
        → 초기 MANO 파라미터 추정 (ViT-L)
        → 정제 모듈 (다중 스케일 이미지 정렬)
        → 최종 MANO 파라미터 (θ, β, K_cam)
        → MANO 레이어 → 3D 메시 + 관절 좌표

4. 손 검출 네트워크 (WiLoR-Det)

아키텍처

WiLoR-Det는 YOLOv8 계열의 실시간 객체 검출 구조를 손 검출에 특화한 모델이다.

백본: DarkNet — 마지막 세 특징 맵 \(\{C_3, C_4, C_5\}\) 추출
넥: PANet (Path Aggregation Network) — 다중 스케일 특징 융합
헤드: 세 개의 앵커-프리 검출 헤드 (각 스케일에서 바운딩 박스 + 좌우 레이블 예측)

두 가지 크기로 제공된다:

WiLoR-M: 25 MB, 138 FPS
WiLoR-S: 7 MB, 175 FPS

검출 손실 함수

\[\mathcal{L}_{det} = \lambda_0 \mathcal{L}_{BCE} + \lambda_1 \mathcal{L}_{DFL} + \lambda_2 \mathcal{L}_{CIoU} + \lambda_3 \mathcal{L}_{kpts}\]

항	가중치	역할
\(\mathcal{L}_{BCE}\)	\(\lambda_0 = 0.5\)	분류 (손 여부 + 좌우)
\(\mathcal{L}_{DFL}\)	\(\lambda_1 = 1.5\)	분포 초점 손실 (박스 좌표 분포 학습)
\(\mathcal{L}_{CIoU}\)	\(\lambda_2 = 15\)	바운딩 박스 형상 회귀
\(\mathcal{L}_{kpts}\)	\(\lambda_3 = 10\)	키포인트 정렬

5. 3D 손 재건 네트워크 (WiLoR-Rec)

백본: ViT-Large

ViTPose 사전학습 가중치에서 fine-tuning
은닉 차원 1,280
이미지 패치 토큰 \(\mathbf{T}_{img}\)와 포즈 \(\theta\), 형태 \(\beta\), 카메라 \(K_{cam}\)에 대한 학습 가능 토큰 함께 입력
ViT 출력 토큰에서 MLP를 통해 초기(coarse) MANO 파라미터 추정

HaMeR의 ViT-H 대비 ViT-L를 사용하지만, ViTPose 사전학습 가중치와 정제 모듈로 성능을 보완한다.

WiLoR-Rec의 핵심 차별점이다. 초기 MANO 파라미터만으로 끝나지 않고, 이미지 공간에 정렬된 특징을 추출하여 잔차를 예측한다.

동작 원리:

특징 맵 생성: ViT 출력 이미지 토큰을 디컨볼루션 레이어로 업샘플링하여 다중 해상도 특징 맵 \(\{F_0, F_1, \ldots, F_n\}\) 생성
메시 투영: 초기 추정 카메라 \(K_{cam}\)을 사용해 3D 손 메시의 각 꼭짓점 \(v\)를 이미지 평면에 투영
이중선형 샘플링: 투영된 좌표에서 다중 스케일 특징 맵을 이중선형 보간으로 샘플링

\[f_v^0 = \pi(v, K_{cam})\] \[\text{per-vertex feature} = \text{bilinear\_sample}(\{F_i\}, f_v^0)\]

잔차 예측: 메시 레벨 \(M_l\)의 꼭짓점 특징을 집약하여 포즈와 형태 잔차 계산

\[\Delta\beta = \text{MLP}_\beta\left(\bigoplus_{v \in M_l} f_v^0\right), \quad \Delta\theta = \text{MLP}_\theta\left(\bigoplus_{v \in M_l} f_v^0\right)\]

이 설계의 핵심은 이미지 정렬이다. 단순히 전역 이미지 특징으로 MANO를 회귀하는 것이 아니라, 추정된 메시 위치에 해당하는 지역 이미지 특징을 직접 참조하여 오차를 교정한다.

재건 손실 함수

\[\mathcal{L}_{rec} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D} + \lambda_{pose}\mathcal{L}_{MANO,\theta} + \lambda_{shape}\mathcal{L}_{MANO,\beta} + \mathcal{L}_{adv}\]

항	가중치	수식
\(\mathcal{L}_{3D}\)	0.05	\(\|V_{3D} - \hat{V}_{3D}\|_1\)
\(\mathcal{L}_{2D}\)	0.01	\(\|\pi(J_{3D}, K_{cam}) - \hat{J}_{2D}\|_1\)
\(\mathcal{L}_{MANO,\theta}\)	0.001	\(\|\theta - \hat{\theta}\|_2^2\)
\(\mathcal{L}_{MANO,\beta}\)	0.0005	\(\|\beta - \hat{\beta}\|_2^2\)
\(\mathcal{L}_{adv}\)	—	\(\|D(\theta, \beta) - 1\|_2\)

6. WHIM 데이터셋

동기와 규모

기존 in-the-wild 학습 데이터의 핵심 문제는 다양성 부족이다. WiLoR는 이를 해결하기 위해 WHIM (Wild Hand In-the-wild Monocular) 데이터셋을 자동 파이프라인으로 구축했다.

규모: 200만 개 이상의 in-the-wild 손 이미지
소스: 1,400개 이상의 YouTube 영상 (수화, 요리, 스포츠, 게임, 에고/엑소센트릭)
주석: 2D 바운딩 박스, 좌우 레이블, 3D MANO 파라미터

자동 주석 파이프라인

사람이 직접 주석을 달지 않고 자동화된 피팅 파이프라인으로 3D GT를 생성한다.

1단계 — 인체 검출:

VitPose + AlphaPose로 신뢰도 0.65 이상의 사람 검출

2단계 — 손 검출 앙상블:

MediaPipe, OpenPose, ContactHands 세 검출기 앙상블
검출기 신뢰도 기반 가중 평균으로 바운딩 박스 통합:

\[\hat{y} = \frac{\sum_i P(b_i | d_i) \cdot b_i}{\sum_i P(b_i | d_i)}\]

3단계 — 3D MANO 피팅: 세 가지 손실로 최적화:

재투영 손실: \(\mathcal{L}_{proj} = \|J_M - \pi(\hat{J}_s, K)\|_1\)
생체역학 손실: \(\mathcal{L}_{BMC} = \mathcal{L}_{BL} + \mathcal{L}_{A}\) (뼈 길이 + 관절 각도 제약)
PCA 사전 손실: \(\mathcal{L}_{prior} = \|X - [(X - \mu)U^T]U + \mu\|_2\) (자연스러운 손 형태 강제)

생체역학 제약을 손실에 명시적으로 포함하여 비물리적 손 자세 생성을 방지한다.

7. 학습 설정

검출 모델

옵티마이저: Adam, 200 에포크 (30 에포크 조기 종료)
학습률: 0.01 → 1e-6 선형 감소
하드웨어: RTX 4090 × 2, 배치 크기 256, 3주 학습
증강: 모자이크 (확률 0.7), 회전 [-60°, 60°], 스케일 [0.5, 1]

재건 모델

옵티마이저: Adam, 1,000 에포크, 학습률 1e-5, 가중치 감쇠 1e-4
학습 데이터: 14개 데이터셋, 420만 개 이미지 (기존 방법 대비 55% 이상 증가)
기존 7개 3D 어노테이션 보유 데이터셋 (FreiHAND, HO-3D, InterHand2.6M 등) + WHIM 포함 7개 추가

8. 실험 결과

FreiHAND 벤치마크 (Table 3)

방법	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	6.0	5.7	0.785	0.990
WiLoR	5.5	5.1	0.825	0.993

WiLoR가 FreiHAND 모든 메트릭에서 HaMeR를 상회한다. PA-MPJPE 기준 8.3% 향상, F@5mm 기준 5.1%p 향상.

HO3Dv2 벤치마크 (Table 4)

방법	AUCⱼ ↑	PA-MPJPE (mm) ↓	AUCᵥ ↑	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	0.846	7.7	0.841	7.9	0.635	0.980
WiLoR	0.851	7.5	0.846	7.7	0.646	0.983

손-물체 상호작용 벤치마크에서도 일관된 향상.

손 검출 벤치마크 (Table 1, COCO 기준)

방법	모델 크기	FPS ↑	AP@0.5 ↑	mAP ↑
ContactHands	819 MB	3	50.29	16.67
ViTDet	1,400 MB	1	41.64	13.21
WiLoR-S	7 MB	175	46.96	18.56
WiLoR-M	25 MB	138	62.48	25.97

WiLoR-M은 ContactHands 대비 45배 빠르고, 32배 작으면서 AP@0.5 기준 12.19%p 높은 성능을 달성한다.

WHIM 테스트셋 기준으로는 WiLoR-M이 AP@0.5 96.06, mAP 53.79을 기록한다.

시간적 일관성 (Table 6)

프레임별 독립 추론임에도 비디오에서의 시간적 일관성을 측정한 결과:

방법	MPFVE×100 ↓	MPFJE×100 ↓	Jitter ↓	RTE ↓
HaMeR	10.60	1.768	20.43	2.92
WiLoR	4.43	0.762	5.92	0.07

시간적 스무딩 모듈 없이도 HaMeR 대비 MPFVE 2.4배, Jitter 3.4배 향상. 명시적 시간 모델링 없이 달성된 시간적 일관성이 주목할 만하다.

9. 어블레이션

재건 모델 구성 요소 분석 (Table 5, FreiHAND 기준)

구성	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
FastViT 백본 사용	6.5	6.3	0.741	0.967
ViTPose 사전학습 미사용	5.9	5.7	0.795	0.989
단일 스케일 정제	6.0	5.9	0.793	0.991
정제 모듈 없음	6.1	5.8	0.795	0.991
FreiHAND 단독 학습	6.1	5.8	0.793	0.990
전체 모델 (WiLoR)	5.5	5.1	0.825	0.993

주요 관찰:

백본 선택이 중요: FastViT 대신 ViT-L 사용 시 PA-MPJPE 1.0mm 향상. 용량 차이가 성능 차이를 만든다.
ViTPose 사전학습이 핵심: 일반 ViT-L 가중치 대비 ViTPose 가중치에서 시작하면 0.4mm 추가 향상. 손-신체 포즈 도메인 유사성이 전이 학습에 유리.
정제 모듈 효과: 정제 없이는 6.1mm → 정제 추가 시 5.5mm. 이미지 정렬 기반 잔차 예측이 0.6mm 향상 기여.
다중 스케일의 중요성: 단일 스케일 정제(6.0mm) 대비 다중 스케일(5.5mm)이 0.5mm 추가 향상.
WHIM 데이터 기여: FreiHAND 단독 학습(6.1mm) 대비 전체 데이터(5.5mm)에서 0.6mm 향상. 도메인 외 데이터가 스튜디오 벤치마크에도 도움.

10. 한계점

검출 의존성: 재건 성능이 업스트림 검출 품질에 직접 의존. 검출 실패 또는 오탐 시 재건도 실패
단단한 크롭 요구: 최적 성능을 위해 손이 이미지 내에 충분히 포함된 타이트한 크롭 필요
명시적 시간 모델링 부재: 비디오에서 시간적 일관성이 우수하나, 명시적 시간 컨텍스트를 활용하지 않아 급격한 움직임에서 오류 가능
3D GT 자동 생성의 노이즈: WHIM의 3D 주석은 자동 피팅 파이프라인으로 생성되므로 수동 주석 대비 노이즈 포함
MANO 표현의 한계: MANO 파라미터 공간으로 표현 불가능한 극단적 손 변형이나 도구 조작 상황에서 성능 저하

11. 정리

WiLoR는 HaMeR의 “스케일 가설”을 계승하면서 두 가지 중요한 방향을 추가한다.

첫째, 엔드-투-엔드 통합. 검출과 재건을 하나의 파이프라인으로 묶고, 검출기 자체의 성능도 동시에 끌어올린다. WiLoR-M은 45배 빠르면서도 더 정확한 손 검출을 달성한다.

둘째, 이미지 정렬 정제. 단일 포워드 패스 회귀에서 더 나아가, 초기 예측 결과를 이미지에 투영하고 이미지 정렬 특징으로 오차를 보정하는 코스-투-파인 구조를 도입한다. 이 설계가 정량적으로는 0.5–0.6mm의 향상을, 정성적으로는 뛰어난 시간적 일관성을 만들어낸다.

WHIM 데이터셋은 WiLoR의 숨겨진 인프라다. 수작업 주석 없이 1,400개 유튜브 영상에서 200만 개의 자동 주석 데이터를 구축하는 파이프라인은, 생체역학 제약을 피팅 과정에 통합하여 노이즈를 최소화한다.

HaMeR와 WiLoR를 비교하면:

	HaMeR	WiLoR
재건 백본	ViT-H	ViT-L + ViTPose
정제 모듈	없음	다중 스케일 이미지 정렬
검출기	외부 의존	내장 (WiLoR-Det)
학습 데이터	2.7M (10 datasets)	4.2M (14 datasets)
FreiHAND PA-MPJPE	6.0 mm	5.5 mm
시간적 일관성 (Jitter)	20.43	5.92

“검출부터 재건까지 통합하고, 이미지로 돌아와 오차를 교정한다 — WiLoR가 보여주는 다음 단계.”

[Paper Review] HaMeR: Reconstructing Hands in 3D with Transformers (CVPR 2024)

2026-05-24T00:00:00+00:00

Paper: Reconstructing Hands in 3D with Transformers
Venue: CVPR 2024
Authors: Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik
Affiliations: UC Berkeley, NYU
arXiv: 2312.05251
GitHub: geopavlakos/hamer

One-Line Summary

A fully transformer-based 3D hand mesh recovery model built on a ViT-H backbone and Transformer decoder, achieving 2–3× better in-the-wild generalization over prior methods via 2.7M training examples and a new benchmark (HInt).

1. Background and Problem Definition

Monocular 3D Hand Mesh Recovery

The task is estimating the full 3D shape and pose of a hand from a single RGB image. Complete hand meshes are a core input for AR/VR, human-computer interaction, robotics, and medical analysis.

Common failure modes of prior methods:

Brittle CNN backbones: Limited receptive fields and inductive biases cause failure on in-the-wild images
Small studio datasets: Training on controlled, small-scale data that does not reflect real environments
Inability to handle occlusions and interactions: Performance collapses under hand-hand, hand-object interaction, or heavy occlusion
Limited diversity: Robust only to specific skin tones, lighting, and viewpoints

Scaling Philosophy

HaMeR’s approach rests on a simple premise:

“Recent developments in computer vision and NLP point to the direction where advances are achieved by simple, high capacity models, powered by huge amounts of data.”

Rather than complex architectural design or domain-specific inductive biases, HaMeR tests the hypothesis that scaling both model capacity and data simultaneously works for 3D hand reconstruction as well.

2. Output Representation: MANO Hand Model

HaMeR uses MANO, a parametric hand model, as its output space.

MANO parameters:

Pose \(\theta \in \mathbb{R}^{48}\): Finger joint rotations (PCA-based)
Shape \(\beta \in \mathbb{R}^{10}\): Per-identity hand shape variables
Camera \(\pi\): Weak-perspective camera translation

The full output \(\Theta = \{\theta, \beta, \pi\}\) deterministically yields a 778-vertex mesh and 21 joint locations.

Two reasons for using MANO as output: a compact parameter space eases optimization, and only physically plausible hand shapes can be produced.

3. Architecture

Full Pipeline

Single RGB image (hand bounding box crop)
    → ViT-H image encoder → patch token sequence
    → Transformer decoder (single query token, cross-attends to all patch tokens)
    → MANO parameter regression (θ, β, π)
    → MANO layer → 3D mesh + joint coordinates

Vision Transformer Huge (ViT-H) Backbone

Splits the image into fixed-size patches and produces a token sequence
Global self-attention captures full image context simultaneously
Fine-tuned from ImageNet-21K pretrained weights

The key advantage of ViT-H over CNN backbones is the global receptive field. Every layer attends to the full image, making it easier to reason about occluded or truncated hand regions.

Transformer Decoder Head

A single query token performs cross-attention over all ViT-H output patch tokens. The query aggregates the full image information and regresses MANO parameters.

The design is deliberately simple: a single forward pass produces the final output without iterative refinement or multi-stage regression.

4. Loss Functions

Three losses are jointly optimized.

3D Loss (datasets with 3D ground truth)

\[\mathcal{L}_{3D} = \|\theta - \theta^*\|_2^2 + \|\beta - \beta^*\|_2^2 + \|X - X^*\|_1\]

L2 loss on pose and shape parameters plus L1 loss on 3D joint coordinates.

2D Reprojection Loss

\[\mathcal{L}_{2D} = \|x - x^*\|_1\]

L1 loss between projected 2D joint coordinates and 2D keypoint annotations. Enables training on datasets that have only 2D annotations and no 3D ground truth.

Adversarial Loss (for 2D-only data)

\[\mathcal{L}_{adv} = \sum_k (D_k(\Theta) - 1)^2\]

Three discriminators are used:

Full shape discriminator: Judges whether the overall MANO parameters correspond to a natural hand
Full pose discriminator: Judges the plausibility of the full hand pose
Per-joint discriminator: Judges individual finger joint angle naturalness

The adversarial loss suppresses unnatural hand poses that arise when training without 3D supervision.

5. Training Data Scaling

2.7M Training Examples

4× larger than the FrankMocap baseline. Ten heterogeneous datasets are combined.

Datasets with 3D annotations:

Dataset	Characteristics
FreiHAND	Studio, single hand
HO-3D	Hand-object interaction
MTC (Panoptic Studio)	Multi-camera capture
RHD	Synthetic data
InterHand2.6M	Two-hand interaction
H2O3D	Hand-object interaction
DexYCB	Hand-object manipulation

Datasets with 2D annotations only:

Dataset	Characteristics
COCO WholeBody	Natural environments
Halpe	Person photography
MPII NZSL	Sign language

For 2D-only datasets, only the reprojection and adversarial losses are applied — no 3D loss. This allows in-the-wild data that lacks 3D ground truth to contribute to training.

6. HInt Dataset: A New In-the-Wild Benchmark

Limitations of Existing Benchmarks

Benchmarks like FreiHAND and HO3Dv2 are collected in controlled environments. They cannot adequately measure generalization to real-world conditions (egocentric video, hand-object interaction, varied lighting).

HInt (Hand Interactions in the Wild)

A new in-the-wild benchmark with 40,400 annotated hands.

Key features:

2D keypoints for 21 joints + per-keypoint occlusion labels (first dataset to provide these)
86.7% of hands are in contact scenarios
90.5% inter-annotator agreement on occlusion labels
94.6% of visible keypoints within 0.25× palm length across annotators

Three sources:

Source	Count	Characteristics
Hands23 (New Days of Hands)	12.0K	Third-person, natural environments
Epic-Kitchens VISOR	5.3K	Egocentric, kitchen settings
Ego4D	23.2K	Egocentric, diverse activities

Being the first large-scale in-the-wild hand dataset to provide occlusion labels is significant — it enables separate measurement of performance on occluded vs. visible joints.

7. Experimental Results

FreiHAND Benchmark (Table 1)

Method	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
I2L-MeshNet	7.4	7.6	0.681	0.973
MobRecon	5.7	5.8	0.784	0.987
HaMeR	6.0	5.7	0.785	0.990

HaMeR achieves state-of-the-art on FreiHAND, on par with or marginally above prior methods on this studio benchmark.

HO3Dv2 Benchmark (Table 2)

Method	AUCⱼ ↑	PA-MPJPE (mm) ↓	AUCᵥ ↑
HandOccNet	0.831	8.8	—
AMVUR	0.835	8.3	0.836
HaMeR	0.846	7.7	0.841

Best on all metrics for HO3Dv2, which includes hand-object interaction.

HInt Benchmark: PCK@0.05 (Table 3) — Core Result

Method	New Days	VISOR	Ego4D
FrankMocap	16.1%	16.8%	13.1%
HandOccNet (param)	9.1%	8.1%	7.7%
HaMeR	48.0%	43.0%	38.9%

HaMeR shows 2–3× improvement over all prior methods on in-the-wild data. This is the paper’s strongest claim.

Breakdown by occlusion status (VISOR):

Split	HaMeR
Visible joints	56.6%
Occluded joints	25.9%

8. Ablation: Data Scale vs. Model Scale

Independent Contributions and Synergy (Table 5)

Config	Large Data	Large Model	New Days	VISOR	Ego4D
FrankMocap	✗	✗	16.1%	16.8%	13.1%
Base (ResNet50)	✗	✗	16.9%	17.5%	13.9%
+ Large data only	✓	✗	31.3%	29.9%	24.7%
+ Large model only	✗	✓	25.9%	24.1%	19.4%
HaMeR (both)	✓	✓	48.0%	43.0%	38.9%

Key observation: large data alone gives +14.4%p, large model alone gives +9.8%p, but together they give +31.1%p — a synergistic effect larger than the sum of independent contributions. Data scale and model capacity amplify each other.

Effect of HInt Training Data (Table 4)

After fine-tuning with HInt’s training split:

Dataset	Without HInt	With HInt	Improvement
VISOR (all)	43.0%	56.5%	+13.5%p
VISOR (visible)	56.6%	66.5%	+9.9%p
VISOR (occluded)	25.9%	42.6%	+16.7%p
Ego4D (all)	38.9%	46.9%	+8.0%p

The larger gain on occluded joints (+16.7%p) compared to visible joints (+9.9%p) demonstrates that HInt’s occlusion labels directly improve occlusion handling.

9. Qualitative Generalization

Scenarios where HaMeR demonstrates robustness:

Egocentric and third-person viewpoints
Hand-hand and hand-object interactions with occlusion
Motion blur, diverse lighting conditions
Diverse skin tones
Non-standard appearances (gloves, robotic hands, illustrations)
Temporally smooth video output from per-frame inference (no temporal smoothing applied)

10. Limitations

Spurious detections: False positives from the upstream hand detector propagate through the pipeline
Left/right classification errors: Occasional misclassification of hand side
Extreme poses: Performance degrades on highly unusual finger configurations
Severe occlusion: Improved by HInt training but still challenging under complete occlusion
No temporal modeling: Single-frame approach with no explicit temporal consistency
No 3D GT for in-the-wild data: Only 2D PCK evaluation is possible; 3D quantification is unavailable in-the-wild

11. Summary

HaMeR’s central claim is one: in 3D hand reconstruction, model and data scale matter more than architectural complexity.

Concretely:

A simple pipeline: ViT-H + Transformer decoder
2.7M training examples from 10 heterogeneous datasets
A new in-the-wild benchmark, HInt (40.4K hands with occlusion labels)

These three elements together produce 2–3× better in-the-wild performance over prior methods. The synergy between data scale and model capacity is particularly striking — combining them outperforms the sum of their individual contributions.

HaMeR demonstrates that LLM-style scaling laws hold in the domain of 3D human reconstruction.

“Instead of complex inductive biases — a large enough model with enough data. Hand reconstruction is no exception.”

[논문리뷰] HaMeR: Reconstructing Hands in 3D with Transformers (CVPR 2024)

2026-05-24T00:00:00+00:00

논문: Reconstructing Hands in 3D with Transformers
학회: CVPR 2024
저자: Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik
소속: UC Berkeley, NYU
arXiv: 2312.05251
GitHub: geopavlakos/hamer

한 줄 요약

ViT-H 백본과 Transformer 디코더로 구성된 완전 트랜스포머 기반 3D 손 메시 복원 모델로, 270만 개 학습 데이터와 새로운 in-the-wild 벤치마크(HInt)를 통해 실제 환경에서의 일반화 성능을 기존 대비 2–3배 향상.

1. 배경과 문제 정의

단안 3D 손 재건 (Monocular 3D Hand Mesh Recovery)

단일 RGB 이미지에서 손의 3D 형태와 포즈를 추정하는 문제다. 완성된 손 메시는 AR/VR, 인간-컴퓨터 상호작용, 로보틱스, 의료 분석 등 다양한 응용 분야에서 핵심 입력으로 활용된다.

기존 방법들이 공통적으로 겪는 문제:

취약한 CNN 백본: 제한된 수용 영역과 归납적 편향으로 in-the-wild 일반화 실패
소규모 스튜디오 데이터: 통제된 환경에서 수집된 소량 데이터로 학습 → 실제 환경 미반영
폐색 및 상호작용 처리 불가: 손-손, 손-물체 상호작용, 극단적 폐색 상황에서 성능 급락
제한적 다양성: 특정 피부색, 조명, 시점에만 강건

스케일링의 철학

HaMeR가 제안하는 접근 방식의 핵심은 단순하다.

“최근 컴퓨터 비전과 NLP의 발전은 대용량 데이터로 훈련된 단순하고 고용량 모델이 진보를 이끈다는 방향을 제시한다.”

즉, 복잡한 아키텍처 설계나 도메인 특화 귀납 편향 대신, 모델 규모와 데이터 규모를 동시에 키우는 것이 3D 손 재건에서도 통한다는 가설을 검증한다.

2. 출력 표현: MANO 손 모델

HaMeR는 파라메트릭 손 모델인 MANO를 출력 공간으로 사용한다.

MANO 파라미터:

포즈 \(\theta \in \mathbb{R}^{48}\): 손가락 관절 회전 (PCA 기반)
형태 \(\beta \in \mathbb{R}^{10}\): 개인별 손 형태 변수
카메라 \(\pi\): weak-perspective 카메라 이동

최종 출력 \(\Theta = \{\theta, \beta, \pi\}\)로부터 778개 꼭짓점 메시와 21개 관절 위치가 결정론적으로 계산된다.

MANO를 출력으로 사용하는 이유는 두 가지다. 첫째, 컴팩트한 파라미터 공간이 학습을 용이하게 한다. 둘째, 물리적으로 타당한 손 형태만 생성된다.

3. 아키텍처

전체 파이프라인

단일 RGB 이미지 (손 바운딩 박스 크롭)
    → ViT-H 이미지 인코더 → 패치 토큰 시퀀스
    → Transformer 디코더 (단일 쿼리 토큰, 전체 패치 토큰에 cross-attention)
    → MANO 파라미터 회귀 (θ, β, π)
    → MANO 레이어 → 3D 메시 + 관절 좌표

Vision Transformer Huge (ViT-H) 백본

이미지를 고정 크기 패치로 분할하여 토큰 시퀀스 생성
전역 self-attention으로 이미지 전체 맥락을 동시에 파악
ImageNet-21K 사전학습 가중치에서 fine-tuning

CNN 기반 백본 대비 ViT-H의 핵심 장점은 전역 수용 영역이다. 첫 번째 레이어부터 이미지 전체를 참조할 수 있어 폐색이나 가려진 영역의 손 형태를 추론하는 데 유리하다.

Transformer 디코더 헤드

단일 쿼리 토큰이 ViT-H의 모든 출력 패치 토큰에 cross-attention을 수행한다. 쿼리 토큰이 전체 이미지 정보를 집약하여 MANO 파라미터를 회귀한다.

이 설계의 핵심은 단순성이다. 복잡한 다단계 회귀나 반복적 개선 과정 없이, 단일 포워드 패스로 최종 출력을 생성한다.

4. 손실 함수

세 가지 손실을 함께 최적화한다.

3D 손실 (3D GT가 있는 데이터셋)

\[\mathcal{L}_{3D} = \|\theta - \theta^*\|_2^2 + \|\beta - \beta^*\|_2^2 + \|X - X^*\|_1\]

포즈와 형태 파라미터 L2 오차에 더해, 실제 3D 관절 좌표 L1 오차를 감독 신호로 사용한다.

2D 재투영 손실

\[\mathcal{L}_{2D} = \|x - x^*\|_1\]

예측된 3D 관절을 이미지 평면에 투영한 2D 좌표와 정답 2D 키포인트 간의 L1 오차. 3D GT 없이 2D 어노테이션만 가진 데이터셋에서도 학습 가능하게 한다.

적대적 손실 (2D 전용 데이터용)

\[\mathcal{L}_{adv} = \sum_k (D_k(\Theta) - 1)^2\]

세 종류의 판별자(Discriminator)를 사용한다:

전체 형태 판별자: 전체 MANO 파라미터가 자연스러운 손인지 판별
전체 포즈 판별자: 전체 손 포즈의 자연스러움 판별
개별 관절 판별자: 각 손가락 관절 각도의 자연스러움 판별

적대적 손실은 3D GT 없이 2D 어노테이션으로만 학습할 때 발생하는 비현실적 손 포즈를 억제하는 역할을 한다.

5. 학습 데이터 스케일링

2.7M 학습 예제

기존 FrankMocap 대비 4배 규모. 10개 이종 데이터셋을 혼합하여 사용한다.

3D 어노테이션 보유 데이터셋:

데이터셋	특성
FreiHAND	스튜디오, 단일 손
HO-3D	손-물체 상호작용
MTC (Panoptic Studio)	다중 카메라
RHD	합성 데이터
InterHand2.6M	양손 상호작용
H2O3D	손-물체 상호작용
DexYCB	손-물체 조작

2D 어노테이션 전용 데이터셋:

데이터셋	특성
COCO WholeBody	자연스러운 환경
Halpe	인물 사진
MPII NZSL	수화

2D 전용 데이터셋에 대해서는 3D 손실 없이 재투영 손실과 적대적 손실만 적용한다. 이를 통해 3D GT를 구하기 어려운 in-the-wild 데이터도 학습에 활용할 수 있다.

6. HInt 데이터셋: 새로운 In-the-Wild 벤치마크

기존 벤치마크의 한계

FreiHAND, HO3Dv2 같은 기존 벤치마크는 통제된 환경에서 수집된다. 실제 환경(에고센트릭 영상, 손-물체 상호작용, 다양한 조명)에서의 일반화 성능을 측정하기 어렵다.

HInt (Hand Interactions in the Wild)

40,400개 손 어노테이션으로 구성된 새로운 in-the-wild 벤치마크.

핵심 특징:

21개 관절의 2D 키포인트 + 폐색 레이블 제공 (최초)
전체 손의 86.7%가 접촉 상황
어노테이터 간 폐색 레이블 일치율 90.5%
가시 키포인트의 94.6%가 팜 길이의 0.25배 이내에서 어노테이터 간 일치

세 가지 소스:

소스	수량	특성
Hands23 (New Days of Hands)	12.0K	제3자 시점, 자연스러운 환경
Epic-Kitchens VISOR	5.3K	에고센트릭, 주방 환경
Ego4D	23.2K	에고센트릭, 다양한 활동

폐색 레이블을 제공하는 최초의 대규모 in-the-wild 손 데이터셋이라는 점이 중요하다. 이를 통해 폐색 상황에서의 성능을 별도로 측정할 수 있다.

7. 실험 결과

FreiHAND 벤치마크 (Table 1)

방법	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
I2L-MeshNet	7.4	7.6	0.681	0.973
MobRecon	5.7	5.8	0.784	0.987
HaMeR	6.0	5.7	0.785	0.990

FreiHAND에서 HaMeR는 전반적으로 최고 수준 성능을 달성한다. 스튜디오 데이터에서는 기존 방법 대비 소폭 우위 혹은 동등 수준이다.

HO3Dv2 벤치마크 (Table 2)

방법	AUCⱼ ↑	PA-MPJPE (mm) ↓	AUCᵥ ↑
HandOccNet	0.831	8.8	—
AMVUR	0.835	8.3	0.836
HaMeR	0.846	7.7	0.841

손-물체 상호작용이 포함된 HO3Dv2에서 모든 메트릭에서 최고 성능 달성.

HInt 벤치마크: PCK@0.05 (Table 3) — 핵심 결과

방법	New Days	VISOR	Ego4D
FrankMocap	16.1%	16.8%	13.1%
HandOccNet (param)	9.1%	8.1%	7.7%
HaMeR	48.0%	43.0%	38.9%

HaMeR가 기존 최고 방법 대비 2–3배 향상. 이 결과가 HaMeR 논문의 가장 강력한 주장이다.

폐색 여부별 세분화 (VISOR 기준):

구분	HaMeR
가시 관절 전체	56.6%
폐색 관절 전체	25.9%

8. 어블레이션: 데이터 스케일 vs. 모델 스케일

독립 기여도와 시너지 효과 (Table 5)

구성	대용량 데이터	대용량 모델	New Days	VISOR	Ego4D
FrankMocap	✗	✗	16.1%	16.8%	13.1%
Base (ResNet50)	✗	✗	16.9%	17.5%	13.9%
+ 대용량 데이터만	✓	✗	31.3%	29.9%	24.7%
+ 대용량 모델만	✗	✓	25.9%	24.1%	19.4%
HaMeR (둘 다)	✓	✓	48.0%	43.0%	38.9%

주목할 점: 대용량 데이터만 사용하면 +14.4%p, 대용량 모델만 사용하면 +9.8%p 향상되지만, 둘을 함께 사용하면 +31.1%p의 시너지 효과가 나타난다. 데이터 스케일과 모델 규모가 서로를 증폭시키는 관계임을 보여준다.

HInt 학습 데이터의 효과 (Table 4)

HInt의 학습 분할 데이터를 추가로 fine-tuning했을 때:

데이터셋	HInt 미사용	HInt 사용	개선
VISOR (전체)	43.0%	56.5%	+13.5%p
VISOR (가시)	56.6%	66.5%	+9.9%p
VISOR (폐색)	25.9%	42.6%	+16.7%p
Ego4D (전체)	38.9%	46.9%	+8.0%p

폐색 관절에 대한 향상 폭(+16.7%p)이 가시 관절(+9.9%p)보다 훨씬 크다. HInt의 폐색 레이블이 폐색 처리 능력 향상에 직접적으로 기여함을 보여준다.

9. 정성적 일반화 능력

HaMeR가 강건성을 보이는 시나리오:

에고센트릭 및 제3자 시점 영상
손-손, 손-물체 상호작용 및 폐색
모션 블러, 다양한 조명 환경
다양한 피부색
비표준 외관 (장갑, 로봇 손, 삽화 등)
시간적 스무딩 적용 없이도 비디오에서 부드러운 출력 (프레임 단위 추론)

10. 한계점

오탐지: 업스트림 손 검출기의 false positive가 전체 파이프라인에 영향
좌우 분류 오류: 손의 좌/우를 잘못 분류하는 경우 발생
극단적 포즈: 매우 비자연스러운 손가락 구성에서 성능 저하
심각한 폐색: HInt 학습으로 개선되었으나 완전 폐색 상황에서 여전히 어려움
시간 모델링 부재: 단일 프레임 접근으로 명시적 시간적 일관성 없음
3D GT 부재: In-the-wild 데이터에 대한 3D 정량 평가 불가 (2D PCK만 가능)

11. 정리

HaMeR의 핵심 주장은 하나다: 3D 손 재건에서 아키텍처 복잡성보다 모델과 데이터의 규모가 더 중요하다.

구체적으로:

ViT-H + Transformer 디코더라는 단순한 파이프라인
10개 데이터셋 혼합 2.7M 학습 예제
새로운 in-the-wild 벤치마크 HInt (40.4K 손, 폐색 레이블 포함)

이 세 가지 요소가 결합되어 기존 방법 대비 in-the-wild 환경에서 2–3배 향상을 달성한다. 특히 데이터 스케일과 모델 규모 간의 시너지 효과가 인상적이다 — 각각의 기여도를 단순히 더한 것보다 함께 사용했을 때 더 큰 향상이 나타난다.

HaMeR는 LLM 스케일링 법칙이 3D 인간 재건 도메인에서도 유효함을 보여주는 사례다.

“복잡한 귀납 편향 대신, 충분히 큰 모델에 충분히 많은 데이터를 — 손 재건도 예외가 아니다.”

[Paper Review] π³: Permutation-Equivariant Visual Geometry Learning (ICLR 2026)

2026-05-23T00:00:00+00:00

Paper: π³: Permutation-Equivariant Visual Geometry Learning
Venue: ICLR 2026
Authors: Yifan Wang*, Jianjun Zhou*, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Affiliations: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shanghai Innovation Institute, Zhejiang University, USTC, Fudan University
arXiv: 2507.13347
GitHub: yyfz/Pi3

One-Line Summary

A permutation-equivariant 3D reconstruction model that completely eliminates the fixed reference view relied on by prior methods, predicting camera poses and point maps independent of input image ordering.

1. Background and Problem Definition

Reference View Bias

Modern feed-forward 3D reconstruction methods — including DUSt3R, MASt3R, and VGGT — share a common inductive bias: the first image is used as the anchor (world coordinate origin).

This seemingly natural choice creates structural weaknesses:

Order dependence: Results change with input ordering. A blurry or partially occluded first image destabilizes the entire reconstruction.
Asymmetric processing: The reference view receives structural privilege through special tokens or fixed coordinate systems.
Failure modes: A suboptimal reference view can cause the entire reconstruction to fail.

VGGT fixes the first camera as identity; DUSt3R/MASt3R use explicit reference views during pairwise processing. π³ is the first work to quantify how much this matters.

When the same scene is fed to VGGT with different frame orderings, the standard deviation in reconstruction accuracy reaches 0.033cm. π³ reduces this to 0.003cm — 10× more stable.

π³’s Question

“Can accurate and stable 3D reconstruction be achieved without any reference view?”

2. Core Idea

Permutation Equivariance

The central claim of π³ is one mathematical property:

\[f(\sigma(\mathbf{I})) = \sigma(f(\mathbf{I})) \quad \forall \sigma \in S_N\]

Applying any permutation σ to the input image set should yield the same permutation applied to the output — every image is processed identically regardless of its position in the sequence.

Two design choices enforce this:

Affine-Invariant Camera Poses: Rotation is predicted in SO(3); translation is expressed without a global anchor point, using an affine-invariant representation.
Scale-Invariant Local Point Maps: Per-view point maps are predicted in local camera coordinates. Scale ambiguity is absorbed into the invariant representation rather than resolved by a reference frame.

Both choices together make the full pipeline permutation-equivariant.

Fundamental Differences from Prior Work

Aspect	DUSt3R / MASt3R	VGGT	π³
Reference view	Required (pairwise)	Required (first = identity)	None
Order dependence	High	High	None
Pose representation	Absolute coordinates	Absolute coordinates	Affine-invariant
Point map	World coordinates	World coordinates	Scale-invariant local

3. Architecture

Full Pipeline

N input images (B × N × 3 × H × W)
    → DINOv2 patch encoder → image token sequences
    → Alternating-Attention Transformer (36 layers)
    ├─→ Camera Head    → affine-invariant camera poses (SE(3))
    ├─→ Point Map Head → scale-invariant local point maps
    └─→ Confidence Head → confidence maps

Alternating-Attention (36 Layers)

Similar to VGGT’s structure but with 36 layers instead of 24, and with all order-dependent components removed:

View-wise Self-Attention: Tokens within the same image attend to each other. Extracts per-frame spatial features.
Global Self-Attention: All tokens across all frames attend to each other. Learns cross-view 3D consistency.

Removed elements (to guarantee permutation equivariance):

Frame index positional embeddings (no order information injected)
Reference-view special tokens (no reference token exists)
Cross-attention between views (no asymmetric processing)

Camera Head

Predicts SE(3) matrices (4×4) in affine-invariant form. The first frame is NOT fixed as identity; only relative relationships between predicted poses are used as supervision.

Point Map Head

Predicts 3D points per pixel in local camera coordinates — not world coordinates. This is the key: consistency is maintained regardless of which view is “first.”

Pi3X Extension (December 2025)

A follow-up release added:

Convolutional Output Head: Eliminates grid-like artifacts from MLP upsampling
Conditional inputs: Optional injection of camera poses, intrinsics, and depth
Approximate metric scale: Metric-scale reconstruction capability

4. Training

Loss Function

\[\mathcal{L} = \mathcal{L}_{\text{point}} + \lambda_{\text{normal}} \mathcal{L}_{\text{normal}} + \lambda_{\text{conf}} \mathcal{L}_{\text{conf}} + \lambda_{\text{cam}} \mathcal{L}_{\text{cam}} + \lambda_{\text{trans}} \mathcal{L}_{\text{trans}}\]

Loss	Role	Weight
\(\mathcal{L}_{\text{point}}\)	Scale-invariant point map L1 (after optimal scale alignment)	1.0
\(\mathcal{L}_{\text{normal}}\)	Surface normal angular error	1.0
\(\mathcal{L}_{\text{conf}}\)	Confidence map BCE	0.05
\(\mathcal{L}_{\text{cam}}\)	Geodesic rotation error	0.1
\(\mathcal{L}_{\text{trans}}\)	Scaled translation error	100.0

Two-Stage Training

Stage	Resolution	GPU	Notes
Stage 1	224×224 fixed	16× A100	DINOv2 encoder frozen
Stage 2	100K–255K pixels variable	64× A100	Full network trained

80 epochs per stage, 800 iterations/epoch. Total parameters: 959M (24% lighter than VGGT’s 1.26B).

Training Data (15 Datasets)

CO3D, ScanNet, TartanAir, Habitat, and synthetic renderings spanning indoor, outdoor, and dynamic scenes.

5. Experimental Results

Camera Pose Estimation

RealEstate10K (Zero-shot):

Method	RTA@30	AUC@30
DUSt3R	76.1	67.7
MASt3R	—	76.4
VGGT	93.13	77.62
π³	95.62	85.90

Sintel (lower is better):

Method	ATE (↓)	RPE trans (↓)
VGGT	0.167	0.062
π³	0.074	0.040

π³ improves ATE over VGGT by 55% on Sintel.

Point Map Reconstruction

DTU Dataset (cm, lower is better):

Method	Accuracy (↓)	Completion (↓)	Normal Consistency (↑)
DUSt3R	1.620	2.241	0.640
MASt3R	1.406	2.015	0.662
VGGT	1.338	1.896	0.676
π³	1.198	1.849	0.678

7-Scenes (Dense Views, cm):

Method	Accuracy (↓)	Completion (↓)
VGGT	0.022	0.026
π³	0.016	0.022

Video Depth Estimation — KITTI

Metric	π³	VGGT	Gain
Abs Rel (↓)	0.037	0.052	29%
δ<1.25 (↑)	0.986	0.968	+1.8%
FPS (↑)	57.4	43.2	33% faster
Parameters	959M	1.26B	24% lighter

Permutation Robustness

Standard deviation of DTU Accuracy across different frame orderings of the same scene:

Method	std (↓)
VGGT	0.033
π³	0.003

π³ is 10× more stable than VGGT. This directly validates the permutation-equivariant design.

6. Comparison with VGGT

Aspect	VGGT	π³
Architecture	24-layer alternating attention	36-layer alternating attention
Reference view	First frame = identity (required)	None
Order dependence	Present (std 0.033)	None (std 0.003)
Pose representation	Absolute coordinates	Affine-invariant
Point map	World coordinates	Scale-invariant local
Parameters	1.26B	959M
Inference FPS (KITTI)	43.2	57.4
RealEstate10K AUC@30	77.62	85.90
DTU Accuracy (cm)	1.338	1.198
Sintel ATE	0.167	0.074

The key takeaway: π³ is smaller, faster, and more accurate than VGGT. Removing the reference view bias is the root cause of every improvement.

7. Ablation: Effect of Affine/Scale Invariance

Model	ETH3D Acc.	7-Scenes Acc.	NRGBD Acc.
Baseline (no invariance)	0.229	0.020	0.034
+ Scale-invariant point	0.197	0.020	0.031
+ Affine-invariant camera (full)	0.131	0.019	0.028

Affine-invariant camera modeling is the dominant contributor. Scale-invariant geometry shows pronounced benefits on outdoor datasets, with more modest gains indoors.

8. Limitations

Transparent objects: Simplified light transport assumptions preclude handling of transparent or reflective surfaces.
Grid-like artifacts: MLP-based point cloud upsampling produces visible grid patterns in uncertain regions (partially addressed in Pi3X with convolutional heads).
Fine-grained detail: Falls short of diffusion-based reconstruction in high-frequency detail.
Dynamic scenes: No explicit handling of non-rigid motion beyond training data diversity.

9. Summary

π³ answers one question: “Is a reference view actually necessary?”

The answer is no. Removing the reference view and designing a permutation-equivariant architecture yielded a model that is smaller, faster, and more accurate, with 10× better stability under input reordering.

“The reference view was a convenience, not a necessity.”

If VGGT showed that “processing all views together at once” beats DUSt3R’s pairwise approach, π³ shows that “eliminating the hierarchy among views (reference frame)” is the next step forward.

The inductive biases of 3D reconstruction are being removed one by one. What comes next?