[Paper Review] WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild (CVPR 2025)

Paper: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
Venue: CVPR 2025
Authors: Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou
Affiliations: Imperial College London (Potamias, Deng, Zafeiriou), Shanghai Jiao Tong University (Zhang)
arXiv: 2409.12259
GitHub: rolpotamias/WiLoR

One-Line Summary

An end-to-end 3D hand reconstruction pipeline combining a real-time fully convolutional hand detector with a ViT-Large + multi-scale refinement module, achieving state-of-the-art on FreiHAND and HO3Dv2 while delivering 2–3× better temporal coherence than HaMeR, trained on 4.2M images including the new 2M+ in-the-wild WHIM dataset.

1. Background and Problem Definition

The Fragmented Two-Stage Pipeline

3D hand reconstruction systems have traditionally been developed as two separate stages: a hand detector finds the hand region, and a separate pose estimation model recovers the 3D mesh from the detected crop. This separation introduces three problems:

Detection bottleneck: False positives or misses from the upstream detector directly limit full-pipeline performance
Detection speed: Prior hand detectors (ContactHands: 3 FPS) are unsuitable for real-time applications
Temporal incoherence: Single-frame pose estimation causes inter-frame jitter in video

Additionally, prior methods lack a refinement step after initial MANO parameter estimation, making image-space alignment inaccurate. Where HaMeR uses a single query token for direct regression, WiLoR introduces an initial prediction → image-aligned feature residual refinement structure.

WiLoR’s Core Proposal

WiLoR addresses all three issues simultaneously:

Real-time hand detection: Fully convolutional detection network at 130+ FPS
High-fidelity 3D reconstruction: ViT-Large backbone with a multi-scale image-aligned refinement module
Large-scale in-the-wild data: 2M+ automatically annotated images (WHIM dataset)

2. Output Representation: MANO Hand Model

WiLoR also uses the parametric hand model MANO as its output space.

MANO parameters:

Pose \(\theta \in \mathbb{R}^{48}\): Finger joint rotations (PCA-based)
Shape \(\beta \in \mathbb{R}^{10}\): Per-identity hand shape variables
Camera \(K_{cam}\): Weak-perspective camera parameters

The full output \(\{\theta, \beta, K_{cam}\}\) deterministically yields a 778-vertex mesh \(V_{3D}\) and 21 joint locations \(J_{3D}\).

3. Architecture Overview

WiLoR is an end-to-end pipeline of two networks:

Single RGB image
    → WiLoR-Det (hand detection network)
        → bounding boxes + left/right labels
    → hand crop extraction
    → WiLoR-Rec (3D reconstruction network)
        → initial MANO parameter estimation (ViT-L)
        → refinement module (multi-scale image alignment)
        → final MANO parameters (θ, β, K_cam)
        → MANO layer → 3D mesh + joint coordinates

4. Hand Detection Network (WiLoR-Det)

Architecture

WiLoR-Det is a real-time object detection architecture in the YOLOv8 family, specialized for hand detection.

Backbone: DarkNet — extracts last three feature maps \(\{C_3, C_4, C_5\}\)
Neck: PANet (Path Aggregation Network) — multi-scale feature fusion
Head: Three anchor-free detection heads (predict bounding boxes + left/right labels at each scale)

Provided in two sizes:

WiLoR-M: 25 MB, 138 FPS
WiLoR-S: 7 MB, 175 FPS

Detection Loss

\[\mathcal{L}_{det} = \lambda_0 \mathcal{L}_{BCE} + \lambda_1 \mathcal{L}_{DFL} + \lambda_2 \mathcal{L}_{CIoU} + \lambda_3 \mathcal{L}_{kpts}\]

Term	Weight	Role
\(\mathcal{L}_{BCE}\)	\(\lambda_0 = 0.5\)	Classification (hand presence + left/right)
\(\mathcal{L}_{DFL}\)	\(\lambda_1 = 1.5\)	Distribution focal loss (box coordinate distribution)
\(\mathcal{L}_{CIoU}\)	\(\lambda_2 = 15\)	Bounding box shape regression
\(\mathcal{L}_{kpts}\)	\(\lambda_3 = 10\)	Keypoint alignment

5. 3D Hand Reconstruction Network (WiLoR-Rec)

Backbone: ViT-Large

Fine-tuned from ViTPose pretrained weights
Hidden dimension 1,280
Input: image patch tokens \(\mathbf{T}_{img}\) concatenated with learnable tokens for pose \(\theta\), shape \(\beta\), and camera \(K_{cam}\)
Initial (coarse) MANO parameters estimated via MLP from ViT output tokens

WiLoR uses ViT-L rather than HaMeR’s ViT-H, but compensates through ViTPose pretraining and the refinement module.

This is WiLoR-Rec’s key differentiator. Rather than stopping at initial MANO parameters, it extracts image-aligned features to predict residual corrections.

How it works:

Feature map generation: Deconvolutional layers upsample ViT image output tokens into multi-resolution feature maps \(\{F_0, F_1, \ldots, F_n\}\)
Mesh projection: Each vertex \(v\) of the initially estimated hand mesh is projected onto the image plane using the estimated camera \(K_{cam}\)
Bilinear sampling: Multi-scale features are bilinearly interpolated at the projected coordinates

\[f_v^0 = \pi(v, K_{cam})\] \[\text{per-vertex feature} = \text{bilinear\_sample}(\{F_i\}, f_v^0)\]

Residual prediction: Per-vertex features from mesh level \(M_l\) are aggregated to compute pose and shape residuals

\[\Delta\beta = \text{MLP}_\beta\left(\bigoplus_{v \in M_l} f_v^0\right), \quad \Delta\theta = \text{MLP}_\theta\left(\bigoplus_{v \in M_l} f_v^0\right)\]

The key insight is image alignment: rather than regressing from global image features, the model directly references local image features corresponding to estimated mesh positions, correcting errors through local context.

Reconstruction Loss

\[\mathcal{L}_{rec} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D} + \lambda_{pose}\mathcal{L}_{MANO,\theta} + \lambda_{shape}\mathcal{L}_{MANO,\beta} + \mathcal{L}_{adv}\]

Term	Weight	Formula
\(\mathcal{L}_{3D}\)	0.05	\(\|V_{3D} - \hat{V}_{3D}\|_1\)
\(\mathcal{L}_{2D}\)	0.01	\(\|\pi(J_{3D}, K_{cam}) - \hat{J}_{2D}\|_1\)
\(\mathcal{L}_{MANO,\theta}\)	0.001	\(\|\theta - \hat{\theta}\|_2^2\)
\(\mathcal{L}_{MANO,\beta}\)	0.0005	\(\|\beta - \hat{\beta}\|_2^2\)
\(\mathcal{L}_{adv}\)	—	\(\|D(\theta, \beta) - 1\|_2\)

6. WHIM Dataset

Motivation and Scale

The core problem with prior in-the-wild training data is insufficient diversity. WiLoR builds the WHIM (Wild Hand In-the-wild Monocular) dataset via an automated annotation pipeline.

Scale: 2M+ in-the-wild hand images
Source: 1,400+ YouTube videos (sign language, cooking, sports, games, ego/exocentric viewpoints)
Annotations: 2D bounding boxes, left/right labels, 3D MANO parameters

Automated Annotation Pipeline

3D ground truth is generated through an automated fitting pipeline rather than manual annotation.

Stage 1 — Person detection:

VitPose + AlphaPose with confidence threshold 0.65

Stage 2 — Hand detection ensemble:

Three detectors (MediaPipe, OpenPose, ContactHands) ensembled
Confidence-weighted bounding box fusion:

\[\hat{y} = \frac{\sum_i P(b_i | d_i) \cdot b_i}{\sum_i P(b_i | d_i)}\]

Stage 3 — 3D MANO fitting: Optimized with three losses:

Reprojection loss: \(\mathcal{L}_{proj} = \|J_M - \pi(\hat{J}_s, K)\|_1\)
Biomechanical loss: \(\mathcal{L}_{BMC} = \mathcal{L}_{BL} + \mathcal{L}_{A}\) (bone length + joint angle constraints)
PCA prior loss: \(\mathcal{L}_{prior} = \|X - [(X - \mu)U^T]U + \mu\|_2\) (enforces natural hand shapes)

Explicitly embedding biomechanical constraints in the fitting loss prevents generation of physically implausible hand poses.

7. Training Configuration

Detection Model

Optimizer: Adam, 200 epochs with 30-epoch early stopping
Learning rate: 0.01 → 1e-6 (linear decay)
Hardware: 2× RTX 4090, batch size 256, 3 weeks
Augmentation: Mosaic (0.7 probability), rotation [-60°, 60°], scale [0.5, 1]

Reconstruction Model

Optimizer: Adam, 1,000 epochs, learning rate 1e-5, weight decay 1e-4
Training data: 14 datasets, 4.2M images (55%+ more than prior methods)
7 existing datasets with 3D annotations (FreiHAND, HO-3D, InterHand2.6M, etc.) + 7 additional including WHIM

8. Experimental Results

FreiHAND Benchmark (Table 3)

Method	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	6.0	5.7	0.785	0.990
WiLoR	5.5	5.1	0.825	0.993

WiLoR surpasses HaMeR on all FreiHAND metrics: 8.3% improvement in PA-MPJPE, 5.1%p improvement in F@5mm.

HO3Dv2 Benchmark (Table 4)

Method	AUCⱼ ↑	PA-MPJPE (mm) ↓	AUCᵥ ↑	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
HaMeR	0.846	7.7	0.841	7.9	0.635	0.980
WiLoR	0.851	7.5	0.846	7.7	0.646	0.983

Consistent improvement on the hand-object interaction benchmark as well.

Hand Detection Benchmark (Table 1, COCO)

Method	Model Size	FPS ↑	AP@0.5 ↑	mAP ↑
ContactHands	819 MB	3	50.29	16.67
ViTDet	1,400 MB	1	41.64	13.21
WiLoR-S	7 MB	175	46.96	18.56
WiLoR-M	25 MB	138	62.48	25.97

WiLoR-M achieves 45× faster speed at 32× smaller size compared to ContactHands while outperforming it by 12.19%p in AP@0.5.

On the WHIM test set, WiLoR-M scores AP@0.5 96.06, mAP 53.79.

Temporal Coherence (Table 6)

Despite per-frame independent inference, temporal coherence on video:

Method	MPFVE×100 ↓	MPFJE×100 ↓	Jitter ↓	RTE ↓
HaMeR	10.60	1.768	20.43	2.92
WiLoR	4.43	0.762	5.92	0.07

2.4× improvement in MPFVE, 3.4× improvement in Jitter over HaMeR — temporal coherence achieved without any explicit temporal modeling.

9. Ablation Study

Reconstruction Component Analysis (Table 5, FreiHAND)

Configuration	PA-MPJPE (mm) ↓	PA-MPVPE (mm) ↓	F@5mm ↑	F@15mm ↑
w. FastViT backbone	6.5	6.3	0.741	0.967
w/o ViTPose pretrain	5.9	5.7	0.795	0.989
w. Single-scale refinement	6.0	5.9	0.793	0.991
w/o Refinement module	6.1	5.8	0.795	0.991
w. FreiHAND only	6.1	5.8	0.793	0.990
Full model (WiLoR)	5.5	5.1	0.825	0.993

Key observations:

Backbone capacity matters: ViT-L over FastViT gives 1.0mm improvement in PA-MPJPE. Capacity difference directly translates to performance.
ViTPose pretraining is critical: Starting from ViTPose weights versus standard ViT-L gives an additional 0.4mm improvement. Domain similarity between hand pose and body pose makes transfer beneficial.
Refinement module effect: Without refinement 6.1mm → with refinement 5.5mm. Image-aligned residual prediction contributes 0.6mm gain.
Multi-scale matters: Single-scale (6.0mm) vs. multi-scale (5.5mm) gives 0.5mm additional improvement.
WHIM data contribution: FreiHAND-only training (6.1mm) vs. full data (5.5mm) shows 0.6mm gain. Out-of-domain in-the-wild data benefits even a controlled studio benchmark.

10. Limitations

Detection dependency: Reconstruction performance directly depends on upstream detection quality; detection failures propagate through the pipeline
Tight crop requirement: Optimal performance requires tight crops where the hand is sufficiently contained in the image
No explicit temporal modeling: Despite strong temporal coherence, the approach does not explicitly leverage temporal context and may produce errors under rapid motion
Noisy automatic annotations: WHIM’s 3D annotations are generated by an automated fitting pipeline and contain more noise than manual annotations
MANO representation constraints: Performance degrades on extreme hand deformations or tool manipulation scenarios that fall outside the MANO parameter space

11. Summary

WiLoR inherits HaMeR’s scaling hypothesis and extends it in two important directions.

First, end-to-end integration. Detection and reconstruction are unified into a single pipeline, and the detector’s own performance is simultaneously improved. WiLoR-M achieves better hand detection while being 45× faster.

Second, image-aligned refinement. Going beyond single-pass regression, WiLoR introduces a coarse-to-fine structure that projects initial predictions back into image space and corrects errors using locally aligned features. This design delivers 0.5–0.6mm quantitative gains and, notably, dramatic improvements in temporal coherence.

The WHIM dataset is WiLoR’s hidden infrastructure. The pipeline that constructs 2M+ automatic annotations from 1,400 YouTube videos — with biomechanical constraints embedded in the fitting process to minimize noise — enables the scale of in-the-wild training that makes the performance possible.

Comparing HaMeR and WiLoR:

	HaMeR	WiLoR
Reconstruction backbone	ViT-H	ViT-L + ViTPose
Refinement module	None	Multi-scale image-aligned
Detector	External dependency	Built-in (WiLoR-Det)
Training data	2.7M (10 datasets)	4.2M (14 datasets)
FreiHAND PA-MPJPE	6.0 mm	5.5 mm
Temporal coherence (Jitter)	20.43	5.92

“From detection through reconstruction as one, then return to the image to correct errors — WiLoR’s next step.”