[Paper Review] WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild (CVPR 2025)
Paper: WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
Venue: CVPR 2025
Authors: Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou
Affiliations: Imperial College London (Potamias, Deng, Zafeiriou), Shanghai Jiao Tong University (Zhang)
arXiv: 2409.12259
GitHub: rolpotamias/WiLoR
One-Line Summary
An end-to-end 3D hand reconstruction pipeline combining a real-time fully convolutional hand detector with a ViT-Large + multi-scale refinement module, achieving state-of-the-art on FreiHAND and HO3Dv2 while delivering 2–3× better temporal coherence than HaMeR, trained on 4.2M images including the new 2M+ in-the-wild WHIM dataset.
1. Background and Problem Definition
The Fragmented Two-Stage Pipeline
3D hand reconstruction systems have traditionally been developed as two separate stages: a hand detector finds the hand region, and a separate pose estimation model recovers the 3D mesh from the detected crop. This separation introduces three problems:
- Detection bottleneck: False positives or misses from the upstream detector directly limit full-pipeline performance
- Detection speed: Prior hand detectors (ContactHands: 3 FPS) are unsuitable for real-time applications
- Temporal incoherence: Single-frame pose estimation causes inter-frame jitter in video
Additionally, prior methods lack a refinement step after initial MANO parameter estimation, making image-space alignment inaccurate. Where HaMeR uses a single query token for direct regression, WiLoR introduces an initial prediction → image-aligned feature residual refinement structure.
WiLoR’s Core Proposal
WiLoR addresses all three issues simultaneously:
- Real-time hand detection: Fully convolutional detection network at 130+ FPS
- High-fidelity 3D reconstruction: ViT-Large backbone with a multi-scale image-aligned refinement module
- Large-scale in-the-wild data: 2M+ automatically annotated images (WHIM dataset)
2. Output Representation: MANO Hand Model
WiLoR also uses the parametric hand model MANO as its output space.
MANO parameters:
- Pose \(\theta \in \mathbb{R}^{48}\): Finger joint rotations (PCA-based)
- Shape \(\beta \in \mathbb{R}^{10}\): Per-identity hand shape variables
- Camera \(K_{cam}\): Weak-perspective camera parameters
The full output \(\{\theta, \beta, K_{cam}\}\) deterministically yields a 778-vertex mesh \(V_{3D}\) and 21 joint locations \(J_{3D}\).
3. Architecture Overview
WiLoR is an end-to-end pipeline of two networks:
Single RGB image
→ WiLoR-Det (hand detection network)
→ bounding boxes + left/right labels
→ hand crop extraction
→ WiLoR-Rec (3D reconstruction network)
→ initial MANO parameter estimation (ViT-L)
→ refinement module (multi-scale image alignment)
→ final MANO parameters (θ, β, K_cam)
→ MANO layer → 3D mesh + joint coordinates
4. Hand Detection Network (WiLoR-Det)
Architecture
WiLoR-Det is a real-time object detection architecture in the YOLOv8 family, specialized for hand detection.
- Backbone: DarkNet — extracts last three feature maps \(\{C_3, C_4, C_5\}\)
- Neck: PANet (Path Aggregation Network) — multi-scale feature fusion
- Head: Three anchor-free detection heads (predict bounding boxes + left/right labels at each scale)
Provided in two sizes:
- WiLoR-M: 25 MB, 138 FPS
- WiLoR-S: 7 MB, 175 FPS
Detection Loss
\[\mathcal{L}_{det} = \lambda_0 \mathcal{L}_{BCE} + \lambda_1 \mathcal{L}_{DFL} + \lambda_2 \mathcal{L}_{CIoU} + \lambda_3 \mathcal{L}_{kpts}\]| Term | Weight | Role |
|---|---|---|
| \(\mathcal{L}_{BCE}\) | \(\lambda_0 = 0.5\) | Classification (hand presence + left/right) |
| \(\mathcal{L}_{DFL}\) | \(\lambda_1 = 1.5\) | Distribution focal loss (box coordinate distribution) |
| \(\mathcal{L}_{CIoU}\) | \(\lambda_2 = 15\) | Bounding box shape regression |
| \(\mathcal{L}_{kpts}\) | \(\lambda_3 = 10\) | Keypoint alignment |
5. 3D Hand Reconstruction Network (WiLoR-Rec)
Backbone: ViT-Large
- Fine-tuned from ViTPose pretrained weights
- Hidden dimension 1,280
- Input: image patch tokens \(\mathbf{T}_{img}\) concatenated with learnable tokens for pose \(\theta\), shape \(\beta\), and camera \(K_{cam}\)
- Initial (coarse) MANO parameters estimated via MLP from ViT output tokens
WiLoR uses ViT-L rather than HaMeR’s ViT-H, but compensates through ViTPose pretraining and the refinement module.
Refinement Module
This is WiLoR-Rec’s key differentiator. Rather than stopping at initial MANO parameters, it extracts image-aligned features to predict residual corrections.
How it works:
- Feature map generation: Deconvolutional layers upsample ViT image output tokens into multi-resolution feature maps \(\{F_0, F_1, \ldots, F_n\}\)
- Mesh projection: Each vertex \(v\) of the initially estimated hand mesh is projected onto the image plane using the estimated camera \(K_{cam}\)
- Bilinear sampling: Multi-scale features are bilinearly interpolated at the projected coordinates
- Residual prediction: Per-vertex features from mesh level \(M_l\) are aggregated to compute pose and shape residuals
The key insight is image alignment: rather than regressing from global image features, the model directly references local image features corresponding to estimated mesh positions, correcting errors through local context.
Reconstruction Loss
\[\mathcal{L}_{rec} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D} + \lambda_{pose}\mathcal{L}_{MANO,\theta} + \lambda_{shape}\mathcal{L}_{MANO,\beta} + \mathcal{L}_{adv}\]| Term | Weight | Formula |
|---|---|---|
| \(\mathcal{L}_{3D}\) | 0.05 | \(|V_{3D} - \hat{V}_{3D}|_1\) |
| \(\mathcal{L}_{2D}\) | 0.01 | \(|\pi(J_{3D}, K_{cam}) - \hat{J}_{2D}|_1\) |
| \(\mathcal{L}_{MANO,\theta}\) | 0.001 | \(|\theta - \hat{\theta}|_2^2\) |
| \(\mathcal{L}_{MANO,\beta}\) | 0.0005 | \(|\beta - \hat{\beta}|_2^2\) |
| \(\mathcal{L}_{adv}\) | — | \(|D(\theta, \beta) - 1|_2\) |
6. WHIM Dataset
Motivation and Scale
The core problem with prior in-the-wild training data is insufficient diversity. WiLoR builds the WHIM (Wild Hand In-the-wild Monocular) dataset via an automated annotation pipeline.
- Scale: 2M+ in-the-wild hand images
- Source: 1,400+ YouTube videos (sign language, cooking, sports, games, ego/exocentric viewpoints)
- Annotations: 2D bounding boxes, left/right labels, 3D MANO parameters
Automated Annotation Pipeline
3D ground truth is generated through an automated fitting pipeline rather than manual annotation.
Stage 1 — Person detection:
- VitPose + AlphaPose with confidence threshold 0.65
Stage 2 — Hand detection ensemble:
- Three detectors (MediaPipe, OpenPose, ContactHands) ensembled
- Confidence-weighted bounding box fusion:
Stage 3 — 3D MANO fitting: Optimized with three losses:
- Reprojection loss: \(\mathcal{L}_{proj} = \|J_M - \pi(\hat{J}_s, K)\|_1\)
- Biomechanical loss: \(\mathcal{L}_{BMC} = \mathcal{L}_{BL} + \mathcal{L}_{A}\) (bone length + joint angle constraints)
- PCA prior loss: \(\mathcal{L}_{prior} = \|X - [(X - \mu)U^T]U + \mu\|_2\) (enforces natural hand shapes)
Explicitly embedding biomechanical constraints in the fitting loss prevents generation of physically implausible hand poses.
7. Training Configuration
Detection Model
- Optimizer: Adam, 200 epochs with 30-epoch early stopping
- Learning rate: 0.01 → 1e-6 (linear decay)
- Hardware: 2× RTX 4090, batch size 256, 3 weeks
- Augmentation: Mosaic (0.7 probability), rotation [-60°, 60°], scale [0.5, 1]
Reconstruction Model
- Optimizer: Adam, 1,000 epochs, learning rate 1e-5, weight decay 1e-4
- Training data: 14 datasets, 4.2M images (55%+ more than prior methods)
- 7 existing datasets with 3D annotations (FreiHAND, HO-3D, InterHand2.6M, etc.) + 7 additional including WHIM
8. Experimental Results
FreiHAND Benchmark (Table 3)
| Method | PA-MPJPE (mm) ↓ | PA-MPVPE (mm) ↓ | F@5mm ↑ | F@15mm ↑ |
|---|---|---|---|---|
| HaMeR | 6.0 | 5.7 | 0.785 | 0.990 |
| WiLoR | 5.5 | 5.1 | 0.825 | 0.993 |
WiLoR surpasses HaMeR on all FreiHAND metrics: 8.3% improvement in PA-MPJPE, 5.1%p improvement in F@5mm.
HO3Dv2 Benchmark (Table 4)
| Method | AUCⱼ ↑ | PA-MPJPE (mm) ↓ | AUCᵥ ↑ | PA-MPVPE (mm) ↓ | F@5mm ↑ | F@15mm ↑ |
|---|---|---|---|---|---|---|
| HaMeR | 0.846 | 7.7 | 0.841 | 7.9 | 0.635 | 0.980 |
| WiLoR | 0.851 | 7.5 | 0.846 | 7.7 | 0.646 | 0.983 |
Consistent improvement on the hand-object interaction benchmark as well.
Hand Detection Benchmark (Table 1, COCO)
| Method | Model Size | FPS ↑ | AP@0.5 ↑ | mAP ↑ |
|---|---|---|---|---|
| ContactHands | 819 MB | 3 | 50.29 | 16.67 |
| ViTDet | 1,400 MB | 1 | 41.64 | 13.21 |
| WiLoR-S | 7 MB | 175 | 46.96 | 18.56 |
| WiLoR-M | 25 MB | 138 | 62.48 | 25.97 |
WiLoR-M achieves 45× faster speed at 32× smaller size compared to ContactHands while outperforming it by 12.19%p in AP@0.5.
On the WHIM test set, WiLoR-M scores AP@0.5 96.06, mAP 53.79.
Temporal Coherence (Table 6)
Despite per-frame independent inference, temporal coherence on video:
| Method | MPFVE×100 ↓ | MPFJE×100 ↓ | Jitter ↓ | RTE ↓ |
|---|---|---|---|---|
| HaMeR | 10.60 | 1.768 | 20.43 | 2.92 |
| WiLoR | 4.43 | 0.762 | 5.92 | 0.07 |
2.4× improvement in MPFVE, 3.4× improvement in Jitter over HaMeR — temporal coherence achieved without any explicit temporal modeling.
9. Ablation Study
Reconstruction Component Analysis (Table 5, FreiHAND)
| Configuration | PA-MPJPE (mm) ↓ | PA-MPVPE (mm) ↓ | F@5mm ↑ | F@15mm ↑ |
|---|---|---|---|---|
| w. FastViT backbone | 6.5 | 6.3 | 0.741 | 0.967 |
| w/o ViTPose pretrain | 5.9 | 5.7 | 0.795 | 0.989 |
| w. Single-scale refinement | 6.0 | 5.9 | 0.793 | 0.991 |
| w/o Refinement module | 6.1 | 5.8 | 0.795 | 0.991 |
| w. FreiHAND only | 6.1 | 5.8 | 0.793 | 0.990 |
| Full model (WiLoR) | 5.5 | 5.1 | 0.825 | 0.993 |
Key observations:
- Backbone capacity matters: ViT-L over FastViT gives 1.0mm improvement in PA-MPJPE. Capacity difference directly translates to performance.
- ViTPose pretraining is critical: Starting from ViTPose weights versus standard ViT-L gives an additional 0.4mm improvement. Domain similarity between hand pose and body pose makes transfer beneficial.
- Refinement module effect: Without refinement 6.1mm → with refinement 5.5mm. Image-aligned residual prediction contributes 0.6mm gain.
- Multi-scale matters: Single-scale (6.0mm) vs. multi-scale (5.5mm) gives 0.5mm additional improvement.
- WHIM data contribution: FreiHAND-only training (6.1mm) vs. full data (5.5mm) shows 0.6mm gain. Out-of-domain in-the-wild data benefits even a controlled studio benchmark.
10. Limitations
- Detection dependency: Reconstruction performance directly depends on upstream detection quality; detection failures propagate through the pipeline
- Tight crop requirement: Optimal performance requires tight crops where the hand is sufficiently contained in the image
- No explicit temporal modeling: Despite strong temporal coherence, the approach does not explicitly leverage temporal context and may produce errors under rapid motion
- Noisy automatic annotations: WHIM’s 3D annotations are generated by an automated fitting pipeline and contain more noise than manual annotations
- MANO representation constraints: Performance degrades on extreme hand deformations or tool manipulation scenarios that fall outside the MANO parameter space
11. Summary
WiLoR inherits HaMeR’s scaling hypothesis and extends it in two important directions.
First, end-to-end integration. Detection and reconstruction are unified into a single pipeline, and the detector’s own performance is simultaneously improved. WiLoR-M achieves better hand detection while being 45× faster.
Second, image-aligned refinement. Going beyond single-pass regression, WiLoR introduces a coarse-to-fine structure that projects initial predictions back into image space and corrects errors using locally aligned features. This design delivers 0.5–0.6mm quantitative gains and, notably, dramatic improvements in temporal coherence.
The WHIM dataset is WiLoR’s hidden infrastructure. The pipeline that constructs 2M+ automatic annotations from 1,400 YouTube videos — with biomechanical constraints embedded in the fitting process to minimize noise — enables the scale of in-the-wild training that makes the performance possible.
Comparing HaMeR and WiLoR:
| HaMeR | WiLoR | |
|---|---|---|
| Reconstruction backbone | ViT-H | ViT-L + ViTPose |
| Refinement module | None | Multi-scale image-aligned |
| Detector | External dependency | Built-in (WiLoR-Det) |
| Training data | 2.7M (10 datasets) | 4.2M (14 datasets) |
| FreiHAND PA-MPJPE | 6.0 mm | 5.5 mm |
| Temporal coherence (Jitter) | 20.43 | 5.92 |
“From detection through reconstruction as one, then return to the image to correct errors — WiLoR’s next step.”