[Paper Review] ExtPose: Robust and Coherent Pose Estimation by Extending ViTs (ICML 2025)
Paper: ExtPose: Robust and Coherent Pose Estimation by Extending ViTs
Venue: ICML 2025
Authors: Rongyu Chen*, Li’an Zhuo*, Linlin Yang, Qi Wang, Liefeng Bo, Bang Zhang, Angela Yao (* equal contribution)
Project: gloryyrolg.github.io/extpose
OpenReview: hm9FNEZZ6z
One-Line Summary
A framework that extends image-based ViT HPE with skeleton image inputs (2D pose evidence) and cross-frame attention (temporal information) — simultaneously improving 2D alignment and temporal coherence without introducing extra parameters. Achieves 34.0mm PA-MPJPE (−23%) on 3DPW and 4.9mm PA-MPJPE (−18%) on FreiHAND.

1. Background and Problem Statement
Two Structural Limitations of ViT-based HPE
Vision Transformers have rapidly become the dominant backbone for 3D human and hand pose estimation (HPE), with models like HMR2.0 and HaMeR achieving remarkable benchmark numbers after large-scale pretraining. However, these ViT-based methods share two structural limitations.
① Absence of temporal information
Existing ViT HPE models are designed for single-image input. When applied to video, they cannot incorporate temporal information across frames, causing jittery, temporally incoherent predictions. Video-based methods add separate temporal modules on top of static image features, which prevents full spatiotemporal feature interaction at each transformer layer.
② Failure to maintain 2D pixel alignment
ViT-based 3D HPE models reason from global image context, while 2D pose estimators provide more accurate localized cues. This gap causes ViT 3D HPE to repeatedly commit pixel misalignment errors such as wrist flips and incorrect global orientation — a problem rooted not in the ViT architecture itself, but in the nature of the 3D HPE task.
ExtPose’s Central Question
“Can we integrate 2D pose evidence and temporal information into a pretrained image ViT without introducing any additional parameters?”
2. Core Ideas
ExtPose follows an extension paradigm: rather than designing new modules or retraining from scratch, it extends the ViT’s existing attention mechanism in two orthogonal directions.
- 2D pose evidence integration: Takes skeleton images generated by an off-the-shelf 2D detector as a second input alongside RGB, fusing them via cross-modal attention.
- Temporal attention extension: Applies the same ViT attention mechanism across frames to model temporal relationships.
The key insight is representation unification. By representing 2D poses as skeleton images (same 3-channel visual format as RGB), the same ViT backbone processes both modalities with the same patch embedding, positional encoding, and self-attention. Parameters are shared across modalities and frames, so no additional parameters are introduced.
3. Unified 2D Pose Feature Extraction

Skeleton Images: The Key to Representation Unification
Three representations exist for 2D poses:
- 1D coordinate arrays: Joint coordinates as a vector. No spatial structure.
- Heatmaps \(H \in \mathbb{R}^{H \times W \times \lvert J \rvert}\): Gaussians placed at joint locations. Channel count varies across datasets; lacks explicit structural relationships.
- Skeleton images \(I_p \in \mathbb{R}^{H \times W \times 3}\): Stick-figure pose on a blank background.
ExtPose adopts skeleton images. Joints are rendered as distinctively colored circles, bones as gradient-colored connections reflecting the kinematic chain. Per-joint confidence (from Gu et al., 2024) is encoded via alpha blending, automatically suppressing unreliable detections.
Because skeleton images share the same 3-channel visual format as RGB:
- The same patch embedding, positional encoding, and SA encoder processes both modalities.
- Extracted features maintain spatial consistency with image features from the pretrained ViT.
- The aligned feature space enables faster convergence than coordinate-based or heatmap representations.
4. Dual-Stream Cross-Modal Attention
The image stream \(\{F^I_i\}^M\) and 2D pose stream \(\{F^p_i\}^M\) are processed in parallel by the shared ViT backbone. Both streams attend to themselves and to each other:
Attention weights for image feature \(F^I_i\):
\[A^I_i = \text{Softmax}\left(\frac{[S^{I\text{-}I}_i;\; S^{I\text{-}p}_i]}{\sqrt{D}}\right)\]where \(S^{I\text{-}I}_i = K^I Q^I_i\) captures within-image attention, and \(S^{I\text{-}p}_i = K^p Q^I_i\) captures cross-modal attention toward the pose stream.
Similarly, attention weights for pose feature \(F^p_i\):
\[A^p_i = \text{Softmax}\left(\frac{[K^I;\; K^p] Q^p_i}{\sqrt{D}}\right)\]Updated features:
\[\Delta F^I_i = A^I_i \odot V^{Ip},\quad \Delta F^p_i = A^p_i \odot V^{Ip},\quad \text{where } V^{Ip} = [V^I;\; V^p]\]This is implemented by simply concatenating the two streams before self-attention — no new parameters are required. The dot-product-based attention naturally handles imperfect pose alignment, as it can gather information from anywhere in the spatial domain without requiring exact correspondence.
5. From Images to Videos: Temporal Attention Extension
No separate temporal module is needed. The same ViT self-attention is applied across frames:
3D spatiotemporal attention over all tokens \(\{\{F^t_i\}^M\}^T\):
\[F^t_i = F^t_i + \text{Attn}\left(F^t_i,\; \{\{F^t_i\}^M\}^T\right)\]A token \(F^{t_1}_i\) can attend to the same spatial location \(F^{t_2}_i\) in a different frame, as well as to a different location \(F^{t_2}_j\,(j \neq i)\), naturally capturing motion.
An attention mask \(M\) prevents temporal attention from crossing modality boundaries:
\[M = \begin{pmatrix} 0_{T \times M} & -\infty_{T \times M} \\ -\infty_{T \times M} & 0_{T \times M} \end{pmatrix}\]Applying 3D attention to a pretrained image-based ViT zero-shot yields a “free lunch”: temporal coherence improves without additional training. Fine-tuning with temporal positional embeddings encoding frame location and motion ordering further improves performance.
6. Loss Functions & Implementation Details
The total training loss is a weighted sum of four terms:
\[\mathcal{L} = \mathcal{L}_{joint} + \mathcal{L}_{param} + \mathcal{L}_{reproj} + \mathcal{L}_{adv}\]| Loss | Definition | Description |
|---|---|---|
| \(\mathcal{L}_{joint}\) | \(|\hat{J}_{3D} - J_{3D}|_1\) | L1 3D joint position loss |
| \(\mathcal{L}_{param}\) | \(|\hat{\theta} - \theta|^2_2 + |\hat{\beta} - \beta|^2_2\) | SMPL parameter loss (when GT params available) |
| \(\mathcal{L}_{reproj}\) | \(|\hat{J}_{2D} - J_{2D}|_1\) | L1 2D reprojection loss |
| \(\mathcal{L}_{adv}\) | \(|D(\theta, \beta) - 1|^2_2\) | Adversarial loss with discriminator \(D\) |
Modality masking: During training, each modality (image or skeleton) is masked out entirely with 50% probability to train the model to extract features from each modality independently.
Implementation details:
- Optimizer: AdamW (lr=1e-5, β₁=0.9, β₂=0.999, weight decay=1e-3)
- Training: 50K iterations, batch size 32, 8× A100 GPUs
- Initialization: Fine-tuned from HMR2.0 (human) / HaMeR (hand)
- Attention: Flash Attention for memory and speed efficiency
7. Experimental Results
Human Pose Estimation: 3DPW (Table 1)

| Method | MPVPE↓ | MPJPE↓ | PA-MPJPE↓ |
|---|---|---|---|
| Image-based | |||
| HMR2.0 (ICCV’23) | 82.2 | 69.8 | 44.4 |
| REFIT (ICCV’23) | 75.1 | 65.3 | 40.5 |
| ExtPose (T=1) | 68.9 | 55.6 | 35.5 |
| ExtPose† (T=1) | - | 36.7 | 25.4 |
| Video-based | |||
| WHAM (CVPR’24) | 68.7 | 57.8 | 35.9 |
| ExtPose (T=16) | 67.5 | 54.2 | 34.0 |
At T=1, ExtPose improves PA-MPJPE from 44.4mm (HMR2.0) to 35.5mm (−12.3%). At T=16, it surpasses WHAM (which also uses 2D poses) by 5.3%. Using GT 2D poses (†) yields 25.4mm PA-MPJPE, approaching annotation error levels.
Hand Pose Estimation: FreiHAND (Table 2)

| Method | PA-MPJPE↓ | PA-MPVPE↓ | F@5↑ | F@15↑ |
|---|---|---|---|---|
| MeshGraphormer (ICCV’21) | 5.9 | 6.0 | 0.764 | 0.986 |
| MobRecon (CVPR’22) | 5.7 | 5.8 | 0.784 | 0.986 |
| HaMeR (CVPR’24) | 6.0 | 5.7 | 0.785 | 0.990 |
| ExtPose | 4.9 | 5.1 | 0.823 | 0.993 |
14.0% PA-MPJPE↓ and +4.8% F@5↑ over HaMeR.
Hand Pose Estimation: HO3D v2 (Table 3)

| Method | AUCJ↑ | PA-MPJPE↓ | AUCV↑ | PA-MPVPE↓ | F@5↑ | F@15↑ |
|---|---|---|---|---|---|---|
| HaMeR (CVPR’24) | 0.846 | 7.7 | 0.841 | 7.9 | 0.635 | 0.980 |
| ExtPose (T=1) | 0.858 | 7.0 | 0.850 | 7.5 | 0.660 | 0.985 |
| DeFormer (ICCV’23) | - | 9.4 | - | 9.1 | 0.546 | 0.963 |
| ExtPose (T=16) | 0.863 | 6.9 | 0.856 | 7.3 | 0.667 | 0.991 |
In the video setting (T=16), ExtPose achieves 26.6%↓ PA-MPJPE and +22.2%↑ AUCJ over DeFormer.

2D Alignment Evaluation: HInt Dataset (Table 4)

| Method | NEWDAYS @0.05↑ | @0.1↑ | @0.15↑ | VISOR @0.05↑ | @0.1↑ | @0.15↑ |
|---|---|---|---|---|---|---|
| HaMeR | 48.0 | 78.0 | 88.8 | 43.0 | 76.9 | 89.3 |
| ExtPose | 59.6 | 84.8 | 92.7 | 61.1 | 88.5 | 95.6 |
| ExtPose† | 84.6 | 97.9 | 99.4 | 83.3 | 98.2 | 99.6 |
| ViTPose | 66.5 | 86.5 | 93.1 | 70.8 | 90.6 | 96.2 |
At @0.05 threshold, ExtPose improves over HaMeR by +24.2% (NEWDAYS) and +42.1% (VISOR). With GT 2D poses, ExtPose surpasses ViTPose’s 2D alignment performance — demonstrating that the bottleneck is 2D detection quality, not the fusion architecture.
8. Ablation Studies
2D Pose Representation Comparison (Table 5)

| IMG | 2D Pose | PA-MPJPE↓ | PA-MPVPE↓ | F@5↑ | F@15↑ |
|---|---|---|---|---|---|
| - | 1D coords | 6.5 | 6.6 | 0.724 | 0.983 |
| - | Heatmap | 6.3 | 6.3 | 0.747 | 0.984 |
| - | Skel. image | 6.2 | 6.3 | 0.742 | 0.985 |
| ✓ | - (zeros) | 6.0 | 5.7 | 0.783 | 0.991 |
| ✓ | Skel. image | 4.9 | 5.1 | 0.823 | 0.993 |
Key findings:
- Image-only (zero 2D): Minimal improvement over HaMeR because the well-trained ViT backbone already has strong image features — precise 2D localization is needed for further gain.
- Lifting only (2D-only): Underperforms image-based HPE due to depth ambiguity and missing keypoint detections without RGB texture.
- Skeleton image > heatmap > 1D coords: Visual-domain representations align better with ViT features and converge faster.
- RGB + skeleton image: Synergy between the two modalities achieves the best performance (PA-MPJPE 4.9mm, −18%).
- With GT 2D poses: further improves to 3.7mm on FreiHAND.
Fusion Strategy Comparison (Table 6)

| Method | NEWDAYS @0.05 | @0.1 | @0.15 | VISOR @0.05 | @0.1 | @0.15 |
|---|---|---|---|---|---|---|
| HaMeR | 48.0 | 78.0 | 88.8 | 43.0 | 76.9 | 89.3 |
| Late Fusion | 50.5 | 82.4 | 92.5 | 52.5 | 87.1 | 95.6 |
| Channel Concat* | 56.3 | 83.6 | 92.2 | 55.9 | 87.3 | 95.3 |
| ControlNet* | 55.6 | 83.5 | 92.3 | 57.7 | 87.5 | 95.5 |
| From ViTPose init | 49.9 | 82.2 | 92.2 | 46.4 | 85.3 | 95.2 |
| Only Q, K | 50.0 | 81.9 | 92.3 | 49.1 | 85.3 | 95.1 |
| 1st Half blocks | 50.8 | 82.2 | 92.3 | 50.2 | 85.8 | 95.2 |
| ExtPose (full) | 59.6 | 84.8 | 92.7 | 61.1 | 88.5 | 95.6 |
Key findings:
- Channel Concat* and ControlNet* use extra parameters but underperform ExtPose — the attention-based fusion achieves more effective cross-modal information exchange at every layer.
- Initialization: Starting from ViTPose (2D-trained) hurts performance; initializing from the 3D HPE pretrained model (HaMeR) is essential.
- Training scope: Training only Q,K projectors or only the first half of ViT blocks significantly limits performance — full backbone fine-tuning is required.

Sequence Length Analysis (Table 7)

| Sequence Length L | MPVPE↓ | MPJPE↓ | PA-MPJPE↓ |
|---|---|---|---|
| 1 (image) | 68.9 | 55.6 | 35.5 |
| 8 | 68.2 | 55.1 | 34.7 |
| 16 | 67.5 | 54.2 | 34.0 |
| 32 | 67.9 | 54.8 | 34.3 |
Performance improves steadily from L=1 to L=16, then slightly degrades at L=32 — excessively long sequences may introduce additional ambiguity.
9. Convergence Speed & Efficiency


10. Human3.6M Results (Appendix)
| Method | T | MPJPE↓ | PA-MPJPE↓ |
|---|---|---|---|
| HMR2.0 (model-based) | 1 | 44.8 | 33.6 |
| ExtPose (model-based) | 16 | 43.5 | 27.2 |
Among model-based methods, ExtPose achieves 19.0% PA-MPJPE improvement over HMR2.0 with only T=16 frames, approaching competitive lifting methods that use 243 frames.
11. Summary
ExtPose’s core insight is the extension paradigm:
“Don’t add new modules — extend ViT’s attention as-is to a wider input space.”
Three design choices work together:
- Skeleton images: Represent 2D poses in the same visual domain as RGB so the same ViT backbone processes both. Outperforms 1D coordinates and heatmaps in both convergence speed and accuracy.
- Cross-modal attention: Image and skeleton streams attend to each other at every transformer layer. More effective than late fusion, channel concat, or ControlNet — and requires zero extra parameters.
- Cross-frame attention: The same ViT attention extended across the temporal axis. No separate temporal module; temporal coherence emerges naturally, with further improvement from fine-tuning with temporal positional embeddings.
Practical advantages of this paradigm:
- Fine-tuning only — no large-scale retraining required (upgrade HMR2.0 → ExtPose in 50K iterations)
- Universal applicability — same framework for full-body and hand pose estimation
- Both image and video settings achieve SOTA with the same model
- Consistent improvements: 3DPW −23%, FreiHAND −18%, HO3D v2 −26.6% (video)