[Paper Review] ExtPose: Robust and Coherent Pose Estimation by Extending ViTs (ICML 2025)

Paper: ExtPose: Robust and Coherent Pose Estimation by Extending ViTs
Venue: ICML 2025
Authors: Rongyu Chen*, Li’an Zhuo*, Linlin Yang, Qi Wang, Liefeng Bo, Bang Zhang, Angela Yao (* equal contribution)
Project: gloryyrolg.github.io/extpose
OpenReview: hm9FNEZZ6z

One-Line Summary

A framework that extends image-based ViT HPE with skeleton image inputs (2D pose evidence) and cross-frame attention (temporal information) — simultaneously improving 2D alignment and temporal coherence without introducing extra parameters. Achieves 34.0mm PA-MPJPE (−23%) on 3DPW and 4.9mm PA-MPJPE (−18%) on FreiHAND.

ext-pose-fig1

Figure 1: ExtPose extends and aids SOTA ViT-based HPE methods in generalization with auxiliary 2D detection. Even in extremely complex parkour scenarios, ExtPose (3rd column) reconstructs pixel-aligned meshes accurately, while HaMeR/HMR2.0/ViTPose-whole (2nd column) produce a completely flipped hand.

1. Background and Problem Statement

Two Structural Limitations of ViT-based HPE

Vision Transformers have rapidly become the dominant backbone for 3D human and hand pose estimation (HPE), with models like HMR2.0 and HaMeR achieving remarkable benchmark numbers after large-scale pretraining. However, these ViT-based methods share two structural limitations.

① Absence of temporal information

Existing ViT HPE models are designed for single-image input. When applied to video, they cannot incorporate temporal information across frames, causing jittery, temporally incoherent predictions. Video-based methods add separate temporal modules on top of static image features, which prevents full spatiotemporal feature interaction at each transformer layer.

② Failure to maintain 2D pixel alignment

ViT-based 3D HPE models reason from global image context, while 2D pose estimators provide more accurate localized cues. This gap causes ViT 3D HPE to repeatedly commit pixel misalignment errors such as wrist flips and incorrect global orientation — a problem rooted not in the ViT architecture itself, but in the nature of the 3D HPE task.

ExtPose’s Central Question

“Can we integrate 2D pose evidence and temporal information into a pretrained image ViT without introducing any additional parameters?”

2. Core Ideas

ExtPose follows an extension paradigm: rather than designing new modules or retraining from scratch, it extends the ViT’s existing attention mechanism in two orthogonal directions.

2D pose evidence integration: Takes skeleton images generated by an off-the-shelf 2D detector as a second input alongside RGB, fusing them via cross-modal attention.
Temporal attention extension: Applies the same ViT attention mechanism across frames to model temporal relationships.

The key insight is representation unification. By representing 2D poses as skeleton images (same 3-channel visual format as RGB), the same ViT backbone processes both modalities with the same patch embedding, positional encoding, and self-attention. Parameters are shared across modalities and frames, so no additional parameters are introduced.

3. Unified 2D Pose Feature Extraction

ext-pose-fig2

Figure 2: ExtPose framework overview. The skeleton image represents 2D poses (Sec. 4.3). A shared ViT backbone processes parallel image and 2D pose streams. Cross-modal attention (Sec. 4.4) fuses the two streams, and cross-frame attention (Sec. 4.5) captures video temporal context. Enhanced features are decoded into the 3D mesh.

Skeleton Images: The Key to Representation Unification

Three representations exist for 2D poses:

1D coordinate arrays: Joint coordinates as a vector. No spatial structure.
Heatmaps \(H \in \mathbb{R}^{H \times W \times \lvert J \rvert}\): Gaussians placed at joint locations. Channel count varies across datasets; lacks explicit structural relationships.
Skeleton images \(I_p \in \mathbb{R}^{H \times W \times 3}\): Stick-figure pose on a blank background.

ExtPose adopts skeleton images. Joints are rendered as distinctively colored circles, bones as gradient-colored connections reflecting the kinematic chain. Per-joint confidence (from Gu et al., 2024) is encoded via alpha blending, automatically suppressing unreliable detections.

Because skeleton images share the same 3-channel visual format as RGB:

The same patch embedding, positional encoding, and SA encoder processes both modalities.
Extracted features maintain spatial consistency with image features from the pretrained ViT.
The aligned feature space enables faster convergence than coordinate-based or heatmap representations.

The image stream \(\{F^I_i\}^M\) and 2D pose stream \(\{F^p_i\}^M\) are processed in parallel by the shared ViT backbone. Both streams attend to themselves and to each other:

Attention weights for image feature \(F^I_i\):

\[A^I_i = \text{Softmax}\left(\frac{[S^{I\text{-}I}_i;\; S^{I\text{-}p}_i]}{\sqrt{D}}\right)\]

where \(S^{I\text{-}I}_i = K^I Q^I_i\) captures within-image attention, and \(S^{I\text{-}p}_i = K^p Q^I_i\) captures cross-modal attention toward the pose stream.

Similarly, attention weights for pose feature \(F^p_i\):

\[A^p_i = \text{Softmax}\left(\frac{[K^I;\; K^p] Q^p_i}{\sqrt{D}}\right)\]

Updated features:

\[\Delta F^I_i = A^I_i \odot V^{Ip},\quad \Delta F^p_i = A^p_i \odot V^{Ip},\quad \text{where } V^{Ip} = [V^I;\; V^p]\]

This is implemented by simply concatenating the two streams before self-attention — no new parameters are required. The dot-product-based attention naturally handles imperfect pose alignment, as it can gather information from anywhere in the spatial domain without requiring exact correspondence.

5. From Images to Videos: Temporal Attention Extension

No separate temporal module is needed. The same ViT self-attention is applied across frames:

3D spatiotemporal attention over all tokens \(\{\{F^t_i\}^M\}^T\):

\[F^t_i = F^t_i + \text{Attn}\left(F^t_i,\; \{\{F^t_i\}^M\}^T\right)\]

A token \(F^{t_1}_i\) can attend to the same spatial location \(F^{t_2}_i\) in a different frame, as well as to a different location \(F^{t_2}_j\,(j \neq i)\), naturally capturing motion.

An attention mask \(M\) prevents temporal attention from crossing modality boundaries:

\[M = \begin{pmatrix} 0_{T \times M} & -\infty_{T \times M} \\ -\infty_{T \times M} & 0_{T \times M} \end{pmatrix}\]

Applying 3D attention to a pretrained image-based ViT zero-shot yields a “free lunch”: temporal coherence improves without additional training. Fine-tuning with temporal positional embeddings encoding frame location and motion ordering further improves performance.

6. Loss Functions & Implementation Details

The total training loss is a weighted sum of four terms:

\[\mathcal{L} = \mathcal{L}_{joint} + \mathcal{L}_{param} + \mathcal{L}_{reproj} + \mathcal{L}_{adv}\]

Loss	Definition	Description
\(\mathcal{L}_{joint}\)	\(\|\hat{J}_{3D} - J_{3D}\|_1\)	L1 3D joint position loss
\(\mathcal{L}_{param}\)	\(\|\hat{\theta} - \theta\|^2_2 + \|\hat{\beta} - \beta\|^2_2\)	SMPL parameter loss (when GT params available)
\(\mathcal{L}_{reproj}\)	\(\|\hat{J}_{2D} - J_{2D}\|_1\)	L1 2D reprojection loss
\(\mathcal{L}_{adv}\)	\(\|D(\theta, \beta) - 1\|^2_2\)	Adversarial loss with discriminator \(D\)

Modality masking: During training, each modality (image or skeleton) is masked out entirely with 50% probability to train the model to extract features from each modality independently.

Implementation details:

Optimizer: AdamW (lr=1e-5, β₁=0.9, β₂=0.999, weight decay=1e-3)
Training: 50K iterations, batch size 32, 8× A100 GPUs
Initialization: Fine-tuned from HMR2.0 (human) / HaMeR (hand)
Attention: Flash Attention for memory and speed efficiency

7. Experimental Results

Human Pose Estimation: 3DPW (Table 1)

ext-pose-tab1

Table 1: Comparison with SOTA HPE methods on 3DPW. ExtPose outperforms all image-based and video-based methods. † indicates upper bound using GT 2D poses.

Method	MPVPE↓	MPJPE↓	PA-MPJPE↓
Image-based
HMR2.0 (ICCV’23)	82.2	69.8	44.4
REFIT (ICCV’23)	75.1	65.3	40.5
ExtPose (T=1)	68.9	55.6	35.5
ExtPose† (T=1)	-	36.7	25.4
Video-based
WHAM (CVPR’24)	68.7	57.8	35.9
ExtPose (T=16)	67.5	54.2	34.0

At T=1, ExtPose improves PA-MPJPE from 44.4mm (HMR2.0) to 35.5mm (−12.3%). At T=16, it surpasses WHAM (which also uses 2D poses) by 5.3%. Using GT 2D poses (†) yields 25.4mm PA-MPJPE, approaching annotation error levels.

Hand Pose Estimation: FreiHAND (Table 2)

ext-pose-tab2

Table 2: SOTA comparison on FreiHAND. HaMeR is the ViT baseline. ExtPose achieves 14.0% PA-MPJPE improvement over HaMeR despite FreiHAND's challenging self-occlusion scenarios.

Method	PA-MPJPE↓	PA-MPVPE↓	F@5↑	F@15↑
MeshGraphormer (ICCV’21)	5.9	6.0	0.764	0.986
MobRecon (CVPR’22)	5.7	5.8	0.784	0.986
HaMeR (CVPR’24)	6.0	5.7	0.785	0.990
ExtPose	4.9	5.1	0.823	0.993

14.0% PA-MPJPE↓ and +4.8% F@5↑ over HaMeR.

Hand Pose Estimation: HO3D v2 (Table 3)

ext-pose-tab3

Table 3: SOTA comparison on HO3D v2 in image (T=1) and video (T=16) settings. ExtPose (T=16) achieves the largest improvement among video methods.

Method	AUCJ↑	PA-MPJPE↓	AUCV↑	PA-MPVPE↓	F@5↑	F@15↑
HaMeR (CVPR’24)	0.846	7.7	0.841	7.9	0.635	0.980
ExtPose (T=1)	0.858	7.0	0.850	7.5	0.660	0.985
DeFormer (ICCV’23)	-	9.4	-	9.1	0.546	0.963
ExtPose (T=16)	0.863	6.9	0.856	7.3	0.667	0.991

In the video setting (T=16), ExtPose achieves 26.6%↓ PA-MPJPE and +22.2%↑ AUCJ over DeFormer.

ext-pose-fig3

Figure 3: Qualitative comparison across challenging scenarios: motion blur, hand-object interaction (HOI), and egocentric 3D. ExtPose consistently produces superior reconstructions compared to ViTPose (2D) and HaMeR.

2D Alignment Evaluation: HInt Dataset (Table 4)

ext-pose-tab4

Table 4: 2D PCK comparison on HInt (NEWDAYS, VISOR sequences). ExtPose substantially closes the gap to ViTPose while greatly outperforming HaMeR. With GT 2D poses (†), ExtPose exceeds ViTPose.

Method	NEWDAYS @0.05↑	@0.1↑	@0.15↑	VISOR @0.05↑	@0.1↑	@0.15↑
HaMeR	48.0	78.0	88.8	43.0	76.9	89.3
ExtPose	59.6	84.8	92.7	61.1	88.5	95.6
ExtPose†	84.6	97.9	99.4	83.3	98.2	99.6
ViTPose	66.5	86.5	93.1	70.8	90.6	96.2

At @0.05 threshold, ExtPose improves over HaMeR by +24.2% (NEWDAYS) and +42.1% (VISOR). With GT 2D poses, ExtPose surpasses ViTPose’s 2D alignment performance — demonstrating that the bottleneck is 2D detection quality, not the fusion architecture.

8. Ablation Studies

2D Pose Representation Comparison (Table 5)

ext-pose-tab5

Table 5: Ablation of 2D pose representations and branch combinations on FreiHAND. Combining RGB image with skeleton image achieves the best performance.

IMG	2D Pose	PA-MPJPE↓	PA-MPVPE↓	F@5↑	F@15↑
-	1D coords	6.5	6.6	0.724	0.983
-	Heatmap	6.3	6.3	0.747	0.984
-	Skel. image	6.2	6.3	0.742	0.985
✓	- (zeros)	6.0	5.7	0.783	0.991
✓	Skel. image	4.9	5.1	0.823	0.993

Key findings:

Image-only (zero 2D): Minimal improvement over HaMeR because the well-trained ViT backbone already has strong image features — precise 2D localization is needed for further gain.
Lifting only (2D-only): Underperforms image-based HPE due to depth ambiguity and missing keypoint detections without RGB texture.
Skeleton image > heatmap > 1D coords: Visual-domain representations align better with ViT features and converge faster.
RGB + skeleton image: Synergy between the two modalities achieves the best performance (PA-MPJPE 4.9mm, −18%).
With GT 2D poses: further improves to 3.7mm on FreiHAND.

Fusion Strategy Comparison (Table 6)

ext-pose-tab6

Table 6: Ablation of fusion strategies and training configurations on HInt. * denotes additional parameters. ExtPose with full attention and full parameter training achieves the best 2D alignment.

Method	NEWDAYS @0.05	@0.1	@0.15	VISOR @0.05	@0.1	@0.15
HaMeR	48.0	78.0	88.8	43.0	76.9	89.3
Late Fusion	50.5	82.4	92.5	52.5	87.1	95.6
Channel Concat*	56.3	83.6	92.2	55.9	87.3	95.3
ControlNet*	55.6	83.5	92.3	57.7	87.5	95.5
From ViTPose init	49.9	82.2	92.2	46.4	85.3	95.2
Only Q, K	50.0	81.9	92.3	49.1	85.3	95.1
1st Half blocks	50.8	82.2	92.3	50.2	85.8	95.2
ExtPose (full)	59.6	84.8	92.7	61.1	88.5	95.6

Key findings:

Channel Concat* and ControlNet* use extra parameters but underperform ExtPose — the attention-based fusion achieves more effective cross-modal information exchange at every layer.
Initialization: Starting from ViTPose (2D-trained) hurts performance; initializing from the 3D HPE pretrained model (HaMeR) is essential.
Training scope: Training only Q,K projectors or only the first half of ViT blocks significantly limits performance — full backbone fine-tuning is required.

ext-pose-fig4

Figure 4: Cross-modal attention visualization. Attention weights between image and 2D pose streams. Weights are not concentrated on the diagonal — each branch attends to distinct areas of the other, with notable spatial awareness at corners and boundaries that aids 3D localization.

Sequence Length Analysis (Table 7)

ext-pose-tab7

Table 7: Effect of sequence length L on 3DPW performance. L=16 is optimal across MPVPE, MPJPE, and PA-MPJPE.

Sequence Length L	MPVPE↓	MPJPE↓	PA-MPJPE↓
1 (image)	68.9	55.6	35.5
8	68.2	55.1	34.7
16	67.5	54.2	34.0
32	67.9	54.8	34.3

Performance improves steadily from L=1 to L=16, then slightly degrades at L=32 — excessively long sequences may introduce additional ambiguity.

9. Convergence Speed & Efficiency

ext-pose-fig5

Figure 5: Convergence comparison of different 2D pose representations in hand experiments. ExtPose (skeleton image + full attention) converges fastest and most stably. ControlNet starts from a lower point due to zero initialization.

ext-pose-fig6

Figure 6: Efficiency-Accuracy tradeoff curve on 3DPW (latency vs PA-MPJPE). ExtPose achieves the best accuracy across all latency regimes. While the 2D pose branch roughly doubles the data processed, overall efficiency remains near real-time.

10. Human3.6M Results (Appendix)

Method	T	MPJPE↓	PA-MPJPE↓
HMR2.0 (model-based)	1	44.8	33.6
ExtPose (model-based)	16	43.5	27.2

Among model-based methods, ExtPose achieves 19.0% PA-MPJPE improvement over HMR2.0 with only T=16 frames, approaching competitive lifting methods that use 243 frames.

11. Summary

ExtPose’s core insight is the extension paradigm:

“Don’t add new modules — extend ViT’s attention as-is to a wider input space.”

Three design choices work together:

Skeleton images: Represent 2D poses in the same visual domain as RGB so the same ViT backbone processes both. Outperforms 1D coordinates and heatmaps in both convergence speed and accuracy.
Cross-modal attention: Image and skeleton streams attend to each other at every transformer layer. More effective than late fusion, channel concat, or ControlNet — and requires zero extra parameters.
Cross-frame attention: The same ViT attention extended across the temporal axis. No separate temporal module; temporal coherence emerges naturally, with further improvement from fine-tuning with temporal positional embeddings.

Practical advantages of this paradigm:

Fine-tuning only — no large-scale retraining required (upgrade HMR2.0 → ExtPose in 50K iterations)
Universal applicability — same framework for full-body and hand pose estimation
Both image and video settings achieve SOTA with the same model
Consistent improvements: 3DPW −23%, FreiHAND −18%, HO3D v2 −26.6% (video)