[Paper Review] DexVLG: Dexterous Vision-Language-Grasp Model at Scale (ICCV 2025)

Paper: DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Venue: ICCV 2025 Spotlight
Authors: Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang
Affiliations: BAAI, Galbot, Tsinghua University, Peking University, CASIA, Shanghai Jiao Tong University, EIT
arXiv: 2507.02747
GitHub: jiaweihe1996/DexVLG


One-Line Summary

DexGraspNet 3.0 (170M part-aligned grasp poses on 174k objects) paired with the flow-matching-based VLM DexVLG enables language-aligned dexterous grasp pose generation from single-view RGBD input, achieving 87.7% simulation success rate and 80% real-world success rate.


dexvlg-fig1a

Figure 1a: DexVLG overview — given a natural language instruction and single-view RGBD input, the model generates dexterous grasp poses targeting the specified object part.

dexvlg-fig1b

Figure 1b: DexGraspNet 3.0 dataset diversity and qualitative grasp results from DexVLG.

1. Background

The Gap Between VLA Systems and Dexterous Hands

While Vision-Language-Action (VLA) systems have advanced rapidly in robot manipulation, most research focuses on simple parallel grippers. Dexterous hands (22 DoF for the Shadow Hand) offer far richer manipulation capability, but three factors have prevented large-scale VLA research:

  • Data scarcity: Existing dexterous grasp datasets are small-scale (thousands to ~1M samples)
  • No semantic alignment: No existing dataset supports part-level language alignment
  • Complex configuration space: Automatically generating physically stable grasp poses in 22-DoF space is difficult

Limitations of Prior Datasets

Method Scale Semantic Part Alignment Language Condition
DexGraspNet 1.32M
DexGYS 50k
SemGrasp 50k Partial
Multi-GraspLLM 120k
DexGraspNet 3.0 (Ours) 170M

2. Core Ideas

  1. Large-scale part-aligned dataset: Automatically generate 170M grasp poses on 174k objects using GPT-4o and SAMesh, paired with part-level captions
  2. Part-aware initialization: Classify object parts into 4 geometric categories and set OBB (Oriented Bounding Box)-aligned initial hand poses
  3. LP-based physics energy optimization: Replace the equal-magnitude contact force assumption of standard DFC with linear programming
  4. Flow-matching VLM: End-to-end model combining Uni3D point cloud encoder, Florence-2 LLM, and flow-matching grasp generation head

3. DexGraspNet 3.0 Dataset Generation

The dataset is generated via a 5-stage pipeline.

dexvlg-fig2

Figure 2: DexGraspNet 3.0 dataset generation pipeline — object preparation, part-aware grasp generation, energy optimization, validation & captioning, and table-top scene generation.

Stage 1: Object Preparation and Part Segmentation

Starting from 800k+ Objaverse assets filtered by GPT-4o using 5 quality criteria:

  • ManifoldPlus + CoACD: Watertight collision mesh and convex decomposition (threshold=0.4)
  • GPT-4o: Object size estimation; normalize to 20–50cm diagonal range
  • SAMesh: Zero-shot geometry-based semantic part segmentation
  • Set-of-Mark + GPT-4o: Multi-view rendering analysis for automatic part name labeling

Stage 2: Part-aware Dexterous Grasp Generation

Initial Shadow Hand (22 DoF) palm pose \(T \in \mathbb{R}^3, R \in SO(3)\) and joint angles \(\theta \in \mathbb{R}^{22}\) are set using OBB geometry. Object parts are classified into four categories with differentiated initialization strategies:

Category Description Initialization Strategy
Lid-like Flat parts embedded in the object Palm perpendicular to principal direction, 24-angle jitter
Disk-like Protruding flat/disk parts Palm rotated sideways
L-shaped Thin, elongated interactive parts Palm directly aligned to grasp point geometry
Shaft-like Default/general parts Palm aligned to part principal direction

Two grasp modes:

  • Wrap: 7 contact candidates (5 fingertips + 2 palm points)
  • Pinch: 4 contact candidates (thumb, index, middle finger, palm)

Stage 3: Gradient-based Energy Optimization

Details in Section 4.

Stage 4: Grasp Validation and Captioning

Physics-based validation in Isaac Gym:

  • Penetration < 3mm, self-penetration < 3mm
  • Gravity counteraction test from 6 directions
  • Part-alignment check: contacted links must be closest to the target part

Accepted grasps receive captions via template:

“Grasp the {part} of the {object} object, with contacts on {fingers}

Stage 5: Table-top Scene Generation

Objects are dropped to generate stable resting poses; scenes rendered from 8 viewpoints (Intel RealSense D415). Table-surface collision filtering applied to simulate real table-top environments.


4. Energy-based Optimization

The total energy is a weighted sum of four terms:

\[E = \omega_{FC} \cdot E_{FC} + \omega_{bar} \cdot E_{bar} + \omega_{dis} \cdot E_{dis} + \omega_{reg} \cdot E_{reg}\]

4-1. LP-based Differentiable Force Closure (\(E_{FC}\))

Standard DFC assumes all contact forces have equal magnitude — an unrealistic constraint that causes artifacts such as tilted fingers and drifted contacts. LP-based DFC replaces this with linear programming.

When the pose is stable (\(P < \tau_{FC}\)):

\[E_{FC} = \|G(f \odot c)\|^2\]

In the unstable initial phase:

\[E_{FC} = \|Gc\|^2\]

Where:

  • \(G\): grasp matrix mapping contact forces to net wrench
  • \(c\): contact normal vectors
  • \(f\): per-contact force magnitudes from LP (\(\max_i f_i = 1,\ f_i \geq 0\))

4-2. Part-contact Energy (\(E_{bar}\))

Penalizes fingertip penetration outside the target part:

\[E_{bar} = \sum_{n=1}^{5} \sum_{p_j \notin s_i} b(d(x_n, p_j),\ d_{thr})\]

Truncated barrier function:

\[b(d,\ d_{thr}) = \begin{cases} -(d - d_{thr})^2 \ln(d/d_{thr}) & 0 < d < d_{thr} \\ 0 & \text{otherwise} \end{cases}\]
  • \(x_n\): position of the \(n\)-th fingertip
  • \(p_j\): surface points outside the target part \(s_i\)

4-3. Distance Energy (\(E_{dis}\))

\[E_{dis} = \sum_{n=1}^{N} d(x_n, O) + \omega_{palm} \left| d(x_{palm}, O) - d_0 \right|\]

Minimizes fingertip-to-object distance while encouraging the palm to maintain target distance \(d_0\).

4-4. Regularization Energy (\(E_{reg}\))

\[E_{reg} = \omega_{limit} \cdot E_{limit} + \omega_{pen} \cdot E_{pen} + \omega_{spen} \cdot E_{spen} + \omega_{dir} \cdot E_{dir}\]
  • \(E_{limit}\): joint limit violation penalty (via cuRobo)
  • \(E_{pen}\): hand-object penetration penalty
  • \(E_{spen}\): hand self-collision penalty
  • \(E_{dir}\): directional alignment energy \(\displaystyle E_{dir} = \sum_{i=0}^{N} (1 - c_i \cdot N_i)\)

5. DexVLG Model Architecture

dexvlg-fig3

Figure 3: DexVLG model architecture — Uni3D point cloud encoder, Florence-2 LLM, and flow-matching grasp generation head.

5-1. Point Cloud Encoder

  • Backbone: Uni3D (pre-trained ViT-based 3D encoder)
  • Input: 10,000 colored points downsampled from single-view RGBD
  • Alignment: MLP projector maps encoded 3D features to language embedding space

5-2. Language Foundation Model

  • LLM: Florence-2 (Base or Large variant)
  • Point cloud features concatenated with language token embeddings
  • Language tokenizer frozen during training

5-3. Flow Matching-based Grasp Generation Head

Learns a velocity field \(v(X_t, t)\) that transports noise samples \(X_0\) to target grasp poses \(X_1\):

\[\min_v \mathbb{E}_{(t, X_0, X_1) \sim \gamma} \left\| \frac{d}{dt} X_t - v(X_t, t) \right\|^2\]
  • Conditioning: conditioned on LLM hidden states
  • Architecture: shares transformer blocks with the LLM
  • MLP Pose Decoder: outputs \(T \in \mathbb{R}^3\) (translation), \(R \in SO(3)\) (rotation), \(\theta \in \mathbb{R}^{22}\) (joint angles)

5-4. Training

Single-stage, full-parameter fine-tuning on DexGraspNet 3.0.


6. Loss Functions & Implementation Details

Item Detail
Training data DexGraspNet 3.0 (170M grasp poses, 174k objects)
3D encoder Uni3D (ViT-based, pre-trained)
LLM Florence-2 Base / Large
Point count 10,000 (single-view RGBD)
Robot hand Shadow Hand (22 DoF)
Simulator Isaac Gym
Camera Intel RealSense D415 (8 viewpoints)
Physics library cuRobo

7. Experimental Results

Simulation Benchmarks

Three metrics are evaluated:

  • Suc (Simulation Success Rate): physical simulation grasp success rate
  • PTA (Part Touch Accuracy): accuracy of touching the target part
  • PGA (Part Grasp Accuracy): accuracy of grasping the target part

Baselines: DGN2.0* (retrained DexGraspNet 2.0), DGN2.0*+CLIP (augmented with CLIP text features)

dexvlg-tab1

Table 1: Simulation performance comparison on LVIS-Seen, LVIS-Unseen, and SamPart3D benchmarks. DexVLG significantly outperforms all baselines across all metrics.

DexVLG surpasses all baselines by large margins on every benchmark:

  • LVIS-Seen: Suc 87.7%, PTA 70.7%, PGA 62.1%
  • LVIS-Unseen: Suc 79.1%, PTA 68.2%, PGA 36.3%
  • SamPart3D: Suc 76.3%, PGA 52.0%

Strong zero-shot generalization holds across unseen objects (LVIS-Unseen) and a different part segmentation method (SamPart3D).

dexvlg-tab2

Table 2: Additional simulation experiment results.

dexvlg-tab3

Table 3: DexGraspNet 3.0 dataset statistics — number of objects, grasp poses, and captions.

dexvlg-tab4

Table 4: Dataset quality comparison against prior work — penetration (mm), self-penetration (mm), and Q1 stability metric.

Qualitative Results

dexvlg-fig4

Figure 4: Qualitative grasp results on diverse objects. DexVLG accurately grasps the specified part according to the language instruction.

8. Ablation Studies

Grasp Mode Comparison

dexvlg-tab5

Table 5: Performance comparison between Wrap and Pinch grasp modes.

Wrap grasps substantially outperform Pinch across all metrics (LVIS-Seen Suc: 87.7% vs 71.8%). Pinch offers more flexibility but significantly lower stability.

Denoising Paradigm Comparison

dexvlg-tab6

Table 6: Comparison of denoising paradigms — DDPM, DDIM, and Flow Matching.

Flow Matching achieves 75.3% success rate, far exceeding DDPM (51.9%) and DDIM (57.7%), demonstrating clear advantages of continuous-flow methods for pose generation.

Model Component Ablation

dexvlg-tab7

Table 7: Ablation on model scale, training data scale, and colored point clouds.

Key findings:

  • Large model + large-scale training data are critical for generalization performance, especially on unseen objects
  • Colored point clouds (vs. geometry-only) significantly improve results

dexvlg-fig5

Figure 5: Ablation analysis visualization.

9. Real-world Experiments

Hardware:

  • Shadow Hand (22 DoF) + UR10e robotic arm
  • Intel RealSense D415 camera (single-view)

The system achieves 80% success rate and 75% part accuracy on simple objects, demonstrating successful part-aligned grasping in real-world scenarios.

dexvlg-fig6

Figure 6: Real-world dexterous grasping experiments with Shadow Hand + UR10e. The system executes part-aligned grasps on diverse objects following language instructions.

Limitations:

  1. Grasp poses are generated without considering robot arm workspace constraints, requiring post-generation filtering for real-world deployment
  2. No effective sample ranking methodology available for VLM-based models due to computational cost constraints

10. Summary

“We present DexVLG, a large vision-language model trained on DexGraspNet 3.0 — 170M dexterous grasp poses across 174k objects — achieving over 76% zero-shot execution success rate in simulation and 80% in real-world scenarios.”

Key contributions:

  1. DexGraspNet 3.0: The largest dexterous grasp dataset — 174k objects, 170M part-aligned grasp poses with semantic captions
  2. LP-based DFC: Replaces the equal-magnitude contact force assumption of standard DFC with linear programming for more realistic stability optimization
  3. Part-aware initialization: Classifies object parts into 4 geometric categories for OBB-aligned hand pose initialization
  4. DexVLG: Combines Uni3D + Florence-2 + flow-matching head to predict language-conditioned dexterous grasp poses from single-view RGBD
  5. Strong zero-shot generalization: Maintains high performance on unseen objects and across different part segmentation methods
* Posts in this blog were written with the assistance of Claude Code.