[Paper Review] DexVLG: Dexterous Vision-Language-Grasp Model at Scale (ICCV 2025)

Paper: DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Venue: ICCV 2025 Spotlight
Authors: Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang
Affiliations: BAAI, Galbot, Tsinghua University, Peking University, CASIA, Shanghai Jiao Tong University, EIT
arXiv: 2507.02747
GitHub: jiaweihe1996/DexVLG

One-Line Summary

DexGraspNet 3.0 (170M part-aligned grasp poses on 174k objects) paired with the flow-matching-based VLM DexVLG enables language-aligned dexterous grasp pose generation from single-view RGBD input, achieving 87.7% simulation success rate and 80% real-world success rate.

dexvlg-fig1a

Figure 1a: DexVLG overview — given a natural language instruction and single-view RGBD input, the model generates dexterous grasp poses targeting the specified object part.

dexvlg-fig1b

Figure 1b: DexGraspNet 3.0 dataset diversity and qualitative grasp results from DexVLG.

1. Background

The Gap Between VLA Systems and Dexterous Hands

While Vision-Language-Action (VLA) systems have advanced rapidly in robot manipulation, most research focuses on simple parallel grippers. Dexterous hands (22 DoF for the Shadow Hand) offer far richer manipulation capability, but three factors have prevented large-scale VLA research:

Data scarcity: Existing dexterous grasp datasets are small-scale (thousands to ~1M samples)
No semantic alignment: No existing dataset supports part-level language alignment
Complex configuration space: Automatically generating physically stable grasp poses in 22-DoF space is difficult

Limitations of Prior Datasets

Method	Scale	Semantic Part Alignment	Language Condition
DexGraspNet	1.32M	✗	✗
DexGYS	50k	✗	✗
SemGrasp	50k	Partial	✗
Multi-GraspLLM	120k	△	✗
DexGraspNet 3.0 (Ours)	170M	✓	✓

2. Core Ideas

Large-scale part-aligned dataset: Automatically generate 170M grasp poses on 174k objects using GPT-4o and SAMesh, paired with part-level captions
Part-aware initialization: Classify object parts into 4 geometric categories and set OBB (Oriented Bounding Box)-aligned initial hand poses
LP-based physics energy optimization: Replace the equal-magnitude contact force assumption of standard DFC with linear programming
Flow-matching VLM: End-to-end model combining Uni3D point cloud encoder, Florence-2 LLM, and flow-matching grasp generation head

3. DexGraspNet 3.0 Dataset Generation

The dataset is generated via a 5-stage pipeline.

dexvlg-fig2

Figure 2: DexGraspNet 3.0 dataset generation pipeline — object preparation, part-aware grasp generation, energy optimization, validation & captioning, and table-top scene generation.

Stage 1: Object Preparation and Part Segmentation

Starting from 800k+ Objaverse assets filtered by GPT-4o using 5 quality criteria:

ManifoldPlus + CoACD: Watertight collision mesh and convex decomposition (threshold=0.4)
GPT-4o: Object size estimation; normalize to 20–50cm diagonal range
SAMesh: Zero-shot geometry-based semantic part segmentation
Set-of-Mark + GPT-4o: Multi-view rendering analysis for automatic part name labeling

Stage 2: Part-aware Dexterous Grasp Generation

Initial Shadow Hand (22 DoF) palm pose \(T \in \mathbb{R}^3, R \in SO(3)\) and joint angles \(\theta \in \mathbb{R}^{22}\) are set using OBB geometry. Object parts are classified into four categories with differentiated initialization strategies:

Category	Description	Initialization Strategy
Lid-like	Flat parts embedded in the object	Palm perpendicular to principal direction, 24-angle jitter
Disk-like	Protruding flat/disk parts	Palm rotated sideways
L-shaped	Thin, elongated interactive parts	Palm directly aligned to grasp point geometry
Shaft-like	Default/general parts	Palm aligned to part principal direction

Two grasp modes:

Wrap: 7 contact candidates (5 fingertips + 2 palm points)
Pinch: 4 contact candidates (thumb, index, middle finger, palm)

Stage 3: Gradient-based Energy Optimization

Details in Section 4.

Stage 4: Grasp Validation and Captioning

Physics-based validation in Isaac Gym:

Penetration < 3mm, self-penetration < 3mm
Gravity counteraction test from 6 directions
Part-alignment check: contacted links must be closest to the target part

Accepted grasps receive captions via template:

“Grasp the {part} of the {object} object, with contacts on {fingers}”

Stage 5: Table-top Scene Generation

Objects are dropped to generate stable resting poses; scenes rendered from 8 viewpoints (Intel RealSense D415). Table-surface collision filtering applied to simulate real table-top environments.

4. Energy-based Optimization

The total energy is a weighted sum of four terms:

\[E = \omega_{FC} \cdot E_{FC} + \omega_{bar} \cdot E_{bar} + \omega_{dis} \cdot E_{dis} + \omega_{reg} \cdot E_{reg}\]

4-1. LP-based Differentiable Force Closure (\(E_{FC}\))

Standard DFC assumes all contact forces have equal magnitude — an unrealistic constraint that causes artifacts such as tilted fingers and drifted contacts. LP-based DFC replaces this with linear programming.

When the pose is stable (\(P < \tau_{FC}\)):

\[E_{FC} = \|G(f \odot c)\|^2\]

In the unstable initial phase:

\[E_{FC} = \|Gc\|^2\]

Where:

\(G\): grasp matrix mapping contact forces to net wrench
\(c\): contact normal vectors
\(f\): per-contact force magnitudes from LP (\(\max_i f_i = 1,\ f_i \geq 0\))

4-2. Part-contact Energy (\(E_{bar}\))

Penalizes fingertip penetration outside the target part:

\[E_{bar} = \sum_{n=1}^{5} \sum_{p_j \notin s_i} b(d(x_n, p_j),\ d_{thr})\]

Truncated barrier function:

\[b(d,\ d_{thr}) = \begin{cases} -(d - d_{thr})^2 \ln(d/d_{thr}) & 0 < d < d_{thr} \\ 0 & \text{otherwise} \end{cases}\]

\(x_n\): position of the \(n\)-th fingertip
\(p_j\): surface points outside the target part \(s_i\)

4-3. Distance Energy (\(E_{dis}\))

\[E_{dis} = \sum_{n=1}^{N} d(x_n, O) + \omega_{palm} \left| d(x_{palm}, O) - d_0 \right|\]

Minimizes fingertip-to-object distance while encouraging the palm to maintain target distance \(d_0\).

4-4. Regularization Energy (\(E_{reg}\))

\[E_{reg} = \omega_{limit} \cdot E_{limit} + \omega_{pen} \cdot E_{pen} + \omega_{spen} \cdot E_{spen} + \omega_{dir} \cdot E_{dir}\]

\(E_{limit}\): joint limit violation penalty (via cuRobo)
\(E_{pen}\): hand-object penetration penalty
\(E_{spen}\): hand self-collision penalty
\(E_{dir}\): directional alignment energy \(\displaystyle E_{dir} = \sum_{i=0}^{N} (1 - c_i \cdot N_i)\)

5. DexVLG Model Architecture

dexvlg-fig3

Figure 3: DexVLG model architecture — Uni3D point cloud encoder, Florence-2 LLM, and flow-matching grasp generation head.

5-1. Point Cloud Encoder

Backbone: Uni3D (pre-trained ViT-based 3D encoder)
Input: 10,000 colored points downsampled from single-view RGBD
Alignment: MLP projector maps encoded 3D features to language embedding space

5-2. Language Foundation Model

LLM: Florence-2 (Base or Large variant)
Point cloud features concatenated with language token embeddings
Language tokenizer frozen during training

5-3. Flow Matching-based Grasp Generation Head

Learns a velocity field \(v(X_t, t)\) that transports noise samples \(X_0\) to target grasp poses \(X_1\):

\[\min_v \mathbb{E}_{(t, X_0, X_1) \sim \gamma} \left\| \frac{d}{dt} X_t - v(X_t, t) \right\|^2\]

Conditioning: conditioned on LLM hidden states
Architecture: shares transformer blocks with the LLM
MLP Pose Decoder: outputs \(T \in \mathbb{R}^3\) (translation), \(R \in SO(3)\) (rotation), \(\theta \in \mathbb{R}^{22}\) (joint angles)

5-4. Training

Single-stage, full-parameter fine-tuning on DexGraspNet 3.0.

6. Loss Functions & Implementation Details

Item	Detail
Training data	DexGraspNet 3.0 (170M grasp poses, 174k objects)
3D encoder	Uni3D (ViT-based, pre-trained)
LLM	Florence-2 Base / Large
Point count	10,000 (single-view RGBD)
Robot hand	Shadow Hand (22 DoF)
Simulator	Isaac Gym
Camera	Intel RealSense D415 (8 viewpoints)
Physics library	cuRobo

7. Experimental Results

Simulation Benchmarks

Three metrics are evaluated:

Suc (Simulation Success Rate): physical simulation grasp success rate
PTA (Part Touch Accuracy): accuracy of touching the target part
PGA (Part Grasp Accuracy): accuracy of grasping the target part

Baselines: DGN2.0* (retrained DexGraspNet 2.0), DGN2.0*+CLIP (augmented with CLIP text features)

dexvlg-tab1

Table 1: Simulation performance comparison on LVIS-Seen, LVIS-Unseen, and SamPart3D benchmarks. DexVLG significantly outperforms all baselines across all metrics.

DexVLG surpasses all baselines by large margins on every benchmark:

LVIS-Seen: Suc 87.7%, PTA 70.7%, PGA 62.1%
LVIS-Unseen: Suc 79.1%, PTA 68.2%, PGA 36.3%
SamPart3D: Suc 76.3%, PGA 52.0%

Strong zero-shot generalization holds across unseen objects (LVIS-Unseen) and a different part segmentation method (SamPart3D).

dexvlg-tab2

Table 2: Additional simulation experiment results.

dexvlg-tab3

Table 3: DexGraspNet 3.0 dataset statistics — number of objects, grasp poses, and captions.

dexvlg-tab4

Table 4: Dataset quality comparison against prior work — penetration (mm), self-penetration (mm), and Q1 stability metric.

Qualitative Results

dexvlg-fig4

Figure 4: Qualitative grasp results on diverse objects. DexVLG accurately grasps the specified part according to the language instruction.

8. Ablation Studies

Grasp Mode Comparison

dexvlg-tab5

Table 5: Performance comparison between Wrap and Pinch grasp modes.

Wrap grasps substantially outperform Pinch across all metrics (LVIS-Seen Suc: 87.7% vs 71.8%). Pinch offers more flexibility but significantly lower stability.

Denoising Paradigm Comparison

dexvlg-tab6

Table 6: Comparison of denoising paradigms — DDPM, DDIM, and Flow Matching.

Flow Matching achieves 75.3% success rate, far exceeding DDPM (51.9%) and DDIM (57.7%), demonstrating clear advantages of continuous-flow methods for pose generation.

Model Component Ablation

dexvlg-tab7

Table 7: Ablation on model scale, training data scale, and colored point clouds.

Key findings:

Large model + large-scale training data are critical for generalization performance, especially on unseen objects
Colored point clouds (vs. geometry-only) significantly improve results

dexvlg-fig5

Figure 5: Ablation analysis visualization.

9. Real-world Experiments

Hardware:

Shadow Hand (22 DoF) + UR10e robotic arm
Intel RealSense D415 camera (single-view)

The system achieves 80% success rate and 75% part accuracy on simple objects, demonstrating successful part-aligned grasping in real-world scenarios.

dexvlg-fig6

Figure 6: Real-world dexterous grasping experiments with Shadow Hand + UR10e. The system executes part-aligned grasps on diverse objects following language instructions.

Limitations:

Grasp poses are generated without considering robot arm workspace constraints, requiring post-generation filtering for real-world deployment
No effective sample ranking methodology available for VLM-based models due to computational cost constraints

10. Summary

“We present DexVLG, a large vision-language model trained on DexGraspNet 3.0 — 170M dexterous grasp poses across 174k objects — achieving over 76% zero-shot execution success rate in simulation and 80% in real-world scenarios.”

Key contributions:

DexGraspNet 3.0: The largest dexterous grasp dataset — 174k objects, 170M part-aligned grasp poses with semantic captions
LP-based DFC: Replaces the equal-magnitude contact force assumption of standard DFC with linear programming for more realistic stability optimization
Part-aware initialization: Classifies object parts into 4 geometric categories for OBB-aligned hand pose initialization
DexVLG: Combines Uni3D + Florence-2 + flow-matching head to predict language-conditioned dexterous grasp poses from single-view RGBD
Strong zero-shot generalization: Maintains high performance on unseen objects and across different part segmentation methods