[Paper Review] DexVLG: Dexterous Vision-Language-Grasp Model at Scale (ICCV 2025)
Paper: DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Venue: ICCV 2025 Spotlight
Authors: Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang
Affiliations: BAAI, Galbot, Tsinghua University, Peking University, CASIA, Shanghai Jiao Tong University, EIT
arXiv: 2507.02747
GitHub: jiaweihe1996/DexVLG
One-Line Summary
DexGraspNet 3.0 (170M part-aligned grasp poses on 174k objects) paired with the flow-matching-based VLM DexVLG enables language-aligned dexterous grasp pose generation from single-view RGBD input, achieving 87.7% simulation success rate and 80% real-world success rate.


1. Background
The Gap Between VLA Systems and Dexterous Hands
While Vision-Language-Action (VLA) systems have advanced rapidly in robot manipulation, most research focuses on simple parallel grippers. Dexterous hands (22 DoF for the Shadow Hand) offer far richer manipulation capability, but three factors have prevented large-scale VLA research:
- Data scarcity: Existing dexterous grasp datasets are small-scale (thousands to ~1M samples)
- No semantic alignment: No existing dataset supports part-level language alignment
- Complex configuration space: Automatically generating physically stable grasp poses in 22-DoF space is difficult
Limitations of Prior Datasets
| Method | Scale | Semantic Part Alignment | Language Condition |
|---|---|---|---|
| DexGraspNet | 1.32M | ✗ | ✗ |
| DexGYS | 50k | ✗ | ✗ |
| SemGrasp | 50k | Partial | ✗ |
| Multi-GraspLLM | 120k | △ | ✗ |
| DexGraspNet 3.0 (Ours) | 170M | ✓ | ✓ |
2. Core Ideas
- Large-scale part-aligned dataset: Automatically generate 170M grasp poses on 174k objects using GPT-4o and SAMesh, paired with part-level captions
- Part-aware initialization: Classify object parts into 4 geometric categories and set OBB (Oriented Bounding Box)-aligned initial hand poses
- LP-based physics energy optimization: Replace the equal-magnitude contact force assumption of standard DFC with linear programming
- Flow-matching VLM: End-to-end model combining Uni3D point cloud encoder, Florence-2 LLM, and flow-matching grasp generation head
3. DexGraspNet 3.0 Dataset Generation
The dataset is generated via a 5-stage pipeline.

Stage 1: Object Preparation and Part Segmentation
Starting from 800k+ Objaverse assets filtered by GPT-4o using 5 quality criteria:
- ManifoldPlus + CoACD: Watertight collision mesh and convex decomposition (threshold=0.4)
- GPT-4o: Object size estimation; normalize to 20–50cm diagonal range
- SAMesh: Zero-shot geometry-based semantic part segmentation
- Set-of-Mark + GPT-4o: Multi-view rendering analysis for automatic part name labeling
Stage 2: Part-aware Dexterous Grasp Generation
Initial Shadow Hand (22 DoF) palm pose \(T \in \mathbb{R}^3, R \in SO(3)\) and joint angles \(\theta \in \mathbb{R}^{22}\) are set using OBB geometry. Object parts are classified into four categories with differentiated initialization strategies:
| Category | Description | Initialization Strategy |
|---|---|---|
| Lid-like | Flat parts embedded in the object | Palm perpendicular to principal direction, 24-angle jitter |
| Disk-like | Protruding flat/disk parts | Palm rotated sideways |
| L-shaped | Thin, elongated interactive parts | Palm directly aligned to grasp point geometry |
| Shaft-like | Default/general parts | Palm aligned to part principal direction |
Two grasp modes:
- Wrap: 7 contact candidates (5 fingertips + 2 palm points)
- Pinch: 4 contact candidates (thumb, index, middle finger, palm)
Stage 3: Gradient-based Energy Optimization
Details in Section 4.
Stage 4: Grasp Validation and Captioning
Physics-based validation in Isaac Gym:
- Penetration < 3mm, self-penetration < 3mm
- Gravity counteraction test from 6 directions
- Part-alignment check: contacted links must be closest to the target part
Accepted grasps receive captions via template:
“Grasp the {part} of the {object} object, with contacts on {fingers}”
Stage 5: Table-top Scene Generation
Objects are dropped to generate stable resting poses; scenes rendered from 8 viewpoints (Intel RealSense D415). Table-surface collision filtering applied to simulate real table-top environments.
4. Energy-based Optimization
The total energy is a weighted sum of four terms:
\[E = \omega_{FC} \cdot E_{FC} + \omega_{bar} \cdot E_{bar} + \omega_{dis} \cdot E_{dis} + \omega_{reg} \cdot E_{reg}\]4-1. LP-based Differentiable Force Closure (\(E_{FC}\))
Standard DFC assumes all contact forces have equal magnitude — an unrealistic constraint that causes artifacts such as tilted fingers and drifted contacts. LP-based DFC replaces this with linear programming.
When the pose is stable (\(P < \tau_{FC}\)):
\[E_{FC} = \|G(f \odot c)\|^2\]In the unstable initial phase:
\[E_{FC} = \|Gc\|^2\]Where:
- \(G\): grasp matrix mapping contact forces to net wrench
- \(c\): contact normal vectors
- \(f\): per-contact force magnitudes from LP (\(\max_i f_i = 1,\ f_i \geq 0\))
4-2. Part-contact Energy (\(E_{bar}\))
Penalizes fingertip penetration outside the target part:
\[E_{bar} = \sum_{n=1}^{5} \sum_{p_j \notin s_i} b(d(x_n, p_j),\ d_{thr})\]Truncated barrier function:
\[b(d,\ d_{thr}) = \begin{cases} -(d - d_{thr})^2 \ln(d/d_{thr}) & 0 < d < d_{thr} \\ 0 & \text{otherwise} \end{cases}\]- \(x_n\): position of the \(n\)-th fingertip
- \(p_j\): surface points outside the target part \(s_i\)
4-3. Distance Energy (\(E_{dis}\))
\[E_{dis} = \sum_{n=1}^{N} d(x_n, O) + \omega_{palm} \left| d(x_{palm}, O) - d_0 \right|\]Minimizes fingertip-to-object distance while encouraging the palm to maintain target distance \(d_0\).
4-4. Regularization Energy (\(E_{reg}\))
\[E_{reg} = \omega_{limit} \cdot E_{limit} + \omega_{pen} \cdot E_{pen} + \omega_{spen} \cdot E_{spen} + \omega_{dir} \cdot E_{dir}\]- \(E_{limit}\): joint limit violation penalty (via cuRobo)
- \(E_{pen}\): hand-object penetration penalty
- \(E_{spen}\): hand self-collision penalty
- \(E_{dir}\): directional alignment energy \(\displaystyle E_{dir} = \sum_{i=0}^{N} (1 - c_i \cdot N_i)\)
5. DexVLG Model Architecture

5-1. Point Cloud Encoder
- Backbone: Uni3D (pre-trained ViT-based 3D encoder)
- Input: 10,000 colored points downsampled from single-view RGBD
- Alignment: MLP projector maps encoded 3D features to language embedding space
5-2. Language Foundation Model
- LLM: Florence-2 (Base or Large variant)
- Point cloud features concatenated with language token embeddings
- Language tokenizer frozen during training
5-3. Flow Matching-based Grasp Generation Head
Learns a velocity field \(v(X_t, t)\) that transports noise samples \(X_0\) to target grasp poses \(X_1\):
\[\min_v \mathbb{E}_{(t, X_0, X_1) \sim \gamma} \left\| \frac{d}{dt} X_t - v(X_t, t) \right\|^2\]- Conditioning: conditioned on LLM hidden states
- Architecture: shares transformer blocks with the LLM
- MLP Pose Decoder: outputs \(T \in \mathbb{R}^3\) (translation), \(R \in SO(3)\) (rotation), \(\theta \in \mathbb{R}^{22}\) (joint angles)
5-4. Training
Single-stage, full-parameter fine-tuning on DexGraspNet 3.0.
6. Loss Functions & Implementation Details
| Item | Detail |
|---|---|
| Training data | DexGraspNet 3.0 (170M grasp poses, 174k objects) |
| 3D encoder | Uni3D (ViT-based, pre-trained) |
| LLM | Florence-2 Base / Large |
| Point count | 10,000 (single-view RGBD) |
| Robot hand | Shadow Hand (22 DoF) |
| Simulator | Isaac Gym |
| Camera | Intel RealSense D415 (8 viewpoints) |
| Physics library | cuRobo |
7. Experimental Results
Simulation Benchmarks
Three metrics are evaluated:
- Suc (Simulation Success Rate): physical simulation grasp success rate
- PTA (Part Touch Accuracy): accuracy of touching the target part
- PGA (Part Grasp Accuracy): accuracy of grasping the target part
Baselines: DGN2.0* (retrained DexGraspNet 2.0), DGN2.0*+CLIP (augmented with CLIP text features)

DexVLG surpasses all baselines by large margins on every benchmark:
- LVIS-Seen: Suc 87.7%, PTA 70.7%, PGA 62.1%
- LVIS-Unseen: Suc 79.1%, PTA 68.2%, PGA 36.3%
- SamPart3D: Suc 76.3%, PGA 52.0%
Strong zero-shot generalization holds across unseen objects (LVIS-Unseen) and a different part segmentation method (SamPart3D).



Qualitative Results

8. Ablation Studies
Grasp Mode Comparison

Wrap grasps substantially outperform Pinch across all metrics (LVIS-Seen Suc: 87.7% vs 71.8%). Pinch offers more flexibility but significantly lower stability.
Denoising Paradigm Comparison

Flow Matching achieves 75.3% success rate, far exceeding DDPM (51.9%) and DDIM (57.7%), demonstrating clear advantages of continuous-flow methods for pose generation.
Model Component Ablation

Key findings:
- Large model + large-scale training data are critical for generalization performance, especially on unseen objects
- Colored point clouds (vs. geometry-only) significantly improve results

9. Real-world Experiments
Hardware:
- Shadow Hand (22 DoF) + UR10e robotic arm
- Intel RealSense D415 camera (single-view)
The system achieves 80% success rate and 75% part accuracy on simple objects, demonstrating successful part-aligned grasping in real-world scenarios.

Limitations:
- Grasp poses are generated without considering robot arm workspace constraints, requiring post-generation filtering for real-world deployment
- No effective sample ranking methodology available for VLM-based models due to computational cost constraints
10. Summary
“We present DexVLG, a large vision-language model trained on DexGraspNet 3.0 — 170M dexterous grasp poses across 174k objects — achieving over 76% zero-shot execution success rate in simulation and 80% in real-world scenarios.”
Key contributions:
- DexGraspNet 3.0: The largest dexterous grasp dataset — 174k objects, 170M part-aligned grasp poses with semantic captions
- LP-based DFC: Replaces the equal-magnitude contact force assumption of standard DFC with linear programming for more realistic stability optimization
- Part-aware initialization: Classifies object parts into 4 geometric categories for OBB-aligned hand pose initialization
- DexVLG: Combines Uni3D + Florence-2 + flow-matching head to predict language-conditioned dexterous grasp poses from single-view RGBD
- Strong zero-shot generalization: Maintains high performance on unseen objects and across different part segmentation methods