World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Li, Runze; Zhang, Hongyin; Jin, Junxi; Zeng, Qixin; Zhuang, Zifeng; Tang, Yiqi; Lyu, Shangke; Wang, Donglin

World-Value-Action Model
Implicit Planning for Vision-Language-Action Systems

Runze Li^*1, Hongyin Zhang^*1, Junxi Jin¹, Qixin Zeng¹, Zifeng Zhuang¹, Yiqi Tang¹, Shangke Lyu², Donglin Wang^1†

¹Westlake University ²Nanjing University Suzhou Campus

^*Equal contribution. ^†Corresponding author: wangdonglin@westlake.edu.cn

arXiv Code

Place Object in Drawer

Flatten Towel

Stack Bowls

Abstract

Vision-Language-Action models have become a promising route to embodied agents, but most existing approaches still optimize action prediction directly and struggle to reason over long-horizon outcomes. World-Value-Action Model (WAV) introduces a unified world-value-action formulation that learns a latent representation of future trajectories conditioned on observations and language. A learned world model predicts future states, a trajectory value function evaluates long-horizon utility, and action generation becomes latent-space inference that concentrates probability mass on high-value, dynamically feasible futures. Instead of planning directly in action space, WAV performs iterative inference in a compact latent trajectory space, which biases sampling toward feasible futures and improves long-horizon decision making in both simulation and real-world robotic manipulation.

          Unified Backbone
          Multi-view diffusion transformer with video, value, and action experts.
        

          Latent Planning
          Iterative trajectory-space inference before action execution.
        

          Three-Stage Training
          Video adaptation, trajectory value learning, and action post-training.
        

          Evaluation
          Closed-loop LIBERO benchmark plus real-world bimanual manipulation.
        

Method Overview

WAV combines instruction-conditioned video generation, trajectory value estimation, action prediction, and latent trajectory planning in one multi-view architecture. The model first imagines future visual rollouts, then scores candidate futures with a value expert, and finally decodes executable actions from optimized latent trajectory features.

Results

LIBERO Benchmark

WAV improves long-horizon closed-loop manipulation performance on LIBERO by reasoning over future trajectories rather than predicting actions purely reactively.

Real-World Evaluation

The same framework transfers to real bimanual robotics and maintains strong performance under clutter, noisy dynamics, and longer task horizons.

Real-World Comparisons

Task 1. Place Object in Drawer

Representative comparison between the baseline policy and WAV on a long-horizon drawer placement sequence.

Run 1 / 3

Baseline

WAV

Task 2. Flatten Towel

Deformable-object manipulation requires stable long-horizon correction and coordinated bimanual behavior.

Run 1 / 3

Baseline

WAV

Task 3. Stack Bowls

WAV preserves stronger visual grounding and multi-step consistency on cluttered tabletop rearrangement.

Run 1 / 3

Baseline

WAV

BibTeX

@article{li2026world,
  title={World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems},
  author={Li, Runze and Zhang, Hongyin and Jin, Junxi and Zeng, Qixin and Zhuang, Zifeng and Tang, Yiqi and Lyu, Shangke and Wang, Donglin},
  journal={arXiv preprint arXiv:2604.14732},
  year={2026}
}

World-Value-Action Model Implicit Planning for Vision-Language-Action Systems

Abstract

Method Overview

Results

LIBERO Benchmark

Real-World Evaluation

Real-World Comparisons

Task 1. Place Object in Drawer

Task 2. Flatten Towel

Task 3. Stack Bowls

BibTeX

World-Value-Action Model
Implicit Planning for Vision-Language-Action Systems