World-Value-Action Model
Implicit Planning for Vision-Language-Action Systems

Runze Li*1, Hongyin Zhang*1, Junxi Jin1, Qixin Zeng1, Zifeng Zhuang1, Yiqi Tang1, Shangke Lyu2, Donglin Wang1†
1Westlake University 2Nanjing University Suzhou Campus

*Equal contribution. Corresponding author: wangdonglin@westlake.edu.cn

Place Object in Drawer

Flatten Towel

Stack Bowls

Abstract

Vision-Language-Action models have become a promising route to embodied agents, but most existing approaches still optimize action prediction directly and struggle to reason over long-horizon outcomes. World-Value-Action Model (WAV) introduces a unified world-value-action formulation that learns a latent representation of future trajectories conditioned on observations and language. A learned world model predicts future states, a trajectory value function evaluates long-horizon utility, and action generation becomes latent-space inference that concentrates probability mass on high-value, dynamically feasible futures. Instead of planning directly in action space, WAV performs iterative inference in a compact latent trajectory space, which biases sampling toward feasible futures and improves long-horizon decision making in both simulation and real-world robotic manipulation.

Unified Backbone Multi-view diffusion transformer with video, value, and action experts.
Latent Planning Iterative trajectory-space inference before action execution.
Three-Stage Training Video adaptation, trajectory value learning, and action post-training.
Evaluation Closed-loop LIBERO benchmark plus real-world bimanual manipulation.

Method Overview

WAV overview

WAV combines instruction-conditioned video generation, trajectory value estimation, action prediction, and latent trajectory planning in one multi-view architecture. The model first imagines future visual rollouts, then scores candidate futures with a value expert, and finally decodes executable actions from optimized latent trajectory features.

Results

LIBERO Benchmark

WAV LIBERO results

WAV improves long-horizon closed-loop manipulation performance on LIBERO by reasoning over future trajectories rather than predicting actions purely reactively.

Real-World Evaluation

WAV real-world results

The same framework transfers to real bimanual robotics and maintains strong performance under clutter, noisy dynamics, and longer task horizons.

Real-World Comparisons

Task 1. Place Object in Drawer

Representative comparison between the baseline policy and WAV on a long-horizon drawer placement sequence.

Run 1 / 3

Baseline

WAV

Task 2. Flatten Towel

Deformable-object manipulation requires stable long-horizon correction and coordinated bimanual behavior.

Run 1 / 3

Baseline

WAV

Task 3. Stack Bowls

WAV preserves stronger visual grounding and multi-step consistency on cluttered tabletop rearrangement.

Run 1 / 3

Baseline

WAV

BibTeX

@article{li2026world, title={World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems}, author={Li, Runze and Zhang, Hongyin and Jin, Junxi and Zeng, Qixin and Zhuang, Zifeng and Tang, Yiqi and Lyu, Shangke and Wang, Donglin}, journal={arXiv preprint arXiv:2604.14732}, year={2026} }