Vision-Language-Action models have become a promising route to embodied agents, but most existing approaches still optimize action prediction directly and struggle to reason over long-horizon outcomes. World-Value-Action Model (WAV) introduces a unified world-value-action formulation that learns a latent representation of future trajectories conditioned on observations and language. A learned world model predicts future states, a trajectory value function evaluates long-horizon utility, and action generation becomes latent-space inference that concentrates probability mass on high-value, dynamically feasible futures. Instead of planning directly in action space, WAV performs iterative inference in a compact latent trajectory space, which biases sampling toward feasible futures and improves long-horizon decision making in both simulation and real-world robotic manipulation.
WAV combines instruction-conditioned video generation, trajectory value estimation, action prediction, and latent trajectory planning in one multi-view architecture. The model first imagines future visual rollouts, then scores candidate futures with a value expert, and finally decodes executable actions from optimized latent trajectory features.
WAV improves long-horizon closed-loop manipulation performance on LIBERO by reasoning over future trajectories rather than predicting actions purely reactively.
The same framework transfers to real bimanual robotics and maintains strong performance under clutter, noisy dynamics, and longer task horizons.
Representative comparison between the baseline policy and WAV on a long-horizon drawer placement sequence.
Baseline
WAV
Deformable-object manipulation requires stable long-horizon correction and coordinated bimanual behavior.
Baseline
WAV
WAV preserves stronger visual grounding and multi-step consistency on cluttered tabletop rearrangement.
Baseline
WAV
@article{li2026world,
title={World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems},
author={Li, Runze and Zhang, Hongyin and Jin, Junxi and Zeng, Qixin and Zhuang, Zifeng and Tang, Yiqi and Lyu, Shangke and Wang, Donglin},
journal={arXiv preprint arXiv:2604.14732},
year={2026}
}