Guanxing Lu, Chubin Zhang, Haonan Jiang, Yuheng Zhou, Zifeng Gao, Yansong Tang$^{\spadesuit}$, Ziwei Wang$^{\clubsuit}$
$^{\spadesuit}$: Corresponding; $^{\clubsuit}$: Project lead
GitHub: https://github.com/GuanxingLu/vlarl
First published on: 2025/4/13
TL;DR
<aside> 🤗
We want AI agents that can discover like we can, not which contain what we have discovered. — Richard S. Sutton “The Bitter Lesson”
Recent vision-language-action (VLA) models highlight a potential pathway toward generalist robots through exploitation-based imitation. However, better performance demands exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. On the other hand, robotic applications naturally necessitate high precision. We believe the key to overcoming such challenges lies in transforming exploitation-based approaches into exploration-based methods, as exemplified by reinforcement learning (RL). In this blog, we seek a scalable path toward masterful and general robotic manipulation with efficient online reinforcement learning.
Proximal Policy Optimization (PPO) is a crucial policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm (and also our single-file implementation ) follows these key steps:
Rollout Phase: We first merge the lora weights with the original checkpoint and broadcast it to the inference engine. Then the agent interacts with the environment according to its current policy $\pi_{\theta_{old}}$, generating a sequence of states, actions, and rewards (i.e., trajectories). The log-probability of an action sequence can be decomposed into the sum of token-level log probabilities in autoregressive models:
$\log \pi_{\theta_{old}}(a_t|s_t) = \sum_{i=1}^n \log \pi_{\theta_{old}}(\text{token}i|\text{tokens}{<i}, s_t)$
where n=7 is the degrees of freedom of the action space of OpenVLA.
Advantage Computation: For each state, the advantage is computed via Generalized Advantage Estimation (GAE):
$A_{t}=\sum_{l=0}^{\infty}(\gamma\lambda)^{l}(r_{t+l}+\gamma V(s_{t+l+1})-V(s_{t+l}))$
where $\lambda\in[0,1]$ is the GAE parameter that controls the trade-off between bias and variance, $\gamma$ is the discount factor, $V(s)$ is the value function estimate for state s.
Learning Phase: The PPO objective function utilizes importance sampling with clipping to ensure stable updates:
$L(\theta) = \mathbb{E}t\left[\min\left(\frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}A_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)A_t\right)\right]$
where $\epsilon$ is the clipping parameter that restricts the ratio between the new and old policies, preventing excessive policy updates.
The PPO infrastructure uses bfloat16 to fit models into memory. We allocate a dedicated 1 GPU to do inference with ‣ acceleration and other N-1 GPUs for learning by https://github.com/ray-project/ray, like done in ‣ and ‣. We implement multiple vectorized environments for parallel rollout, where each training GPU holds a subset of environments. The distributed training process is managed by https://github.com/pytorch/pytorch Fully Sharded Data Parallel (FSDP) to keep on par with the pretraining phase of https://github.com/openvla/openvla.
As PPO highly depends on the implementation, we would like to share some tricks we have adopted in this project:
all_reduce
operation to gather the environmental states across all workers for the inference engine.