Guanxing Lu, Chubin Zhang, Haonan Jiang, Yuheng Zhou, Zifeng Gao, Yansong Tang$^{\spadesuit}$, Ziwei Wang$^{\clubsuit}$

$^{\spadesuit}$: Corresponding; $^{\clubsuit}$: Project lead

GitHub: https://github.com/GuanxingLu/vlarl

First published on: 2025/4/13

TL;DR

<aside> 🤗

Towards masterful and general robotic manipulation, we present VLA-RL, an open-source solution to advance vision-language-action (VLA) models with reinforcement learning beyond imitation learning.
Adopting the cutting-edge solution of ray+vllm+lora+fsdp, our codebase is scalable yet flexible.
We provide single-file implementation following the philosophy of https://github.com/vwxyzjn/cleanrl, making the code extremely easy to read and tweak. Happy hacking!
Work in Progress, let's build it together. </aside>

We want AI agents that can discover like we can, not which contain what we have discovered. — Richard S. Sutton “The Bitter Lesson”

Introduction

Recent vision-language-action (VLA) models highlight a potential pathway toward generalist robots through exploitation-based imitation. However, better performance demands exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. On the other hand, robotic applications naturally necessitate high precision. We believe the key to overcoming such challenges lies in transforming exploitation-based approaches into exploration-based methods, as exemplified by reinforcement learning (RL). In this blog, we seek a scalable path toward masterful and general robotic manipulation with efficient online reinforcement learning.

Methodology

Reinforcement Learning

Proximal Policy Optimization (PPO) is a crucial policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm (and also our single-file implementation ) follows these key steps:

Rollout Phase: We first merge the lora weights with the original checkpoint and broadcast it to the inference engine. Then the agent interacts with the environment according to its current policy $\pi_{\theta_{old}}$, generating a sequence of states, actions, and rewards (i.e., trajectories). The log-probability of an action sequence can be decomposed into the sum of token-level log probabilities in autoregressive models:

$\log \pi_{\theta_{old}}(a_t|s_t) = \sum_{i=1}^n \log \pi_{\theta_{old}}(\text{token}i|\text{tokens}{<i}, s_t)$

where n=7 is the degrees of freedom of the action space of OpenVLA.
Advantage Computation: For each state, the advantage is computed via Generalized Advantage Estimation (GAE):

$A_{t}=\sum_{l=0}^{\infty}(\gamma\lambda)^{l}(r_{t+l}+\gamma V(s_{t+l+1})-V(s_{t+l}))$

where $\lambda\in[0,1]$ is the GAE parameter that controls the trade-off between bias and variance, $\gamma$ is the discount factor, $V(s)$ is the value function estimate for state s.
Learning Phase: The PPO objective function utilizes importance sampling with clipping to ensure stable updates:

$L(\theta) = \mathbb{E}t\left[\min\left(\frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}A_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right)A_t\right)\right]$

where $\epsilon$ is the clipping parameter that restricts the ratio between the new and old policies, preventing excessive policy updates.

Infrastructure

The PPO infrastructure uses bfloat16 to fit models into memory. We allocate a dedicated 1 GPU to do inference with ‣ acceleration and other N-1 GPUs for learning by https://github.com/ray-project/ray, like done in ‣ and ‣. We implement multiple vectorized environments for parallel rollout, where each training GPU holds a subset of environments. The distributed training process is managed by https://github.com/pytorch/pytorch Fully Sharded Data Parallel (FSDP) to keep on par with the pretraining phase of https://github.com/openvla/openvla.

Implementation

As PPO highly depends on the implementation, we would like to share some tricks we have adopted in this project:

GPU-balanced Vectorized Environments: Modern renderers often rely on GPUs for acceleration, but as the number of vectorized environments increases, GPU memory consumption can grow significantly. To address this, we assign each GPU worker its own set of environments to interact with and learn from. At the same time, we use an all_reduce operation to gather the environmental states across all workers for the inference engine.