Guanxing Lu, Chubin Zhang, Haonan Jiang, Yuheng Zhou, Zifeng Gao, Yansong Tang$^{\spadesuit}$, Ziwei Wang$^{\clubsuit}$

$^{\spadesuit}$: Corresponding; $^{\clubsuit}$: Project lead

GitHub: https://github.com/GuanxingLu/vlarl

First published on: 2025/4/13

TL;DR

<aside> 🤗

We want AI agents that can discover like we can, not which contain what we have discovered. — Richard S. Sutton “The Bitter Lesson”

Introduction

Recent vision-language-action (VLA) models highlight a potential pathway toward generalist robots through exploitation-based imitation. However, better performance demands exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. On the other hand, robotic applications naturally necessitate high precision. We believe the key to overcoming such challenges lies in transforming exploitation-based approaches into exploration-based methods, as exemplified by reinforcement learning (RL). In this blog, we seek a scalable path toward masterful and general robotic manipulation with efficient online reinforcement learning.

Methodology

Reinforcement Learning

Proximal Policy Optimization (PPO) is a crucial policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm (and also our single-file implementation ) follows these key steps:

Infrastructure

The PPO infrastructure uses bfloat16 to fit models into memory. We allocate a dedicated 1 GPU to do inference with ‣ acceleration and other N-1 GPUs for learning by https://github.com/ray-project/ray, like done in ‣ and ‣. We implement multiple vectorized environments for parallel rollout, where each training GPU holds a subset of environments. The distributed training process is managed by https://github.com/pytorch/pytorch Fully Sharded Data Parallel (FSDP) to keep on par with the pretraining phase of https://github.com/openvla/openvla.

Implementation

As PPO highly depends on the implementation, we would like to share some tricks we have adopted in this project: