RT-2: Vision-Language-Action Models for Generalizable Robotic Control: A Comprehensive Review

Austin  Zhou

doi:10.56028/aetr.15.1.1423.2025

Authors

Austin Zhou

DOI:

https://doi.org/10.56028/aetr.15.1.1423.2025

Keywords:

RT-2, Vision-Language Models, Action Tokenization, General-Purpose Robotics, Zero-Shot Learning

Abstract

Recent advances in large-scale vision-language models (VLMs) have opened new pathways for robotic learning and control. Google DeepMind’s Robotics Transformer 2 (RT-2) represents a transformative step by integrating pre-trained internet-scale multimodal models with physical robotic systems. RT-2 reconceptualizes robot actions as a symbolic language, enabling seamless knowledge transfer from web-scale data to low-level manipulation through action tokenization and co-fine-tuning. This review presents a comprehensive analysis of RT-2’s architecture, training methodology, and empirical performance, highlighting its significant improvements in zero-shot generalization, emergent reasoning abilities, and real-world deployment capacity. The paper also examines critical limitations such as physical precision, safety concerns, and computational requirements, offering insights into future directions for scalable, embodied artificial intelligence. RT-2 stands at the forefront of general-purpose robotics and sets the foundation for more capable and adaptive human-robot interaction systems.

RT-2: Vision-Language-Action Models for Generalizable Robotic Control: A Comprehensive Review

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section