Paperroboticsv2.0

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

by Google DeepMind · unknown · Last verified 2026-03-17

RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.

https://arxiv.org/abs/2307.15818 ↗

B—Above Average

Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: Open Access
Pricing: unknown
Capabilities: end-to-end robotic control, visual-reasoning, action-generation, zero-shot generalization to new tasks, emergent reasoning capabilities, symbolic understanding, multi-stage semantic reasoning, transfer of web-scale knowledge, natural language instruction following
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Tags: robotics, vision-language-models, action-models, transfer-learning, google-research, foundation-models, embodied-ai, zero-shot-learning, generalist-robots, vla
Added: 2026-03-17
Completeness: 0.9%

Index Score

65.5

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service