RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
by Google DeepMind · unknown · Last verified 2026-03-17
RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.
https://arxiv.org/abs/2307.15818 ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Open Access
- Pricing
- unknown
- Capabilities
- end-to-end robotic control, visual-reasoning, action-generation, zero-shot generalization to new tasks, emergent reasoning capabilities, symbolic understanding, multi-stage semantic reasoning, transfer of web-scale knowledge, natural language instruction following
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- robotics, vision-language-models, action-models, transfer-learning, google-research, foundation-models, embodied-ai, zero-shot-learning, generalist-robots, vla
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
65.5Adoption
72
Quality
90
Freshness
78
Citations
75
Engagement
0