Google DeepMind’s new RT-2 : One of the first things you discover in the world of robotics is the complexity of simple tasks. Things that appear simple to humans have potentially infinite variables that we take for granted. Robots don’t have such luxuries.
That’s precisely why much of the industry is focused on repeatable tasks in structured environments. Thankfully, the world of robotic learning has seen some game-changing breakthroughs in recent years, and the industry is on track for the creation and deployment of more adaptable systems.
Last year, Google DeepMind’s robotics team showcased Robotics Transformer — RT-1 — which trained its Everyday Robot systems to perform tasks like picking and placing and opening draws. The system was based on a database of 130,000 demonstrations, which resulted in a 97% success rate for “over 700” tasks, according to the team.
As artificial intelligence advances, we look to a future with more robots and automations than ever before. They already surround us — the robot vacuum that can expertly navigate your home, a robot pet companion to entertain your furry friends, and robot lawnmowers to take over weekend chores. We appear to be inching towards living out The Jetsons in real life. But as smart as they appear, these robots have their limitations.
Google DeepMind unveiled RT-2, the first vision-language-action (VLA) model for robot control, which effectively takes the robotics game several levels up. The system was trained on text data and images from the internet, much like the large language models behind AI chatbots like ChatGPT and Bing are trained.

Our robots at home can operate simple tasks they are programmed to perform. Vacuum the floors, for example, and if the left-side sensor detects a wall, try to go around it. But traditional robotic control systems aren’t programmed to handle new situations and unexpected changes — often, they can’t perform more than one task at a time.
- Eggs : The Healthiest Food on the Planet.
- Luna-25: Russia’s Moon Mission Crashes.
- Food Poisoning: Causes, Symptoms, Treatment, and Prevention
- Netflix Software Engineer Missing After Getting Into Uber.
- Simple Blood Test Could Revolutionize Lung Cancer Detection.
The DeepMind team adapted two existing models, Pathways Language and Image Model (PaLI-X) and Pathways Language Model Embodied (PaLM-E), to train RT-2. PaLI-X helps the model process visual data, trained on massive amounts of images and visual information with other corresponding descriptions and labels online. With PaLI-X, RT-2 can recognize different objects, understand its surrounding scenes for context, and relate visual data to semantic descriptions.
PaLM-E helps RT-2 interpret language, so it can easily understand instructions and relate them to what is around it and what it’s currently doing.