GNM: A General Navigation Model to Drive Any Robot
Abstract
Learning provides a powerful tool for vision-based navigation, but the capabilities of learning-based policies are constrained by limited training data. If we could combine data from all available sources, including multiple kinds of robots, we could train more powerful navigation models. In this paper, we study how a general goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots, and enable broad generalization across environments and embodiments. We analyze the necessary design decisions for effective data sharing across robots, including the use of temporal context and standardized action spaces, and demonstrate that an omnipolicy trained from heterogeneous datasets outperforms policies trained on any single dataset. We curate 60 hours of navigation trajectories from 6 distinct robots, and deploy the trained GNM on a range of new robots, including an underactuated quadrotor. We find that training on diverse data leads to robustness against degradation in sensing and actuation. Using a pre-trained navigation model with broad generalization capabilities can bootstrap applications on novel robots going forward, and we hope that the GNM represents a step in that direction. For more information on the datasets, code, and videos, please check out our project page https://sites.google.com/view/drive-any-robot.
Community
Introduces GNM (General Navigation Model): a single policy trained on large (diverse) heterogeneous datasets, a single omnipolicy to control a variety of robots (including new/unseen) in different environments; general purpose visual navigation models. Related to transfer learning, DroNet, etc.; uses topological graphs for high-level planning and image-goal policies for low-level control. Combined 9 datasets across different platforms, movement speeds, durations, and environments; solve problem as image-goal navigation. Use a shared transformed action space using relative waypoints, yaw change, and mid-level actions; common to diverse robots; normalize action and temporal distance (by top speed of robot). Learn an (ego-centric) embodiment context using past observations to generalise. Architecture has two MobileNetV2 CNN encoders: one for context and current observations and another for both with goal; give all embeddings to FC layer and have heads for normalized temporal actions/waypoints and distances. Sample image-goal pairs: positives from same trajectory and negatives from different trajectories; distance head trained on positives and negatives, action gets only positives. Topological map (and training) setup from ViNG. Results outperform single-robot policy (GS, RECON) across multiple robots; more data is better; ablations for action space, architecture, and context; GNM is more robust to hardware faults and (ego) viewpoint changes. From UC Berkeley (Sergey Levine), Toyota.
Links: website (part of general navigation models), arxiv, GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper