Abstract

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement.

In this work, we show that LLMs can be adapted for Reinforcement Learning (RL) problems in Embodied AI, using a method called Large LAnguage model Reinforcement learning Policy (LLaRP). LLaRP is able to solve diverse rearrangement tasks specified by natural language instructions in unseen houses while operating from an egocentric RGB camera and interacting with the environment via an arm and mobile base. The method robustly generalizes over unseen objects and scenes, novel ways of referring to objects, either by description or explanation of an activity; and even novel description of tasks including variable number of rearrangements, spatial descriptions, and conditional statements.

LLaRP takes as input the task instruction and all egocentric visual RGB frames from the current episode. These are encoded using the LLM embeddings or a vision encoder. The hidden outputs of the LLM are projected to action and value predictions. The observation encoder and action decoder MLPs are the only trained components. The LLM and vision encoder are frozen. The agent learns with online RL (PPO) by interacting with the environment.

LLaRP Success Examples

In the new Language Rearrangement benchmark, we show examples of LLaRP successfully zero-shot generalizing to unseen instructions in unseen houses. The agent is trained on 150,000 basic rearrangement instructions. We evaluate to test generalization capabilities with respect to different rearrangement concepts expressed with language. The policy takes as input the task instruction and 1st person RGB head camera (top right of the visualizations), the 3rd person and top down view are only for visualization and not provided to the policy. In the videos, the LLaRP policy is trained to select high-level action primitives like pick(apple), navigate(table), or close(fridge). These primitives then execute the low-level joint control.

Context

Describe a situation where a particular object fits.

Referring Expressions

Refer to objects by their visual appearance.

Instruction Rephrasing

Swap the order that nouns appear in the instruction and substitute synonyms for verbs.

Irrelevant Instruction Text

Instructions that include irrelevant context.

Novel Objects

New entity/instruction pairs.

Multiple Objects

Find all of an object.

Conditional Instructions

Adjust behavior based on if the conditional statement is true.

Spatial Relationships

Refer to receptacles indirectly by their location relative to other obstacles.

LLaRP Failure Examples

Context

Referring Expressions

Spatial Relationships

Irrelevant Instruction Text

Multiple Rearrangements

Multiple Objects

Conditional Instructions

BibTeX

@article{szot2023large,
  author    = {Szot, Andrew and Schwarzer, Max and Mazoure, Bogdan and Agrawal, Harsh and Talbott, Walter and Metcalf, Katherine and Mackraz, Natalie and Hjelm, Devon and Toshev, Alexander},
  title     = {Large Language Models as Generalizable Policies for Embodied Tasks},
  journal   = {preprint},
  year      = {2023},
}