The pipeline of policy adaptation from foundation model feedback (PAFF). When we adapt a trained policy to a new task, we first let the robot play, that is, the policy continuously predicts and performs actions given a series of randomly generated language instructions. We record these demonstrations including the visual observations and the model's actions. After that, we let the model relabel, that is, the vision-language foundation model relabels the demonstrations by retrieving the language instructions given the recorded visual observations. We then fine-tune the policy with the accurate paired observations and instructions, and the corresponding actions, which are collected in an automatic way.
We train a policy on simulation data and adapt it to the real world.
We train a policy to pack objects of different shapes in the brown box, and put blocks of different colors in the bowls of different colors, and adapt it to put objects of different shapes in the bowls of different colors.
We train a policy to pack certain objects in the brown box, and adapt it to pack unseen objects.
We train a policy on seen environments and adapt it to a new environment with different textures and differently positioned static elements such as the sliding door, the drawer and the light button.