SOAR | Notion

Screenshot 2024-07-31 at 10.16.34 am.png

VLM generates tasks
Uses Image-Editing diffusion model to generate images of subgoals
Goal conditioned Robot Policy
1. obs: 256x256 RGB images
2. action space: delta eef control 5HZ
VLM for success detection

Inspo for autonomous-izing amzn task: perhaps eliminate the whole move and pick process and just use the wrist camera to fondle around in bin?

It will be much impressive if it is a fully autonomous industrial system

Foundational model takeaway

new perspective on how the generated images can be used
- Looking at the failures, the generated images seem quite informational, though sometimes the objects look deformed
- If I have to make a guess, it is the GCBC’s fault for most failures (aka images to delta eefs)
- It’d be interesting if they tried different inputs, e.g. using a masked image of eef/RT trajectories to train the GCBC