auto-improvement.github.io

- VLM generates tasks
- Uses Image-Editing diffusion model to generate images of subgoals
- Goal conditioned Robot Policy
- obs: 256x256 RGB images
- action space: delta eef control 5HZ
- VLM for success detection
Inspo for autonomous-izing amzn task: perhaps eliminate the whole move and pick process and just use the wrist camera to fondle around in bin?
It will be much impressive if it is a fully autonomous industrial system
Foundational model takeaway
- new perspective on how the generated images can be used
- Looking at the failures, the generated images seem quite informational, though sometimes the objects look deformed
- If I have to make a guess, it is the GCBC’s fault for most failures (aka images to delta eefs)
- It’d be interesting if they tried different inputs, e.g. using a masked image of eef/RT trajectories to train the GCBC