VideoAgent: Self-Improving Video Generation

1University of Waterloo, 2IIT Kharagpur, 3Google Deepmind, 4Stanford University, 5Georgia Institute of Technology, 6New York University

*Equal contribution

Self-improving video plans via Vision-Language Model feedback and Online Interaction based real world feedback for robotic manipulation in simulation, embodied navigation, real world robot manipulation and video generation tasks.

Abstract

Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self- conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.

Quantitative Results

Meta-World Results

Meta-World Results. The mean success rates of baselines and VideoAgent on 11 simulated robot manipulation environments from Meta-World. VideoAgent consistently outperforms baselines across all tasks.

Task AVDC AVDC-Replan VideoAgent VideoAgent-Online (Iter1) VideoAgent-Online (Iter2) VideoAgent-Online-Replan
Door Open 30.7% 72.0% 40.0% 41.3% 44.0% 80.0%
Door Close 28.0% 89.3% 29.3% 32.0% 29.3% 97.3%
Basketball 21.3% 37.3% 13.3% 17.3% 18.7% 40.0%
Shelf Place 8.0% 18.7% 9.3% 12.0% 18.7% 22.7%
Button Press 34.7% 60.0% 38.7% 45.3% 46.7% 72.0%
Button Press Topdown 17.3% 24.0% 18.7% 14.7% 16.0% 40.0%
Faucet Close 12.0% 53.3% 46.7% 38.7% 49.3% 58.7%
Faucet Open 17.3% 24.0% 12.0% 13.3% 21.3% 36.0%
Handle Press 41.3% 81.3% 36.0% 36.0% 44.0% 85.3%
Hammer 0.0% 8.0% 0.0% 0.0% 1.3% 8.0%
Assembly 5.3% 6.7% 1.3% 4.0% 1.3% 10.7%
Overall 19.6% 43.1% 22.3% 23.2% 26.4% 50.0%

iThor Success Rates

Room AVDC Baseline VideoAgent (Ours)
Kitchen 26.7% 28.3%
Living Room 23.3% 26.7%
Bedroom 38.3% 41.7%
Bathroom 36.7% 40.0%
Overall 31.3% 34.2%

BridgeData-V2 Results

Metrics AVDC Video Agent (Ours)
Clip Score 22.39 22.90
Flow Consistency 2.48 ± 0.00 2.59 ± 0.01
Visual Quality 1.97 ± 0.003 2.01 ± 0.003
Temporal Consistency 1.48 ± 0.01 1.55 ± 0.01
Dynamic Degree 3.08 ± 0.01 3.07 ± 0.02
Text to Video Alignment 2.26 ± 0.003 2.30 ± 0.03
Factual Consistency 2.02 ± 0.004 2.07 ± 0.01
Average Video Score 2.16 ± 0.01 2.20 ± 0.01
Human Eval on Task Success 42.0% 64.0%

Qualitative Results

Meta-World Qualitative Results

Synthesized Videos

Robot Executions

Task: door-close

Task: door-open

Task: button-press-topdown

Task: hammer

iThor Qualitative Results

Robot Executions

Bridge Qualitative Results

Synthesized Videos from AVDC

Old Bridge GIF 1
Old Bridge GIF 2
Old Bridge GIF 3
Old Bridge GIF 4
Old Bridge GIF 5

Synthesized Videos with Refinement using VideoAgent

Bridge GIF 1
Bridge GIF 2
Bridge GIF 3
Bridge GIF 4
Bridge GIF 5

Task Description: Put Banana in Colander

Effect of Different Feedback

Effect of Different Feedback Graph
Effect of Different Feedback. Descriptive feedback from the VLM leads to higher improvement in task success.

Effect of Refinement Iterations

Effect of Refinement Iterations
Effect of Refinement Iterations. The accuracy of downstream tasks generally increases as the number of refinement iterations increases.

Effect of Online Iterations

Effect of Online Iterations
Effect of Online Iterations. The overall task success of VideoAgent increases as the number of online iterations increases.