-
Visual Foresight (Goal Image Generation)
Given a natural language instruction and RGB-D observations of the object and placement scene, a language-conditioned image editing model generates a goal image showing the object placed according to the instruction. This image acts as a semantic hypothesis of the desired final configuration. -
Object Flow Inference
The generated goal image is grounded into metric 3D space to infer a feasible motion trajectory for the object.- Metric 3D Scene Reconstruction Recover a metrically consistent 3D scene from the generated image by aligning estimated depth with the observed scene depth.
- Contact Footprint Estimation Identify the support region where the object should be placed by estimating the contact footprint between the generated object and the environment.
- Object Geometry Alignment Replace the generated object's geometry with the real object geometry to ensure physically consistent placement.
- Collision-Aware Object Flow Optimization Compute a trajectory that moves the object from the current gripper pose to the target placement configuration while enforcing collision avoidance and motion smoothness.
-
Placement Execution
The inferred object flow is executed by the aerial manipulator using trajectory tracking, transporting the grasped object to the predicted placement configuration.