AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow


AeroPlace-Flow Overview

Overview of AeroPlace-Flow. Given a natural language instruction and RGB-D observations of the object and placement scene, our method infers a collision-free object flow for aerial manipulation in three main steps. (1) Visual Foresight: a language-conditioned image editing model generates a goal image of the scene with the object placed according to the instruction. (2) Object Flow Extraction: the generated image is converted into a metrically consistent 3D scene, contact footprints are estimated, and the original object geometry is used to compute a collision-free object flow trajectory. (3) Placement Execution: the aerial manipulator tracks the inferred object flow to execute the placement. Bottom: hardware demonstrations of language-conditioned aerial placement tasks in diverse scenarios. *Cable connected to drone is only for supplying power.

Abstract

Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language- and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.

How AeroPlace-Flow Works

  1. Visual Foresight (Goal Image Generation)
    Given a natural language instruction and RGB-D observations of the object and placement scene, a language-conditioned image editing model generates a goal image showing the object placed according to the instruction. This image acts as a semantic hypothesis of the desired final configuration.
  2. Object Flow Inference
    The generated goal image is grounded into metric 3D space to infer a feasible motion trajectory for the object.
    • Metric 3D Scene Reconstruction Recover a metrically consistent 3D scene from the generated image by aligning estimated depth with the observed scene depth.
    • Contact Footprint Estimation Identify the support region where the object should be placed by estimating the contact footprint between the generated object and the environment.
    • Object Geometry Alignment Replace the generated object's geometry with the real object geometry to ensure physically consistent placement.
    • Collision-Aware Object Flow Optimization Compute a trajectory that moves the object from the current gripper pose to the target placement configuration while enforcing collision avoidance and motion smoothness.
  3. Placement Execution
    The inferred object flow is executed by the aerial manipulator using trajectory tracking, transporting the grasped object to the predicted placement configuration.

Video

Test Results

Place the purple object on the left side of the table.

Object Image: I obj

Scene Image: I scene

Visual Foresight: I gen