AeroPlace-Flow: Language-Conditioned Object Placement for Aerial Manipulators

Abstract

Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language- and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.

How AeroPlace-Flow Works

Visual Foresight (Goal Image Generation)
Given a natural language instruction and RGB-D observations of the object and placement scene, a language-conditioned image editing model generates a goal image showing the object placed according to the instruction. This image acts as a semantic hypothesis of the desired final configuration.
Object Flow Inference
The generated goal image is grounded into metric 3D space to infer a feasible motion trajectory for the object.
- Metric 3D Scene Reconstruction Recover a metrically consistent 3D scene from the generated image by aligning estimated depth with the observed scene depth.
- Contact Footprint Estimation Identify the support region where the object should be placed by estimating the contact footprint between the generated object and the environment.
- Object Geometry Alignment Replace the generated object's geometry with the real object geometry to ensure physically consistent placement.
- Collision-Aware Object Flow Optimization Compute a trajectory that moves the object from the current gripper pose to the target placement configuration while enforcing collision avoidance and motion smoothness.
Placement Execution
The inferred object flow is executed by the aerial manipulator using trajectory tracking, transporting the grasped object to the predicted placement configuration.

Video

Test Results

Place the purple object on the left side of the table.

Image Generation
Object Flow
Object Placement

Object Image: I _obj

Scene Image: I _scene

Visual Foresight: I _gen

AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow

Abstract

How AeroPlace-Flow Works

Video

Test Results