Open-World Drone Active Tracking with Goal-Centered Rewards

South China University of Technology,

Institute for Super Robotics (Huangpu),

Pazhou Laboratory,

Peng Cheng Laboratory,

Sun Yat-sen University

NIPS 2025 (Poster)

Abstract

Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations, providing a more practical solution for effective tracking in dynamic environments. However, accurate Drone Visual Active Tracking using reinforcement learning remains challenging due to the absence of a unified benchmark and the complexity of open-world environments with frequent interference. To address these issues, we pioneer a systematic solution. First, we propose DAT, the first open-world drone active air-to-ground tracking benchmark. It encompasses 24 city-scale scenes, featuring targets with human-like behaviors and high-fidelity dynamics simulation. DAT also provides a digital twin tool for unlimited scene generation. Additionally, we propose a novel reinforcement learning method called GC-VAT, which aims to improve the performance of drone tracking targets in complex scenarios. Specifically, we design a Goal-Centered Reward to provide precise feedback across viewpoints to the agent, enabling it to expand perception and movement range through unrestricted perspectives. Inspired by curriculum learning, we introduce a Curriculum-Based Training strategy that progressively enhances the tracking performance in complex environments. Besides, experiments on simulator and real-world images demonstrate the superior performance of GC-VAT, achieving an approximately 400% improvement over the SOTA methods in terms of the cumulative reward metric.

Benchmark

24 Scenarios

24
Tracking Targets

Rich
Perceptual Information

Complex
Scenario Design

Realistic
Traffic Management

Realistic
Dynamics Simulation

  • 01

    CityStreet

  • Village

    02

  • Lake

    03

  • Downtown

    04

  • Farmland

    05

  • Desert

    06

  • MOTORBIKE

  • SCOOTER

  • TRAILER

  • TRUCK

  • BUS

  • TESLAModel3

  • LINKOLNMKZ

  • RANGEROVER

  • BENZSprinter

  • TOYOTAPrius

  • BMWX5

  • CITROENCZero

  • PEDESTRIAN

  • SHRIMP

  • CREATE

  • SOJOURNER

  • MANTIS

  • BB_8

  • AIBOERS7

  • BIOLOIDDOG

  • FIREBIRD6

  • SCOUT

  • GHOSTDOG2

  • HOAP2

Performance

We conduct within-scene, cross-scene and cross-domain tests. In cross-scene testing, the agent trained under daytime conditions in one environment is tested in different scenarios with the same weather. For cross-domain testing, it is evaluated in the same scene but under varying weather conditions.
1. Within-scene and cross-scene
Our R-VAT performs significantly better than other methods as shown in Tab. 1. In within-scene testing, for the CR, the average performance improvement on six maps relative to the D-VAT method is 591%(35 → 242). Regarding the TSR, the average enhancement is 279%(0.19 → 0.72). In cross-scene testing, R-VAT achieves a 376%(37 → 176) improvement in average CR and a 200%(0.19 → 0.57) improvement in average TSR compared to D-VAT.

2. Cross-domain
As shown in Tab. 2, our method outperforms existing methods significantly in cross-domain generalization. Specifically, the R-VAT demon-strates an average CR enhancement of 509%(35 → 213) relative to D-VAT and T SR boost of 253%(0.19 → 0.67).

Acknowledgements

Thanks to Dongqing Zhao, Xilang Zeng and Li Wei for their support and help in the project

Create a free website with Framer, the website builder loved by startups, designers and agencies.