Perceptual uncertainty is a central challenge for heterogeneous robot teams op- erating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are com- putationally prohibitive for onboard inference and lack calibrated uncertainty quantification.
We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, elim- inating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and de- tection outputs. These calibrated uncertainty estimates directly trigger active per- ception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty.
Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame in- ference latency 350×. We also release an air-ground dataset for future research. Code, videos, and dataset available at: co-glance.github.io.
An aerial robot and ground robot coordinate through communication for active perception.
System overview: (1) perceptual uncertainty detection, (2) occlusion uncertainty, (3) resolution of high-uncertainty areas, (4) object detection, (5) detection uncertainty, and (6) uncertainty-driven active perception.
Perceptual uncertainty detection: (1) occlusion segmentation and robot allocation by VLM with self-review, (2) knowledge distillation, and (3) onboard inference using the distilled model.
Real air-ground data is costly to collect, requiring two robots operating outdoors simultaneously with synchronized sensing and metric localization across platforms. The Co-GLANCE dataset provides more than 4,000 synchronized aerial and ground RGB frames across semi-structured outdoor scenarios, recorded with a DJI Matrice 600 and a Boston Dynamics Spot.
Depending on the scenario, available streams include RGB, estimated depth, RTK GPS, and IMU data. Raw ROS 2 bags from both platforms are also released to support evaluation of perception and autonomy stacks beyond static image benchmarks.
| Scenario | Run | Frame Pairs |
|---|---|---|
| Construction | 1 | 118 |
| Construction | 2 | 326 |
| Construction | 3 | 280 |
| Construction | 4 | 485 |
| Construction | Total | 1,209 |
| Camouflage | 1 | 186 |
| Camouflage | 2 | 545 |
| Camouflage | 3 | 131 |
| Camouflage | Total | 862 |
The construction scenario contains four runs and 1,209 annotated frame pairs. The camouflage scenario contains three runs and 862 annotated frame pairs, with two camouflage-wearing individuals moving through visually occluded areas.
Qualitative examples showing Co-GLANCE compared with baselines across both real-world scenarios.
@inproceedings{co-glance,
title={Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming},
author={},
year={2026},
booktitle={},
}