WalkOCC | Vision and Autonomy Intelligence Lab

TL;DR

WalkOCC

Sidewalk3D

Visualization of Cross-Embodiment Inference

Coco Delivery Robot. A wheeled robot with a front-facing fisheye camera, approximately 40 cm tall, primarily used for last-mile food and parcel delivery on sidewalks.

Robot 1 / 3

Diverse Test Set Inference Visualization

Our proposed SideWalk3D dataset captures diverse appearances across regions and time periods (daytime and nighttime), providing a challenging benchmark for urban sidewalk occupancy prediction.

Model Output Visualization

WalkOCC predicts not only 3D occupancy but also 2D depth and semantic segmentation. In the video, the first row shows pseudo-labels used for supervision, and the second row shows the model's inference results.

Automatic Pseudo-Label Generation

Pseudo-Label Generation. With pre-calibrated and time-synchronized sensors, we project 3D LiDAR points onto 2D images to inherit per-point semantic labels. We then generate dense occupancy pseudo-labels using the SurroundOcc

High-Quality Manual Annotations for the Test Set

Refined LiDAR ground-truth examples. We visualize manually annotated global point clouds from three representative scenarios: tourist area (day), tourist area (night), and commercial district.

Long-Horizon Inference Visualization

Long-horizon demo on a wheeled-legged robot dog. The robot runs along a sidewalk in a residential area in Los Angeles.

WalkOCC Model architecture

We present WalkOCC, a hybrid Ray-marching-based occupancy-learning framework for sidewalk occupancy prediction using a monocular RGB camera. Our approach consists of two key components: (i) a depth-aware lifting architecture that transforms front-view images into 3D semantic occupancy grids, and (ii) a hybrid training strategy that leverages both 2D and 3D supervision via a ray-marching-based 2D-3D consistency loss. Enforcing this consistency enables effective learning from large-scale 2D-only data while preserving geometric accuracy, which in turn improves prediction quality and cross-domain generalization.

Dataset Distribution

Data distribution and representative scenes from Sidewalk3D. Our dataset spans diverse domains, geographic regions, and illumination conditions (day and night).

Sidewalk Occupancy Prediction via 2D-3D Consistency Learning with Pseudo Labels

Yukai Ma ^1,2 , Jeo Lin ^1,3 , Liu Liu ^1,4 , Honglin He ¹ , Lulu Ricketts ³ , Brad Squicciarini ³ , Yong Liu ² , Bolei Zhou ¹

¹ University of California, Los Angeles , ² Zhejiang University , ³ Coco Robotics , ⁴ Massachusetts Institute of Technology

Code | Paper