Joint Optimization for 4D Human-Scene Reconstruction in the Wild

ICLR 2026

Zhizheng Liu , Joe Lin , Wayne Wu , Bolei Zhou
University of California, Los Angeles

Overview

We propose a novel method JOSH (Joint Optimization of Scene Geometry and Human Motion) for 4D Human-Scene Reconstruction in the wild, which jointly optimizes the global human motion, the surrounding environment, and the camera poses with coherent human-scene interaction given a web video captured from a single camera. JOSH uses local scene reconstruction and human mesh recovery as initialization and then jointly optimizes motion and scene with the human-scene contact constraints. JOSH achieves state-of-the-art performance for both global human motion estimation and metric-scale scene reconstruction with joint optimization.

We further design an end-to-end model, JOSH3R to predict the relative human transformation directly between two frames,allowing real-time inference as a trade-off to the estimation accuracy.

Image

Results on Datasets

JOSH surpasses existing methods on both global human motion estimation and metric-scale scene reconstruction by a large margin, and has high potential for scalable training of end-to-end models using extensive web videos.

Evaluation on Global Human Motion Estimation with the EMDB Dataset

Evaluation on Global Camera Trajectory Estimation with the SLOPER4D Dataset

Evaluation on 4D Human-Scene Reconstruction with the RICH Dataset

Interactive Demo on Web Video

Relevant Work

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou. Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels. ICLR 2025.
Comment: This work proposes a dataset and a model for context-aware pedestrian movement generation from pseudo-labels of web videos. We can use JOSH to extract human and scene labels with better quality for pedestrian movement generation.

Reference

@article{liu2026joint,
    title={Joint Optimization for 4D Human-Scene Reconstruction in the Wild},
    author={Liu, Zhizheng and Lin, Joe and Wu, Wayne and Zhou, Bolei},
    journal={The Fourteenth International Conference on Learning Representations},
    year={2026}
}