Joint Optimization for 4D Human-Scene Reconstruction in the Wild
ICLR 2026
Zhizheng Liu , Joe Lin , Wayne Wu , Bolei Zhou
University of California, Los Angeles
Overview
We propose a novel method JOSH (Joint Optimization of Scene Geometry and Human Motion) for 4D Human-Scene Reconstruction in the wild, which jointly optimizes the global human motion, the surrounding environment, and the camera poses with coherent human-scene interaction given a web video captured from a single camera. JOSH uses local scene reconstruction and human mesh recovery as initialization and then jointly optimizes motion and scene with the human-scene contact constraints. JOSH achieves state-of-the-art performance for both global human motion estimation and metric-scale scene reconstruction with joint optimization.
We further design an end-to-end model, JOSH3R to predict the relative human transformation directly between two frames,allowing real-time inference as a trade-off to the estimation accuracy.
Results on Datasets
JOSH surpasses existing methods on both global human motion estimation and metric-scale scene reconstruction by a large margin, and has high potential for scalable training of end-to-end models using extensive web videos.
Evaluation on Global Human Motion Estimation with the EMDB Dataset
Evaluation on Global Camera Trajectory Estimation with the SLOPER4D Dataset
Evaluation on 4D Human-Scene Reconstruction with the RICH Dataset
Interactive Demo on Web Video
Relevant Work

Reference
@article{liu2026joint,
title={Joint Optimization for 4D Human-Scene Reconstruction in the Wild},
author={Liu, Zhizheng and Lin, Joe and Wu, Wayne and Zhou, Bolei},
journal={The Fourteenth International Conference on Learning Representations},
year={2026}
}
Comment: This work proposes a dataset and a model for context-aware pedestrian movement generation from pseudo-labels of web videos. We can use JOSH to extract human and scene labels with better quality for pedestrian movement generation.