Repository Collection: null

Scale-Invariant and View-Relational Representation Learning for Full Surround Monocular Depth

Wed, 31 Dec 2025 15:00:00 GMT

Title: Scale-Invariant and View-Relational Representation Learning for Full Surround Monocular Depth Author(s): Hwang, Kyumin; Choi, Wonhyeok; Han, Kiljoon; Choi, Wonjoon; Choi, Minwoo; Na, Yongcheon; Park, Minwoo; Im, Sunghoon Abstract: Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits realtime performance, and (2) difficulty in estimating metricscale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme–traditionally used in classification–with a depth binning module to enhance scale consistency. Specifically, we introduce a crossinteraction knowledge distillation scheme that distills the scaleinvariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements. © 2016 IEEE.

Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation

Mon, 31 Mar 2025 15:00:00 GMT

Title: Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation Author(s): Kim, Jaeyeul; Woo, Jungwan; Shin, Ukcheol; Oh, Jean; Im, Sunghoon Abstract: Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal features. Furthermore, they utilize 2D Bird's Eye View and process only two frames, missing crucial spatial information along the Z-axis and the broader temporal context, leading to suboptimal performance. To address these limitations, we propose Flow4D, which temporally fuses multiple point clouds after the 3D intra-voxel feature encoder, enabling more explicit extraction of spatio-temporal features through a 4D voxel network. However, while using 4D convolution improves performance, it significantly increases the computational load. For further efficiency, we introduce the Spatio-Temporal Decomposition Block (STDB), which combines 3D and 1D convolutions instead of using heavy 4D convolution. In addition, Flow4D further improves performance by using five frames to take advantage of richer temporal information. As a result, the proposed method achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time, and won 1st place in the 2024 Argoverse 2 Scene Flow Challenge(Figure presented.). © IEEE.

자가학습과 지식증류 방법을 활용한 LiDAR 3차원 물체 탐지에서의 준지도 도메인 적응

Mon, 31 Jul 2023 15:00:00 GMT

Title: 자가학습과 지식증류 방법을 활용한 LiDAR 3차원 물체 탐지에서의 준지도 도메인 적응 Author(s): 우정완; 김재열; 임성훈 Abstract: With the release of numerous open driving datasets, the demand for domain adaptation in perception tasks has increased, particularly when transferring knowledge from rich datasets to novel domains. However, it is difficult to solve the change 1) in the sensor domain caused by heterogeneous LiDAR sensors and 2) in the environmental domain caused by different environmental factors. We overcome domain differences in the semi-supervised setting with 3-stage model parameter training. First, we pre-train the model with the source dataset with object scaling based on statistics of the object size. Then we fine-tine the partially frozen model weights with copy-and-paste augmentation. The 3D points in the box labels are copied from one scene and pasted to the other scenes. Finally, we use the knowledge distillation method to update the student network with a moving average from the teacher network along with a self-training method with pseudo labels. Test-Time Augmentation with varying z values is employed to predict the final results. Our method achieved 3rd place in ECCV 2022 workshop on the 3D Perception for Autonomous Driving challenge.

A Study on the Generality of Neural Network Structures for Monocular Depth Estimation

Sun, 31 Mar 2024 15:00:00 GMT

Title: A Study on the Generality of Neural Network Structures for Monocular Depth Estimation Author(s): Bae, Jinwoo; Hwang, Kyumin; Im, Sunghoon Abstract: Monocular depth estimation has been widely studied, and significant improvements in performance have been recently reported. However, most previous works are evaluated on a few benchmark datasets, such as KITTI datasets, and none of the works provide an in-depth analysis of the generalization performance of monocular depth estimation. In this paper, we deeply investigate the various backbone networks (e.g. CNN and Transformer models) toward the generalization of monocular depth estimation. First, we evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets, which have never been seen during network training. Then, we investigate the internal properties of the representations from the intermediate layers of CNN-/Transformer-based models using synthetic texture-shifted datasets. Through extensive experiments, we observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias. We also discover that texture-biased models exhibit worse generalization performance for monocular depth estimation than shape-biased models. We demonstrate that similar aspects are observed in real-world driving datasets captured under diverse environments. Lastly, we conduct a dense ablation study with various backbone networks which are utilized in modern strategies. The experiments demonstrate that the intrinsic locality of the CNNs and the self-attention of the Transformers induce texture-bias and shape-bias, respectively. IEEE