Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Abstract

This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Key Idea

A polycuboid is composed of multiple cuboids. Thus, we infer cuboid face types and their spatial relationships from a point cloud to form polycuboid instances.
We leverage a Transformer network to precisely detect faces from the input point cloud, classifying each point according to one of six cuboid face types.
Detected faces are aggregated into individual polycuboid instances by inferring their spatial relationships using a Graph Neural Network (GNN).

Framework

Given a noisy point cloud of an indoor scene, we first detect polycuboid faces by classifying each 3D point into one of the six cuboid face labels using a Transformer network, while simultaneously predicting the face center offsets. Detected faces are then grouped into polycuboid instances based on the spatial relationships inferred by a Graph Convolutional Network (GCN). Finally, each polycuboid instance is reconstructed into a compact rectilinear mesh by assembling cuboids consistent with the aggregated face information.

Another Results

Caption for Video 1

The first video demonstrates how the fitted polycuboid can be used as a controller, enabling easy rearrangement of furniture. The second video demonstrates how the polycuboid shape can be stylized with fine resolution polycuboids.

BibTeX

@misc{lee2025deeppolycuboidfittingcompact, title={Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes}, author={Gahye Lee and Hyejeong Yoon and Jungeon Kim and Seungyong Lee}, year={2025}, eprint={2503.14912}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.14912}, }