AbstractAutomatic three-dimensional (3D) as-built reconstruction for non-Manhattan structures and multiroom buildings remains an industrywide challenge due to complex building environments and high demands for generating volumetric and object-level models. Conventional approaches are based on multiple separate steps extracting geometric and semantic features independently that cannot fully exploit object-level features. This paper aims to develop an end-to-end, fully automatic, and object-level reconstruction approach to converting point clouds of non-Manhattan and multiroom buildings into 3D models. A two-stage 3D object-detection method is proposed using region-based convolutional neural networks (R-CNN). Feature fusion between sparse 3D and two-dimensional (2D) bird’s eye view (BEV) feature maps is investigated to improve the generality and efficiency of modeling building primitives. In order to address the difficulties of training label generation caused by largely overlapped building objects, a dual-channel network is developed with one channel detecting walls and the other channel detecting remaining categories. The experimental results achieved an overall detection accuracy of 85.79% and localization accuracy of 79.03%, which have increased by 12.75% and 5.71% over the latest benchmarks, respectively. It took an average of 4.75 s to reconstruct a single-story building with a mean footprint of 471.936 m2. The resulting computing efficiency outweighs a majority of existing as-built modeling approaches and thus holds significant potential for future industrial applications.