AbstractThe type and duration of construction workers’ activities are useful information for project management purposes. Therefore, several studies have used surveillance cameras and computer vision to automate the time-consuming process of manually gathering this information. However, the three-stage method they have adopted consisting of separate detection, tracking, and activity classification modules is not fully optimized. Additionally, the activity classification module is trained per-clip/segment on trimmed video clips and fails when applied to long untrimmed construction videos. This paper aims to (1) investigate the benefits of a fully optimized method such as you only watch once (YOWO) and a per-frame and per-worker annotated untrimmed data set over the previous approach for activity recognition of construction workers; (2) propose an improved version of YOWO, called YOWO53, to improve detection performance; (3) propose a semiautomatic data set annotation; (4) conduct a sensitivity analysis to compare the performance of YOWO, YOWO53, and the three-stage method; and (5) conduct a case study to compute the percentage of different workers’ activities. YOWO53 improves the detection recall of YOWO by up to 3%, and the classification accuracy of the three-stage method by 16.3%. Although YOWO53 has a lower inference speed, it is still sufficiently fast for productivity analysis.