AbstractSafety in the construction industry has always been a focus of attention. Existing methods of detecting unsafe behavior of workers relied primarily on manual detection. Not only did it consume significant time and money, but it also inevitably produced omissions. Currently, automated techniques for detecting unsafe behaviors rely only on the unsafe factors of workers’ ontology to judge their behaviors, making it difficult to understand unsafe behaviors in complex scenes. To address the presented problems, this study proposed a method to automatically extract workers’ unsafe behaviors by combining information from complex scenes—an image captioning based on an attention mechanism. First, three different sets of image captioning models were constructed using convolutional neural network (CNN), which are widely used in AI. These models could extract key information from complex scenes was constructed. Then, two datasets dedicated to the construction domain were created for method validation. Finally, three sets of experiments were conducted by combining the datasets and the three different sets of models. The results showed that the method could detect the worker’s job type and output the interaction behavior between the worker and the target (unsafe behavior) based on the environmental information in the construction images. We introduced environmental information into the determination of workers’ unsafe behaviors for the first time and not only output the worker’s job type but also determine the worker’s behavior. This allows the model output to be better for ergonomic analysis.Practical ApplicationsThis study developed an intelligent solution for determining whether a worker had unsafe behavior in complex scenarios using behavioral norms. The operator would not need to prepare the appropriate construction safety knowledge, such as whether to wear a helmet, whether to wear a safety belt, or whether to work at height, but simply input the target image into the model, and the model would combine the predefined behavioral norms, scene information, and other factors to determine what kind of behavior (or unsafe behavior) was contained in the image and output a simple description of the information. Descriptions could also be set as fixed templates for easy management, such as worker A wearing (not) a helmet, and these descriptions would play a key role in daily management and project summaries. Using this method, managers could use the relevant equipment to automate the acquisition of possible good behaviors or violations of anyone on site. It also enables efficient organization and recording, improving the efficiency of managers.