| Summary: | Human-Object Interaction (HOI) recognition is an essential task in computer vision, which entails recognising interactions between a human and objects appearing in the same visual scene. Although Graph Neural Networks (GNN) have recently shown impressive performance in HOI recognition, the HOI recognition process generates many non-interactive human-object pairs and insufficient processing of local features. This impedes the improvement of HOI recognition precision. Meanwhile, GNN will produce high computational complexity, lengthy training time and low convergence efficiency at the training stage. This thesis addresses the challenges of HOI recognition and GNN by developing and evaluating graph-based deep-learning models and strategies. Specifically, they are the Interactive Recognition-Graph Neural Network (IR-GNN) model, the Parallel Multi-Head Graph Attention Network (PMGAT) model, and the Graph Sampling-based Dynamic Edge Strategy (GraphSADES). The IR-GNN model enhances HOI recognition by detecting interactive human-object pairs, employing human posture features and graph-based methods to improve precision. Extensive experiments on the HICO-DET and V-COCO datasets show that in detecting interactive human-object pairs, the recall accuracy is improved by 4.44% and 5.07% compared to the Transferable Interactiveness Knowledge (TIN) method, respectively. Regarding HOI recognition precision, the IR-GNN improves the mean Average Precision (mAP) of 1.99% and 2.61% on the two datasets, respectively, compared to existing methods. The PMGAT model, utilising the output of IR-GNN, integrates local, global, and semantic features through a multi-head attention mechanism. This parallel structure significantly enhances recognition accuracy and reduces training time. The results show that PMGAT outperforms the previous best method, ViPLO, with an average improvement of 0.61% in mAP using ViT backbone networks on V-COCO and HICO-DET and an average improvement of 0.65% on AP while also reducing training time by 31.5%. The GraphSADES strategy is then applied to the PMGAT model, optimising computational complexity measured by Floating Point Operations (FLOPs) by dynamically adjusting edge participation in graph networks. This strategy maintains high precision while reducing FLOPs by 40.12% and 39.89% on HICO-DET and by 39.81% and 39.56% on V-COCO, respectively, using ResNet-50 and ViT-B/16 backbones. Training time is decreased by 14.20% and 12.02% on HICO-DET and by 10.26% and 16.91% on V-COCO, with the earliest convergence efficiency achieved after 180 epochs and 165 epochs, respectively. These models and strategies produce recognition methods that can advance HOI recognition and provide a foundation for further research and practical applications in robotics, video surveillance, and autonomous vehicles.
|