| Summary: | The ability to automatically generate captions for medical videos, particularly in the context of gastrointestinal (GI) tract endoscopy, has become increasingly important. Precise interpretation of visual data is important for accurate diagnosis, effective treatment, and enhanced medical education. Despite advancements in video captioning technology, existing models face significant challenges when applied to medical domains. Three key gaps exist in the current state of video captioning for medical applications. First, there is a shortage of curated datasets focused on the medical domain, particularly in the context of GI endoscopy. Second, traditional machine learning techniques have proven inadequate in capturing the complex visual and linguistic patterns present in medical videos. Third, the absence of specialized evaluation metrics tailored to the medical context obstructs the accurate assessment of model performance in generating clinically relevant captions. This research aims to address these gaps by curating a dedicated GI endoscopy dataset, designing a robust deep learning-based video captioning model, and creating an evaluation metric tailored to the medical context. Advanced deep learning techniques are employed, including ResNet152v2, GRU, and multi-head attention mechanisms. A novel evaluation metric, VSEQScore, is developed to assess the semantic and clinical relevance of the generated captions. Trained on the newly developed GITVidSet dataset, the proposed deep learning-based GITVidCap model outperforms existing approaches, achieving significant improvements in BLEU (0.75), ROUGE-L (0.61), and CIDEr (1.2) scores. The VSEQScore metric also demonstrates its effectiveness, exceeding established metrics in precision (0.82), recall (0.78), and attention score (0.82). This study introduces three main contributions to address these gaps: (1) a dedicated GI endoscopy dataset, GITVidSet, tailored to medical video captioning; (2) the deep learning-based GITVidCap model, designed for complex medical patterns using ResNet152v2, GRU, and multi-head attention mechanisms; and (3) VSEQScore, a novel metric created for evaluating semantic and clinical relevance in captions. This research significantly enhances the accuracy of diagnostic interpretation, improves medical documentation quality, and provides valuable educational resources for healthcare professionals, ultimately contributing to better patient outcomes and advancing knowledge in medical video analysis.
|