A Review of the Applications of Self-Supervised Learning in Multimodal Models
DOI:
https://doi.org/10.56028/aetr.14.1.1491.2025Keywords:
self-supervised learning; multimodal models; contrast learning; mask modelling; generative learning; artificial intelligence.Abstract
Multimodal models achieve richer semantic understanding and reasoning capabilities by comprehensively processing multi-source information such as images, text, speech, and video. Traditional multimodal models usually require a large amount of labelled data for training, which is costly and inefficient. Self-Supervised Learning (SSL) brings new possibilities for efficient training of multimodal tasks by designing tasks to autonomously learn feature representations from unlabelled data without manual labelling. This paper systematically reviews the main methods and key applications of self-supervised learning in multimodal models, including contrastive learning, masked modelling, and generative learning, and analyses the applications of SSL technology in practical areas such as visual language retrieval, audio and video understanding, and medical diagnosis. Meanwhile, this paper discusses the limitations in the current research and looks forward to the future development direction, with a view to providing a theoretical foundation and practical reference for researchers in this field.