A Review of the Applications of Self-Supervised Learning in Multimodal Models

Yihao  Tu

doi:10.56028/aetr.14.1.1491.2025

Authors

Yihao Tu

DOI:

https://doi.org/10.56028/aetr.14.1.1491.2025

Keywords:

self-supervised learning; multimodal models; contrast learning; mask modelling; generative learning; artificial intelligence.

Abstract

Multimodal models achieve richer semantic understanding and reasoning capabilities by comprehensively processing multi-source information such as images, text, speech, and video. Traditional multimodal models usually require a large amount of labelled data for training, which is costly and inefficient. Self-Supervised Learning (SSL) brings new possibilities for efficient training of multimodal tasks by designing tasks to autonomously learn feature representations from unlabelled data without manual labelling. This paper systematically reviews the main methods and key applications of self-supervised learning in multimodal models, including contrastive learning, masked modelling, and generative learning, and analyses the applications of SSL technology in practical areas such as visual language retrieval, audio and video understanding, and medical diagnosis. Meanwhile, this paper discusses the limitations in the current research and looks forward to the future development direction, with a view to providing a theoretical foundation and practical reference for researchers in this field.

A Review of the Applications of Self-Supervised Learning in Multimodal Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section