Generating Missing Modalities: A Conditional Diffusion and Transformer Approach for Emotion Recognition
DOI:
https://doi.org/10.56028/aetr.15.1.1238.2025Keywords:
Diffusion Model, Transformer, Generative Model, Incomplete Modality, Multimodal Emotion Classification.Abstract
Multimodal models have significantly advanced traditional emotion recognition by utilizing information from text, audio, and visual modalities. Many studies have pushed the boundaries of this field. However, the absence of modalities remains a major challenge, hindering the model’s ability to capture and integrate cross-modal interactions effectively. Besides, conventional modality completion approaches often fail to preserve fine-grained details. To break through these limitations, we propose a novel modality completion framework based on Conditional Diffusion and Transformer (CDTP). By incorporating three types of prompts and conditions, CDTP enables more detailed representations within and across modalities. Experiments and ablation studies demonstrate that our method substantially enhances emotion recognition performance and exhibits strong robustness in scenarios with missing modalities. The source code will be publicly available at https://github.com/cwzhang689/DPT.