Modeling and Application of Deep Multimodal Integration in Intelligent Perception Systems
DOI:
https://doi.org/10.56028/aetr.15.1.1316.2025Keywords:
Multimodal Deep Fusion; Intelligent Perception System; DW-CMTransformer; Attention Mechanism.Abstract
To address the limitations of single-modal perception in complex dynamic environments—such as insufficient robustness—and the challenges of multimodal fusion—including semantic alignment difficulties and high computational overhead—this paper proposes a dynamically weighted cross-modal Transformer framework (DW-CMTransformer). Firstly, a modal reliability evaluation module is constructed based on information entropy, which quantifies the confidence of heterogeneous sensors such as vision, lidar and voice in real time and generates adaptive weights. Secondly, the hierarchical attention mechanism is designed, and the two-way semantic alignment of multi-modal sequences is realized by using cross-modal Transformer at the feature level, and the single-modal output is weighted and fused at the decision level to ensure that the system performance remains stable when any mode fails. Finally, knowledge distillation is introduced, and the teacher model with 40M parameters is compressed to 5M, which increases the reasoning speed by 5 times and only loses 0.7% mAP. Experiments on nuScenes 3D target detection task show that the average mAP of DW-CMTransformer is 66.1%, which is 7.9% and 4.3% higher than Late Fusion and MMFN (Multimodal Fusion Network) baselines respectively, and the performance degradation is the smallest in the scene with single mode missing. This study provides an efficient and scalable new paradigm of multi-modal deep fusion for robust intelligent perception in edge computing environment, and the results can be transferred to key fields such as medical diagnosis and industrial detection.