A Wav2Vec2 and Multi-Head Attention Based Framework for Pronunciation Error Detection in Tibetan Mandarin Learners

Zhenye Gan; Ming Wang; Le Wei

doi:10.56028/aetr.15.1.291.2025

Authors

Zhenye Gan
Ming Wang
Le Wei

DOI:

https://doi.org/10.56028/aetr.15.1.291.2025

Keywords:

Tibetan Mandarin Learners, MDD,self-supervised learning.

Abstract

Aiming at the systematic pronunciation bias problem of Tibetan native speakers when learning Mandarin Chinese, this paper proposes an end-to-end detection method based on the self-supervised learning Wav2Vec 2.0 model fusing the multi-head self-attention mechanism (MHA) with a CTC decoder. The model is fine-tuned to adapt to the pronunciation characteristics of Tibetan, and the MHA is utilized to enhance the ability of capturing long-distance dependent features. Experiments on a self-constructed Mandarin pronunciation bias dataset of Tibetan students show that the proposed model significantly outperforms the traditional ASR model and the baseline Wav2Vec2-CTC system in terms of detection accuracy (DAR) and F1 scores, which validates its effectiveness in low-resource speech learning scenarios.

A Wav2Vec2 and Multi-Head Attention Based Framework for Pronunciation Error Detection in Tibetan Mandarin Learners

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section