
Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Explore the innovative Composite Speech-Language Model (ComSL) for seamless speech-to-text translation. Leveraging existing pretrained models, ComSL eliminates the need for extensive pretraining while achieving superior performance in downstream tasks. Learn about its architecture, unified pretraining approach, and more in this comprehensive study.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Yanmin Qian, Michael Zeng, Xuedong Huang Shanghai Jiao Tong University, Microsoft Research Asia Speaker : Yu-Chen Kuan
OUTLINE Introduction Method Experiments Conclusion 2
Introduction Unified speech-language pretraining has largely boosted E2E modeling for spoken language tasks In order to mitigate modality gap and achieve the comparable performance of the cascaded module system, unified speech-language pretraining must leverage the same or larger scale of data used in speech-only or language-only model pertaining 4
Introduction Proposed ComSL: a Composite Speech-Language Model Fully leverages existing pretrained models Eliminating the need for pretraining with large amounts of data from scratch Directly fine-tuned for downstream tasks 5
Introduction Proposed Cross-modality learning with speech-text mapping/matching on either representations or distributions is only based on the concatenation of paired speech and text Unlike conventional approaches that use contrastive learning among the modalities, it does not require external or internal aligners to force-align speech and text at the token or word level 6
Problem formulation ST training corpus DST = {(s, x, y)} acoustic feature sequence s = {s1, s2, ..., sT } transcription of speech: x = {x1, x2, ..., xM} translation label sequence y = {y1, y2, ..., yN} Multi-task learning (MTL): 8
ComSL 9
Model Architecture Speech Transformer blocks: initialized with Whisper model Adapter: two layers each layer is composed of a feed-forward module and a 1-D convolution layer with stride two achieves four times down-sampling for speech encoder outputs in total 10
Model Architecture Language Transformer blocks: mBART model composed of 12 layers of encoder and 12 layers of decoder model dimension of 1024 on 16 heads 11
Training Strategies Multi-task Learning: 12
ASR ASR Loss: 13
Cross-modality learning (CML) Minimize the gap between speech and text modalities Different from the previous speech-text alignment approaches that rely on externally forced-alignment methods Intrinsically learns cross-modality information during model optimization CML Training Loss: 14
Masked Token Prediction (MTP) The text tokens in the concatenated input are randomly masked by a probability pmask Let the decoder generate the whole sentence teacher-forcingly but we only add loss on the masked tokens: 16
Speech to Text Mapping (STM) Similar to a regular sequence-to-sequence mapping task (such as ASR or ST ) Encoder output hidden state is now conditioned on not only the input speech s but also the masked ground-truth transcription x STM two losses: 18
ASR V.S. STM ASR Loss: STM Loss: 19
ASR V.S. STM 20
Encoder Representation Matching (ERM) Push the representation of speech-only input zs, closer to concatenate input zs, add a L2 loss between them Since the variation among them might be too large to learn well, add the loss on the hidden state of the k-th Transformer block of the text encoder, s sand ak ak k = 4 21
ST: Decoder Distribution Matching (DDM) The performance of MT is generally better than that of ST especially when the input of MT is the ground-truth transcription DDM in ST Loss: 23
Regularization on the MT Output The language Transformer blocks have been fine-tuned with MT tasks on our corpus before the model composition To prevent MT task from overfitting in the multi-task learning for the composite model, introduce an additional language model, i.e., fine-tuned mBART-50 model, Freeze its parameters during training 24
Regularization on the MT Output The operation of MT task is similar to that in ST task, using a better task as a teacher to guide a relatively worse task MT Loss: 25
Total Training Loss Multi-task Learning: 26
Freezing Speech Encoder The speech and language Transformer blocks are initialized by well-trained speech- only and text-only models The adapter is randomly initialized Freeze the speech encoder at the first few epochs 27
Datasets E2E ST Data: CoVoST 2 21 languages into English and from English into 15 languages focuses on X-EN, the non-English to English direction Pseudo ST Data some low-resource language pairs in the CoVoST 2 add unlabeled translation data into the language directions that contain less than 30 hours of recordings into training in a self-training manner Mozilla Common Voice used to extract data 29
Configuration ComSL Medium and ComSL Large speech Transformer blocks are initialized with Whisper in different sizes Medium(0.9B): 24 layers of Transformer blocks with 1024 hidden dimensions and 16 heads Large(1.3B): 32 layers of Transformer blocks with 1280 hidden dimensions and 20 head Two-layer convolution adapter and an mBART model initialized by a mbart50-large-mant-to-many-mmt checkpoint 30
Training and Inference Save the checkpoint that has highest BLEU score on the validation set 3 days to train on 4*8 Nvidia Tesla V100 GPUs with 32G of memory Beam search with beam size 5 Detokenized corpus level BLEU scores on CoVoST 2 test set using sacreBLEU 31
Main Results Dividing the 21 languages of the test set into three groups: High-resource (High), Medium-resource (Med), and Low-resource (Low) 32
Ablation study on training tasks/losses ComSL Medium 33
Comparison among different CML methods Contrastive-based methods (ConST and WACO) Whatever internal or external token-level forced aligner, adds a burden to the training process 34
Conclusion Presented a Composite Speech and Language (ComSL) model to perform E2E speech-to-text translation tasks Bridged the gap between speech and text representations through cross-modality learning tasks, in addition to other auxiliary tasks like speech recognition and machine translation Outperformed the constituent speech model (Whisper) or cascaded speech and language models (Whisper + mBART) in a pipeline manner 36