
Tuning Large Language Model for End-to-End Speech Translation with LST Approach
"Discover how a Large Language Model (LLM) coupled with the LST approach excels at End-to-End Speech Translation (E2E-ST) tasks. Learn about the model architecture, two-stage training process, and the advancements in large multimodal models. Explore the cascade strategy to improve scalability and integration of research outcomes."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Tuning Large language model for End-to-end Speech Translation Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Xiaolin Jiao University of Information Engineering, ZhengZhou 450000, China Speaker : Yu-Chen Kuan
OUTLINE Introduction Method Experiments Results & Analysis Conclusion 2
Introduction Recently, large language models (LLMs) have experienced rapid advancements Simple tweaks on downstream tasks can even approach or surpass a well-designed small models Propose LST, a Large multimodal model that excels at E2E-ST task LST adopts the cascade strategy to improve scalability and facilitate the utilization of advanced research results in single modality 4
LST 6
Model Architecture Speech frontend: extracting semantic representation from the input speech signal LLM the transformed representation serves as a soft prompt for the LLM, prompting it to generate the corresponding translation 7
Model Architecture Length adapter: The extracted representation typically exceeds a length of 500 use a 1-dimensional convolution, to reduce the length Modality adapter: a simple linear layer that transforms speech representation into text embedding space 8
Two stages training of LST Modality adjustment (First Stage): solely tune the adapter to align the speech representation with the text embedding space Downstream task fine-tuning (Second Stage): freeze the parameter of the speech frontend, and train both the adapter and LLM model to optimize the performance on the E2E-ST task 9
First Stage training Keep the speech frontend and the LLM backend frozen, and only the adapter parameters are trainable Main objectives: speech length reduction modality transformation of speech representation 10
First Stage training For this stage, the choice of training data is flexible and can include either ASR or ST datasets When using ASR dataset, the loss: When using ST dataset, the loss 12
Second Stage training Freeze the speech frontend and focus on training both the adapter and the LLM backend Objective: enhance the model s capabilities for specific downstream tasks Loss calculation remains the same as in the first stage 13
Datasets ST dataset: MuST-C ASR dataset: 960 hours Librispeech 16
Experimental Setups Speech frontend: CTC finetuned Wav2vec 2.0 large model pre-trained with 53.2k hours LibriVox fine-tuned on the 960h Librispeech discard the CTC projection head, and utilize the last encoder output as speech representation Length adapter two 1-dimensional convolutional layers, kernel size: 5, stride size: 2, padding: 2, hidden dimension: 1024 Modality adapter simple linear layer 17
Experimental Setups LLM: LLaMA2 embedding dim: 4096 first stage for 6 epochs second stage for 1 epoch quickly overfit after one epoch batch size: 128 first and second stage, save checkpoints every 1000 and 100 steps 18
Experimental Setups During inference, final checkpoint for evaluation averaging multiple checkpoints does not yield better results beam size: 4 sacreBLEU : case-sensitive detokenized BLEU The training for first and second stages takes 23h and 26h 19
Results 21
Effects of Training Strategy The two stages of training are indeed necessary, the first and second stages alone will yield poor results In the first stage of training, using direct ST data yields better results compared to ASR data, even if the ASR data consists of a total of 960h while the ST data only comprises 400h 23
Effects of LLM backend Supervised instruction fine-tuning (SFT) is not essential for the LLM backend The overall performance of the model on the E2E-ST task hinges on the foundation LLM The larger the LLM model, the higher the final performance achieved on the E2E-ST task 26
Results in Different Speech Lengths On 5 separate groups according to the duration of the source speech in tst-COMMON sets 27
Case Study 28
Conclusion Propose LST, a large multimodal model that excels at E2E-ST task Outperforms the previous models and achieves new state-of-the-art Further exploration: Using more complex adapter Using better LLM In combination with other methods 31