Advanced Speech Translation System Leveraging Large Language Models

llast improved end to end speech translation n.w

1 / 31

Embed Share

Explore the enhanced end-to-end speech translation system empowered by large language models (LLMs). The system focuses on efficient utilization of LLMs for high-performance speech translation, covering model architecture design, training strategies, and data recipe. Dive into speech encoder techniques, adapter functions, and more for robust linguistic representations.

penrodo Follow

Uploaded on Jun 26, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models Xi Chen, Songyang Zhang, Qibing Bai1, Kai Chen, Satoshi Nakamura The Chinese University of Hong Kong, Shenzhen Speaker : Yu-Chen Kuan

OUTLINE Introduction Method Experiments Conclusion 2

Introduction

Introduction Speech translation is intrinsically linked to NLP, and the unprecedented capabilities that LLMs have demonstrated across a variety of NLP tasks How to most effectively harness the vast potential of LLMs to develop a high- performance ST system in an efficient manner, without compromising on quality or scalability 4

Introduction In this paper, focusing on the exploration of best practices for constructing an effective speech translation system powered by LLMs, and proposed LLaST Explore the LLMs-based speech translation method, including model architecture design, training strategies, and data recipe 5

Method

Problem setting Fa is the acoustic feature extraction operation F represents the entire ST system Speech translation dataset D = {(S, Ysrc, Ytgt)} Source language speech S s acoustic features: The goal of speech translation is to generate the prediction text of target language Ytgt from the source speech S: 7

LLaST 8

Speech Encoder Generate robust linguistic representations, denoted as Zs: Fse represents the speech encoder function Investigates various options for the speech encoder, focus on mHubert and Whisper 9

Adapter Bridge between the speech encoder and the LLM, aligns speech features more effectively with the LLM s representation space Project the extracted linguistic representations, Zs, into the embedding realm of the LLM, thus yielding Hs: Adopt a 3-layer multilayer perceptrons (MLPs) for adaptor 10

Large Language Model Construct a speech-text prompt input for the LLM, denoted as Xq such as Translate the French sentence into English Post-tokenization and embedding, Xqis transformed into the LLM s input representation, Hq LLM generates translation predictions based on the concatenated speech-text features: Entire process: 11

Speech-text prompt 12

Optimization with Dual-LoRA Fintuning 13

Optimization with Dual-LoRA Fintuning Applying LoRA separately to both the speech encoder (S-LoRA) and the Large Language Model (L-LoRA) Perform instruction-tuning on prediction tokens using the original auto-regressive training objective of LLM The target translation result Ytgt of length N: allows us to efficiently tune LLaST without extensive retraining 14

Training with ASR-augmentation Incorporate ASR tasks for data augmentation during training Simply modify the ASR prompt to match ST objectives such as "Transcribe the French sentence into English" 15

Inference Construct prompts in the same format as training Beam search algorithm with a beam size of 5 16

Experiments

Dataset CoVoST-2 En-X: 15 languages X-En: 21 languages For monolingual experiments, utilize six subsets with source languages translating to English, focusing on French-English In the multilingual setup, employ Fr En, Es En, De En, It En, Zh En, and Ja En subsets and three English-to-X subsets: En Zh, En Ja, and En De 18

Model Architecture Whisper-large-v2 speech encoder 1B parameters Adaptor is a compact multilayer perceptron with three layers 1280-dimensional inputs, adjusting output dimensions to match those of the subsequent LLMs Overall parameter count is predominantly influenced by the LLM component denote models as LLaST-2B, LLaST-8B, and LLaST-14B 19

Model Architecture 20

Hyperparameters Optimizer: AdamW warmup-then-linear decay learning rate, peaking at 0.0002 S-LoRA (Whisper LoRA) rank:128 L-LoRA (LLM LoRA) rank: 512 The LLaST-8B and LLaST-14B models are trained using 32 NVIDIA A100 GPUs batch size of 32 Smaller LLaST2B model is trained on a setup consisting of 8 A100 GPUs batch size of 32 21

Comparisons with Other Models 22

Choice of Speech Encoder 23

Choice of Large Language Models 24

Training with ASR Augmentation 25

Multilingual Data Augmentation 26

Dual-LoRA Optimization 27

Different Size of Speech Encoder 28

Conclusion

Conclusion Presents the development and analysis of LLaST a novel speech translation model that harnesses LLM Integrating well-tuned speech encoders significantly improves speech-to-text translation performance Dual LoRA optimization leads to substantial gains in BLEU scores Increasing the scale of either the speech encoder or the LLM positively impacts performance Incorporating ASR augmentation and multilingual training further enhances the model s performance 30

Thank You For Listening

Advanced Speech Translation System Leveraging Large Language Models

Download Presentation

Presentation Transcript

Related

More Related Content