ZeroST: Innovative Speech Translation Framework

1 / 43

Embed Share

Explore the cutting-edge ZeroST framework, which bridges speech and text models for multilingual tasks, inspired by recent advancements in foundation models and BLIP-2. Learn about the key components like Q-Former and how ZeroST achieves zero-shot speech translation without the need for translation pairs during training.

alexandri Follow

Uploaded on Jul 01, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ZeroST: Zero-Shot Speech Translation Sameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA Speaker : Yu-Chen Kuan

OUTLINE Introduction ZeroST Experiments Conclusion 2

Introduction

Introduction In recent years, tremendous advancements have been made in multilingual speech foundation models (Whisper, MMS) and multilingual text foundation models (GPT) As the performance of foundation models, a logical next research step is to connect unimodal foundation models to performing multimodal tasks such as image captioning, speech-to-text translation, multimodal retrieval One recent research effort in this direction is bootstrapping language image pre-training with frozen image encoders and large language models (BLIP-2) 4

Introduction BLIP-2 bridges the representation gap between a pre-trained vision foundation model and a pretrained LLM The key idea in BLIP-2 is to translate visual representations outputted by a pre-trained image encoder into text-like representations that can be ingested and processed by an LLM BLIP-2 uses a Query Transformer (Q-Former) as the bridge between the image encoder and the LLM 5

Q-Former Initialize QFormer with the pre-trained weights of BERTbase (Devlin et al., 2019), whereas the cross-attention layers are randomly initialized 6

Introduction Inspired by BLIP-2, proposed a framework that connecting a pre-trained multilingual speech foundation model with pre-trained multilingual text translation model by Q-Former Use a Q-Former to bridge the representation gap between the speech and text foundation models Refer this framework as the Zero-Shot Speech Translation (ZeroST), since no translation pairs are provided during the training of our framework 7

Illustration 8

ZeroST

Model Overview Three main modules: speech foundation model Q-Former text translation model 10

ZeroST 11

ZeroST 12

Speech Foundation Model Whisper-large-v3 or pre-trained MMS Whisper transformer encoder: 32 layers, embedding dimension of 1280, 630M parameters MMS transformer encoder: 24 layers, embedding dimension of 1024, 300M parameters A linear projection layer transforms the output of the speech encoder before inputting to the Q-Former to match its embedding dimension 13

Text Translation Model No-Language-Left-Behind (NLLB) as the text translation model Transformer encoder-decoder trained on 200 x 200 text-to-text translation tasks Use the 1.3B parameter model Encoder and Decoder both have 24 layers Model s embedding size: 1024 Vocabulary: 256k BPE tokens 14

Original Q-Former Initialize QFormer with the pre-trained weights of BERTbase (Devlin et al., 2019), whereas the cross-attention layers are randomly initialized 15

Q-Former in Paper 16

Q-Former Same architecture as the NLLB decoder Self-attention module in Q-Former is bi-directional Q-Former s self-attention is initialized using the NLLB encoder s self-attention The input of Q-Former is a set of learnable embeddings referred to as queries 256 be the optimal number of queries Q-Former aims to output a representation close to the NLLB encoder s output and its decoder s input 17

Training Data Multilingual transcribed speech corpora CommonVoice-v16.1 (CoVo) VoxPopuli (VP) Multilingual Speech (MLS) 18

Training Data CoVo: collect transcribed speech in 96 languages that intersect with the languages supported by Whisper VP: 16 languages, English (en), German (de), French (fr), Spanish (es), Polish (pl), Italian (it), Romanian (ro), Hungarian (hu), Czech (cs), Dutch (nl), Finnish (fi), Croatian (hr), Slovak (sk), Slovene (sl), Estonian (et), and Lithuanian (lt) MLS: 8 languages, en, es, it, pl, Portuguese (pt), nl, de, fr 19

Learning Process 20

Learning Process Two-step learning process First step, using the NLLB encoder as the teacher training the student Q-Former reduce the representation gap between the Q-Former output and the NLLB text encoder output Second step, using the NLLB decoder as the teacher training the student Q-Former. removes the remaining representation gap between the Q-Former s output and the decoder s input First step as knowledge distillation (KD) and the second as negative-log-likelihood (NLL) training Both steps use multilingual transcribed speech data for training 21

First Step: Knowledge Distillation 22

First Step: Knowledge Distillation Trains the Q-Former on the task of speech-to-text retrieval Given a tuple (x, y), where x is a speech waveform and y its corresponding transcript Q-Former transforms x RS into a set of embeddings Q Rq d , where q is the number of queries NLLB text encoder transforms the corresponding transcript y into a set of embeddings T Rm d,where m is the number of tokens in the transcript y queries q is fixed, while the number m of tokens is variable Compute two KD losses: fine-grained and global 23

Fine-grained loss Q[i] Rd is the ith query embedding, T[j] is the jth token embedding, Proj transforms the embeddings in Q and T via a linear projection followed by Tanh non-linearity L2Norm normalizes the input embeddings by their L2 norm Dot product Q[i] T[j] gives the cosine similarity between Q[i] and T[j] 24

Colberts text query-document retrieval model 25

Global loss Average the embeddings in sets Q and T to get single embeddings q Rd and t Rd LGlobal is the cosine distance between q and t The L2Norm and Proj layers perform the same operations as in the fine-grained KD loss Final KD loss is computed as: LKD= (LGlobal + LFine) , where is a scaling factor ( = 10) 26

Adapter-I Two adapters are inserted in each speech encoder s layer, one after self-attention and the other after the feed-forward module 27

Second Step: Negative-log-likelihood (NLL) training 28

Negative Log-Likelihood Training Trains the Q-Former from the previous step to generate text transcription y corresponding to a speech waveform Q is the set of query embedding, m is the number of tokens in the transcript the negative log-likelihood: During training, the previous tokens y1:n 1 are the ground-truth tokens (teacher-forcing) During inference, the model is conditioned on the tokens it generates 29

Adapter-II New adapters are inserted in sequence with the adapters used in the KD step Adapter II Adapter II 30

Experiments

Evaluation Protocol Europarl-ST benchmark for evaluating ZeroST Spoken languages in Europarl-ST is X = {en, fr, de, it, es, pt, pl, ro, nl} Each spoken language L X is paired with its text translations in the other languages X L Total translation tasks are 72, report an evaluation score for all L X Evaluation score is computed for a language L by averaging the BLEU-4 scores for the eight translation tasks L L , L X L 32

Training Detail Combine VP, MLS, and CoVo corpora, referred to as Big All the models are trained on 8 A100 GPUs for 100k iteration Except Big data for training, which are trained on 64 GPUs for 400k iteration batch size: approximately 2.6 hours of transcribed speech, or 2.5 minutes per GPU Adam optimizer with a learning rate of 1e-4 Three-phase learning rate scheduler with the setting [0.1, 0.4, 0.5] learning rate is warmed up to 1e-4 during the first 10% of the training iterations, remains constant for the next 40%, and decays for the remaining 50 33

Q-Former Q-Simple: a Q-Former with no transformer layers The queries are directly applied to the output of the speech encoder as follows: Q = softmax(WKT)V , where W R256 d , K Rn d , and V = K W are the learnable queries K is the output of the pre-trained speech encoder Q-Lite: a bi-directional transformer decoder with four layers embedding size of 768, 4 attention heads in each layer, and a feed-forward layer dimension of 3072 34

Q-Former Q-NLLB: same architecture as the NLLB decoder but with bi-directional self-attention self-attention module of Q-NLLB is initialized with the self-attention of the NLLB encoder 35

Baselines and Topline Whisper-NLLB-cascade model transcribes speech waveforms using Whisper-large-V3, which the NLLB model translates to text in the target language The NLLB-topline uses the ground-truth text transcript for the speech utterances in the Europarl-ST benchmark and translates them into the desired target language using the NLLB model 36

Issue Is NLL or KD training alone sufficient for ZeroST? Comparing different Q-Formers Impact of training data size Comparing speech encoders Comparison with toplines 37

Impact of Varying The Number of Queries 38

Generating unseen target languages translations Train the ZeroST model on transcribed speech in en, fr, de, it, and es languages in the VoxPopuli corpora During inference, we use ZeroST model to perform the following 15 translation tasks: X ={en, fr, de, it, es} Y ={pl, ro, nl} Speech and text in languages in X are seen during training, while languages in Y are unseen during ZeroST model training The ZeroST model achieves an average BLEU-4 score of 13.5 on the 15 translation tasks 39

Generating unseen target languages translations Train the ZeroST model on transcribed speech in en, fr, de, it, and es languages in the VoxPopuli corpora During inference, we use ZeroST model to perform the following 15 translation tasks: X ={en, fr, de, it, es} Y ={pl, ro, nl} Speech and text in languages in X are seen during training, while languages in Y are unseen during ZeroST model training The ZeroST model achieves an average BLEU-4 score of 13.5 on the 15 translation tasks Comparable with the 14.1 average BLEU-4 achieved by the ZeroST model trained on transcribed speech in all the languages in the VoxPopuli transcribed speech corpora, including languages in Y 40

Conclusion

Conclusion Presents a promising approach for zero-shot speech-to-text translation Propose the ZeroST model, which connects a pre-trained multilingual speech foundation model, with a transformer-based multilingual text-to-text translation model Bridge the gap between the foundation models by a Q-former and train it using a two-step learning process that uses NLLB as the teacher Zero-shot translation results on Europarl-ST verify our claim that zero-shot multilingual speech-to-text translation is possible using only multilingual transcribed speech data Achieve better results than a strong cascade and are comparable to the top line 42

Thank You For Listening

ZeroST: Innovative Speech Translation Framework

Download Presentation

Presentation Transcript

Related

More Related Content