End-to-End Speech Translation using XSTNet

end to end speech translation via cross modal n.w

1 / 28

Embed Share

A comprehensive overview of end-to-end speech translation leveraging the innovative XSTNet model. Discusses the challenges in training E2E ST models, introduces XSTNet functionalities like supporting audio/text input, utilizing Transformer module, self-supervised audio representation learning, and more. Examines problem formulations, speech encoder configurations, and encoder-decoder mechanisms with modality and language indicators.

lgree Follow

Uploaded on Jun 20, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

End-to-end Speech Translation via Cross-modal Progressive Training Rong Ye, Mingxuan Wang, Lei Li ByteDance AI Lab, Shanghai, China Speaker : Yu-Chen Kuan

OUTLINE Introduction XSTNet Experiments Conclusion 2

Introduction

Introduction Training E2E ST models is challenging: limited parallel speech-text data Multi-task supervision with the speech-transcript-translation triple data Pretraining with external large-scale MT parallel text data Proposed Cross Speech-Text Network (XSTNet) for end-to-end ST to joint train ST, ASR and MT tasks 4

XSTNet

XSTNet Supports either audio or text input Shares a Transformer module Use a self-supervised trained Wav2vec2.0 representation of the audio Incorporate external large-scale MT data Progressive multi-task learning strategy Pre-training and fine-tuning paradigm 6

XSTNet 7

Problem Formulation Speech-transcript-translation triples D = {(s, x, y)} DASR = {(s, x)}, DST = {(s, y)} and DMT = {(x, y)} External MT dataset DMT-ext = {(x', y')} |DMT-ext| >> |D| 8

Speech Encoder Wav2vec2.0: without fine-tuning Self-supervised pre-trained contextual audio representation c = [c1, ...cT ] Convolution layers: match the lengths of the audio representation and text sequences 2 layers of 2-stride 1-dimensional convolutional layers with GELU activation reducing the time dimension by a factor of 4 es= CNN(c), es Rd T/4 , kernel size of CNN 5, the hidden size d = 512 9

Encoder-Decoder with Modality and Language Indicators Different indicators [src_tag] to distinguish the three tasks and audio/text inputs For audio input: Extra [audio] token with embedding e[audio] Rd Embedding of the audio e Rd (T/4+1) is the concatenation of e[audio] and es For the text input: Put the language id symbol before the sentence: [en] This is a book. When decoding, the language id symbol serves as the initial token to predict the output text 10

Progressive Multi-task Training Large-scale MT Pre-training: first pre-train the transformer encoder-decoder module using external MT data 11

Progressive Multi-task Training Multi-task Fine-tuning : combine external MT, ST, ASR, and MT parallel data from the in-domain speech translation dataset and jointly optimize the negative loglikelihood loss 12

Experiments

Dataset ST datasets: MuST-C Augmented LibriSpeech En-Fr MT datasets: external WMT OPUS100 OpenSubtitle 14

Experimental Setups SentencePiece subword units with a vocabulary size of 10k Best BLEU on dev-set and average the last 10 checkpoints Beam size of 10 for decoding sacreBLEU 15

MuST-C 16

MuST-C Wav2vec2+Transformer model (abbreviated W-Transf.) with the same configuration as XSTNet Wav2vec2.0 vs. Fbank (W-Transf. & Transformer ST baseline) Multi-task vs. ST-only (XSTNet (Base) & W-Transf.) Additional MT data 17

Uncased-tok V.S. Cased-detok 18

Results on Auxiliary MT and ASR Tasks 19

Comparsion with Cascaded Baselines 20

The Influence of Training Procedure 21

The Influence of Training Procedure MT pretraining is effective Don t stop training the data in the previous stage Multi-task fine-tuning is preferred 22

Convergence Analysis 23

Convergence Analysis Progressive multi-task training converges faster Multi-task training generalizes better 24

Influence of Additional MT Data 25

Conclusion

Conclusion Cross Speech-Text Network (XSTNet) Extremely concise model which can accept bi-modal inputs and jointly train ST, ASR and MT tasks Progressive multi-task training algorithm Significant improvement on the speech-to-text translation task compared with SOTA model 27

Thank You For Listening

End-to-End Speech Translation using XSTNet

Download Presentation

Presentation Transcript

Related

More Related Content