End-to-End Speech Translation using XSTNet

end to end speech translation via cross modal n.w
1 / 28
Embed
Share

A comprehensive overview of end-to-end speech translation leveraging the innovative XSTNet model. Discusses the challenges in training E2E ST models, introduces XSTNet functionalities like supporting audio/text input, utilizing Transformer module, self-supervised audio representation learning, and more. Examines problem formulations, speech encoder configurations, and encoder-decoder mechanisms with modality and language indicators.

  • Speech Translation
  • XSTNet
  • End-to-End
  • Language Processing
  • AI

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. End-to-end Speech Translation via Cross-modal Progressive Training Rong Ye, Mingxuan Wang, Lei Li ByteDance AI Lab, Shanghai, China Speaker : Yu-Chen Kuan

  2. OUTLINE Introduction XSTNet Experiments Conclusion 2

  3. Introduction

  4. Introduction Training E2E ST models is challenging: limited parallel speech-text data Multi-task supervision with the speech-transcript-translation triple data Pretraining with external large-scale MT parallel text data Proposed Cross Speech-Text Network (XSTNet) for end-to-end ST to joint train ST, ASR and MT tasks 4

  5. XSTNet

  6. XSTNet Supports either audio or text input Shares a Transformer module Use a self-supervised trained Wav2vec2.0 representation of the audio Incorporate external large-scale MT data Progressive multi-task learning strategy Pre-training and fine-tuning paradigm 6

  7. XSTNet 7

  8. Problem Formulation Speech-transcript-translation triples D = {(s, x, y)} DASR = {(s, x)}, DST = {(s, y)} and DMT = {(x, y)} External MT dataset DMT-ext = {(x', y')} |DMT-ext| >> |D| 8

  9. Speech Encoder Wav2vec2.0: without fine-tuning Self-supervised pre-trained contextual audio representation c = [c1, ...cT ] Convolution layers: match the lengths of the audio representation and text sequences 2 layers of 2-stride 1-dimensional convolutional layers with GELU activation reducing the time dimension by a factor of 4 es= CNN(c), es Rd T/4 , kernel size of CNN 5, the hidden size d = 512 9

  10. Encoder-Decoder with Modality and Language Indicators Different indicators [src_tag] to distinguish the three tasks and audio/text inputs For audio input: Extra [audio] token with embedding e[audio] Rd Embedding of the audio e Rd (T/4+1) is the concatenation of e[audio] and es For the text input: Put the language id symbol before the sentence: [en] This is a book. When decoding, the language id symbol serves as the initial token to predict the output text 10

  11. Progressive Multi-task Training Large-scale MT Pre-training: first pre-train the transformer encoder-decoder module using external MT data 11

  12. Progressive Multi-task Training Multi-task Fine-tuning : combine external MT, ST, ASR, and MT parallel data from the in-domain speech translation dataset and jointly optimize the negative loglikelihood loss 12

  13. Experiments

  14. Dataset ST datasets: MuST-C Augmented LibriSpeech En-Fr MT datasets: external WMT OPUS100 OpenSubtitle 14

  15. Experimental Setups SentencePiece subword units with a vocabulary size of 10k Best BLEU on dev-set and average the last 10 checkpoints Beam size of 10 for decoding sacreBLEU 15

  16. MuST-C 16

  17. MuST-C Wav2vec2+Transformer model (abbreviated W-Transf.) with the same configuration as XSTNet Wav2vec2.0 vs. Fbank (W-Transf. & Transformer ST baseline) Multi-task vs. ST-only (XSTNet (Base) & W-Transf.) Additional MT data 17

  18. Uncased-tok V.S. Cased-detok 18

  19. Results on Auxiliary MT and ASR Tasks 19

  20. Comparsion with Cascaded Baselines 20

  21. The Influence of Training Procedure 21

  22. The Influence of Training Procedure MT pretraining is effective Don t stop training the data in the previous stage Multi-task fine-tuning is preferred 22

  23. Convergence Analysis 23

  24. Convergence Analysis Progressive multi-task training converges faster Multi-task training generalizes better 24

  25. Influence of Additional MT Data 25

  26. Conclusion

  27. Conclusion Cross Speech-Text Network (XSTNet) Extremely concise model which can accept bi-modal inputs and jointly train ST, ASR and MT tasks Progressive multi-task training algorithm Significant improvement on the speech-to-text translation task compared with SOTA model 27

  28. Thank You For Listening

More Related Content