Efficient Conformers via Sharing Experts for Speech Recognition

parameter efficient conformers via sharing n.w
1 / 19
Embed
Share

"Explore parameter-efficient Conformers employing Sparsely-Gated Experts for precise end-to-end speech recognition. Leveraging knowledge distillation and weight-sharing mechanisms, this approach achieves competitive performance with significantly reduced encoder parameters. Discover attention-based encoder-decoder models for acoustic feature sequences and Conformer-based Seq2Seq models for ASR text prediction. Delve into Conformer encoders, MoE modules, and parameter-sharing techniques for effective model optimization."

  • Conformers
  • Speech Recognition
  • Parameter-Efficient
  • Sparsely-Gated Experts
  • Seq2Seq Models

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang Kuaishou Technology Co., Ltd, Beijing, China

  2. Index Terms parameter-efficient sparsely-gated mixture-of experts Conformer (transformer + CNN) cross-layer weight-sharing knowledge distillation

  3. Introduction Parameter-Efficient Conformers + Sharing Sparsely-Gated Experts + knowledge distillation = competitive performance with 1/3(parameters of the encoder) (compared with the full-parameter model)

  4. attention-based encoder-decoder (AED) models encoder : capture the high-level representations from the acoustic features decoder : predict text sequences token-by-token with the attention mechanism acoustic feature sequence X = [x0,x1,x2, xT-1], length T Background: Conformer- based Seq2Seq Models for ASR text token sequence Y = [y0,y1,y2, ys], length S+1 (start-of-sentence symbol<sos> and the end-of-sentence symbol<eos> )

  5. Background: Conformer-based Seq2Seq Models for ASR Model Trfm predicts the probability of the text token: Model is trained with maximum likelihood criterion:

  6. Conformer encoder

  7. Sharing Sparsely-Gated Experts parameter-efficient model sparsely-gated MoE modules

  8. Parameter-Sharing for Conformers FFN modules use Swish activation functions The multi-head self-attention module uses relative positional encodings The convolution module is a time-depth separable style convolutional block with GLU and Swish activation functions.

  9. Parameter-Sharing for Conformers

  10. Dynamic Routing for Mixture of Experts To encourage all the experts to be used in balance, the load balancing loss is used as follows:

  11. Individual Routers and Normalization

  12. Distilling Knowledge from Hidden Embedding use knowledge distillation to transfer the knowledge from a full-parameter model minimize the L2 distance between the outputs of the shared-parameter encoder (student) and the full- parameter encoder (teacher):

  13. Learning The model is learned by minimizing the overall loss:

  14. Relation to Prior Work Conditional computation of mixture-of-experts MoE has been shown as an effective way to scale the capacity of neural networks without increasing computation Cross-layer weight sharing first used in transformer with adaptive computation time uses this technique to reduce the parameters of BERT.

  15. Experiments 150 hours of speech for training, about 18 hours of speech for development, and about 10 hours speech for test. 80-dimension Mel-filter bank features (FBANK) as the input 2-layer CNN 4 experts for the second FFN

  16. Experiments

  17. Experiments

  18. Thank for listening

Related


More Related Content