
Multi-Blank Transducers for Speech Recognition
Explore the innovative multi-blank approach in RNN-T models for enhanced speech recognition, focusing on improving inference speed and training ease. Learn about the blank symbols, loss functions, and performance gains compared to traditional RNN-T methods.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
MULTI-BLANK TRANSDUCERS FOR SPEECH RECOGNITION Hainan Xu1, Fei Jia1, Somshubra Majumdar1, Shinji Watanabe2, Boris Ginsburg NVIDIA, USA, Carnegie Mellon University, PA, USA Speaker : Yu-Chen Kuan
OUTLINE Introduction Multi-blank RNN-T Logits Under-Normalization Experiments Analysis Conclusion and Future Work 2
Introduction AED, CTC and RNN-T are the most commonly used in ASR systems. Considerable research efforts have been spent in improving those ASR models. for computational efficiency more flexible training scenarios different types of regularization methods 4
Introduction This paper focus on RNN-T models. RNN-T achieves great performance on speech recognition. But it suffers from the slow inference speed and difficulty to train due to model structure and memory footprint. 5
Introduction Less investigated area of RNN-Ts the blank symbol and the loss function. Propose a multi-blank method Unlike standard RNN-Ts with a single blank symbol. With additional blank symbols that explicitly model duration, and advance the input t dimension by two or more frames. 6
Blank symbol in RNN-T RNN-T does not need alignment information during training. A label sequence could be augmented by adding an arbitrary number of blanks at any position of the sequence. For any input sequence, it tries to maximize the probability sum over all augmented sequences of the correct labels. 8
Blank symbol in RNN-T Typical RNN-T output: During inference, an RNN-T model emits at least one token per input frame. Those blank symbols are omitted in post-processing in order to generate the final ASR outputs. 10
Multi-blank RNN-T A typical RNN-T model generates more blank symbols than non-blanks during inference. Spends a lot of computation generating labels that are not going to be in the final outputs. 11
Multi-blank RNN-T Introduces big blank symbols. As blank symbols with explicitly defined durations. once emitted, the big blank advances the t by more than one, e.g. two or three. A multi-blank model could use an arbitrary number of blanks with different durations. a set N containing all possible blank durations standard RNN-Ts, where N = {1} 12
Forward-backward algorithm the forward weights ( ) and backward weights ( ) are computed as: For multi-blank RNN-Ts, with a predefined N: 14
Model inference With the multi-blank models, when a big blank with duration m is emitted, the decoding loop increments t by exactly m. Allows the inference to skip frames and thus become faster. 15
Logits Under-Normalization To prioritize emissions of big blanks, propose a modified RNN-T loss function. Probabilities of the correct label or blank emitted at (t, u) location: log softmax function call: under-normalize the logits by adding an extra term: is chosen to be 0.05 17
Logits Under-Normalization the weight of a complete path : RNN-T loss: 18
Logits Under-Normalization With the added terms, the loss does not sum over the probabilities uniformly. Added weight penalizes longer paths, and therefore would prioritize the emission of blanks with larger durations. cover multiple frames and make the path shorter Under-normalization has no effect on the original RNN-T. all paths are of the same length and thus penalized equally 19
Experiments Evaluate with a Conformer-RNN-T model with a stateless decoder. The stateless-decoder models outperform LSTM-decoder models both in terms of accuracy and speed. The Conformer encoder: a convolution layer at the beginning of the network that performs subsampling on the input The stateless decoders: use the concatenation of embeddings of the last two context words as the output 21
Experiments byte-pair encoding as the text representation vocabulary sizes: 1024 For baseline, standard blank only with N = {1} = 0.05 22
Librispeech results Trained with the full Librispeech dataset Augmented 3-times using speed perturbation factors of 0.9x, 1.0x, 1.1x. Conformer-rnnt-large configuration in NeMo: 120M parameter The optimal subsampling rates: 4X subsampling 23
Librispeech results For models with the 8X and 16X subsampling, the subsampling at the beginning of the encoder directly impacts the computational complexity of the later self- attention operations. Paper s approach all speedup comes from the decoding loops. 26
German ASR results RNN-T models with stateless decoder 2070 hours of audio data 813,000 utterances VoxPopuli dataset: speedups up to around 90% MLS dataset: around 140% Three big blank symbols with durations 2, 4, and 8 (N = {1, 2, 4, 8}) 27
Efficient batched inference for multi-blank transducers Ideal for on-device speech recognition Also supports batched inference to run on the server side Non-trivial to implement exact batched inference for multi-blank RNN-Ts different utterances in the same batch might output blanks with different durations Hard to fully parallelize the computation 32
Efficient batched inference for multi-blank transducers Inexact batched inference method: different utterances in a batch emit blanks of different durations increment t by the minimum of those durations Allows for better parallelization for different utterances in the same batch 33
Efficient batched inference for multi-blank transducers Baseline: Larger batch-sizes speed up model inference with diminishing returns Large batch-sizes would result in more wasted computation due to padding Multi-blank: With larger batches, less likely for all utterances to emit big blanks of large durations. It has to perform more decoding steps by picking the minimum of those duration. 35
Efficient batched inference for multi-blank transducers Slight differences in WER with different batch-sizes Inexact batched inference method is not equivalent to running utterances in a non-batched mode Might result in small perturbations in the ASR outputs 36
Conclusion and Future Work The design speed up the model inference and improves ASR accuracy. In order to prioritize the emission of big blanks, performs under-normalization of the logits before RNN-T computation. Bring between +90% to +140% relative speedup for inference on different dataset, while achieving better ASR accuracy. Apply similar ideas to other ASR frameworks. e.g. CTC, and work on other modifications 38
Progress Aishell1(Transformer) dev(CER) test(CER) CTC only baseline (spec_aug) CTC only (no spec_aug) CTC only (spec_aug) CTC+Attention baseline (no spec_aug) CTC+Attention (no spec_aug) CTC+Attention (spec_aug) 5.8 6.3 6.9 7.6 5.8 6.2 6.5 7.4 6.5 7.3 5.3 5.8 41