Towards Fast and Accurate End-to-End ASR Model

towards fast and accurate streaming end to end asr n.w

1 / 25

Embed Share

Explore the advancements in RNN-T models for streaming ASR, focusing on recognition quality and latency. Learn about joint endpointing and decoding, penalties for early or late emissions, and the integration of non-streaming models for enhanced performance.

csum Follow

Uploaded on Apr 12, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

TOWARDS FAST AND ACCURATE STREAMING END-TO-END ASR Bo Li, Shuo-yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, Yonghui Wu Google LLC, USA 2020 IEEE/ICASSP 2020

1. INTRODUCTION 2. RNN TRANSDUCER AND ENDPOINTER 3. EXPERIMENTAL SETUP 4. RESULTS Outline

Among E2E models, RNN-T(recurrent neural network transducer) has shown for n-device streaming ASR Critical metric : Recognition quality & Latency Recognition quality : WER(word error rate) Latency : Time different between when the user stop speaking and when the system produces its final text hypothesis. INTRODUCTION Low enough that the system responds to the user quickly, while still high enough that it does not cut off the user s speech.

INTRODUCTION EP(endpoint) model : Decision of whether user has stopped speaking. VAD(voice activity detector) : Detect speech and filters out non-speech, can used to declare end-of-query(EOQ).

INTRODUCTION Joint Endpointing and Decoding with End-to-end Models : Folding the EOQ detector into the RNN-T model by introducing a special token (</s>), signaling the end of speech, into RNN-T s output vocabulary. Premature prediction may cause not only substitution errors but also deletions.

INTRODUCTION In this work : 1. Penalties for emitting too early or late in training 2. MWER training(minimum word error rate training) 3. Rescore RNN-T EP s hypotheses with a non-streaming model, namely Listen, Attend and Spell (LAS) Achieves a 18.7% relative WER reduction and 40ms median latency and 160ms 90- percentile latency reductions on a Voice Search task comparing to the original RNN-T EP.

RNN TRANSDUCER AND ENDPOINTER Input acoustic frames as x = {x1, . . . , xT} log-mel filterbank energies (d = 512) Directly predict word piece token sequence y = {y1, . . . , yU} where the last label yUis the special token

RNN TRANSDUCER AND ENDPOINTER Early and Late Penalties During training for every input frame in {x1, . . . , xT} and every label {y1, . . . , yU}, RNN- T computes a U T matrix PRNN-T(y|x), which is used in the training loss computation. The last label yUis always </s>. MWER Training Hence investigate MWER training with N-best hypotheses for the RNN-T EP model

RNN TRANSDUCER AND ENDPOINTER Dataset The test set consists of 14K Voice Search utterances with duration less than 5.5 seconds long.

RNN TRANSDUCER AND ENDPOINTER Listen, Attend and Spell Rescoring LAS has been explored to serve as a second pass rescorer, that can still fit within the on-device latency constraint. First pick the top-K hypotheses from the RNN-T decoder, then run the LAS model on each sequence in the teacher-forcing mode to compute a score.

EXPERIMENTAL SETUPS Modeling The input waveforms are framed using a 32 msec window with 10 msec shift. Normalized 128 dimension logmel features extracted from frequencies spanning from 125 Hz to 7.5kHz. The input window size is 4, consisting of 3 frames on the left and no future context.

EXPERIMENTAL SETUPS Modeling All LSTM layers in the model are unidirectional, with 2048 units and a projection layer with 640 units. RNN-T encoder consists of 8 LSTM layers, with a time-reduction layer after the second layer. The RNN-T decoder consists of a prediction network with 2 LSTM layers, and a joint network with a single feedforward layer with 640 units.

EXPERIMENTAL SETUPS Modeling The RNN-T decoder consists of a prediction network with 2 LSTM layers, and a joint network with a single feedforward layer with 640 units The additional LAS encoder consists of 2 LSTM layers The LAS decoder consists of multi-head attention with 4 attention heads, which is fed into 2 LSTM layers. All models are trained on 8x8 Cloud TPU using the Tensorflow Lingvo toolkit to predict 4,096 word pieces including the token.

EXPERIMENTAL SETUPS Inference For RNN-T EPs, the endpoint decision is defined by: p((</s>)|x1, . . . , xt, (yRNN-T)0, . . . , (yRNN-T)t 1) </s> </s> :penalty term for the posterior of that modifies the ordering for the hypothesis with </s>. : predefined threshold that determines if is allowed in the search beam

RESULTS Baseline Train a RNN-T model to predict 4,096 word pieces for the ASR task only (no </s>) with an external EOQ EP is used. Also trained a joint endpointing and recognition RNN-T EP model. Endpointing coverage (EOU) which represents the percentage of the test data actually receives an end-of- utterance signal from the endpointer model

Early and Late Penalties Granularity of the time is frame RESULTS (particularly 60ms in our setup) Experimented with tbuffer= {3, 5, 7}

RESULTS MWER training MWER training optimizes sequence level loss and penalizes WER when is emitted too early.

RESULTS LAS Rescoring Take the pre-MWER model E2 and added an additional encoder with two LSTM layers and an extra LAS decoder. They are trained with cross entropy (CE) loss with the RNN-T weights frozen.

RESULTS Analysis Plotted the WER vs latency (EP90) curve for these four models (B1, B2, E5, E8) in Figure 2 by varying the penalty scale and threshold . Lower curves are better.

Additional information MWER Two-Pass End-to-End Speech Recognition Tara N. Sainath , Ruoming Pang , David Rybach, Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirko Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Google, Inc., USA https://arxiv.org/pdf/1908.10992.pdf

Additional information MWER Given input x, groundtruth transcript y , the probability computed by LAS P(??|x) for any given target sequence ??with teacher-forcing (where m = r if ??is given by RNN-T and m = l if ??is given by LAS). First, we run a beam search with one of the decoders m from the two-pass model to get a set of hypotheses Hm = { 1, . . . , ?} where b is the beam-size. To make the MWER training match decoding, the generation of Hm depends on the target decoding mode.

Additional information MWER For a LAS decoder to be used in the 2nd beam search mode, we compute Hm by running beam search with the LAS decoder itself on x (m = l). For a LAS decoder to be used in the rescoring mode, on the other hand, we compute Hm(x) by running beam search with the first-pass RNN-T decoder (m = r). (In the 2nd beam search mode, it produces output yl from e alone, ignoring yr, the output of the RNN-T decoder.)

Additional information MWER For each sequence ?? ??, let W(? ,??) be the number of word errors of ??. Let ?(? ,??) = 1 ?? ?? ???(? ,??) be the mean number of word errors for ?? Let ? ? ,?? = ? ? ,?? ?(? ,??) be the relative word error rate of ??in ?? ?(??|?) ?? ???(??|?)represent the conditional probability LAS decoder assigns to hypothesis ??among all hypotheses in??. Let ? ???,??) =

Additional information MWER The MWER loss is defined as : ?????(?,? ) = ?? ??(?) ?(??|?,??) ?(? ,??) Train the LAS decoder to minimize a combination of the MWER loss and the maximum- likelihood cross-entropy loss: ??????,? + ???????(? |?) Where ???is a hyperparameter that experimentally we set to be ???= 0.01 following.

Additional information Rescoring Rescoring is a simple and widely used method to apply LMs to endto-end ASR models. An n-best list is generated by beam search with the ASR model, then each hypothesis in the list is rescored using the LM. Score(X, y) = log pASR(y|X) + ScoreLM(y) + |y| X = acoustic features y = (y1, ..., yL) denotes a hypothesis ? ScoreLM(y) = ?=1 ????(??|?<?) = log p(?1, ,??)

Towards Fast and Accurate End-to-End ASR Model

Download Presentation

Presentation Transcript

Related

More Related Content