Innovations in End-to-End Contextual Speech Recognition

the 29th international dms conference n.w

1 / 20

Embed Share

Explore the advancements in automatic speech recognition through the utilization of dynamic contextual information at different levels, from shallow fusion to deep fusion. Dive into the research on fine-grained contextual information matching ASR word-piece-level token output distribution, enhancing the CLAS model for improved performance in dealing with contextual phrases in speech recognition.

hanishbl Follow

Uploaded on Mar 18, 2025 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

The 29th International DMS Conference on Visualization and Visual Languages End-to-End Contextual Speech Recognition With Word-Piece-Level Token Selection Zhibin Wu, Yang Zou, Jian Zhou, Min Wang, Xiaoqin Zeng Hohai University, Nanjing, China Speaker Zhibin Wu

C O N T E N T S 01 Introduction and Motivation Fine-CLAS 02 Experiments 03

Part 01 Introduction and Motivation

Introduction and motivation The utilization of dynamic contextual information in end-to-end automatic speech recognition has been an active research topic. Researchers have done a lot of work, which can be roughly divided into two directions: 1 Shallow Fusion: OTF rescoring Class LM Token Passing Decoder Deep Fusion: Contextual-LAS 2

Introduction and motivation In class 1, re-scoring using an externally trained language model independently runs counter to the benefits obtained from the joint optimization of components from sequence-to-sequence models. WER v.s. Number of Phrases

Introduction and motivation As for class 2, although the full neural context approach outperforms shallow fusion, it still suffers from a problem: the performance of the model drops significantly when dealing with hundreds or even thousands of contextual phrases, which is caused by the large number of contextual phrases with similar pronunciation or partial word repetition. WER for the Talk-To set with a CLAS model

Part 02 Fine-CLAS

Fine-CLAS CLAS The original CLAS model only contextually modelled phrase-level embeddings, so it cannot be extended to larger bias lists. the network architecture is shown as follow. we propose to fuse contextual information at two different levels, phrase-level embeddings and word- piece-level embeddings, in order to enable the contextual bias module to focus on fine-grained contextual information to match the ASR word- piece-level token output distribution. We have made improvements to this module.

Fine-CLAS Architecture Our paper proposes an Fine-CLAS, which consists of a prefix tree, an encoder, a decoder, and two attentions. The main network architecture is shown as follow.

Fine-CLAS Architecture The main contribution of Fine-CLAS are summarized as follows: a transformation chain between word-piece-level embeddings is constructed so as to obtain relationship between piece-level bias embeddings. bias a prefix tree is constructed and combined with information to select whether to enable each phrase in the context list, which can reduce the number of phrases and obtain a smaller number of phrase-level embeddings and word-piece- level biased embeddings. historical the transfer word- a selection designed to select top-K phrases based weights and corresponding level bias which can result in a series of word-piece-level embedding information. word-piece-level token biased algorithm is on the the obtain word-piece- embeddings,

Fine-CLAS Methods The Fine-CLAS model is established on the CLAS model by augmenting three additional models that correspond to the following three approaches. 1 Prefix Tree Constraint Word-piece-level Token Selection 2 3 Contextual Transformation Chain Construction

Fine-CLAS Technology 1. Prefix Tree Constraint Given the previously output word fragment tokens as queries, a certain history interval is selected and input to the bias module to find the phrases corresponding to the prefixes, returning a binary vector computed to filter relevant phrases and will only be used for phrase-level attention in the inference stage. = , , , 0,1 bias h a a a , which is 0 1 N An example of prefix tree search

Fine-CLAS Technology 2. Word-piece-level Token Selection It introduces word-piece-level context vectors that are spliced and mapped to the decoder's output, thus matching the token units of ASR with word fragments as the output distribution and reducing the uncertainty of token prediction. The specific formula is as follows 1 , , , nb K z z z PhraseTopKSelection Z = ( ) z z z , , , , t nb a a a (1) , ,1 , t t N ( ) ( ) = = z k z k z k z nb z nb , , , h h h ContextualEnc z , h h ContextualEnc z (2) ,1 ,2 k ,1 nb = z z nb z nb = z z k z k i + h h h + h h h (3) ,1 ,1 nb , , k i = = (4) z z z z z [ , , , , , , ] K V h h h h h ,1 1,1 1,2 ,1 ,2 nb K K

Fine-CLAS Technology 3. Contextual Transformation Chain Construction Modelling this transfer may be helpful when the context is personalized entity names and proper names that are rare or invisible during training, as it allows us to recover the expected next token by using the preceding subsequence. The original formula for the key-value pair selected by the word-piece-level token is as follows. = = z k v h , l l k i ( ) , k v Accordingly, the memory entries of the key-value pair constructed after the contextual transformation chain are two consecutive word-piece-level embedding vectors and , as follows. , k i h , 1 k i h + ( , l l k v = l l z z ) ( ) z z , h h + , , 1 k i k i

Part 03 Experiments

Experiments Metrics Our experiments are conducted on the dataset Librispeech. As LibriSpeech's test set lacks a bias list, we construct a bias list by collecting words other than the 20,000 most common words in the training data from the reference of the test set and discarding short words of less than 5 letters. Finally, the simulated bias lists for test-clean and test-other consists of around 1,000 phrases. a set of evaluation metrics is introduced that tracks three different aspects of ASR, as follows. 1 WER: overall word error rate assessed for all words 2 CER: overall character error rate assessed for all words 3 U-WER: unbiased word error rate assessed for words not in the bias list

Experiments Metrics Secondly, contextual bias is measured using the precision (P), recall (R) and F1- score (F1) of the biased phrases. In summary, we use six evaluation metrics to measure the performance of Fine-CLAS. 1 P: Percentage of positive cases with correct predictions to those predicted to be positive P= TP FP + TP 2 R: Proportion of positive cases predicted to be correct to actual positive cases R= TP + TP FN 3 F1: Harmonized mean 2 F 1 P 1 R + = 1

Experiments Results Compared with CLAS, our improved Fine-CLAS model decreases WER by another 5.37% and 2.10% on test-clean and test-other, respectively, and achieves significant improvements in the other two metrics. This indicates that the ASR performance of our model is preferable. test-clean test-other Model WER CER U-WER WER CER U-WER AED 21.79 10.98 19.90 45.65 26.35 41.30 CLAS 22.97 16.37 19.90 38.18 22.63 33.90 Fine-CLAS 17.60 10.63 16.00 36.08 20.95 32.70

Experiments Results Compared to CLAS, our Fine-CLAS model achieves a slight improvement with another 1.10% and 2.10% improvement in F1-score on test-clean and test-other, respectively. This indicates that our model improves both the performance of the ASR model and the effect of contextual bias. test-clean test-other Model F R F1 F R F1 AED 97.10 37.80 54.40 82.40 15.60 26.20 CLAS 92.90 59.80 72.80 88.80 29.40 44.10 Fine-CLAS 96.00 66.10 73.90 95.50 30.40 46.20

Thanks for your attention

Innovations in End-to-End Contextual Speech Recognition

Download Presentation

Presentation Transcript

Related

More Related Content