Unlocking Zero-Shot Power of Instruction-Tuned Large Language Model in Speech Recognition

harnessing the zero shot power of instruction n.w

1 / 13

Embed Share

Discover how instruction-tuned large language models are leveraged for grammatical error correction in end-to-end speech recognition tasks. The study showcases the effectiveness of integrating linguistic information embedded in LLMs for ASR error correction, leading to promising performance improvements across various ASR tasks.

straight_k Follow

Uploaded on Jun 08, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

HARNESSING THE ZERO-SHOT POWER OF INSTRUCTION-TUNED LARGE LANGUAGE MODEL IN END-TO-END SPEECH RECOGNITION Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan

Introduction LLMs have potential to solve different downstream tasks in a few-shot or even zero-shot manner. In this work, we aim to relate the end end- -to to- -end ASR task to grammatical error correction correction and leverage the linguistic information embedded within the LLM to improve the target speech task. Revealed the effectiveness of using instruction-tuned LLMs for ASR error correction as post-processing. N-best T5: Robust ASR error correction using multiple input hypotheses and constrained decoding space Can generative large language models perform asr error correction? This approach attempts to directly integrate the LLM s ability to correct grammatical errors into an end-to-end ASR formulation. Experiments on various ASR tasks demonstrate that the proposed approach delivers promising improvements in ASR performance. end ASR task to grammatical error

Methodology

Instruction-Tuned LLM Llama2-Chat A fine-tuned version of Llama2 that is optimized for dialogue usecases.

Methodology

Methodology

Methodology Training Step 1. Train a hybrid CTC-attention-based end-to-end ASR model Step 2. Train the proposed model by training a new decoder from scratch using Llama2, along with the pre-trained encoder and CTC networks Only update the parameters of the decoder network Inference We apply the Viterbi approximation with respect to the posterior distribution.

Experiments Data Model Using ESPnet toolkit Baseline model hybrid CTC-attention-based encoder-decoder model Proposed model Encoder Network (? , ??????, ???, ?????? ????) 2 CNN layers followed by 12 Conformer-encoder blocks. LS-100, TED2, CV2: (4, 256, 1024, 31) LS-960: (8, 512, 2048, 31) Decoder network (? , ??????, ???) 6 Transformer decoder LS-100, TED2, CV2: (4, 256, 2048) LS-960: (8, 512, 2048) To match the hidden dimensions of Llama2 and the decoder network, we applied a single linear layer to the Llama2 output. LibriSpeech(LS) LS-100 for lower-resource scenario LS-960 TEDLIUM2(TED2) CoVoST2(CV2) Transcriptions are normalized by default for ASR training except for CV2 since it can be crucial for the LLM to accurately capture linguistic information.

Experiments Training and decoding The baseline model was trained up to 50 epochs The proposed model was trained up to 50 epochs for LS-100, and 25 epochs for the other datasets. Use Adam optimizer with Noam learning rate scheduling Warmup steps: 15k Peak learning rate: (0.0015, 0.002) Regularization hyperparameters: default Augmented speech data using speed perturbation and adaptive SpecAugment CTC loss weight: 0.3 during baseline model training Prompt You will be provided with a statement in quotes. Correct the wrong words and provide your revised version. Enclosed an user input within double quotation marks( ).

Result Tasks other than LS-960 is outperformed based-model In CV2, the proposed model demonstrated a notably higher level of gain. Enable Llama2 to extract precise linguistic information by the use of unnormalized written-style text.

Result A1: Without using Llama2 as its front end Improvements on the clean set A slight decline in performance on the other set A2: Remove prompt Slight improvement compared to the baseline It had a negative impact on the proposed model A3: Wrong prompt You will be provided with a statement in quotes, and your task is to translate it into Japanese

Result Shallow fusion Incorporating the Llama2 probability into the joint decoding process, with the LM weight set at 0.5. Rescoring Using the Llama2 probability to rerank top-10 hypotheses obtained from the joint decoding process. Combined scores derived from both the joint decoding and rescoring procedures, assigning a weight of 0.5 applied to the Llama2 score. Error correction Assessed the inherent ability of Llama2 for grammatical error correction. Response generated by our prompt with the instruction

Conclusion Proposed a novel integration of instruction-tuned LLM and end-to-end ASR. Guided the LLM to perform grammatical error correction and leveraged the embedded linguistic information to enhance the ASR performance. Future Work: Consider applying the proposed model to other speech tasks.

Unlocking Zero-Shot Power of Instruction-Tuned Large Language Model in Speech Recognition

Download Presentation

Presentation Transcript

Related

More Related Content