
Speech to Text Technology: Evaluation and Guidelines
Explore the evaluation of speech-to-text models, guidelines for manual transcription, and error rates in transcription technologies. Learn about considerations and best practices in utilizing speech recognition technology effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Evaluating different Speech Evaluating different Speech to Text Transcription Models to Text Transcription Models Monica Puerto*, Elizabeth Nichols+, Curtiss Chapman+, Brian Sadacca*, Kevin Zajac+ Accenture Federal Services* , U.S. Census Bureau+ FedCASIC APRIL 2024 Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047 1
Objective Understanding what to consider when using speech to text technology 2
2020 Census Questionnaire Assistance (CQA) 2020 Census Questionnaire Assistance (CQA) About 4.7 million audio recordings 7,300 Customer service representative (CSR) 95% of audio recordings were: English (87%) Spanish (<10%) Mandarin (<1%) Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047 3
Speech to text Pipeline for Evaluation Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Guidelines for Manual Transcribers We needed to follow certain conventions to match the machine transcription No use of punctuation Proper use of accent marks Spelling out the names of numbers and letters e.g. Twenty twenty instead of 2020 Filler words We established a shared document with our own standardized spelling of filler words E.g. Umm , Hmm , Ahh Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047 5
Evaluating Speech to Text Model output vs Manual Transcription Word Error Rate = Substitutions + Insertions + Deletions / Number of words of Machine transcript WER = (S + I + D )/ N Manual Transcription on Agent Welcome to twenty twenty Census would you like information about theracecategories Speech to text Model Transcription on Agent Welcome to twenty twenty sensus would you like more information about responsecategories 2 substitutions, 1 deletion, 1 insertion = 4 /13 = 31% WER Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Examples of 2 main issues of transcription models Hallucinations Misspellings I have already completed the twenty tweny senses . would you like more information about the response categories okay so that would be a hispanic latino or spanish orgia includes all individuals who identify with one or more nationalities or ethnic groups originating in mexico portico cuba central and south american and other spanish cultures examples of these groups Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Which model is right for you? Areas where they differ: 1) Free (open sourced) vs paid 2) Different Modeling Frameworks influence error types 3) Different data trained can influence bias 4) Different languages they support Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Takeaways from Speech to text English transcription models are much farther along than Non English. Ranged from 10% to 50% higher WER for Spanish and Mandarin Chinese transcription models. Consider dialects like Mandarin, Japanese, Korean, Vietnamese that can have more than 1 writing system (ie Traditional vs Simplified). Can you specify in model which writing system? Agent side with headset would produce cleaner audio ; on average 20-50% lower WER than the caller. More background noise with caller and more diverse callers vs agents Some models like Open AI s Whisper performed poorly with long pauses and would hallucinate . Some models leave filler words like umm and uhh while others strip them out. Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Ethics to consider for using transcribed text for machine learning processes: What kind of decision is made from this data and what is the appropriate amount of margin of error of transcription that I am willing to accept? Do I have enough data to be representative? Is it impacting someone? If so, is there any difference between demographics of the speaker? Is there enough information to gather insights? Etc . Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Different ethical scenarios : TOPIC MODELING What: Making sense of many transcribed calls to get an idea of the different types of questions callers have from millions of calls? SCRIPT ADHERENCE What: Making sure customer service agents are reading approved survey language Race / Origin questions Questions about census visits Welcome to the 2020 Census. The 2020 Census questionnaire will take about 10 minutes to complete. All of the information that you provide will remain confidential. For accuracy of the data, I will need to read all of the questions exactly as written. Please give me your 12-digit Census ID. This ID can be found in the materials we mailed to you or left at your door. Internet submission questions
Conclusion To determine best speech to text model for your use case: Optimal and representative audio quality Model choice Ethical considerations for each downstream task Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047
Thank you! Shoutout to the Manual Labeling team at Census : Lin Wang, Kai Yue Charm, Marcus Berger, Crystal Hernandez, Betsari Otero, Micah Harris, Ariana Mauras Questions @ Monica.Puerto@census.gov Disclaimer: This presentation is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not those of the U.S. Census Bureau. Transcript examples are heavily summarized. The presentation has been reviewed for disclosure avoidance and approved under CBDRB-FY24-CBSM002-047