
Learning Audio Segment Representations with Language Transfer of Audio Word2Vec
Explore the concept of Language Transfer of Audio Word2Vec, which involves learning audio segment representations without target language data. This innovative approach enables the development of models that can be applied across various languages, addressing the challenges posed by multilingual audio data on the internet. Discover how audio word vectors are utilized to create a universal model for improved language processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
LANGUAGE TRANSFER OF AUDIO WORD2VEC: Learning Audio Segment Representations without Target Language Data Speaker: Hung-Yi Lee Chia-Hao Shen, Janet Y. Sung, Hung-Yi Lee
Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection (STD) Concluding Remarks
Audio Word to Vector word-level audio segment Model Model Model Model Learn from lots of audio without annotation
Audio Word to Vector The audio segments corresponding to words with similar pronunciations are close to each other. dog never dog never dogs never ever ever
Not included in training audio Language Transfer Language X Model unsupervised ? Audio collection without annotation Can we train an universal model that can be applied even on the unknown languages?
Language Transfer Why consider universal model for all languages? If you want to apply the model on language X, why don t you simply train a model by the audio of language X. Many audio files are code-switched across several different languages. Applied the model on audio data on the Internet in hundreds of languages. Audio collection for model training may not cover all the languages. It would be beneficial to have an universal model.
Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks
Audio Word to Vector word-level audio segment There are lots of segmentation approaches. Model Model Model Model In the following discussion, assume that we already get the segmentation.
Sequence-to-sequence Auto-encoder vector audio segment We use sequence-to-sequence auto-encoder here The training is unsupervised. RNN Encoder The vector we want acoustic features x2 x1 x4 x3 audio segment
Sequence-to-sequence Auto-encoder Input acoustic features x4 x2 x1 x3 The RNN encoder and decoder are jointly trained. y2 y1 y4 y3 RNN Encoder RNN Decoder acoustic features x2 x1 x4 x3 audio segment
What does machine learn? Text word to vector: ? ???? ? ????? + ? ??????? ? ?????? ? ???? ? ????? + ? ???? ? ????? Audio word to vector (phonetic information) V( ) - V( ) + V( ) = V( ) PEARL PEARLS GIRLS GIRL V( ) - V( ) + V( ) = V( ) CAT CATS ITS IT [Chung, Wu, Lee, Lee, Interspeech 16)
Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks
Language Transfer We train sequence-to-sequence auto-encoder on a source language with a large amount of data. Apply RNN encoder on a new language. Training source language source language RNN Decoder RNN Encoder trained by source language Testing target language vector representation z for target language RNN Encoder
Experimental Setup Using 1-layer GRU as encoder and decoder Training with SGD. Initial learning rate was 1 and decayed with a factor of 0.95 every 500 batches. Acoustic features: 39-dim MFCC We used forced alignment with reference transcriptions to obtain word boundaries. The results is kind of oracle. We address this issue in another ICASSP paper.
Experimental Setup - Corpus English is our source language, while the other languages are target languages. English: Librispeech Training data: 2.2M word-level audio segments Testing data: 250K audio segments French, German, Czech and Spanish: GlobalPhone Testing data: 20K audio segments
Phonetic Information Edit Distance between Phoneme sequences EH V ER N EH V ER ever never =1 RNN Encoder RNN Encoder Cosine Similarity
Phonetic Information Model trained on English, tested on English Variance Larger phoneme sequence edit distance, smaller cosine similarity Cosine Similarity The same pronunciation very different pronunciation Phoneme Sequence Edit Distance
Phonetic Information Model trained on English, tested on other languages Audio word2vec still capture phonetic information even though the model has never heard the language. Cosine Similarity Phoneme Sequence Edit Distance:
Visualization Visualizing embedding vectors of each word RNN Encoder day Project to 2-D RNN Encoder day average RNN Encoder day
Visualization Learn on English, and apply on the other languages French German
Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection (STD) Concluding Remarks
Query-by-example Spoken Term Detection ICASSP spoken query user ICASSP ICASSP Spoken Content Compute similarity between spoken queries and audio files on acoustic level, and find the query term
Query-by-example Spoken Term Detection Segmental DTW [Zhang, ICASSP 10], Subsequence DTW [Anguera, ICME 13][Calvo, MediaEval 14] DTW for query-by-example Adding slope-constraints [Chan & Lee, Interspeech 10] The blue path is better than the green one. Spoken Query Utterance
Query-by-example Spoken Term Detection Much faster than DTW Audio archive divided into variable- length audio segments Off-line Audio Word to Vector Audio Word to Vector Similarity Spoken Query On-line Search Result
Query-by-Example STD [Tu & Lee, ASRU 11] [I.-F. Chen, Interspeech 13] Baseline: Na ve Encoder
Query-by-Example STD English 1K queries to retrieve 250K audio segments DTW? Audio Word2vec Na ve Encoder The evaluation measure is Mean Average Precision (MAP) (the larger, the better).
Query-by-Example STD Language Transfer : Na ve Encoder : Audio word2vec by target language (4K segments) 1K queries to retrieve 20K audio segments 0.30 0.25 0.20 MAP 0.15 0.10 0.05 0.00 GRE ESP FRE CZE FRE GER CZE ESP The performance based on audio word2vec is poor with limited training data.
Query-by-Example STD Language Transfer : Na ve Encoder : Audio word2vec by target language (4K segments) : Audio word2vec by English (2.2M segments) 0.30 0.25 0.20 MAP 0.15 0.10 0.05 0.00 GRE ESP FRE CZE FRE GER CZE ESP Audio word2vec learned by English can be directly applied on French and German.
Query-by-Example STD Language Transfer : Na ve Encoder : Audio word2vec by target language (4K segments) : Audio word2vec by English (2.2M segments) : Audio word2vecby English + target (2K segments) 0.30 0.25 0.20 MAP 0.15 0.10 0.05 0.00 GRE ESP FRE CZE FRE GER CZE ESP Fine-tuning English model by target language is helpful.
Outline Introduction Training of Audio Word2vec Language Transfer Application to Query-by-example Spoken Term Detection Concluding Remarks
Concluding Remarks We verify the capability of language transfer of Audio Word2Vec. Audio Word2Vec learned from English captures the phonetic information of other languages. In Query-by-example STD, Audio Word2Vec learned from English outperformed the baselines on French and German.
SEGMENTAL AUDIO WORD2VEC Session: Spoken Language Acquisition and Retrieval Time: Wednesday, April 18, 16:00 - 18:00