
Speech Perception through Parseval's Theorem and Spectral Analysis
Explore the concept of speech perception by delving into Parseval's Theorem, cepstral distance, and spectral analysis. Learn about the basilar membrane frequency scales, mel, ERB filterbank coefficients, MFCC, and how these factors affect what spectrum people hear. Discover how the L2 norm of signals and Fourier transforms play a role in understanding differences in acoustic signals using KNN. Dive into low-pass liftering and smoothed spectra for analyzing vowel differences in speech signals.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ECE 417, Lecture 10: Speech Perception Mark Hasegawa-Johnson 10/3/2017
Content Parseval s Theorem: Cepstral Distance = Spectral Distance What spectrum do people hear? The basilar membrane Frequency scales for hearing: mel, ERB Filterbank coefficients and MFCC
Parsevals Theorem L2 norm of a signal equals the L2 norm of its Fourier transform.
Parsevals Theorem: Examples Fourier Series: ? 1 ? ?(?)2?? = 2 ?? 0 ?= DTFT: ? 1 ?[?]2= ?(?)2?? 2? ? ?= DFT: ? 1 ? 1 ?[?]2=1 ?[?]2 ? ?=0 ?=0
Parsevals Theorem: DCT ? 1 ? 1 1 ? 2 ?[0]2+ 2 ?[?]2 = ?? ?=1 ?=0 Where you remember that (? + 0.5)?? ? ??= ln ?
Parsevals Theorem: Vector Formulation Suppose we define the vectors ? and ? as the cepstrum and the log spectrum, thus ?0 ?? 1 Where for convenience we ll say ?0 , ? = ? = ?? 1 ?[0] ? = 0 ? ??= ?[?] 1 ? ? 1 ?/2
Parsevals Theorem: Vector Formulation That way Parseval s theorem can be written very simply as ? 1 ? 1 2 ??2= ?? ?=0 ?=0 or even more simply as 2 ? 2= ? i.e., the L2 norm of the cepstrum equals the L2 norm of the log spectrum.
What it means for KNN Suppose we have two acoustic signals ?(?) and ?(?), and we want to find out how different they sound. If they have static spectra, then a good measure of their difference is the L2 difference between their log spectra: ? (? + 0.5)?? ? ? 1 ? 1 2 (? + 0.5)?? ? ? = ln ? ln ? ?=0 2 ? ? 2= 2= ? ?2= = ?? ?? ?? ?? ?=0 ?=0
Low-pass liftered L2 norm If you want to know whether two signals are the same vowel, then you want to know how different their smoothed spectra are. Let H(k) be your smoothing function. You smooth the log spectrum, then find the L2 distance: ? 2 ? + 0.5 ?? ? (? + 0.5)?? ? ? ? ln ? ? ? ln ? ?=0 ? 1 ? 1 2= 2[?] ?? ?? 2 = ? ? ?? ? ? ?? ?=0 ?=0
Low-pass liftered L2 norm In particular, suppose [?] = 1 0 < ? 15 ? > 15 0 Then ? 2 ? + 0.5 ?? ? (? + 0.5)?? ? ? ? ln ? ? ? ln ? ?=0 15 2 = ?? ?? ?=1
What spectrum do people hear? Basilar membrane
Basilar membrane of the cochlea = a bank of mechanical bandpass filters
Frequency scales for hearing: mel scale, ERB scale
Mel-scale The experiment: Play tones A, B, C Let the user adjust tone D until pitch(D)-pitch(C) sounds the same as pitch(B)- pitch(A) Analysis: create a frequency scale m(f) such that m(D)-m(C) = m(B)- m(A) 1 2595log10 1 + 700 ? Result: ? ? =
Critical bands When two tones play at exactly the same frequency, users can t tell the difference between x(t) versus x(t)+y(t) if y(t) is about 14dB below x(t) (in other words, the summed power is 1.03 times the power of x(t) alone) When x(t) and y(t) are at different frequencies, the masking power of x(t) is reduced Model: assume that the reduced masking power of x(t) is caused because x(t) is coming in through the tails of the bandpass filter centered at y(t).
ERB scale The experiment: find out the widths, B(f), of the critical-band filters centered at every frequency f. Analysis: create a scale e(f) such that e(f+0.5B(f)) e(f-0.5B(f)) = 1, for all frequencies Result: e ? = 21.4log101 + 0.00437?
Mel filterbank coefficients: convert the spectrum from Hertz-frequency to mel-frequency Goal: instead of computing (?+0.5)?? ? ??= ln ? We want ??= ln ? ?? Where the frequencies ??are uniformly spaced on a mel-scale, i.e., m ??+1 m(??) is a constant across all k. The problem with that idea: we don t want to just sample the spectrum. We want to summarize everything that s happening within a frequency band.
Mel filterbank coefficients: convert the spectrum from Hertz-frequency to mel-frequency The solution: ? 2 1??(?) ? ??? ? ??= ln ?=0 Where ??? ? ?? 1 ?? ?? 1 ??+1 ??? ??+1 ?? 0 ?? ??? ?? 1 ? ??? = ??+1 ??? ? ?? ? ?? ??????
Mel filterbank coefficients: convert the spectrum from Hertz-frequency to mel-frequency
MFCC: the full process Divide the acoustic signal into frames Compute the magnitude FFT of each frame ? 2 1??(?) ? ??? ? Filterbank coefficients: ??= ln ?=0 ? ?+0.5 ? ? ? 1??cos MFCC: ?[?] = ?=0 Liftering: keep only the first 12-15 MFCC coefficients, set the rest to zero.
Summary L2 distance(cepstra) = L2 distance(log magnitude spectra) L2 distance(windowed cepstrum) = L2 distance(smoothed log magnitude spectrum)