
Analyzing Environmental Data with MATLAB or Python: Lecture Insights and Applications
Dive into the world of environmental data analysis using MATLAB or Python with a focus on spectral peak assessment, confidence intervals, and the Bootstrap Method. Explore topics such as probability, linear models, Fourier series, filters, and hypothesis testing. Learn how to assess the significance of spectral peaks and determine confidence intervals, allowing you to make informed decisions based on data analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Environmental Data Analysis with MATLAB or Python 3rdEdition Lecture 26
SYLLABUS Lecture 01 Lecture 02 Lecture 03 Lecture 04 Lecture 05 Lecture 06 Lecture 07 Lecture 08 Lecture 09 Lecture 10 Lecture 11 Lecture 12 Lecture 13 Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Intro; Using MTLAB or Python Looking At Data Probability and Measurement Error Multivariate Distributions Linear Models The Principle of Least Squares Prior Information Solving Generalized Least Squares Problems Fourier Series Complex Fourier Series Lessons Learned from the Fourier Transform Power Spectra Filter Theory Applications of Filters Factor Analysis and Cluster Analysis Empirical Orthogonal functions and Clusters Covariance and Autocorrelation Cross-correlation Smoothing, Correlation and Spectra Coherence; Tapering and Spectral Analysis Interpolation and Gaussian Process Regression Linear Approximations and Non Linear Least Squares Adaptable Approximations with Neural Networks Hypothesis testing Hypothesis Testing continued; F-Tests Confidence Limits of Spectra, Bootstraps
purpose of the lecture continue develop a way to assess the significance of a spectral peak and develop the Bootstrap Method of determining confidence intervals
Part 1 assessing the confidence level of a spectral peak
what does confidence in a spectral peak mean?
one possibility indefinitely long phenomenon you observe a short time window (looks noisy with no obvious periodicities) you compute the p.s.d. and detect a peak you ask would this peak still be there if I observed some other time window? or did it arise from random variation?
example 10 5 d 0 -5 t -10 0 100 200 300 400 500 600 700 800 900 1000 100 100 100 100 a.s.d Y N N N 50 50 50 50 0 0 0 0 0 0.5 0 0.2 0.4 0 0.5 0 0.2 0.4 f f f f
10 5 d 0 -5 t -10 0 100 200 300 400 500 600 700 800 900 1000 100 100 100 100 a.s.d Y Y Y Y 50 50 50 50 0 0 0 0 0 0.2 f 0.4 0 0.2 0.4 0 0.2 f 0.4 0 0.2 0.4 f f
Null Hypothesis The spectral peak can be explained by random variation in a time series that consists of nothing but random noise.
Easiest Case to Analyze Random time series that is: Normally-distributed uncorrelated zero mean variance that matches power of time series under consideration
So what is the probability density function p(s2) of points in the power spectral density s2 of such a time series ?
Chain of Logic, Part 1 The time series is Normally-distributed The Fourier Transform is a linear function of the time series Linear functions of Normally-distributed variables are Normally-distributed, so the Fourier Transform is Normally-distributed too For a complex FT, the real and imaginary parts are individually Normally-distributed
Chain of Logic, Part 2 The time series has zero mean The Fourier Transform is a linear function of the time series The mean of a linear function is the function of the mean value, so the mean of the FT is zero For a complex FT, the means of the real and imaginary parts are individually zero
Chain of Logic, Part 3 The time series is uncorrelated The Fourier Transform has [GTG]-1 proportional to I So by the usual rules of error propagation, the Fourier Transform is uncorrelated too For a complex FT, the real and imaginary parts are uncorrelated
Chain of Logic, Part 4 The power spectral density is proportional to the sum of squares of the real and imaginary parts of the Fourier Transform The sum of squares of two uncorrelated Normally- distributed variables with zero mean and unit variance is chi-squared distributed with two degrees of freedom. Once the p.s.d. is scaled to have unit variance, it is chi- squared distributed with two degrees of freedom.
so s2/c is chi-squared distributed where c is a yet-to-be-determined scaling factor
in the text, it is shown that where: d2is the variance of the data Nf is the length of the p.s.d. f is the frequency sampling ff is the variance of the taper. It adjusts for the effect of a tapering.
example 1: a completely random time series 20A) tapered time series +2 d 10 d(i) 0 2 d -10 -20 0 5 10 15 20 25 30 time t, seconds 9 B) power spectral density 8 7 6 5 95% 4 s2(f) 3 2 mean 1 0 0 2 4 6 8 10 12 14 16 18 20 frequency f, Hz
example 1: histogram of spectral values mean 95% 35 30 25 20 counts 15 10 5 0 1 2 3 4 5 6 7 8 power spectral density, s2(f)
example 2: random time series consisting of 5 Hz cosine plus noise 20A) tapered time series +2 d 10 d(i) 0 2 d -10 -20 0 5 10 15 20 25 30 time t, seconds 20 B) power spectral density 15 10 s2(f) 5 95% mean 0 0 2 4 6 8 10 12 14 16 18 20 frequency f, Hz
example 2: histogram of spectral values mean 95% peak 60 50 40 counts 30 20 10 0 2 4 6 8 10 12 14 16 18 power spectral density, s2(f)
probability of a spectral peak MATLAB p=2; % degrees of freedom ppeak = chi2cdf(speak/c,p); probability of p.s.d. p.s.d. of peak scale factor less than peak degrees of freedom Python p=2; # degrees of freedom ppeak = stats.chi2.cdf(speak/c,p);
so how confident are we of a peak at 5 Hz ? ??? ? ? ? ? = 0.99994 ? the p.s.d. is predicted to be less than the level of the peak 99.994% of the time But here we must be very careful
two alternative Null Hypotheses a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation
two alternative Null Hypotheses a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation much more likely, since p.s.d. has many frequency points (513 in this case)
two alternative Null Hypotheses a peak of the observed amplitude at 5 Hz is caused by random variation peak of the observed amplitude or greater occurs only 1-0.99994 = 0.006 % of the time The Null Hypothesis can be rejected to high certainty a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation
two alternative Null Hypotheses a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation peak of the observed amplitude occurs only 1-(0.99994)513 = 3% of the time The Null Hypothesis can be rejected to acceptable certainty
Part 2 The Bootstrap Method
The Issue What do you do when you have a statistic that can test a Null Hypothesis but you don t know its probability density function ?
If you could repeat the experiment many times, you could address the problem empirically perform experiment calculate statistic, s repeat make histogram of s s normalize histogram into empirical p.d.f.
The problem is that its not usually possible to repeat an experiment many times over
Bootstrap Method create approximate repeat datasets by randomly resampling (with duplications) the one existing data set
example of resampling random integers in range 1-6 original data set resampled data set 1.4 2.1 3.8 3.1 1.5 1.7 1 2 3 4 5 6 3 1 3 2 5 1 1 2 3 4 5 6 3.8 1.4 3.8 2.1 1.5 1.4
example of resampling random integers in range 1-6 original data set new data set 1.4 2.1 3.8 3.1 1.5 1.7 1 2 3 4 5 6 3 1 3 2 5 1 1 2 3 4 5 6 3.8 1.4 3.8 2.1 1.5 1.4
interpretation of resampling mixing sampling duplication p(d) p (d)
Example what is the p(b) where b is the slope of a linear fit? d(i) time t, hours
This is a good test case, because we know the answer if the data are Normally-distributed, uncorrelated with variance d2, and given the linear problem d = Gm where m = [intercept, slope]T The slope is also Normally-distributed with a variance that is the lower-right element of d2 [G GTG G]-1
MATLAB resample dataset returns N random integers from 1 to N
MATLAB usual code for least squares fit of line save slopes
MATLAB histogram of slopes
MATLAB integrate p(b) to P(b) 2.5% and 97.5% bounds
Python for p in range(Nr): # resample rowindex=np.random.randint(N,size=N); t = eda_cvec(torig[rowindex,0]); d = eda_cvec(dorig[rowindex,0]); # straight line fit G=np.zeros((N,M)); G[:,0:1]=np.ones((N,1)); G[:,1:2]=t; GTG = np.matmul(G.T,G); G mest=la.solve(GTG,np.matmul(G.T,d)); slope[p,0]=mest[1,0]; # save slope in vector
Python for p in range(Nr): # resample rowindex=np.random.randint(N,size=N); t = eda_cvec(torig[rowindex,0]); d = eda_cvec(dorig[rowindex,0]); resample dataset returns N random integers from 0 to N-1 # straight line fit G=np.zeros((N,M)); G[:,0:1]=np.ones((N,1)); G[:,1:2]=t; GTG = np.matmul(G.T,G); G mest=la.solve(GTG,np.matmul(G.T,d)); slope[p,0]=mest[1,0]; # save slope in vector
Python for p in range(Nr): # resample rowindex=np.random.randint(N,size=N); t = eda_cvec(torig[rowindex,0]); d = eda_cvec(dorig[rowindex,0]); usual code for least squares fit of line # straight line fit G=np.zeros((N,M)); G[:,0:1]=np.ones((N,1)); G[:,1:2]=t; GTG = np.matmul(G.T,G); G mest=la.solve(GTG,np.matmul(G.T,d)); slope[p,0]=mest[1,0]; # save slope in vector save slopes
Python Nhist=1000; slopemin = 0.2; slopemax = 0.8; ct, ed = np.histogram(slope,Nhist,(slopemin,slopemax)); Nc = len(ct); Ne = len(ed); slopehist = eda_cvec(ct); centers = eda_cvec(0.5*(ed[1:Ne]+ed[0:Ne-1])); Ds = centers[1,0] - centers[0,0]; norm = Ds*np.sum(slopehist); pbootstrap =slopehist/norm; histogram of slopes Pbootstrap = eda_cvec(Ds*np.cumsum(pbootstrap)); ilovec = np.where(Pbootstrap>=0.025) ilox = ilovec[0]; ilo = ilox[0]; blo = centers[ilo]; ihivec = np.where(Pbootstrap>=0.975); ihix = ihivec[0]; ihi = ihix[0]; bhi = centers[ihi];
Python Nhist=1000; slopemin = 0.2; slopemax = 0.8; ct, ed = np.histogram(slope,Nhist,(slopemin,slopemax)); Nc = len(ct); Ne = len(ed); slopehist = eda_cvec(ct); centers = eda_cvec(0.5*(ed[1:Ne]+ed[0:Ne-1])); Ds = centers[1,0] - centers[0,0]; norm = Ds*np.sum(slopehist); pbootstrap =slopehist/norm; 2.5% and 97.5% bounds integrate p(b) to P(b) Pbootstrap = eda_cvec(Ds*np.cumsum(pbootstrap)); ilovec = np.where(Pbootstrap>=0.025) ilox = ilovec[0]; ilo = ilox[0]; blo = centers[ilo]; ihivec = np.where(Pbootstrap>=0.975); ihix = ihivec[0]; ihi = ihix[0]; bhi = centers[ihi];
standard error propagation 50 bootstrap 40 p(b) p(b) 30 20 10 95% confidence 0 0.5 0.51 0.52 slope, b 0.53 0.54 0.55 0.56 slope, b
a more complicated example p(r) where r is ratio of CaO to Na2O ratio of the second varimax factor of the Atlantic Rock dataset
35 30 25 20 p(r) p(r) 15 10 95% confidence 5 mean 0 0.45 0.46 0.47 CaO / Na2O ratio, r 0.48 0.49 0.5 0.51 0.52 CaO/Na2O ratio, r