Understanding Linear Least Squares Method for Data Approximation

1 / 27

Embed Share

Dive into the linear least squares method (LLSQ) for data approximation, exploring the minimization of errors, determination of parameters, and practical applications in various contexts. Learn how to optimize parameter estimates and minimize residual errors efficiently.

fshm Follow

Uploaded on May 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Linear lest squares method (lLSQ)

Approximation of many sample points: Minimization (Optimization) ( : ) How to determine most plausible parameters a and b if observed data (x1, y1), (xn, yn) follow f(x) = a + bx, Error i should be considered: yi = f(xi) + i Fundamental idea: Determine a and b so as to minimize (maximize) a target function S (e.g., error residual function ( )) Minimax method ( ) : S = |f(xi) yi| Least-squares (LSQ) method ( ): S = (f(xi) yi)2 S = (a + bxi yi)2 dS/da = 2 (a + bxi yi) = 2an + 2b xi 2 yi= 0 dS/db = 2 xi(a + bxi yi) = 2a xi+ 2b xi2 2 xiyi= 0 ? ?? ?? ???? ? ? = 2 ?? ?? Even for f(x) = a + bx + cx2+ give a final solution , only one matrix operation can

Minimax approximation () Minimize ) ( ) ( max x f x g b x a minimax LSQ minimax LSQ minimax

lLSQ: Polynomial : 2 n N n ( ) x = i = k = k k f a kx = S y a x i k i N n dS = = k 1 0 = 0 k l k = 2 0 x y a x i i k i da = 1 0 i l n N N = k 0 = i += k l l a x y x (l = 0, 1, , N) k i i i = 1 1 i 2 N n x x x y a i i i 0 i x 2 + 3 1 N y a x x x x 1 i i 2 i 2 i i i + = 3 4 2 N y x a x x x x 2 i i i i i i N + + 1 2 2 N N N N y x a x x x x i i N i i i i

lLSQ: General functions : n 2 ( ) x N n = = i = k dS ( ) f a f x = N ( ) S y a f x k k i k k i = 1 k 1 1 n = = k = 2 ( ) ( ) 0 f x y a f x l i i k k i da = 1 1 i l ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) f x f x f x f x f x f x f x f x y f x a 1 1 1 2 1 3 1 1 1 i i i i i i i N i i i ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) f x f x f x f x f x f x f x f x y f x a 2 1 2 2 2 3 2 2 2 i i i i i i i N i i i = ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) f x f x f x f x f x f x f x f x y f x a 3 1 3 2 3 3 3 3 3 i i i i i i i N i i i ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) f x f x f x f x f x f x f x f x y f x a 1 2 3 N i i N i i N i i N i N i i N i N If f(x) is linear with respect to fitting parameters, final solution is obtained by one matrix operation ex. ( ) f = y + a log + bxy / / f x x a = b + x c x x ( ) + , cy

Ex of lLSQ: Lattice spacing of triclinic lattice ( ) * * 2 2 c b a G l k h d hkl hkl + + = = 2 * 1 = + = = + = + + + 2 2 2 2 2 2 S h S k S l S hk S kl S lh 11 S 22 33 c 12 23 31 2 dhkl * * 2 2 2 2 a a sin / b V 11 2 2 2 2 sin / S c a V 22 = = 2 2 2 2 sin / S c a V 33 ( ) = ) ) * * 2 2 a b cos cos cos / S abc V 12 ( ( = 2 2 cos cos cos / S a bc V 23 = 2 2 cos cos cos / S ab c V 31 = + 2 2 2 1 cos cos cos 2 cos cos cos V ab c The form of dhkl-2 is a linear function with respect to Sij. 1. Sijis obtained by lLSQ 2. Sij=> Reciprocal lattice parameters (a*, b*, c*, *, *, *) 3. => Lattice parameters (a, b, c, , , )

Transcendental equation

Newton-Raphson method Solve f(x) = 0 f(x0+dx) = f(x0) + dx f (x0) ~ 0 => x1= x0 + dx = x0 f(x0) / f (x0) y = f(x) 3 2.5 f (x0) f (x0) can be substituted with finite diffrence Secant method ( , ): f (x) = (f(xn) f(xn-1)) / (xn xn-1) Save calculation of f(x) f(x0) 2 1.5 1 Variation to suppress divergence xk+1= xk f(xk) / (f (xk) + ) : Dumping Factor 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x1= x0 x0 f(x0) / f (x0) -0.5

Effect of dumping factor () f(x) = exp(x) 3x = 0 (initial x = 0) Exact 0.619061 Newton-Raphson (Dumping factor = 0) 1 0.5 2 0.610059654958962 3 0.61899677974154 4 0.619061283355313 5 0.619061286735945 6 0.619061286735945 Newton-Raphson (Dumping factor = 0.1) 1 0.476190476190476 2 0.597901649246081 3 0.617090542717403 4 0.618900291486661 5 0.619048316423879 6 0.619060243007723 7 0.619061202754359 8 0.619061279978579 9 0.619061286192231 10 0.619061286692197 11 0.619061286732425 Newton-Raphson (Dumping factor = 1.0) 1 0.333333333333333 2 0.485235618882813 3 0.556317491275292 4 0.589692022113926 5 0.605333177012923 6 0.612649553494255 7 0.616067929129785 8 0.617664103982484 9 0.618409199563502 10 0.618756961315507 11 0.618919262817103 12 0.618995007056658 0.110059654958962 0.00893712478257794 6.4503613773092e-005 3.38063244722622e-009 -1.94296000199483e-016 0.121711173055605 0.0191888934713221 0.00180974876925825 0.000148024937217564 1.19265838440254e-005 9.59746635487409e-007 7.72242198569211e-008 6.21365241490959e-009 4.99965669237101e-010 4.0228535713285e-011 0.15190228554948 0.0710818723924794 0.0333745308386341 0.0156411548989961 0.00731637648133212 0.00341837563553035 0.00159617485269905 0.00074509558101794 0.000347761752005284 0.000162301501596124 7.57442395542543e-005

Effect of dumping factor: Convergence process f(x) = exp(x) 3x = 0 (initial x = 0) Exact 0.619061 iteration iteration 1.E+00 1.E+00 1 6 11 16 1 2 3 4 5 1.E-01 1.E-01 1.E-02 1.E-03 1.E-02 1.E-04 NR(df=0) NR(df=0.1) NR(df=1) NR( ) Bisection 1.E-05 NR(df=0) 1.E-03 dx 1.E-06 NR(df=0.1) 1.E-04 1.E-07 (numerical diff.) Secant NR(df=1) NR( ) Secant Bisection 1.E-08 1.E-05 (num. diff.) 1.E-09 1.E-10 1.E-06 1.E-11 1.E-12 1.E-07 NR: Newton-Raphson method df: Dumping Factor

Case Newton method fails f(x) = tan-1(10x) initial x = 0.1 Case for convergence i i x x f(x) f(x) df/dx df/dx dx dx 0 0 1 1 2 2 3 3 4 4 0.1 0.7854 -0.5187 0.11633 -0.0011 1.2E-09 5 -0.1571 2 -0.05708 0.011686 -0.00011 1.15E-10 7.54257 0.06877 9.86527 -0.0118 9.99999 0.00011 10 f(x) 1.5 -1E-10 1 2 0.5 x x0 1.5 0 x2 -1 -0.5 0 0.5 1 1 -0.5 0.5 x1 -1 0 -1.5 -0.2 -0.1 0 0.1 0.2 -0.5 -2 -1 -1.5 -2

Case Newton method fails f(x) = tan-1(10x) initial x = 0.15 Diverged ( = 0) i i x x f(x) f(x) 0.98279 -1.0375 1.164 -1.3777 1.53984 -1.5702 1.5708 df/dx df/dx 3.07692 -0.3194 2.58404 0.40152 1.56553 -0.7435 0.36827 3.74095 0.00958 -160.76 4E-06 1.1E-12 dx dx 2 0 0 1 1 2 2 3 3 4 4 5 5 6 6 0.15 f(x) -0.16941 0.232112 -0.51141 3.229546 -157.529 389486.7 x0 x0 1.5 x2 389644 -1E+12 1 0.5 xk+1= xk f(xk) / (f (xk) + ) : Dumping Factor Stabilize convergence by choosing ( = 1) i i x x f(x) f(x) 0 0 0.15 0.98279 1 1 -0.09106 -0.7387 2 2 0.023161 0.2276 3 3 0.001466 0.01466 4 4 0.000133 0.00133 5 5 1.21E-05 0.00012 6 6 1.1E-06 1.1E-05 7 7 1E-07 1E-06 8 8 9.09E-09 9.1E-08 9 9 8.27E-10 8.3E-09 x3 0 x1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1x -0.5 df/dx df/dx 3.07692 -0.2411 5.46675 0.11422 9.49088 -0.0217 9.99785 -0.0013 9.99998 -0.0001 10 10 10 10 10 dx dx -1 x1 -1.5 -1E-05 -1E-06 -9E-08 -8E-09 -8E-10 -2

Non-linear (NL) optimization

NL optimization of crystal structure: Illustrative approach : Calculate total energy by quantum calculations by varying a lattice parameter ex. Si -1160.137 -1160.138 Exp.(RT) aC = 0.5431 nm VR= 270.5 a.u.3(primitive cell) -1160.139 Energy / Ry -1160.140 Opt. aC= 0.5472 nm VR= 276.67 a.u.3 -1160.141 -1160.142 -1160.143 -1160.144 260 270 280 290 300 Volume / a.u.3 ( )2 = / 1 + 2 / E E B V V min 0 0 B0(GPa) = 87.57 GPa (exp: 97.88 GPa)

Ex.: Deconvolution of powder XRD peak Incorporate the intensity ratio from K 1 and K 2 at 2:1

Profile models used for spectroscopy Lorentz function 1 ( ) x = IL w: half width at half maximum 2 ( ) + 1 0/ x x w Gauss function ( 1 1 ( ) x ( 2 ) = exp /( ) I x ) x = a w 2 / 1 1 G w / 1 2 0 a w w 2 = ln 0.83255461 a 0.9 w Voigt function: E.g., observed is convolution of sample spectrum IL(x) and apparatus function IG(x) ( ) a V 0.8 0.7 = ( ) ' x ( ) ' x ' I x I I x dx fG = 0, 0.3, 0.5, 0.7, 1 0.6 V G L 0.5 x 2 exp( 2 V ' ) x = ' dx + 0.4 2 ( ) ' x a 0.3 Pseudo-Voigt function: Simplified Voigt function ( ) f x I G PV = 0.2 + ( ) 1 ( ) ( ) I fG: Gauss fraction x f I x 0.1 G G L 0 -3 -1 1 3

Multiple parameter Newton-Raphson method Extend to multiple parameters Minimize F(x ) ???? = ?? ??/??? = 0 Iteration: ????+ ???~???? + ? ??? ?????/??? = 0 xl,1= xl,0 (?????/??? )-1(fk) = xl,0 (F kk )-1(F k) 2 x ( x ) F = ' ' F Hessian matrix ( ) ( ) ' kk x ' k k Hessian matrix is not always positive definite ( ) (Maximum, Saddle point ) => F dose not always gives decreasing direction Convert F to positive definite and suppress divergence xl,1= xl,0 (F kk + I)-1(F k) : Dumping Factor

Steepest Descend (SD) method () , , (2006) Search minimum only by first derivatives. Simplest one among differential methods SD: S2 would decrease in the vector (df / dxi)dxi xi (i+1)= xi (i) (df / dxi) kmay be a small constant step or determined by a line search method ex. in right figure: S2= f(xi) = 5x12+ x22, initial x1= 0.7, x2= 1.5 Newton method One cycle calculation provides the final solution for quadratic problems SD method = 0.3: Diverged (not shown in the graph) 0.2 0.15: Converged, but oscillated 0.1: Reach final solution by one cycle calculation 0.01: Not oscillated, but slowly converged 2.5 2 1.5 0.1 Newton 0.15 k=0.2 0.01 1 0.5 Problem: If S2 is highly anisotropic, the SD direction would be different largely from the minimization direction S2 => Conjugate Gradient (CG) method ( ) 0 -1 -0.5 0 0.5 1 -0.5

Steepest Descend method , , (2006) Effective way: is determined by line search ( ) Without direct search By direct search 2.5 2.5 2 2 1.5 1.5 0.1 0.15 Newton k=0.2 0.01 1 1 0.5 0.5 0 0 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -0.5 -0.5

SD method in Deep Learning All batch data are divided to mini batches, and apply SD to each mini batch DL: SD method 3 k = 0.1, nMB = 10 2.5 Right example: S2= f(xi) = ax12+ bx22, a = 5, b = 1 1000 batch data are generated with random num (note: the data were re-generated for different runs) Initial a = 0, b = 0 2 1.5 1 k=0.05, nMB=10 0.5 -0.5 SD (Repeat for all batch data) 3 0 0 1 2 3 4 5 6 2.5 k=0.1 2.5 k = 0.05, nMB = 100 Direct search 2 2 1.5 1.5 k=0.01 1 1 k = 0.1 nMB = 100 0.5 0.5 0 0 1 2 3 4 5 6 0 0 1 2 3 4 5 6

Conjugate Gradient method () , , (2006) Vectors u and v satisfy utAv = 0 for a matrix A: u and v and conjugate with each other For quadratic function, repetition of the conjugate direction will find the minimum in finite cycles if exact line search is employed => 2 1. Give initial value x0 2. Initial direction d is determined by SD Case contour is a circle, one cycle calculation reaches the minimum = d f 3. Find xk+1 using appropriately chosen k xk+1= xk+ kdk kmay be a small constant step or determined by a line search method 4. Search direction is updated by ( k f x y = x ) ( ) f + 1 k k Conjugate vectors and ellipsoido circle conversion uTPTPv = uTAv = 0 T x y ( ) f = + + d x d 1 ( ) k k f + + 1 1 k k k T d y k k 5. Repeat 3 4 to reach convergence As the freedom of cg directions is the number of parameters (nparam), need to go back to 2 to reset dkat some interval (typically nparam, necessary for nparam= 2).

Marquart method () Minimize a square sum of m functions fj(xi) with N parameters ( ) = j 1 m = 2) ( F x f x i j i Approximate by f f ( ) j = 2 j ( ) A j + + = + A x ( ) ~ ( ) ( ) f x x f x x f x jk i j i i j i i j i x x k k ( ) ( ) x , k j 2 + + + ~ F x x F f A A A x x x ' ' i i i j jk k jk ik k k , , ' k k ( ) x F x j ( + = ~ 2 0 i A f A A x jk ) j k i jk j k k Gauss-Newton method = t t A A A x jf 1 Levenberg-Marquart method ( ) ) ( ) ( ) : dumping factor = + t t A A A x I jf 1 ( e.g. chosen proportional to diagonal sum of AtA = + diag t t t A A A A A ( ) x jf 1

Features of NL optimization Newton-Raphson method: Use second derivatives (Hessian matrix) Fast convergence, easily diverged, complex program Steepest Descent: Use first derivatives only Simple program, Slower convergence than NR and CG Conjugate Gradient: Use conjugate direction for efficient search Better convergence than NR, faster than SD, complex program Marquart: Use first derivatives of fj(xi) Simple program, Slower convergence than NR Simplex: Trial and error with a pre-determined selections of next candidate parameters Very slow but good convergence

Simplex method (, Amoeba) Fortran Simplex: Polyhedron formed by (n+1) vertexes in n-dimension space ( : n (n+1) ) Minimize F(xi) 1. (n+1) initial values xi(i = 1, 2, , n+1) => Sort F(xi) so that F(xi) > F(xi ) (i < i ) xh= x1, xl= xn+1 2. Average except the maximum vertex xi 3. New x will be examined along the line x1 xG by the following selections (i) Reflection ( ) : xR = (1+ )xG x1( > 0, ex. 1.0) (ii) Expansion ( ) : xE = xr + (1 )xG ( > 0, ex. 2.0) (iii) Contraction ( ) (iv) Reduction ( ) : xRD = (x1 + xl) / 2 2 = = / x x n G i i : xC = x1 + (1 )xG(0 < < 1, ex. 0.5) 4. Replace x1 with the x in (i) (iv) that firstly satisfies F(x) < F(x1) 5. Repeat 2 - 4

Simplex method (, Amoeba) CQ (1986 ) Simplex: Polyhedron formed by (n+1) vertexes in n-dimension space ( : n (n+1) ) Minimize F(xi) 1. (n+1) initial values xi(i = 1, 2, , n+1) => Sort F(xi) so that F(xi) > F(xi ) (i < i ) 2. Average except the maximum vertex xi 2 = = / x x n G i i 3. New x will be examined along the line x1 xG by the following selections (i) Reflection ( ) (ii) Expansion ( ) (iii) Reduction-Reflection ( ) : xCR = x1 + (xG x1) (iv) Reduction : xCW = x1 + (xG x1) : xR = x1+ (xG x1) : xE = x1 + (xG x1) (ex. = 2) (ex. > 2) (ex. 1.0 < < 2.0) (ex. 0 < < 1.0) 4. Replace x1 with the x in (i) (iv) that firstly satisfies F(x) < F(x1) 5. Repeat 2 - 4

Notes for NL optimization Solutions may be more than one Final solution is not obtained by one step calculation Convergence must be confirmed Confirm the solution is the global minimum ( ) Often fall in a local minimum ( ) Residual error parameter

Features of NL optimization algorisms Convergence Speed Stability Region A B Initial cycles Later fast For: convergence A: Simplex ( A,B: Conjugate Gradient (CG, B: Steepest Descent (SD, B: Newton-Raphson method, Quasi Newton methods Davidson-Fletcher-Powell (DFP) Broyden-Fletcher-Goldfarb-Shanno (BFGS) ) ) )

Understanding Linear Least Squares Method for Data Approximation

Download Presentation

Presentation Transcript

Related

More Related Content