
Optimizing Gradient Descent Methods for Big Data Algorithms
Explore the concepts of smooth convex optimization and Gradient Descent methods for handling big data. Learn about the iterative process, analysis lemmas, and convergence theorems involved in minimizing functions efficiently. Improve your understanding of Nesterov's accelerated descent for enhanced convergence rates.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CIS 700: algorithms for Big Data Lecture 8: Gradient Descent Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us
Smooth Convex Optimization Minimize ? over ?: ? admits a minimizer ? (?? ? = 0) ? is continuously differentiable and convex on ?: ?,? ?: ? ? ? ? ? ? ?f(y) ? is smooth (?? is ?-Lipschitz) ?,? ?: ?? ? ?? ? Example: ? =1 ?||? ?|| 2???? ??? ?? = ?? ? ? = ? 1?
Gradient Descent Method Gradient descent method: Start with an arbitrary ?1 Iterate ??+1= ?? ? ??(??) Thm. If ? = 1/?then: 2 2? ?1 ? ? + 3 2 ? ?? ? ? Linear convergence , can be improved to quadratic using Nesterov s accelerated descent
Gradient Descent: Analysis Lemma 1: If ?is ?-smooth then ?,?: ?: ? ? ? ? + ?? ??? ? +? 2 ? ? 2 1?? ? ? ? ? ?? ??? ? = 0 ? ? ? 1 ?? ? ? ? + ?? ? ?? ?? ??(? ?) 2?? =? 2 ? ? 2 0 Convex and ?-smooth is equivalent to: ? ? + ?? ??? ? ? ? ? ? + ?? ??? ? +? 2 ? ? 2
Gradient Descent: Analysis Lemma 2: If ?convexand ?-smooth then ?,?: ?: 1 2 ? ? ? ? + ?f ??? ? + ?? ? ?? ? 2? 2 ?? ? 2 Cor: ?f x ?? ? ??? = ? ? ?? ??? ???? = ?? ? ??(?) ?? is convex, ?-smooth and minimized at ?: ??? ? ? = ? ? ?? ??? ? ? + ?? ??? ? ? ???(y) ????1 ????2 = ?? ?1 ?? ?2 1 ? ?? ? ?? ? ?||?1 ?2||
Gradient Descent: Analysis Lemma 2: If ?convexand ?-smooth then ?,?: ?: 1 2 ? ? ? ? + ?f ??? ? + ?? ? ?? ? 2? 2 ??? = ? ? ?? ??? ???? = ?? ? ??(?) ? ? ? ? ?f ??? ?= ??? ??? ? 1 ?? ????? ??? 2 1 +? 1 ????? ????? ????? ?? ????? 1 2 = 1 2= 1 2 ???? ?? ? ?? ? 2? 2?
Gradient Descent: Analysis Gradient descent: ??+1= ?? 1/? ?? ?? 2 2? ?1 ? Thm:? ?? ? ? 2 ?+3 ???+1 ?? +? 2 ? ??+1 ? ?? ?? ?? ??+1 ?? 2 = 1 2 ?? ?? 2? 2. 1 Let ??= ? ?? ? . Then ??+1 ?? ?? ?? 2? ??? ? | ?? ? | ?? ?? ?? ?? ?? Lem: | ?? ? |is decreasing with ?. ??2 ??+1 ?? 2 2? ?1 ?
Gradient Descent: Analysis ??2 1 ??+1 ?? 2; ? = 2 2? ?1 ? 2? ?1 ? ??? ??+1+ 1 ?? 1 ???2+ ??+1 ?? ??+1 1 1 ?? ? 1 ?? ? ? 1 + 1 ??+1 ? ?1 ? ? ? ?1 ? ? ?1 ? +? 1 2= ?? ? ?1 ? 4? 2 1 ?? ?(?+3)
Gradient Descent: Analysis Lem: | ?? ? |is decreasing with ?. ?? ? ?? ? ? ? 1 2 1 ? ?f x ?? ? ?? ? ?? ? 2 ?? ? ? 2= ?? 1 ??+1 ? ??? ?? ? ??? ? +1 2 2 2 ?? ? = ??? ?? ?? ?? ?2 2 1 2 ?? ? ?? ?? ?2 2 ?? ?
Nesterovs Accelerated Gradient Descent 2 1+ 1+4?? 1 ,??=1 ?? Params: ?0= 0,??= Accelerated Gradient Descent (?1= ?1): ??+1= ?? 1 2 ??+1 ??? ?? ??+1= 1 ????+1+ ???? Optimal convergence rate ?(1/?2) Thm. If ? is convex and ?-smooth then: ? ?? ? ? 2? ?1 ? 2 ?2
Accelerated Gradient Descent: Analysis ? ? 1 ??f x ? ? ? ? 1 ? ? + ?? ??? ? ??f x 2 ?? ??? 1 ??f x ? +? ?? ??? ? = 1 2? ? 1 ??f x ? + 2 2 (by Lemma 1) 2+ ?f xT(x y) ?? ?
Accelerated Gradient Descent: Analysis 2+ ?f xTx y ? ? 1 1 ??f x ? ? ?? ? 2? Apply to ? = ??,? = ??: ? ??+1 ? ?? = ? ?? 1 ??f xs ?(??) 1 2+ ?? ?? ?? ?? ?? ?? 2? 2 ? ??+1 ?? = ? Apply to ? = ??,? = ? : ? ??+1 ? ? ? (2) ??? ??(1) ??+1 ?? 2 2 ? ?(?? ? ) ??+1 ?? 2??+1 ?? 2
Accelerated Gradient Descent: Analysis 1 x ?? 1 + 2 , for ??= ? ?? ? ? : ????+1 (?? 1)?? ? 2?? ??+1 ?? 2 ? ??+1 ?? ?(???? ?? 1 ?? ? ) 2 2 ??: ?? 2+ 2????+1 ?? (x) ?? and use ?? 1 = ?? 2 2??+1 ?? 1 ?? ? ?(???? ?? 1 ?? ? )) 2( ????+1 ?? It holds that: 2+ 2????+1 ?? ?(???? ?? 1 ?? ? )) = 2 ???? ?? 1 ?? ? ????+1 ?? 2 ????+1 ?? 1 ?? ?
Accelerated Gradient Descent: Analysis By definition of AGD: ??+1= ??+1+ ???? ??+1 ??+1??+1= ??+1??+1+ 1 ?? ??+1??+1 ??+1 1 ??+1= ????+1 ?? 1 ?? Putting last three facts together for ??= ???? ?? 1 ?? ? we have: ?? ? Adding up over ? = 1 to ? = ? 1: ?? ??+1 2 ??+1 2 2 2??+1 ?? 1 ?? ?? 2 ? 2 2 ?? ?1 2?? 1 2. Q.E.D. ? By induction ?? 1
Constrained Convex Optimization Non-convex optimization is NP-hard: 21 ?? 2= 0 ?: ?? 0,1 ?? ? Knapsack: Minimize ????? Subject to: ????? ? Convex optimization can often be solved by ellipsoid algorithm in ????(?) time, but too slow
Convex multivariate functions Convexity: ?,? ?: ? ? ? ? + ? ? ?f(y) ?,? ?, 0 ? 1: ? ?? + 1 ? ? ?? ? + 1 ? ?(?) If higher derivatives exist: ? ? = ? ? + ?? ? ? ? + ? ???2? ? ? ? + ?2? ?????? is the Hessian matrix ? is convex iff it s Hessian is positive semidefinite, ???2?? 0 for all ?. ?2? ???=
Examples of convex functions ?-norm is convex for 1 ? : ?? + 1 ? ? ? ?? ?+ 1 ? ? ? = ? ? ?+ 1 ? ? ? ? ? = log(??1+ ??2+ ???) max ?1, ,?? ? ? max ?1, ,?? + log? ? ? = ???? where ? is a p.s.d. matrix, ?2? = ? Examples of constrained convex optimization: (Linear equations with p.s.d. constraints): minimize: 1 (Least squares regression): 2= xTATA x 2 AxTb + bTb 2???? ??? (solution satisfies ?? = ?) Minimize: ?? ? 2
Constrained Convex Optimization General formulation for convex ? and a convex set ?: ????????:? ? subject to: ? ? Example (SVMs): Data: ?1, ,?? ? labeled by ?1, ,?? 1,1 (spam / non-spam) Find a linear model: ? ?? 1 ?? is spam ? ?? 1 ?? is non-spam ?:1 ????? 0 More robust version: ????????: ???? 1 ? ???? + ? ? 2 ? E.g. hinge loss Loss(0,t)=max(0,t) Another regularizer: ? ? 1 (favors sparse solutions)
Gradient Descent for Constrained Convex Optimization (Projection): ? ? ? ? y = argminz ? ? ? 2: ? = ?/ ? 2 2 Easy to compute for 2 2 Let ?? ? 2 ?, max ? ? ?. 2 ?,? ? Let ? =4?2?2 Gradient descent (gradient + projection oracles): Let ? = ?/? ? Repeat for ? = 0, ,?: ?(?+1)= ?(?)+ ??? ?? ?(?+1)= projection of ?(?+1) on ? Output ? =1 ?2 ? ??(?)
Gradient Descent for Constrained Convex Optimization 2 2 ??+1 ? ??+1 ? 2 2 2 ?? ? ??? ?? = 2 2 2 ?? ? + ?2 ?? ?? 2??? ?? ?? ? = 2 2 Using definition of ?: 1 2? +? 2 2 ?? ?? ?? ? ?? ? ??+1 ? 2?2 2 2 2 2 1 +? ? ?? ? ? ?? ? ??+1 ? 2?2 2? 2 2 Sum over ? = 1, ,?: ? ? ?? 1 2? +?? 2 2 ? ? ?0 ? ?? ? 2?2 2 2 ?=1
Gradient Descent for Constrained Convex Optimization 2 1 ? ? ?? ? ? ?0 ? ?=1 2? 2 2 +?? ?? ? 2?2 2 1 1 ? ??? ? ?? ?? ? : ?2 2??+? 1 ? ?? ? ? 2?2 ? ? ? ? RHS ?? ? Set ? = ? ?
Online Gradient Descent Gradient descent works in a more general case: ? sequence of convex functions ?1,?2 ,?? At step ? need to output ?(?) ? Let ? be the minimizer of ???(?) Minimize regret: ???? ??(? ) ? Same analysis as before works in online case.
Stochastic Gradient Descent (Expected gradient oracle): returns ? such that ??? = ?? ? . Example: for SVM pick randomly one term from the loss function. Let ?? be the gradient returned at step i Let ??= ??? be the function used in the i-th step of OGD Let ? =1 ? ??(?) and ? be the minimizer of ?.
Stochastic Gradient Descent ? ? +?? Thm. ? ? ? gradient output by oracle. ? ? ? ? 1 ? where ? is an upper bound of any ? ?(? ?? 1 ? ?(? )) (convexity) ?? ?? ?? ? ? =1 ? ??[??(?? ? )] (grad. oracle) =1 ? ? =1 ??[ ? ?[??(??) ??(? )] ??(??) ??(? )] ?[] = regret of OGD , always ?