Advanced Study on Optimization and Stability in Neural Networks

optimization and stability in neural networks n.w

1 / 39

Embed Share

Explore the intricacies of optimization and stability in neural networks through advanced readings in deep learning and vision. The content delves into topics such as the stability of learning algorithms, implications of stochastic gradient descent, failures of gradient-based methods, and more, providing a broader perspective on the challenges and advancements in this field.

dilt_ri Follow

Uploaded on Mar 19, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

OPTIMIZATION AND STABILITY IN NEURAL NETWORKS ADVANCED READING IN DEEP-LEARNING AND VISION, SPRING 2017 DROR KAUFMANN, NITZAN ARTZI

OUTLINE Stability of a learning algorithm, implications1 Stability of the Stochastic Gradient Descent1 Variants of SGD1 Failures of gradient-based methods2 Bonus: some notes on architecture3 1. Train faster, generalize better: Stability of stochastic gradient descent. Hardt, Recht and Singer, 2016 2. Failures of Deep Learning. Shalev-Shwartz, Shamir and Shammah, 2017 3. On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training. Shalev-Schartz and Shashua, 2016

THE CLASSIC APPROACH TO STATISTICAL LEARNING Data comes in samples ? = ?,? ?, from an unknown distribution ?. ? = ?1, ,?? is our training set Our goal is to minimize (in expectation) some loss function ? ?,? , with respect to ? , using an algorithm ?(?) We usually look at the risk and the empirical risk: ? ? = ??? ?,? ? ??? =1 ? ?,?? ? ?=1

GENERALIZATION AND OVERFITTING ????= ??,?? ? ? ??? ?

STABILITY Def: A randomized algorithm ? is ?-uniformly stable if for all data sets ?,? ?? such that ? and ? differ in at most one example, we have ? ?, ?? ? ? ? ;? ? ? ? ;? Motivation: For an ?-uniformly stable algorithm, ? ???? ? Stable algorithms don t overfit!*

GRADIENT DESCENT Update rule ??+1= ???? ??? = ? ??????(?) In (non-stochastic) gradient descent ??? = ??(?) In stochastic gradient descent ??? = ? ?,??? where ?? are selected randomly: Uniformly from [?] As a permutation of [?]

STABILITY OF SGD Let ? ,? be an ?-Lipschitz and ?-smooth function For convex and strongly-convex losses, SGD is guaranteed to be stable with the following parameter: ? =2?2 2 ?) Convex: ? ?? (for ?? ? ?=1 ?-Strongly convex: ? =2?2 1 ?) ?? (for ??= ? Alas, neural networks are usually non-convex

NON-CONVEXITY OF NN 1 ? 1 ?

PROPERTIES OF UPDATE RULES Def: An update rule ??+1= ?(??) is ?-expansive if ? ? ? ? ? ? ?,? , ? Def: An update rule is ?-bounded if ? , ? ? ? ?

GROWTH RECURSION Lemma (2.5): Fix two arbitrary sequences of updates ?1, ,?? and ?1 sequences of steps in : , ,?? , with a similar starting point ?0= ?0 . Those define two = ?? ?? ??+1= ???? , ??+1 Now, the deviation ?? ?? ?? is bounded by ??= ?? ??? ? ????????? ?? ?? ??? ? ???????, ?? ?? ? ????????? ??? ??+1 ???+ 2?? with ? = min 1,? .

SGD IS STABLE OUTLINE Main ideas in the stability proof: In SGD we use only one example in each iteration, so at the beginning we are unlikely so select the different example Using the growth recursion lemma, we can find a bound on the on the deviance between the outputs of the algorithms

SGD IS STABLE LEMMA Lemma (~3.11): Let ? ,? 0,1 be an ?-Lipschitz loss function, and let ? and ? be two ?-sample sets differing in a single example. Denote by ?? and ?? steps of SGD on ? and ? , respectively. Then, for every ? ? and every ?0 [?]: the output of ? ?0 ,? ? ? ??,? ? ?? ?+ ? ? Where ? ? ????0= 0

SGD IS STABLE LEMMA Proof: ,? ? ? ??,? ? ?? = ,? = Pr ??0= 0 ? ? ??,? ? ?? + Pr ??0 0 ? ? ??,? ? ?? ? ? ??,? ? ?? ??0= 0 ??0 0 ,? ,? ??0= 0 + Pr ??0 0 ??0= 0 + Pr ??0 0 = ? ? ?? ?? = ? ? ????0= 0 + Pr ??0 0 = ? ?+ Pr ??0 0

SGD IS STABLE LEMMA Left to bound Pr ??0 0 . Let ? be the iteration index for which the samples used by SGD were different for the first time. Now Pr ? ?0 = = Pr ? ?0??0 0 Pr ??0 0 + Pr ? ?0??0= 0 Pr ??0= 0 Pr ? ?0??0 0 Pr ??0 0 Pr ??0 0 Pr ? ?0 ?0 ? Finally, ?0 ,? ? ? ??,? ? ?? ?+ ? ?

SGD IS STABLE MAIN THEOREM Theorem (~3.12): Consider ? as before and also ?-smooth. Suppose we run SGD for ? steps with non-increasing step sizes ?? ? ?. Then, the algorithm has uniform stability with 1 +1 ?? ? 1 2??2 1 ?1 1 ?? ??+1 ??+1 ? ??+1 ? ????? ?

SGD IS STABLE MAIN THEOREM Proof: By the lemma, we have ? ? ??,? ? ?? will bound ? as a function of ?0, and then minimize for ?0. ?+ ? ?. We ?0 ,? Claim: Our SGD updates are (1 + ???)-expansive Our SGD updates are ???-bounded Which implies ??= ?? ?? ?? ??+1 (1 + ???)?? ??+ 2???

Hence ?+1= ? ??+1??0= 0 = = ? ??+1??0= 0,??= ?? Pr ??= ?? + ? ??+1??0= 0,?? ?? Pr ?? ?? ? 1 + ??? ????0= 0,??= ?? 1 1 ? + ? ??+ 2??? ??0= 0,?? ?? 1 ?= ?+2??? = ? 1 + ??? 1 1 + ? 1 ? ?

?+1 ? 1 + ??? 1 1 + ? 1 ?+2??? ? ? 1 ?+ 1 1 1 +?? ?+2?? ??= ? ? = 1 + 1 1 ?? ?+2?? ?? ? ? 1 1 ?? ?+2?? exp ? ? ??

RECAP Trying to optimize the bound on ?????: ????? ?0 ? ? ????0= 0 ?+ ? ??0 ?0 ? and therefore ?0= 0. 1 1 ? ?? ?+2?? ?+1 exp ? ?? The last two above yields: ? ? 2?? ?? e1 1 ? ?? ? ? ?=?0+1 ?=?+1

SGD IS STABLE MAIN THEOREM ? ? 2?? ??= ? ?? e1 1 ? ? ?=?0+1 ?=?+1 ? ? 1 1 1 ? 2?? ?? = exp ?? ? ?=?0+1 ?=?+1 ? 1 1 ? ? 2?? ??= exp ?? ln ? ?=?0+1 ? =2?? ??? 1 1 ? ?? 1 1 ? 1 ? ? ?=?0+1

? ? 2?? ??? 1 1 ? ?? 1 1 ? 1 ? ? ?=?0+1 ?? 1 1 2?? 1 ??? 1 1 ? ? ?0 ?? 1 1 ? ? ?? 2? ? ?0 ? ? 1 Which means ?? 2?2 ????? ?0 ? ?0 ?+ ? ? 1 1 ?? Setting ?0 2??2 ??+1? ??+1 1 +1 ? 1 2??2 ?? 1 ?? ??+1 ? ??+1 ?????

STABILITY OF SGD CONCLUSION 1 ?1 ??+1 SGD is stable with ????? ? , even for non- ? convex problems This stability ensures low generalization error However, bound is rising with ? so training too much results in an useless bound

STABILITY INDUCING OPERATIONS Several known methods can be understood through this point of view: Weight decay / ?2 regularization Gradient clipping Dropout Model averaging

WEIGHT DECAY Using weight decay: ? ? = 1 ?? ? ??? ? Equivalent to ? ? = ? ? +? 2 ? 2 (Lemma 4.2) Expansiveness changes to 1 + ? ? ? - that is, we can change ? ? ?

GRADIENT CLIPPING Using gradient clipping, we introduce a bound on the gradient (usually using truncation) Directly, Lipschitz parameter is lowered 2 ????? ? ? ??+1

DROPOUT With dropout, we zero a fraction ? of the gradient elements (Lemma 4.4) Dropout makes the update rule ???- bounded, that is we can change ? ??

MODEL AVERAGING In model averaging ? ?? 1 ? ?=1 ?? (Theorem 4.7) With some assumptions, ????? ???2 is not necessarily an improvement) ? (this

GRADIENT DESCENT VS. SGD In (non-stochastic) gradient descent, the stability guarantee no longer holds Gradient Descent Stochastic Gradient Descent

FAILURES OF GRADIENT BASED- METHODS Until now we have seen powerful guarantees on SGD s stability and generalization performances. However, these guarantees doesn t ensure small risk, which is our goal. We ll have a taste of some reasons that cause gradient based methods to fail. Based on: [2] Failures of Deep Learning; Shalev-Shwartz, Shamir and Shammah, 2017

VARIANCE OF GRADIENTS In this point of view: assume that the samples labeled according to some . Define the loss function ? ? = ??? ?; ?, ? Define the variance of the gradient ??? ,?,? ? 2 ?? ? ? ?? ? With ~? . Chebyshev s inequality: > ? ???(?) Pr ? ? ? ?2

VARIANCE OF GRADIENTS For gradients: ?? ? ? > ? ??? ,?,? Pr ?2 However, ? = ? ?? ? is independent of , that is invariant to the labeling. Observation: very small variance gradient holds almost no information gradient based methods will fail

VARIANCE APPLICATIONS Its not surprising that for some piecewise constant loss function the gradient is not informative (zero almost everywhere w.r.t. Lebesgue measure). However, by [2] there exists (non-trivial) situations where one can show that the variance is exponentially small so gradient-based methods are doomed to fail in those situations, even though smoothness and Lipschitzness hold. Not a contradiction to the stability guarantees!

BONUS: SOME NOTES ON ARCHITECTURE So far we only discussed the optimization method of the problem Some error guarantees are architecture related, regardless of the optimization method to this end, in this section we will only consider the validation of a trained network. Based on: [3] On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training; Shalev-Schartz and Shashua, 2016

BONUS: SOME NOTES ON ARCHITECTURE There are two approaches for NN architecture: End-to-end Semantic Abstraction The end-to-end approach require less preprocessing (labeling) than Semantic Abstraction Semantic Abstraction, with meaningful components, can lead to extremely high accuracy (as showed in [3] ).

VALIDATION: END-TO-END Given a trained classification system ? with the zero-one loss and accuracy ?, that is: Pr ? ? ? ; ?,? We would like to verify the accuracy. To that end, we must have at least 1 ? validation samples (to encounter a failure). Desired accuracy can be extremely small in some applications, and therefore impossible to validate at the end-to-end approach. Decomposing the problem can result in high accuracy guarantees. = 1 = ?

VALIDATION: SEMANTIC ABSTRACTION Denote ? as the event of failure Pr? = ? . Denote two events ?1,?2 such that ?1 is ?-approximately independent of ?2, that is: Pr ?1?2 ? Pr?1 By the law of conditioned probability: Pr? = Pr ?1?2 Pr?2 ? Pr?1 Pr?2 If we can have a guarantee that Pr?? ?, then: Pr? ? ?2

VALIDATION: SEMANTIC ABSTRACTION Assume ? = ?=1 of ?1, ,?? 1. (Lemma 1): ? 0,1 , with probability 1 ? ? ?? where ?? is ??-approximately independent 2 ??ln ?/? ? +4ln ?/? ?? Pr ?? ??+ ? Which implies: ? 2 ??ln ?/? ? +4ln ?/? ??+ Pr? ?? ? ?=1

VALIDATION: SEMANTIC ABSTRACTION ? 2 ??ln ?/? ? +4ln ?/? ??+ Pr? ?? ? ?=1 The above is an exponential guarantee in ?, meaning that training and validating each module ?? to ? accuracy results in total accuracy of order ?? with validation complexity of only ? ?. Note that we aren t required to observe even a single failure.

Advanced Study on Optimization and Stability in Neural Networks

Download Presentation

Presentation Transcript

Related

More Related Content