
Solver Optimization Methods for Constrained Problems in Machine Learning
Learn about efficient optimization methods for solving constrained problems in machine learning. Explore techniques for converting problems, introducing dual variables, and understanding the Lagrangian and dual problems. Discover the concept of duality and strong duality, leading to insights into solving complex ML optimization challenges.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
My first Solver Optimization for ML
2 Constrained Optimization Method 3 Method 3: Creating a Dual Problem Suppose we wish to solve min ? ?? ? + ? ? s.t. ? ? 0 Trick Trick: sneak this constraint into the objective Construct a barrier (indicator) fn ? ? so that ? ? = 0 if ? ? 0 and ? ? = otherwise, and simply solve Easy to see that both problems have the same solution One very elegant way to construct such a barrier is the following ? ? = max Same as min ? ? Let us see how to handle multiple constraints and equality constraints Hmm we still have a constraint here, but a very simple one i.e. ? 0 ? 0? ? ? ? ? + max ? 0 max ? ? + ? ? ? Thus, we want to solve min ? 0? ? ? ? ?
3 A few Cleanup Steps Step 1 Step 1: Convert your problem to a minimization problem max? ? min ? ? Step 2 Step 2: Convert all inequality constraints to constraints ? ? 0 ? ? 0 Step 3 Step 3: Convert all equality constraints to two inequality constraints ? ? = 0 ? ? 0, ? ? 0 Step 4 Step 4: For each constraint we now have, introduce a new variable e.g. if we have ? inequality constraints ?1? 0, ,??? 0, introduce ? new variables ?1, ,?? These new variables are called dual variables or sometimes even called Lagrange multipliers The variables of the original optimization problem, e.g. ? in this case, are called the primal variables by comparison
4 The Lagrangian ? ?,? = ? ? + ?=1 Lagrangian of the problem If ? violates even one constraint, we have max ? ? ?? 0 If ? satisfies every single constraint, we have max ? ? ?? 0 ? ?? ??? ?? ??? called the min ? s.t. ?1? 0 ?2? 0 ? ? ?,? = ??? 0 ?,? = ?(?) This is just a nice way of rewriting the above problem min ? ? max ? ? ?? 0 ? ? + ?=1
5 The Dual Problem The original optimization problem is also called the primal problem Recall: variables of the original problem e.g. ? called primal variables Using the Lagrangian, we rewrote the primal problem as ? min ? ? max ? ? ?? 0 ? ? + ?? ??? ?=1 The dual problem is obtained by simply switching order of min/max ? max ? ? ?? 0 min ? ?? ? + ?? ??? ?=1 In some cases, the dual problem is easier to solve than the primal
6 Duality Let ??, ?? be the solutions to the primal problem i.e. ??, ??= argmin ? ? ? argmax ? ? ?? 0 ? ? + ?? ??? ?=1 Let ??, ?? be the solutions to the dual problem i.e. ? ??, ??= argmax argmin ? ? ? ? + ?? ??? ? ? ?? 0 ?=1 Strong Duality Strong Duality: ??= ??if the original problem is convex and nice Complementary Slackness Complementary Slackness: ?? Note Note: not compli imentary but comple ementary ? ?? ??= 0 for all constraints ?
7 Hard SVM without a bias 1 2?2 2 such that 1 ?? ? ?? 0 for all ? ? min ? ? ? constraints so we need ? dual variables i.e. ? ? Lagrangian: ?,? =1 2?2 ? 2+ ?=1 ??1 ?? ? ?? Lagrangian 1 2?2 ? 2+ ?=1 ??1 ??? ?? Primal problem Primal problem: argmin argmax ? 0 ? ? 1 2?2 ? 2+ ?=1 ??1 ??? ?? Dual problem Dual problem: argmax argmin ? ? ? 0 The dual problem can be greatly simplified!
8 Simplifying the Dual Problem Once you get optimal values of ?, use ? = ?=1 to get optimal value of ? ? ???? ?? Note that the inner problem in the dual problem is 1 2?2 ? 2+ ?=1 ??1 ??? ?? argmax ? 0 argmin ? ? Since this is an unconstrained problem with a convex and differentiable objective, we can apply first order optimality to solve it completely If we set the gradient to zero, we will get ? = ?=1 Substituting this back in the dual problem we get ?? 1 ?=1 ??????????,?? ? ???? ?? ? ? ? ?=1 2 ?=1 argmax ? 0 This is actually the problem several solvers (e.g. libsvm, sklearn) solve
Support Vectors Support Vectors!! Recall: we have ?? for every data point After solving the dual problem, the data points for which ?? 0: Support Vectors Usually we have ? support vectors Recall Recall: complementary slackness tells us that ??1 ??? ??= 0 i.e. only those data points can become SVs for which ??? ??= 1 i.e. at margin The reason these are called support vectors has to do with a mechanical interpretation of these objects need to look at CSVM to understand that Support Vectors
10 Dual for CSVM Similar calculations (see course notes for a derivation) show that if we have a bias term ? as well as slack variables, then the dual looks like ?? 1 ?=1 ??????????,?? s.t. ?? 0,? , and ?=1 ????= 0 Reason for the name SVM Reason for the name SVM : imagine that each data point ? is applying a force ?? on the hyperplane in the direction ?? Then the total force on the hyperplane is equal to zero since ?=1 Also, the condition ? = ?=1 ???? ?? can be interpreted to mean that the total torque on the hyperplane is zero as well Thus, support vectors mechanically support the hyperplane (don t let it shift or rotate around), hence their name ? ? ? ?=1 2 ?=1 ? argmax ? ? ? ????= 0 ?
11 CSVM Dual Problem If we have a bias ?, then the dual problem looks like ?? 1 s.t. ?? 0,? , and ?=1 ????= 0 The constraint ?=1 ????= 0 links all ?? together. Cannot update a single ?? without disturbing all the others A more involved algorithm Sequential Minimal Optimization (SMO) by John Platt is needed to solve the version with a bias updates two ?? at a time! However, if we omit bias (hide it inside the model vector) the dual is ?? 1 ?=1 ??????????,?? s.t. ?? 0,? We will see a method to solve this simpler version of the problem ? ? ? ??????????,?? ?=1 2 ?=1 ? ?=1 argmax ? ? ? ? ? ? ?=1 2 ?=1 argmax ? ?
12 Solvers for the SVM problem Sub-gradient since the primal objective is convex but non-differentiable We can solve the SVM (no bias) by either solving the primal version 1 2?2 Yes, coordinate ascent in the dual looks a lot like stochastic gradient descent in the primal! Both work with a single data point at a time ? 2+ ? ?=1 1 ?? ? ?? Projected since we have a constraint (albeit a simple one) in the dual argmin ? ? + or the dual version ?? 1 ? ? ? ??????????,?? ?=1 2 ?=1 ?=1 argmax ? ? We may use gradient, coordinate etc methods to solve either For primal, we may use sub-gradient descent, coordinate descent, etc For dual, we may use (projected) gradient ascent, coordinate ascent We will actually see how to do coordinate maximization for dual Since the optimization variable in the dual is ? ?, we will need to take one coordinate at each time i.e. choose a different ? ? at each time step s.t. ?? 0,? Does this mean I need to choose one data point at each time step?
13 SDCM for the CSVM Problem ?? 1 s.t. ?? 0,? for all ? ? Concentrating on just the terms that involve ?? we get ?? 1 2 s.t. ?? 0,? Renaming ? = ??,? = ?? 1 2??2 ?(1 ?) s.t. ? 0,? Solution is very simple: find unrestricted minimum i.e. ? = 1 ? /? If ? 0,? , solution is ? elif ? < 0, solution is 0, else solution is ? why this trick works it won t work in general Warning Warning: in general, finding an unconstrained solution and doing a projection step does not does not give a true solution ? ? ? ??????????,?? ?=1 2 ?=1 ?=1 argmax ? ? 2 ???? ? ???????,?? 2?? argmax ?? 2?? 2,? = ?? ? ???????,??, we get 2 argmin ? Indeed! In this special case, our objective had a nice property called unimodality which is
14 Speeding up SDCM computations All that is left is to find how to compute ?,? four our chosen ? 2 can be easily precomputed for all data points ? = ?? 2 However, ? = ?? ? ???????,?? needs ? ?? time to compute only if done naively. Recall that we always have ? = ?=1 for the CSVM (even if we have bias and slack variables) ? ???? ?? 2= ? ?? ????? Thus, ? ???????,??= ? ?? ?????? If we somehow had access to ?, then computing ? ?? would take ? ? time and computing ????? would take ? 1 time All we need to do is create (and update) the ? vector in addition to the ? vector and we would be able to find ? in just ? ? time 2
15 Which Method to Choose? primal coordinate descent in ? ? time per update? Can you work out the details on how to implement stochastic Gradient Methods Primal Gradient Descent: ? ?? time per update Dual Gradient Ascent: ? ?? time per update Stochastic Gradient Methods Stochastic Primal Gradient Descent: ? ? time per update Stochastic Dual Gradient Ascent: ? ? time per update Coordinate Methods: (take ? ?? time per update if done naively) Stochastic Primal Coordinate Descent: ? ? time per update Stochastic Dual Coordinate Maximization: ? ? time per update Case 1: ? ?: use SDCM or SPGD (? ? time per update) Case 2: ? ?: use SDGA or SPCD (? ? time per update) Be careful not to get confused with similar sounding terms. Coordinate Ascent takes a small step along one of the coordinates to increase the objective a bit. Coordinate Maximization instead tries to completely maximize the objective along a coordinate Also be careful that some books/papers may call a method as Coordinate Ascent even when it is really doing Coordinate Maximization. The terminology is unfortunately a bit non-standard