Kernel Machines in Machine Learning: Overview and Applications

csce 633 machine learning 13 kernel machines n.w
1 / 24
Embed
Share

Explore the concept of kernel machines in machine learning, including feature maps, realization, and kernelization of algorithms. Understand the use of kernels in ML models like Ordinary Least Squares and Ridge Regression through examples and illustrations.

  • Machine Learning
  • Kernel Machines
  • Feature Maps
  • Ridge Regression
  • Kernelization

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CSCE-633 Machine Learning 13. Kernel Machines Instructor: Guni Sharon 0 Based on a lecture by Kilian Weinberger and Joaquin Vanschoren

  2. Announcements Midterm on Tuesday, April 12 during class Due: Quiz 4: ML debugging and kernelization, due Wednesday April-4 Assignment (P3): SVM, linear regression and kernelization, due April-12 1

  3. Feature Maps Linear models: ? = ? ? = ?????= ?1?1+ + ???? Cannot fit non-linear patterns Add non-linear combinations of features Feature map (or basis expansion ) ? ? ? ? = ? ? ? = ? ?(?) E.g., Polynomial feature map: all polynomials up to degree ? ?[1,?1, ,??] [1,?1, ,??,?1 Example with p=1,d=3 ? = ?1?1 ? = ?1?1+ ?2?1 2, ,?? 2, ,?? ?,?1?2, ,????] 2+ ?3?1 3 2

  4. Realization Kernel functions can effectively compute the transformed dot product ? ?? Even when mapping to an infinite feature space (e.g., RBF) ?(??) for any observation pair ?,? ?(??). Don t rely on ? ? ?? 1. Predict based on ? ?? 2. Train ?(one per training sample). Don t train ? ? = ????(??) ? ?(?) = ????(??) ?(?) = ???? ?? ?(?) 3

  5. Using kernels in ML In order to use kernels in ML algorithms we need to show that we can train and predict using inner products of the observations Then, we can simply swap the inner product with the kernel function For example, kernelizing 1 nearest neighbors (Euclidian) is straightforward Training: none Predicting: ? = ? ?? where ??= argmin = argmin? ? 2? ??+ ?? ?? ? ?,? ? ?,?? ? ??,?? 2 ? ?? ?? 2= argmin ? ?? ?? 2 4

  6. Using kernels in ML Ordinary Least Squares: arg?min0.5 ?? ?2 Squared loss No regularization Closed form: ? = ? ? 1? ? Closed form: ? =? Ridge Regression: arg?min0.5 ?? ?2+ ? ? Squared loss ?2-regularization Closed form: ? = ? ? + ?? 1? ? Closed form: ? =? 2 5

  7. From ? to inner product Claim: the weight vector is always some linear combination of the training feature vectors: ? = ?????= ? ? Was proven last week 6

  8. Kernelizing Ordinary Least Squares . . . ? = 0.5 ?? ?2 min ? = column vectors = ? ?1 ?? ?? = ? ?? ? = 0? ? = ? ? 1? ? ? ? = ? ? 1? ? ?? 1?? ? = ?? 1? ? ? 1? ? ?? 1?? = I ? ? ? 1? = I because ? ? ? ? 1? = ? I ? = ?? 1? = ? 1? ? = matrix ? rows ? columns = 7

  9. You cant do that ?? and then say that ? = ?? You can t define ??,?= ?? Obviously ? ? ?? Actually, this is correct. ?? is a vector and ?a matrix. Let s break it down ?1 ?2 ?? ?1 ?? ??,1 ?1,1 ??,1 ?1,? ??,? ? = = ?1,1 ?1,? ??,? ?? = ?1, ,?? = 8

  10. OK so what is ?? ? ?1,1 ??,1 ?1,? ??,? ?1 ?? ? ? = ?1, ,?? = Where ??,?= ???,???,? Sanity check: Ordinary Least Squares ? = ? ? 1? ? = ? ? = ?? 1? = ? 1? = ? 9

  11. What about predictions? Train a kernelized linear (not linear anymore) regression model ? = ?? 1? = ? 1? = ? Can we use the trained ? for prediction? This is our end game! Originally, we had ? = ? ? But we didn t train ? = ????? We trained ? ? = ????? This is a linear model with ? dimentions ? ? = ????? 10

  12. Kernelized Support Vector Machines ?,?? ? + ? ??? S.T. ? ??? ??+ ? 1 ?? ?? 0 min ? is a hyper-parameter (bias vs variance) Goal: reformulate the optimization problem with inner products and no ? Step 1, define the dual optimization problem 11

  13. Duality principle in optimization Optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem For convex optimization problems, the duality gap is zero under a constraint qualification condition Duality gap 12

  14. The dual problem Form the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers solve for the primal variable values that maximize the dual problem (minimize the original objective function) The dual problem gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables + derived constraints 13

  15. Kernelized Support Vector Machines Primal: min We won t derive the dual problem (requires substantial background) Bottom line: the objective function is defined as a function of alphas, labels, and inner products (no weights) In this case, we can show that ? ?,?? ? + ? ??? S.T. ? ??? ??+ ? 1 ?? ?? 0 Dual: min ?1, ,?? S.T. 0 ?? ? ?=1 ????= 0 1 2 ?,???????????,? ?=1 ? = ????? ?? ? ?? ?=1 Where ? { 1,+1} Problem: ? is not part of the dual optimization ? 14

  16. Kernelized Support Vector Machines For the primal formulation we know (from a previous lecture) that only support vectors satisfy the constraint with equality: ??? ? ?? + ? = 1 In the dual, these same training inputs can be identified as their corresponding dual values satisfy ??> 0 (all other training inputs have ??= 0 ) In test-time you only need to compute the sum in (?) over the support vectors. All inputs ?? with ??= 0 can be discarded after training This fact can allow us to compute ? in closed form 15

  17. Kernelized Support Vector Machines Primal: support vectors have ??? ? ?? + ? = 1 Dual: support vectors have ??> 0 The primal solution and the dual solution are identical As a result, ???>0?? ???????,?+ ? = 1 ? = ?? ???????,?= ?? ???????,? 1 ? 1,+1 1 = ?? ?? 16

  18. Calculating the support vectors margin ? ??+? ?2 SVM margin: ? ?,? = min ?? ?? ??, ?,? = min ?? ? 9SVM.pptx Slide #8 Once we add the SVM constraint: min ?? ?? ??+ ? = 1 1 : ? ?,? = min ?2 ?? ? ? ? = ?=1 Kernelized SVM margin: = min ????? ?? in Kernelized SVM 1 ? ?=1 ????? ?? ?? ? 2 You are required to compute the Kernelized SVM margin as part of [Assignment (P3) SVM, linear regression and kernelization] 17

  19. Kernel SVM = weighted K-NN K-NN with ? 1,+1 ? = sign ?=1 ?????,? = 1, ?? ? Nearest neighbors 0, Kernel SVM ? = sign ?=1 ????? ??,? + ? Instead of considering the K nearest neighbors equally, Kernel SVM considers all neighbors scaled by a distance measure (the kernel) and a unique learned scale per data point (alpha) ? ???????,? else ? 18

  20. RBF Kernel = ? ?2 ? exp 19

  21. SVM with soft constraints (C hyperparameter) 20

  22. Kernelized SVM Pros SVM classification can be very efficient, because it uses only a subset of the training data, only the support vectors Works very well on smaller data sets, on non-linear data sets, and high dimensional spaces Is very effective in cases where number of dimensions is greater than the number of samples It can have high accuracy, sometimes can perform even better than neural networks Not very sensitive to overfitting Cons Training time is high when we have large data sets When the data set has more noise (i.e. target classes are overlapping) SVM doesn t perform well 21

  23. What did we learn? Kernel functions allow us to utilize powerful linear models to predict non-linear patterns Requires us to represent the linear model through ?? removing the weight vector ? Once we have an appropriate representation, we simply swap ?? with ??,? ?? and ? while ??

  24. What next? Class: Decision trees Assignments: Assignment (P3): SVM, linear regression and kernelization, due April-12 Quizzes: Quiz 4: ML debugging and kernelization, due Wednesday April-4 23

More Related Content