
Kernel Learning with a Million Kernels - Methods and Formulations
Explore the intricate world of kernel learning with a focus on methods and formulations such as SVM, regularizers, GMKL primal formulation, and optimization techniques like Projected Gradient Descent. Uncover the challenges and solutions in this cutting-edge field.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Kernel Learning with a Million Kernels Ashesh Jain IIT Delhi SVN Vishwanathan Purdue University Manik Varma Microsoft Research India To appear SIGKDD 2012
Kernel Learning The objective in kernel learning is to jointly learn both SVM and kernel parameters from training data. Kernel parameterizations Linear : ? = ????? Non-linear : ? = ???= ?? ???? Regularizers Sparse l1 Sparse and non-sparse lp>1 Log determinant
The GMKL Primal Formulation P = Minw,b,d ??? + ? ?? ?????? + ?,?? + ?(?) s. t. ? ? ??????? 0 ? ? ??(??,??) = ?? ??? and ??? exist and are continuous
The GMKL Primal Formulation The GMKL primal formulation for binary classification. ??? + ? ???+ ?(?) ???????? + ? 1 ?? ?? 0 & ? ? P = Minw,b,d, ? ? s. t. Intermediate Dual 1t tYKdY + r(d) 1tY = 0 0 C & ? ? D = Mind Max s. t.
Wrapper Method for Optimizing GKML Outer loop: MindW(d) s. t. ? ? 1t tYKdY + r(d) 1tY = 0 0 C Inner loop: Max W(d) = s. t. ??? = tY?dKY +?dr[Chapelle et al. MLJ 2002]. can be obtained using a standard SVM solver. Optimized using Projected Gradient Descent (PGD).
Projected Gradient Descent x0 1 0.9 0.8 x1 0.7 0.6 0.5 d2 x 0.4 0.3 x3 0.2 0.1 x2 0 0 0.1 0.2 0.3 0.4 0.5 d1 0.6 0.7 0.8 0.9 1 z1
PGD Limitations PGD requires many function and gradient evaluations as No step size information is available. The Armijo rule might reject many step size proposals. Inaccurate gradient values can lead to many tiny steps. Noisy function and gradient values can cause PGD to converge to points far away from the optimum. Solving SVMs to high precision to obtain accurate function and gradient values is very expensive. Repeated projection onto the feasible set might also be expensive.
SPG Solution Spectral Step Length Quadratic approximation : ? ???? + ??? + ? ?? ?? ?, ?? ?? ? ?? ?? ?, ??(??) ??(?? ?) Spectral step length : ????= x* x1 x1 x0 x0 Original Function Approximation
SPG Solution Spectral Step Length ?? ?? ?, ?? ?? ? ?? ?? ?, ??(??) ??(?? ?) Spectral step length : ????= 1 x1 x0 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1
SPG Solution Non-Monotone Rule Handling function and gradient noise. Non-monotone rule: ? ?? ??? ?? 1438 2 0 ? ?? ?? ? ?? ?? ?? max 2 Global Minimum SPG-GMKL 1436 1434 f(x) 1432 1430 1428 1426 0 10 20 30 40 50 60 time (s)
SPG Solution SVM Precision Tuning SPG PGD 0.5 hr 6.5 3 hr 0.2 hr 6 5.5 0.3 hr 5 4.5 2 hr 0.1 hr 4 3.5 1 1 hr 1 0.9 0.9 0.8 0.8 0.1 hr 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
SPG Advantages SPG requires fewer function and gradient evaluations due to The 2nd order spectral step length estimation. The non-monotone line search criterion. SPG is more robust to noisy function and gradient values due to the non-monotone line search criterion. SPG never needs to solve an SVM with high precision due to our precision tuning strategy. SPG needs to perform only a single projection per step.
SPG Algorithm 1: ? 0 2: Initialize ?0 randomly 3:?????? 4: ? SolveSVM(? ??, ) 5: ? SpectralStepLength 6: ?? ?? ? ?? ??? ??,? 7: s? Non Monotone 8: TuneSVMPrecision 9: ??+? ?? s??? 10: ????? converged
Results on Large Scale Data Sets Sum of kernels subject to ?? 1 regularization Data Sets # Train # Kernels PGD (hrs) SPG (hrs) PGD (hrs) SPG (hrs) 35.84 4.55 25.17 40.10 Adult - 9 Cod - RNA KDDCup04 Sonar Covertype 32,561 59,535 50,000 208 581,012 50 50 50 31.77 66.48 4.42 19.10 42.20 105.62 64.46 1 Million 5 Product of kernels subject to ?? 1 regularization Data Sets # Train # Kernels PGD (hrs) SPG (hrs) PGD (hrs) SPG (hrs) 18.66 0.67 5.57 0.49 1.73 0.88 18.17 3.45 Letter Poker Adult - 8 Web - 7 RCV1 Cod - RNA 20,000 25,010 22,696 24,692 20,242 59,535 16 10 42 43 50 8 18.69 2.29 0.66 0.96 3.42 1.33 15.93 8.99
Results on Small Scale Data Sets Sum of kernels subject to ?1 regularization Data Sets Wpbc Breast - Cancer Australian Ionosphere Sonar SimpleMKL (s) 400 356.4 676 356.4 383 33.5 1094 621.6 1247 680.0 1468 1252.7 Shogun (s) 15 7.7 12 1.2 PGD (s) 38 17.6 57 85.1 29 7.1 1392 824.2 SPG (s) 6 4.2 5 0.6 10 0.8 39 6.8 273 64.0 107 18.8 935 65.0
SPG Scaling Properties Scaling with the number of training points Adult 5.5 SPG p=1.0 SPG p=1.33 SPG p=1.66 PGD p=1.0 PGD p=1.33 PGD p=1.66 5 4.5 4 log(Time) 3.5 3 2.5 2 1.5 3.2 3.4 3.6 3.8 log(#Training Points) 4 4.2 4.4 4.6 4.8
SPG Scaling Properties Scaling with the number of kernels Sonar 5.5 SPG p=1.33 SPG p=1.66 PGD p=1.33 PGD p=1.66 5 4.5 log(Time) 4 3.5 3 2.5 2 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 log(#Kernel)
Acknowledgements Kamal Gupta (IITD) Subhashis Banerjee (IITD) The Computer Services Center at IIT Delhi