
Discriminative Classifiers and Maximum A Posteriori in Machine Learning
Explore discriminative classifiers like Logistic Regression and SVMs, and delve into Maximum A Posteriori estimation for Bayesian inference. Learn about the different classification strategies, linear algebra concepts, feature space representation, and the dot product in machine learning.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Discriminative classifiers: Logistic Regression, SVMs CISC 5800 Professor Daniel Leeds
Maximum A Posteriori: a quick review Likelihood:? ? ? = ? ? ? = ?|?|1 ?|?| Prior: ? ? =?? 1(1 ?)? 1 ?(?,?) Posterior Likelihood x prior = ? ? ? ?(?) Choose ? and ? to give the prior belief of Heads bias ? ?,? = ? ? =?? 1(1 ?)? 1 ?(?,?) Higher ?: Heads more likely Higher ?: Tails more likely MAP estimate: argmax ? argmax ? log? ? ? + log?(?) log? ? ? + log?(?) ? + ? 1 ? = ? + ? 1 + ? + ? 1 2
Estimate each P(Xi|Y) through MAP MAP Incorporating prior for each class ?? ? ??= ??? = ?? =#?(??= ?? ? = ??) + (?? 1) #?(? = ??) + ?(?? 1) ? ? = ?? =#?(? = ??) + (?? 1) |?| + ?(?? 1) Note: both X and Y can take on multiple values (binary and beyond) (?? ?) frequency of class j ??? ? frequencies of all classes 3
Classification strategy: generative vs. discriminative Generative, e.g., Bayes/Na ve Bayes: Identify probability distribution for each class Determine class with maximum probability for data example 0 5 10 15 20 Discriminative, e.g., Logistic Regression: Identify boundary between classes Determine which side of boundary new data example exists on 0 5 10 15 20 4
Linear algebra: data features Document 3 Document 1 Document 2 Vector list of numbers: each number describes a data feature Wolf 12 0 8 Lion 16 2 10 Monkey 14 1 11 # of word occurrences Broker 0 14 1 Analyst 1 10 0 Dividend 1 12 1 d Matrix list of lists of numbers: features for each data point 5
Feature space Each data feature defines a dimension in space Document1 Document2 Document3 8 10 Wolf 12 0 20 Lion 16 2 doc1 lion 11 Monkey Broker 14 0 1 14 doc2 10 1 0 Analyst 1 10 doc3 1 Dividend 1 12 0 0 10 20 d wolf 6
The dot product The dot product compares two vectors: ?1 ?? ?? ?1 ? ???? = ??? ? = ? ? = ?=1 , ? = ? ?? ?? = ? ?? + ?? ?? 20 ?? = ?? + ??? = ??? 10 0 7 0 10 20
? The dot product, continued ? ? = ???? ?=1 Magnitude of a vector is the sum of the squares of the elements 2 ??? ? = If ? has unit magnitude, ? ?is the projection of ? onto ? 0.71 0.71 1.5 1.07 + .71 = 1.78 = .71 1.5 + .71 1 2 1 1 0.71 0.71 0 = .71 0 + .71 0.5 0.5 0 + .35 = 0.35 0 8 0 1 2
Separating boundary, defined by w Separating hyperplane splits class 0 and class 1 Plane is defined by line w perpendicular to plan Is data point x in class 0 or class 1? wTx > 0 class 0 wTx < 0 class 1 9
Separating boundary, defined by w Separating hyperplane splits class 0 and class 1 Plane is defined by line w perpendicular to plan Is data point x in class 0 or class 1? wTx> 0 class 1 wTx< 0 class 0 10
From real-number projection to 0/1 label Binary classification: 0 is class A, 1 is class B Sigmoid function stands in for p(x|y) g(h) 1 1 Sigmoid: ? = 0.5 1+? ? ??? 1+? ??? 1 1+? ??? ? ? = 0 ?;? = 1 ? ??? = 0-10 5 0 5 10 h ? ? = 1 ?;? = ? ??? = ??? = ???? ? 11
Learning parameters for classification Similar to MLE for Bayes classifier Likelihood for data points y1, , yn (different from Bayesian likelihood) If yi in class A, yi =0, multiply (1-g(xi;w)) If yi in class B, yi=1, multiply (g(xi;w)) (1 ??) ?? 1 ? ??;? ? ??;? argmax ? ? ? ?;? = ? (1 ??)log 1 ? ??;? + ??log ? ??;? ?? ? ?;? = ? ? ??;? 1 ? ??;? ??log + log 1 ? ??;? ?? ? ?;? = 12 ?
Learning parameters for classification 1 ? = 1 + ? ? ??;? 1 ? ??;? ??log + log 1 ? ??;? ?? ? ?;? = ? 1 ? ???? 1 + ? ???? 1 + ? ???? ??log ?? ? ?;? = + log 1 1 ? 1 + ? ???? ? ???? 1 + ? ???? 1 ??log ?? ? ?;? = 1 + ? ???? 1+ log ? ?????? ???? log 1 + ? ???? ?? ? ?;? = 13 ?
??? = ???? Learning parameters for classification ? ? ? ? ? = ? + ? ? ? ?? ? ?;? = ??????? ????+ log ? ???? ?? ???? 1 + ? ???? ?? ? ? ?? ?+ ???? ?? ? ?;? = ??? ? ? ??? (1 1 ?(????) ) ?? ? ?;? = ?? ??? ? ? ??? ?(????) ?? ? ?;? = ?? ??? 14 ?
yi true data label g(wTxi) computed data label Iterative gradient asscent Begin with initial guessed weights w For each data point (yi,xi), update each weight wj ??? ?(????) ?? ??+ ??? Choose ? so change is not too big or too small step size Intuition ??? ?(????) If yi=1 and g(wTxi)=0, and xij>0, make wj larger and push wTxi to be larger If yi=0 and g(wTxi)=1, and xij>0, make wi smaller and push wTxi to be smaller ?? 15
Separating boundary, defined by w Separating hyperplane splits class 0 and class 1 Plane is defined by line w perpendicular to plan Is data point x in class 0 or class 1? wTx> 0 class 1 wTx< 0 class 0 16
But, where do we place the boundary? Logistic regression: ?? ? ?;? : ?? 1 ???? log 1 + ? ???? ? Each data point ?? considered for boundary ? Outlier data pulls boundary towards it 17
Max margin classifiers Focus on boundary points Find largest margin between boundary points on both sides Works well in practice We can call the boundary points support vectors 18
Classifying with additional dimensions Linear separator No linear separator ? ? 19
Mapping function(s) Map from low-dimensional space ? = ?1,?2 to higher dimensional space ? ? = ?1,?2,?1 2,?2 2,?1?2 N data points guaranteed to be separable in space of N-1 dimensions or more ? = ??? ???? ? Classifying ??: ????????? ?? + ? ? 20
Discriminative classifiers Find a separator to minimize classification error Logistic Regression Support Vector Machines 21
Logistic Regression review Logistic function 1 ? = ? ??? 1+? ??? 1 1+? ??? 1 + ? ? ? = 0 ?;? = 1 ? ??? = ? ? = 1 ?;? = ? ??? = Maximize likelihood: (1 ??) ?? ? ? ?;? = ?1 ? ??;? ? ??;? argmax ? ??,?? Likelihood is ?(?|?) : D = , ? = ? ? ??? ?(????) Update w : ????? ? ?;? = ??? 22
MAP for discriminative classifier MLE: P(y=1|x;w) ~ g(wTx), P(y=0|x;w) ~ 1-g(wTx) MAP: P(y=1,w|x) P(y=1|x;w) P(w) ~ g(wTx) ??? (different from Bayesian posterior) P(w) priors L2 regularization minimize all weights L1 regularization minimize number of non-zero weights 23
p(x) MAP L2 regularization w P(y=1,w|x) P(y=1|x;w) P(w): 2 ? ?? (1 ??) ?? 1 ? ??;? ? ??;? ? ?,? ? = 2? ? ? 2 ?? 2? ?????? ????+ log ? ???? ?? ?,? ? = ? ? ??? ?(????) ?? ? ?? ?,? ? = ?? ??? ? ? Prevent wTx from getting too large 24
p(x) MAP L1 regularization w P(y=1,w|x) P(y=1|x;w) P(w): ? |??| (1 ??) ?? 1 ? ??;? ? ??;? ? ?,? ? = ? ? ? |??| ? ?????? ????+ log ? ???? ?? ?,? ? = ? ? ??? ?(????) sign(??) ? ?? ?,? ? = ?? ??? ? ? Force most dimensions to 0 25
Parameters for learning ??? ?(????) ?? ?? ??+ ? ?? ?? Regularization: selecting ? influences the strength of your bias Gradient ascent: selecting ? influences the effect of individual data points in learning Bayesian: selecting ?? indicates the strength of the class prior beliefs ?,?,?? are parameters controlling our learning 26
Multi-class logistic regression: class probability Recall binary class: ? ??? 1+? ??? 1 1+? ??? ? ? = 0 ?;? = 1 ? ??? = ? ? = 1 ?;? = ? ??? = Multi-class m classes: ? ? = ? ?;? = 1 ?? ? 1? ?? ? ?? ? 1?(? = ?|?;?) ??+ ?=1 ? ?? ?? ? ? = ? ?;? = 1 ?=1 27
Multi-class logistic regression: likelihood Recall binary class: (1 ?)? ??= 1 ??;? ? ? ? ?;? = ?? ??= 0 ??;? ????? ? ?;? = ??? ? ??? ?(????) ? ? = ? ?? ? = ? ? ????????? Multi-class: ?(?? ?) ? ? ?;? = ?,?? ??= ? ??;? ????? ? ?;? = ?,??? ? ??(?? ?) ? ??= ? ??;? 28
Multi-class logistic regression: update rule Recall binary class: ?? ??+ ??? ??? ?(????) ? ? = ? ?? ? = ? ? ????????? Multi-class: ??(?? ?) ? ??= ? ??;? ??,? ??,?+ ??? 29
Logistic regression: how many parameters? N features M classes Learn wk for each of M-1 classes: N (M-1) parameters Actually: ??? = ????? Would be better to allow offset from origin: wTx + b : N+1 parameters per class Total (N+1) (M-1) parameters 30
Max margin classifiers Focus on boundary points Find largest margin between boundary points on both sides Works well in practice We can call the boundary points support vectors 31
Maximum margin definitions Classify as +1 if ??? + ? ? Classify as -1 if ??? + ? ? Undefined if 1 ??? + ? ? ??? + ? = ? ??? + ? = ? ??? + ? = ? M is the margin width x+ is a +1 point closest to boundary, x- is a -1 point closest to boundary ?+= ?? + ? ?+ ? = ? 32
? derivation ??? + ? = ? ???++ ? = +? ?+= ?? + ? ??? + ? = ? ??? + ? = ? ??? + ? = ? ???++ ? = +? ???? + ? + ? = +? ???? + ??? + ? = +? ???? ? ? + ? = +? 2 ??? ? = 33
M derivation ??? + ? = ? ???++ ? = +? ?+= ?? + ? ?+ ? = ? ??? + ? = ? ??? + ? = ? ??? + ? = ? ? = ?? + ? ? = ?? = ? ? ? = ? ??? 2 ??? 2 ? = ??? = ??? minimize ??? maximize ? 34
Support vector machine (SVM) optimization 2 argmaxw,b? = ??? argminw,b??? subject to ??? + ? ? ??? + ? ? for x in class +1 for x in class -1 ? ???? ?? = 0 with Lagrange multipliers. Optimization with constraints: Gradient descent Matrix calculus 35
Alternate SVM formulation ?????? ? = ? Support vectors ?? have ??> 0 ?? are the data labels +1 or -1 To classify sample xj, compute: ????+ ? = ?????????+ ? ? ?? ? ? ????= ? 36 ?
Support vector machine (SVM) optimization with slack variables ?? ?? ?? What if data not linearly separable? ?? argminw,b??? + ? ??? subject to ??? + ? 1 ?? ??? + ? 1 + ?? for x in class 1 for x in class -1 ?? 0 ? Each error ?? is penalized based on distance from separator 37
Classifying with additional dimensions Note: More dimensions makes it easier to separate N training points: training error minimized, may risk over-fit Linear separator No linear separator ? ? 38
?????????+ ? ????+ ? = Quadratic mapping function ? x1, x2, x3, x4 -> x1, x2, x3, x4, x12, x22, x32, x42, x1x2, x1x3, x1x4, , x2x4, x3x4 N features -> ? + ? +? (? 1) ?2 features 2 v with N elements operating in quadratic space N2 values to learn for w in higher-dimensional space 2?1 2+ + ?? 2?? 2 Or, observe: ??? + 12= ?1 +?1?2?1?2+ + ?? 1???? 1?? +?1?1+ + ???? 39
?????????+ ? ????+ ? = Kernels ? Classifying ??: ????? ???? ??+ ? ? Kernel trick: Estimate high-dimensional dot product with function ? ??,??= ? ???? ?? ?? ??2 2?2 E.g., ? ??,??= exp 40
The power of SVM (+kernels) Boundary defined by a few support vectors Caused by: maximizing margin Causes: less overfitting Similar to: regularization Kernels keep number of learned parameters in check 41
Multi-class SVMs Learn boundary for class k vs all other classes Find boundary that gives highest margin for data point xi 42
Benefits of generative methods ? ?|? and ? ?|? can generate non-linear boundary E.g.: Gaussians with multiple variances 43