
Empirical Risk Minimization in Machine Learning
Explore the concept of empirical risk minimization in machine learning, which involves approximating expected risk using training data. Learn about Bayesian decision theory, empirical risk calculation, different empirical risks in regression, convergence issues, and bounds on minimizing empirical risk.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Bounding the error of misclassification Usman Roshan
Bayesian decision theory Find a classifier f(x) that minimizes expected risk (expected value of loss) 2 x X R[f]= c(x,yj, f(x) )P(yj,x) j=1 We want to find f that minimizes this but we don t have all data points. We only have training data. And we don t know P(y,x)
Empirical risk Since we only have training data we can t calculate the expected risk (we don t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution pemp(x,y) pemp(x,y) =1 n i=1 n xi(x) yi(y) The delta function x(y)=1 if x=y and 0 otherwise.
Empirical risk We can now define the empirical risk as 2 x X Remp[ f]= c(x,yj, f(x) )pemp(yj,x) j=1 n =1 c(xi,yi, f(xi) ) n i=1 Once the loss function is defined and training data is given we can then find f that minimizes this.
Different empirical risks Linear regression n 1 n (yi (wTxi+ w0))2 i=1 Logistic regression n 1 n log(1+ e yi(wTGi+w0)) i=1 SVM n 1 n max(0,1- yi(wTxi+w0) ) i=1
Does it make sense to optimize empirical risk? Does Remp(f) approach R(f) as we increase sample size? Remember that f is our classifier. We are asking if the empirical error of f approaches the expected error of f with more samples. Yes according to law of large numbers: mean value of random sample approaches true mean as sample size increases But how fast does it converge?
Chernoff bounds Suppose we have Xii.i.d. trials where each Xi= |f(xi)-yi| Let m be the true mean of X Then < 1 n 2 i Xi- E(X) >e P e2ne2
Convergence issues Applying Chernoff bound to empirical and expected risk give us ( )< 2 P Remp(f)- R(f) >e e2ne2 Remember we fix f first before looking at data. So this is not too helpful. We want to show a bound with the best function estimation
Bound on empirical risk minimization In other words bound: P sup < 2 f F(Remp(f)- R(f))>e e2ne2 With some work we can show that < 2N(F,2n) 2 f F(Remp(f)- R(f))>e P sup e2ne2 where N(F,2n) measures the size of function space F. It is the maximum size of F on 2n datapoints. Since we can have at most 22nbinary classifiers on 2n points the maximum size of F is 22n .
Structural risk/regularized risk minimization We can rewrite the previous bound as 8 n ln(N(F,2n))+ln4 R(f)< Remp(f)+ d Compare to regularized risk minimization