
Output Nonlinearities and Training Criteria in Machine Learning
Explore the concepts of output nonlinearities, training criteria, and Bayesian interpretation in machine learning through images and explanations. Learn about minimum mean squared error, types of learning problems, and the relationship between MMSE and the Bayesian framework. Dive into proofs and explanations for various classification and regression scenarios.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Output Nonlinearities, Training Criteria and Bayesian Interpretation Yang Zhang
Skeleton MMSE training criterion 2-class classification logistic nonlinearity Multiclass classification output structure Multiclass classification softmax nonlinearity Minimum cross entropy criterion
Types of learning problem Feature & explained variable pairs ?,? ~??,? ?,? is a pair of random variables ? 0,1 : 2-class classification ? , each integer denotes a class: Multiclass classification ? : Regression ? , each integer denotes level: Regression with discrete explained variable ??,?? - a set of training tokens independently drawn from ??,?
Minimum Mean Squared Error Works generally well for learning tasks 2 ?? ?? = ? ??- output of the classifier ??- labels ??= ? ?? ? - the set of all functions representable by the neural network architecture. min ? ? 2 ? ?? ??
MMSE and Bayesian framework Given enough hidden nodes and training tokens, neural network with MMSE criterion would give test sample ? ? = ? ? = ? ?|? = ? MMSE leads to posterior expectation!
MMSE and Bayesian framework For 2-class classification problem Assume 0-1 loss ? ?|? = ? = 1 ??|?1|? + 0 ??|?0|? = ??|?1|? = Pr ? = 1|? = ? For regression problem
MMSE and Bayesian framework Proof 2 ? ?? ?? min ? min all ? ? 2 ? ?? ?? ? all ? ? ? ? 2??,?? ,? ?? ?? min all ? ? ? ? ?|? = ? + ? ?|? = ? ? 2??,?? ,? ?? ?? = min all ? ? ? ? ?|? = ? + ? ?|? = ? ? 2??|?? |? ?? ??? ?? = min 2+ ? ?|? = ? ? 2 ? ? ? ?|? = ? = min all ? + 2 ? ? ? ?|? = ? ? ?|? = ? ? ??|?? |? ?? ??? ??
MMSE and Bayesian framework Proof (Con d) 2 ? ? ? ?|? = ? ? ?|? = ? ? ??|?? |? ?? ??? ?? ? ? = 2 ? ?|? = ? ? ?|? = ? ? ??|?? |? ?? ??? ?? ? ?|? = ? ? ??|?? |? ?? = ? ?|? = ? ? ?|? = ? = 0
MMSE and Bayesian framework Proof (Con d) 2 ? ? ? ?|? = ? = min all ? + ? ?|? = ? ? 2??|?? |? ?? ??? ?? Optimal ? ? = ? ?|? = ? Holds regardless of output nonlinearity.
2-Class Classification If the training data size not that large, the output may not be so well behaved. May not even be within 0,1 . Need to choose output nonlinearity such that the output is confined to 0,1 .
Skeleton MMSE training criterion 2-class classification logistic nonlinearity Multiclass classification output structure Multiclass classification softmax nonlinearity Minimum cross entropy criterion
Logistic Function Logistic Function: 1 ? ? = 1 + ? ? ? ? 1 + ? ? 2 ? ? = 1 0.5 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5
?? Logistic Function Bayesian Interpretation ? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1 If ??? 1|?Exponential family + additional constraints, the posterior probability naturally takes the form of ? = 1 ??|?? 11|?? 1 ? 1 + ? ?? 1+?? 1?? 1 Proof ?|0 = ? ? exp ? ?0 + ? ?0 ?|1 = ? ? exp ? ?1 + ? ?1 ? ??? 1 ??? 1 ??? 1|? ?? 1 ??? 1|? ?? 1 ? Then ?|1 ?1??? 1|? ?? 1 ? ??|?? 11|?? 1 = ?|0 + ?1??? 1|? ?? 1 ?|1 ?0??? 1|? ?? 1
?? Logistic Function Bayesian Interpretation ? ?? Proof (cont d) ??|?? 11|?? 1 ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1 ? ? ??? 1 exp log?1+ ? ?1 + ? ?1 = ? ? ??? 1 ??? 1 exp log?0+ ? ?0 + ? ?0 + exp log?1+ ? ?1 + ? ?1 1 = log?1 ??? 1 ? 1 + exp ?0+ ? ?1 ? ?0 + ? ?1 ? ?0 Let ?? 1= log?1 + ? ?1 ? ?0 ?0 ?? 1= ? ?1 ? ?0 Then 1 ? ??|?? 11|?? 1 = ? 1 + ? ?? 1+?? 1?? 1
Multi-class classification ?? 1,2, ,? ? total number of classes Question: is 2 ?? ?? = ? a good metric? No, because there s no reason to assume class 3 is closer to class 2 than to class 1.
Skeleton MMSE training criterion 2-class classification logistic nonlinearity Multiclass classification output structure Multiclass classification softmax nonlinearity Minimum cross entropy criterion
Multi indicator output Label as an indicator vector ??= 0,0, ,0,1,0, ,0? ?? 1,2,3, ,? ??? = 1 0 ??= ? otherwise Accordingly, there should be ? output nodes: ??1 ??2 ??? ?? ? ? ? ?? 1 ?? 2 ?? ?? ? ?? ? ? ? 1 ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Multi indicator output - MMSE 2 ??? ??? = ? = ? 2 ?? ?? ? Similarly, with enough training data and representation power, the output gives posterior probability ??? = ?? ? |?1|?? = Pr ??= ?|? = ?? Inherent constraint ??? = 1 = Pr ??? = 1|? = ?? ? What if there is not enough training data?
Skeleton MMSE training criterion 2-class classification logistic nonlinearity Multiclass classification output structure Multiclass classification softmax nonlinearity Minimum cross entropy criterion
Softmax function Definition exp ??? ??? = ?exp ??? It can be easily verify that ??? = 1 ? ??1 ??2 ??? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Softmax function Why softmax function? If ??? 1|?Exponential family + additional constraints, the posterior probability naturally takes the form of exp ? ? ?exp ? ? ?? ? |?? 11|?? 1 = where ? ? = ?? 1? + ?? 1?,: ?? 1 You will prove it in your next homework
Softmax function Structure more complex -> derivative more complex exp ??? ??? = ?exp ??? Is there a good loss function so as to make the derivative simpler? ??1 ??2 ??? ? ? ? ?? 1 ?? 2 ?? ?? ? ? ? 1 ?? 1 1 ?? 1 2 ?? 1 ?? 1
Skeleton MMSE training criterion 2-class classification logistic nonlinearity Multiclass classification output structure Multiclass classification softmax nonlinearity Minimum cross entropy criterion
Entropy Definition: ??? - pmf ? ?? = ??? log??? ? Interpretation: Expected amount of information that should be received before nailing down to one instance Minimum ??? = ? ? ? for some ? ? ?? = 0 Maximum ??? = 1/? uniform distribution ? ?? = log?
Cross Entropy Definition: ??? ,??? - pmf s ? ??|?? = ??? log??? ? Interpretation: amount of information (encoded for another distribution ??? ) that should be received before nailing down to one instance Minimum (over ??? , given ??? ) ??? = ??? ? ??|?? = ? ??
Cross Entropy as a Loss Function ??? - pmf; ??? - pmf. ?= ? ??|?? ??? log??? = ? ??? log??? ?= = ? ? ? Questions: Does minimum cross entropy criterion given the optimum result as the posterior probability? Does minimum cross entropy criterion result in simpler derivatives?
Cross Entropy and Bayesian Framework Given enough hidden nodes and training tokens, neural network with minimum cross entropy criterion and softmax function would give test sample ? ? ? = ??? = ?? ? |?1|? = Pr ? = ?|? = ? Proof ??? log???? min ? ? ? ??? log???? min all ?: ???=1 ? ? log??? ?? = min all ?: ???=1 ? ???,??,? log??? ?? all ?: ???=1 min
Cross Entropy and Bayesian Framework Proof (Cont d) ???,??,? log??? ?? all ?: ???=1 min ???|??|? log??? ??? ?? = all ?: ???=1 min Optimal solution ??? = ??|??|?
Cross Entropy Back Prop. ?= ??? log??? = ? ? ? exp ??? ??? = ?exp ??? Chain rule ? ? ???? ? ? ???? ???? ???? = ?
Cross Entropy Back Prop. ? ? ???? ?= ? ? ???? ??? log??? ???? ???? = ? ? First term ? ? ???? = ??? ???
Cross Entropy Back Prop. ? ? ???? ? ? ???? exp ??? ???? ???? = ? ??? = ?exp ??? Second term ??? ??? , ???? ???? ? ? = 2, ??? ??? ? = ?
Cross Entropy Back Prop. ? ? ???? ? ? ???? ???? ???? = ? First term ? ? ???? = ??? ??? Second term ??? ??? , ???? ???? ? ? = 2, ??? ??? ? = ? Final result ? ? ???? = ??? ???
Summary Type Training Criteria Output Nonlinearity Regression MMSE 2-class classification MMSE Logistic Multi-class class. Min cross entropy Softmax