
Introduction to Machine Learning Applications and Regression Analysis
Explore the first lecture of machine learning by Hung-yi Lee, focusing on binary classification tasks like saying yes/no, spam filtering, recommendation systems, malware detection, and stock prediction. Understand the function f in spam filtering applications and dive into regression analysis to estimate the probability of yes based on the frequency of certain words in emails. Discover the challenges of regression and how to address them in real-world scenarios.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
First Lecture of Machine Learning Hung-yi Lee
Learning to say yes/no Binary Classification
Learning to say yes/no Spam filtering Is an e-mail spam or not? Recommendation systems recommend the product to the customer or not? Malware detection Is the software malicious or not? Stock prediction Will the future value of a stock increase or not with respect to its current value? Binary Classification
Example Application: Spam filtering f = : Y { , } X yes no E-mail Not spam 2x Spam ( ) x = f yes 2 1x ( ) x = f no 1 (http://spam-filter-review.toptenreviews.com/)
Example Application: Spam filtering f = : Y { , } X yes no What does the function f look like? ( ( ) ) | 5 . 0 yes P yes x ( ) x y = = f | 5 . 0 no P yes x How to estimate P(yes|x)?
Example Application: Spam filtering To estimate P(yes|x), collect examples first Some words frequently appear in the spam e.g., free .. Earn free free Yes (Spam) x1 Use the frequency of free to decide if an e-mail is spam Win free Yes (Spam) x2 Estimate P(yes|xfree= k) xfreeis the number of free in e-mail x Talk Meeting No (Not Spam) x3 . .
Regression In training data, there is no e- mail containing 3 free . p(yes|xfree) p(yes| xfree= 1 ) = 0.4 p(yes| xfree= 0 ) = 0.1 Frequency of Free (xfree) in an e-mail x Problem: What if one day you receive an e-mail with 3 free .
Regression f(xfree) = wxfree+ b (f(xfree) is an estimate of p(yes|xfree) ) Store w and b Regression p(yes|xfree) Frequency of Free (xfree) in an e-mail x
Regression f(xfree) = wxfree+ b The output of f is not between 0 and 1 Regression p(yes|xfree) Frequency of Free (xfree) in an e-mail x Problem: What if one day you receive an e-mail with 6 free .
Logit p ln p 1 p xfree vertical line: Probability to be spam p(yes|xfree) (p) p is always between 0 and 1 vertical line: logit(p) p ( ) p = logit ln 1 p
f(xfree) = wxfree+ b (f (xfree) is an estimate of logit(p) ) Logit xfree xfree vertical line: Probability to be spam p(yes|xfree) (p) p is always between 0 and 1 vertical line: logit(p) p ( ) p = logit ln 1 p
f(xfree) = wxfree+ b (f (xfree) is an estimate of logit(p) ) Logit Store w and b = 3 x free ( ) w = + 5 . 1 = 3 f x b free p ( ) p = 5 . 1 = logit ln 1 p = 0.817 ) = p > 0.5, so yes ( xfree + 0 f x w x b free free p vertical line: logit(p) ln 0 1 p p ( ) p = logit ln 1 p 5 . 0 p yes
Multiple Variables Consider two words free and hello compute p(yes|xfree,xhello) (p) ( ) 1 logit p p = ln p xhello xfree
Multiple Variables Consider two words free and hello compute p(yes|xfree,xhello) (p) ( w ) , f x x free hello + ( ) 1 logit p = + x w x b 1 2 free hello p = ln p Regression xhello xfree
Multiple Variables Of course, we can consider all words {t1, t2, tN} in a dictionary ( N t t t x x x yes 2 | P : p ) 1, ( ) = = = + + + + + , f x x x z w w x w b x w x b 1 2 t t t t x t N t 1 2 1 2 N N w x 1 t 1 w x z is to approximate logit(p) t 2 = = w x 2 x w N t N
Logistic Regression p ( ) p = + z w x b = logit ln 1 p approximate ( ) p : P | 1, yes x x x t t N t 2 If the probability p = 1 or 0, ln(p/1-p) = +infinity or infinity Can not do regression The probability to be spam p is always 1 or 0. = 3 x t 1 = 0 x t1 appears 3 times t2 appears 0 time tNappears 1 time t P yes 2 x = 1 x N t
Logistic Regression p =1 ( 1 ez p ) p = + ln z w x b p p p 1 = ez p p + = 1 z w x b e e 1 p = z z e e ( p p )p 1 = ( ) + + + z w x b 1 1 e e = 1 + z z e e z 1 e + Sigmoid Function = = p + z z 1 1 e e z
Logistic Regression 1 1 p = ( ) + + + z w x b 1 1 e e = 1 t 3 x 1 1 = 1 t 0 x = ( ) 1 1 x 2 1 + + w x b x1 close to 1 e Yes (Spam) = 1 N t 7 x 1 x ( ) 0 = 2 2 x2 + + w x b close to 1 e No (not Spam)
Logistic Regression 1 1 p = ( ) + + + z w x b 1 1 e e x This is a neuron in neural network. 1tx w w w x = + ( ) z z w x b 1 1 2tx z 2 + tx N 0 1 ( ) z Yes =1 N + z e b No feature bias
More than saying yes/no Multiclass Classification
More than saying yes/no Handwriting digit classification This is Multiclass Classification
More than saying yes/no Handwriting digit classification Simplify the question: whether an image is 2 or not 1x Describe the characteristics of input object 2x x Each pixel corresponds to one dimension in the feature N feature of an image
More than saying yes/no Handwriting digit classification Simplify the question: whether an image is 2 or not 1x 2x + 2 or not x N 1
More than saying yes/no Handwriting digit classification Binary classification of 1, 2, 3 If y2is the max, then the image is 2 . 1y + 1x 1 or not 2 y 2x + 2 or not x 3y + N 3 or not 1
Limitation of Logistic Regression w w 2x b 1 1x 0 yes z 5 . 0 yes a 1 a z 5 . 0 no a 0 no z + 2 = + + z w x w x b 1 1 2 2 Input Output x1 0 x2 0 Yes No 0 1 Yes No 1 0 Yes 1 1 No
So we need neural network Input Layer 1 Layer 2 Layer L Output y1 1x 2x y2 yM x N Deep means many layers
Thank you for your listening!
More reference http://www.ccs.neu.edu/home/vip/teach/MLcourse/2_ GD_REG_pton_NN/lecture_notes/logistic_regression_l oss_function/logistic_regression_loss.pdf http://mathgotchas.blogspot.tw/2011/10/why-is-error- function-minimized-in.html https://cs.nyu.edu/~yann/talks/lecun-20071207- nonconvex.pdf http://www.cs.columbia.edu/~blei/fogm/lectures/glms .pdf http://grzegorz.chrupala.me/papers/ml4nlp/linear- classifiers.pdf