Machine Learning with MapReduce

Slide Note

Dive into the world of Machine Learning with MapReduce, specifically focusing on K-Means Clustering. Learn how to implement the K-Means algorithm using the MapReduce framework, along with the Map/Reduce design driver and the EM-Algorithm. Explore the process of mapping, reducing, and clustering data points efficiently to achieve accurate results in a distributed computing environment.

hpai Follow

Uploaded on Feb 27, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Machine Learning with MapReduce

K-Means Clustering 3

How to MapReduce K-Means? Given K, assign the first K random points to be the initial cluster centers Assign subsequent points to the closest cluster using the supplied distance measure Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta Run a final pass over the points to cluster them for output

K-Means Map/Reduce Design Driver Runs multiple iteration jobs using mapper+combiner+reducer Runs final clustering job using only mapper Mapper Configure: Single file containing encoded Clusters Input: File split containing encoded Vectors Output: Vectors keyed by nearest cluster Combiner Input: Vectors keyed by nearest cluster Output: Cluster centroid vectors keyed by cluster Reducer (singleton) Input: Cluster centroid vectors Output: Single file containing Vectors keyed by cluster

Mapper - mapper has k centers in memory. Input Key-value pair (each input data point x). Find the index of the closest of the k centers (call it iClosest). Emit: (key,value) = (iClosest, x) Reducer(s) Input (key,value) Key = index of center Value = iterator over input data points closest to ith center At each key value, run through the iterator and average all the Corresponding input data points. Emit: (index of center, new center)

Improved Version: Calculate partial sums in mappers Mapper - mapper has k centers in memory. Running through one input data point at a time (call it x). Find the index of the closest of the k centers (call it iClosest). Accumulate sum of inputs segregated into K groups depending on which center is closest. Emit: ( , partial sum) Or Emit(index, partial sum) Reducer accumulate partial sums and Emit with index or without

EM-Algorithm

What is MLE? Given A sample X={X1, , Xn} A vector of parameters We define Likelihood of the data: P(X | ) Log-likelihood of the data: L( )=log P(X| ) Given X, find = arg max ( ) L ML

MLE (cont) Often we assume that Xis are independently identically distributed (i.i.d.) = arg max ( ) L ML = arg max log ( | ) P X = arg max log ( ,..., | ) P X X 1 n i = arg max log ( | ) P X i i = arg max log ( | ) P X i Depending on the form of p(x| ), solving optimization problem can be easy or hard.

An easy case Assuming A coin has a probability p of being heads, 1-p of being tails. Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. What is the value of p based on MLE, given the observation?

An easy case (cont) = = m N m ( ) log = ( | p ) log 1 ( ) L P X p p + log ( ) log( 1 ) m N m p + ( ) ( log ( ) log( 1 )) dL d m p N m p m N m = = = 0 1 dp dp p p p= m/N

Basic setting in EM X is a set of data points: observed data is a parameter vector. EM is a method to find ML where max arg = ( ) L ML = arg max log ( | ) P X Calculating P(X | ) directly is hard. Calculating P(X,Y| ) is much simpler, where Y is hidden data (or missing data).

The basic EM strategy Z = (X, Y) Z: complete data ( augmented data ) X: observed data ( incomplete data) Y: hidden data ( missing data)

The log-likelihood function L is a function of , while holding X constant: ) | ( X L = = ( ) ( | ) L P X = = ( ) log = ( ) log ( | ) l L P X n = log ( | ) P x i 1 i n = = log ( | ) P x i 1 i n = = log ( , | ) P x y i 1 i y

The iterative approach for MLE = arg max ( ) L ML = arg max ( ) l n = = arg max log ( , | ) p x y i 1 i y In many cases, we cannot find the solution directly. An alternative is to find a sequence: 0 1 t , ,..., ,.... s.t. 0 1 t ( ) ( ) ... ( ) .... l l l

= = t t ( ) ( ) log ( | ) log ( | ) l l P X P X n n = = t log ( , | ) log ( , | ) P x y P x y i i 1 1 i y i y ( , | ) P x y i n = y = log t ( , | ) P x y 1 i i y ( , | ) P x y n = = i log t ( , | ' y ) P x 1 i y i ' y t ( , | ) ( , | ) P x y P x y n = = i i log t t ( , | ' y ) ( , | ) P x P x y 1 i y i i ' P y t ( P , | ) ( , | ) t x y P x y n = = i i log t ( , | ' y ) ( , | ) x P x y 1 i y i i ' y ( , | ) t P x y n = = t i log ( | , ) P y x i ( , | ) P x y 1 i y i ( , | ) t P x y n = = i log [ ] E t ( | , ) P y x ( , | ) P x y i 1 i i Jensen s inequality ( , | ) t P x y n = i [log ] E t ( | , ) P y x ( , | ) P x y i 1 i i

Jensens inequality convex , [ ( ( )] ( [ ( ]) if f is then E f g x f E g x concave , [ ( ( )] ( [ ( ]) if f is then E f g x f E g x log is a concave function [log( E ( )] log( [ ( )]) x p x E p

Maximizing the lower bound ( , | x ) t n p x y = ) 1 + = ( t arg max [log ] i E t ( | , ) P y x ( , | ) p x y i 1 i i ( , | ) t n P y = = t arg max ( | , ) log i P y x i ( , | ) P x y 1 i y i n = = t arg max ( | , ) log ( , | ) P y x P x y i i 1 i y n = = arg max [log ( , | )] E P x y i t ( | , ) P y x i 1 i The Q function

The Q-function Define the Q-function (a function of ): = = t t ( , ) [log E ( , | | ) , ] [log ( , | )] Q P X Y X E P X Y t ( | , ) P Y X n = = = t ( | , ) log ( , | ) [log ( , | )] P Y X P X Y E P x y i t ( | , ) P y x i 1 Y i n = = t ( | , ) log ( , | ) P y x P x y i i 1 i y Y is a random vector. X=(x1, x2, , xn) is a constant (vector). t is the current parameter estimate and is a constant (vector). is the normal variable (vector) that we wish to adjust. The Q-function is the expected value of the complete data log-likelihood P(X,Y| ) with respect to Y given X and t.

The inner loop of the EM algorithm E-step: calculate n = = t t ( , ) ( | , ) log ( , | ) Q P y x P x y i i 1 i y M-step: find ) 1 + = ( t t arg max ( , ) Q

L() is non-decreasing at each iteration The EM algorithm will produce a sequence 0 1 t , ,..., ,.... It can be proved that 0 1 t ( ) ( ) ... ( ) .... l l l

The inner loop of the Generalized EM algorithm (GEM) E-step: calculate n = = t t ( , ) ( | , ) log ( , | ) Q P y x P x y i i 1 i y M-step: find ) 1 + = ( t t arg max ( , ) Q + 1 t t t t ( , ) ( , ) Q Q

Recap of the EM algorithm

Idea #1: find that maximizes the likelihood of training data = arg max ( ) L ML = arg max log ( | ) P X

Idea #2: find the t sequence No analytical solution iterative approach, find s.t. 0 1 t , ,..., ,.... 0 1 t ( ) ( ) ... ( ) .... l l l

Idea #3: find t+1 that maximizes a tight lower bound of t ( ) ( ) l l ( , | ) t P x y n = t i ( ) ( ) [log ] l l E t , ( | ) P y x ( , | ) P x y i 1 i i a tight lower bound

Idea #4: find t+1 that maximizes the Q function t Lower bound of ( ) ( ) l l ( , | ) t p x y n = ) 1 + = ( t i arg max [log ] E t ( | , ) P y x ( , | ) p x y i 1 i i n = = arg max [log ( , | )] E P x y i t ( | , ) P y x i 1 i The Q function

The EM algorithm Start with initial estimate, 0 Repeat until convergence E-step: calculate n = = t t ( , ) ( | , ) log ( , | ) Q P y x P x y i i 1 i y M-step: find ) 1 + = ( t t arg max ( , ) Q

Important classes of EM problem Products of multinomial (PM) models Exponential families Gaussian mixture

Probabilistic Latent Semantic Analysis (PLSA) PLSA is a generative model for generating the co- occurrence of documents d D={d1, ,dD} and terms w W={w1, ,wW}, which associates latent variable z Z={z1, ,zZ}. The generative processing is: P(z|d) P(w|z) w1 d1 z1 P(d) w2 d2 z2 dD wW zZ

Model The generative process can be expressed by: = ( , ) ( | ) ( ) ( | ), ( | ) ( | ) z Z P d w P d P w d P w z P z d = where P w d Two independence assumptions: 1) Each pair (d,w) are assumed to be generated independently, corresponding to bag-of-words 2) Conditioned on z, words w are generated independently of the specific document d.

Model Following the likelihood principle, we detemines P(z), P(d|z), and P(w|z) by maximization of the log- likelihood function P(d), P(z|d), and P(w|d) co-occurrence times of d and w. Unobserved data = ( | , , ) d w z ( , )log ( , ) n d w L P d w d D w W Observed data = = ( , ) ( | ) ( | ) ( ) P w z P z d P d ( | ) ( | ) ( ) P w z P d z P z where P d w z Z z Z

Maximum-likelihood Definition We have a density function P(x| ) that is govened by the set of parameters , e.g., P might be a set of Gaussians and could be the means and covariances We also have a data set X={x1, ,xN}, supposedly drawn from this distribution P, and assume these data vectors are i.i.d. with P. Then the likehihood function is: N P X P x L = | ) = | ) = ( ( ( | ) X i 1 i The likelihood is thought of as a function of the parameters where the data X is fixed. Our goal is to find the that maximizes L. That is = * argmax ( | ) L X

Jensens inequality j j a ( ) ( ) g j a g j j j provided j = 1 a j 0 a j ( ) 0 g j

Estimation-using EM D d z ( = max | , , ) max ( , ) log ( ) ( | ) ( | ) L d w z n d w P z difficult!!! P w z P d z w W Z Idea: start with a guess t, compute an easily computed lower-bound B( ; t) to the function log P( |U) and maximize the bound instead By Jensen s inequality: ( | , ) P z w d ( ) ( | ) ( | ) ( ) ( | ) ( | ) P z P w z P d z P z P w z P d z ( | , ) [ ] P z w d ( | , ) ( | , ) P z w d P z w d z Z j ( | , ) P z w d ( ) ( | ) ( | ) P z P w ( z P d z D d [log z z ) = t max ( , ) max ( , ) log [ ] B n d w | , ) P ) z log w d ( w W D d z = max ( , ) ( ) ( | ( | | , )] ( | , ) n d w P z P w P d z P z w d P z w d w W

(1)Solve P(w|z) We introduce Lagrange multiplier with the constraint that wP(w|z)=1, and solve the following equation: + = ( , ) n d w [log ( ) ( | ) ( | ) P z P w z P d z log ( | , )] ( | , ) P z w d P z w d ( ( | ) 1) P w z 0 ( | ) P w z d D w W z w ( , ) ( | , ) n d w P z d w + = d D 0, ( | ) P w z ( , ) ( | , ) n d w P z d w = d D ( | ) P w z , = ( | ) P w z 1, w = ( , ) ( | , ), n d w P z d w w W d D ( , ) n d w P ( | , ) z d w = d D ( | ) P w z ( , ) ( | , ) n d w P z d w w W d D

(2)Solve P(d|z) We introduce Lagrange multiplier with the constraint that dP(d|z)=1, and get the following result: ( , ) ( | , ) n d w P z d w = w W ( | ) P d z ( , ) ( | , ) n d w P z d w d D w W

(3)Solve P(z) We introduce Lagrange multiplier with the constraint that zP(z)=1, and solve the following equation: + = ( , ) n d w [log ( ) ( | ) ( | ) P z P w z P d z log ( | , )] ( | , ) P z w d P z w d ( ( ) 1) P z 0 P z ( ) d D w W z z ( , ) ( | , ) n d w P z d w + = d D w W 0, ( ) P z ( , ) ( | , ) n d w P z d w = d D w W ( ) , P z = ( ) 1, P z z = = ( , ) n d w ( | , ) P z d w ( , ), n d w d D w W d D w W z ( , ) ( | , ) n d w P z d w = d D w W ( ) P z ( , ) n d w w W d D

(1)Solve P(z|d,w) We introduce Lagrange multiplier with the constraint that zP(z|d,w)=1, and solve the following equation: = + = + = ( , ) n d w [log ( ) ( | ) ( | ) P z P w z P d z log ( | , )] ( | , ) P z d w P z d w ( ( | , ) 1) P z d w 0 , d w ( | , ) ( , )[log ( ) ( | ) ( | ) log ( | , ) P z d w P z d w n d w d D w W d D w W P z P w z P d z z z + = log ( | , ) 1] P z d w 0, , d w log ( ) ( | ) ( | ) 1 P z P w z P d z 0, , d w d w 1 = ( | P z d w , ) w ( ) ( | ) ( | ) 1, = P z d P z P w z P d z e , ( | , ) z d w 1 = ( ) ( | ) ( | ) P z P w z P d z e 1 , z 1 log = ( ) ( | ) ( | ) P z P w z P d z , d w z ( ) ( | ) ( | ) P z P w z P d z e = ( | ) P w z d w 1 , ( ) ( | ) ( | ) P z P w z P d z = 1 (1 log e P z P w z P d z P z P w z z ( ) ( | ) ( | )) P z P w z P d z z ( ) ( | ) ( | ) ( ) ( | = ) ( | ) P d z

(4)Solve P(z|d,w) -2 ( , , ) ( , ) ( , | ) ( ) ( , ) ( | ) ( | ) ( ) ( | ) ( | ) ( ) z Z P d w z P d w P w d z P z P d w P w z P d z P z P w z P d z P z = ( | , ) P z d w = =

The final update Equations E-step: ( | ) ( | ) ( ) ( | ) ( | ) ( ) P w z P d z P z P w z P d z P z = ( | , ) P z d w z Z M-step: ( , ) ( | , ) n d w P z d w = d D ( | ) P w z ( , ) ( | , ) n d w P z d w w W d D ( , ) ( | , ) n d w P z d w = w W ( | ) P d z ( , ) ( | , ) n d w P z d w d D w W = ( , ) ( | , ) n d w P z d w d D w W ( ) P z ( , ) n d w w W d D

Coding Design Variables: double[][] p_dz_n // p(d|z), |D|*|Z| double[][] p_wz_n // p(w|z), |W|*|Z| double[] p_z_n // p(z), |Z| Running Processing: 1. Read dataset from file ArrayList<DocWordPair> doc; // all the docs DocWordPair (word_id, word_frequency_in_doc) 2. Parameter Initialization Assign each elements of p_dz_n, p_wz_n and p_z_n with a random double value, satisfying d p_dz_n=1, d p_wz_n =1, and d p_z_n =1 3. Estimation (Iterative processing) 1. Update p_dz_n, p_wz_n and p_z_n 2. Calculate Log-likelihood function to see where ( |Log-likelihood old_Log-likelihood| < threshold) 4. Output p_dz_n, p_wz_n and p_z_n

Coding Design Update p_dz_n For each doc d{ For each word wincluded in d{ denominator = 0; nominator = newdouble[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z] denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_dz_n[d][z] += tfwd*P_z_condition_d_w; denominator_p_dz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d }// end for each doc d For each doc d { For each topic z p_dz_n_new[d][z] = nominator_p_dz_n[d][z]/ denominator_p_dz_n[z]; } // end for each topic z }// end for each doc d {

Coding Design Update p_wz_n For each doc d{ For each word wincluded in d{ denominator = 0; nominator = newdouble[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z] denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_wz_n[w][z] += tfwd*P_z_condition_d_w; denominator_p_wz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d }// end for each doc d For each w { For each topic z p_wz_n_new[w][z] = nominator_p_wz_n[w][z]/ denominator_p_wz_n[z]; } // end for each topic z }// end for each doc d {

Coding Design Update p_z_n For each doc d{ For each word wincluded in d{ denominator = 0; nominator = newdouble[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z] denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_z_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z denominator_p_z_n[z] += tfwd; }// end for each word w included in d }// end for each doc d For each topic z{ p_dz_n_new[d][j] = nominator_p_z_n[z]/ denominator_p_z_n; } // end for each topic z

Apache Mahout Industrial Strength Machine Learning GraphLab

Current Situation Large volumes of data are now available Platforms now exist to run computations over large datasets (Hadoop, HBase) Sophisticated analytics are needed to turn data into information people can use Active research community and proprietary implementations of machine learning algorithms The world needs scalable implementations of ML under open license - ASF

History of Mahout Summer 2007 Developers needed scalable ML Mailing list formed Community formed Apache contributors Academia & industry Lots of initial interest Project formed under Apache Lucene January 25, 2008

Current Code Base Matrix & Vector library Memory resident sparse & dense implementations Clustering Canopy K-Means Mean Shift Collaborative Filtering Taste Utilities Distance Measures Parameters

Machine Learning with MapReduce

Download Presentation

Presentation Transcript

Related

More Related Content