
Learning Machine Learning: Algorithms and Use Cases
Dive into the world of machine learning with a focus on scalable computing, supervised vs. unsupervised algorithms, use cases like collaborative filtering and clustering, and an overview of clustering techniques like K-Means, Hierarchical Clustering, and Spectral Clustering.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
But What is Machine Learning Machine Learning is programming computers to optimize a performance criterion using example data or past experience Given a data set X, can we effectively predict Y by optimizing Z? Intro. to Machine Learning by E. Alpaydin
Supervised vs. Unsupervised Algorithms trained on labeled examples I know these images are of cats and these are of dogs, tell me if this image is a cat or a dog Algorithms trained on unlabeled examples Group these images together by similarity, i.e. some kind of distance function
Use Cases Collaborative Filtering Takes users' behavior, and from that try to find items users might like Clustering Take things and put them into groups of related things Classification Learn from existing categories to determine what things in a category look like, and assign unlabeled things the (hopefully) correct category Frequent Itemset Mining Analyzes items in a groups and identifies which items frequently appear together
Clustering Dirichlet Processing Clustering Bayesian mixture modeling K-Means Clustering Partition n observations into k clusters Fuzzy K-Means Soft clusters where a point can be in more than one Hierarchical Clustering Hierarchy of clusters from bottom-up or top-down Canopy Clustering Preprocess data before K-Means or Hierarchical
More Clustering Latent Dirichlet Allocation Cluster words into topics and documents into mixtures of topics Mean Shift Clustering Finding modes or clusters in 2- dimensional space, where number of clusters is unknown Minhash Clustering Quickly estimate similarity between two data sets Spectral Clustering Cluster points using eigenvectors of matrices derived from data
Collaborative Filtering Distributed Item-based Collaborate Filtering Estimates a user s preference for one item by looking at preference for similar items Collaborate Filtering using a Parallel Matrix Factorization Among a matrix of items that a user has not yet seen, predict which items the user might prefer
Classification Bayesian Classify objects into binary categories Random Forests Method for classification and regression by constructing a multitude of decision trees Dog Cat
Frequent Itemset Mining Parallel FP Growth Algorithm Analyzes items in a group and then identifies which items appear together
Algorithm Examples K-Means Clustering Using Mahout Alternating Least Squares (Recommender) Using Spark Mllib
APACHE MAHOUT ma hout -\m - hau t\ - noun - A keeper and driver of an elephant
Overview Build a scalable machine learning library, in both data volume and processing Began in 2008 as a subproject of Apache Lucene, then became a top-level Apache project in 2010 No longer accepting Java MapReduce implementations in favor of Spark MLlib Address issues commonly found in ML libraries: Lack community, scalability, documentation/examples, Apache licensing Not well-tested Not research oriented Not built on existing production-quality projects Active Community
Technical Requirements Linux Java 1.6 or greater Maven Hadoop Although, not all algorithms are implemented to work on Hadoop clusters
Building Mahout for Hadoop 2 Check out Mahout trunk with git git clone https://github.com/apache/mahout.git Build with Maven, giving it the proper Hadoop and HBase versions cd git mvn install -DskipTests \ -Dhadoop2 -Dhadoop2.version=2.6.0 \ -Dhbase.version=1.0.0 cd ../ mv mahout /usr/share/491s15 # Edit .bashrc/.bash_profile to add a $MAHOUT_HOME variable, # $MAHOUT_HOME/bin to the path, and # export HADOOP_CONF_DIR=/usr/share/491s15/hadoop/etc/hadoop
K-Means Clustering c2 c1 c3
K-Means Clustering c2 c1 c3
K-Means Clustering c2 c1 c3
K-Means Clustering c2 c2 c1 c1 c3 c3
K-Means Clustering c2 c1 c3
K-Means Clustering Example Let s cluster the Reuter s data set together A bunch (21,578 to be exact) of hand-classified news articles from the greatest year created, 1987 Steps! 1. Generate Sequence Files from data 2. Generate Vectors from Sequence Files 3. Run k-means
K-Means Clustering Convert dataset into a Sequence File Download and extract the SGML files $ wget http://www.daviddlewis.com/resources/testcollections/ reuters21578/reuters21578.tar.gz $ mkdir reuters-sgm $ tar -xf reuters21578.tar.gz -C reuters-sgm/ Extract content from SGML to text file $ mahout org.apache.lucene.benchmark.utils.ExtractReuters \ reuters-sgm/ reuters-out/ $ hdfs dfs -put reuters-out . # Takes a while... Use seqdirectory tool to convert text file into a Hadoop Sequence File $ mahout seqdirectory -i reuters-out \ -o reuters-out-seqdir -c UTF-8 -chunk 5
Tangent: Writing to Sequence Files // Say you have some documents array Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; ++i) { writer.append(new Text(documents[i].getId()), new Text(documents[i].getContent())); } writer.close();
Original File $ cat reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991
Now, in Sequence File Key Value* /reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 * Contains new line characters
K-Means Clustering Generate Vectors from Sequence Files Steps 1. Compute Dictionary 2. Assign integers for words 3. Compute feature weights 4. Create vector for each document using word-integer mapping and feature-weight Or simply run $ mahout seq2sparse $ mahout seq2sparse \ -i reuters-out-seqdir/ \ -o reuters-out-seqdir-sparse-kmeans
Document to Integers to Vector { 3960:1.0, 21578:1.0, 33629:1.0, 41511:1.0, 8361:1.0, 10882:1.0, 5405:1.0, 22224:1.0, 15528:1.0, 38507:2.0, 39687:1.0, 2737:1.0, 35909:1.0, 2962:1.0, 19078:1.0, 20362:1.0 } 14.36 15 1991 26 43 6.7 billion curbs dlrs estate feb raising seek tax u.s writers 2737 2962 3960 5405 8361 10882 15528 19078 20362 21578 22224 33629 35909 38507 39687 41511 26-FEB-1987 15:43:14.36 U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991 One document of many!
After seq2sparse Key Value /reut2-000.sgm-30.txt {3960:1.0,21578:1.0, 33629:1.0,41511:1.0,8361:1.0,10882:1.0,5405:1.0 ,22224:1.0,15528:1.0,38507:2.0,39687:1.0,2737:1 .0 ,35909:1.0,2962:1.0,19078:1.0,20362:1.0}
K-Means Clustering Run the kmeans program $ mahout kmeans \ -i reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c reuters-kmeans-clusters \ -o reuters-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -cd 0.1 -x 10 -k 20 Key Parameters dm: Distance measure cd: Convergence delta x: Number of iterations k: Creating assignments
Inspect clusters $ bin/mahout clusterdump \ -i reuters-kmeans/clusters-*-final \ -d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 10 :{"identifier":"VL-316","r":[{"00":0.497},{"00.14":0.408},{"00.18":0.408},{"00.56 Top Terms: president => 3.4944214993103375 chief => 3.3234287659025012 executive => 3.16472187060367 officer => 3.143776322498974 chairman => 2.5400053276308587 vice => 1.9913627557428164 named => 1.9851312619198411 said => 1.9030630459350324 company => 1.782354193948521 names => 1.4052995438811444
FAQs How to get rid of useless words? Increase minSupport and or decrease dfPercent Use StopwordsAnalyzer How to see documents to cluster assignments? Run clustering process at the end of centroid generation using cl How to choose appropriate weighting? If its long text, go with tf-idf. Use normalization if documents different in length How to run this on a cluster? Set HADOOP_CONF directory to point to your hadoop cluster conf directory How to scale? Use small value of k to partially cluster data and then do full clustering on each cluster.
FAQs How to choose k? Figure out based on the data you have. Trial and error Or use Canopy Clustering and distance threshold to figure it out Or use Spectral clustering How to improve Similarity Measurement? Not all features are equal Small weight difference for certain types creates a large semantic difference Use WeightedDistanceMeasure Or write a custom DistanceMeasure
Recommendations Help users find items they might like based on historical preferences Based on example by Sebastian Schelter in Distributed Itembased Collaborative Filtering with Apache Mahout
Recommendations Alice 5 1 4 ? Bob 2 5 Peter 4 3 2
Recommendations Algorithm Neighborhood-based approach Works by finding similarly rated items in the user- item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient) Estimates a user's preference towards an item by looking at his/her preferences towards similar items
Recommendations Prediction: Estimate Bob's preference towards The Matrix 1. Look at all items that a) are similar to The Matrix b) have been rated by Bob => Alien , Inception 2. Estimate the unknown preference with a weighted sum
Recommendations MapReduce phase 1 Map Make user the key (Alice, Matrix, 5) Alice (Matrix, 5) (Alice, Alien, 1) Alice (Alien, 1) (Alice, Inception, 4) Alice (Inception, 4) (Bob, Alien, 2) Bob (Alien, 2) (Bob, Inception, 5) Bob (Inception, 5) (Peter, Matrix, 4) Peter (Matrix, 4) (Peter, Alien, 3) Peter (Alien, 3) (Peter, Inception, 2) Peter (Inception, 2)
Recommendations MapReduce phase 1 Reduce Create inverted index Alice (Matrix, 5) Alice (Alien, 1) Alice (Inception, 4) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Bob (Alien, 2) Bob (Alien, 2) (Inception, 5) Bob (Inception, 5) Peter (Matrix, 4) (Alien, 3) (Inception, 2) Peter (Matrix, 4) Peter (Alien, 3) Peter (Inception, 2)
Recommendations MapReduce phase 2 Map Isolate all co-occurred ratings (all cases where a user rated both items) Matrix, Alien (5,1) Matrix, Alien (4,3) Alice (Matrix, 5) (Alien, 1) (Inception, 4) Alien, Inception (1,4) Bob (Alien, 2) (Inception, 5) Alien, Inception (2,5) Peter(Matrix, 4) (Alien, 3) (Inception, 2) Alien, Inception (3,2) Matrix, Inception (4,2) Matrix, Inception (5,4)
Recommendations MapReduce phase 2 Reduce Compute similarities Matrix, Alien (5,1) Matrix, Alien (4,3) Alien, Inception (1,4) Matrix, Alien (-0.47) Alien, Inception (2,5) Matrix, Inception (0.47) Alien, Inception (3,2) Alien, Inception(-0.63) Matrix, Inception (4,2) Matrix, Inception (5,4)
Recommendations Calculate Weighted sum (-.47*2 + .47*5) / (.47+.47) = 1.5
Recommendations Alice 5 1 4 Bob 1.5 2 5 Peter 4 3 2
Implementation in Spark Alternating Least Squares (ALS) Accepts a tuple of (user, product, rating) to train data Accepts a tuple of (user, product) to predict their rating Example: https://spark.apache.org/docs/latest/mllib- collaborative-filtering.html
Implementations in Mahout ItemSimilarityJob Computes all item similarities Various configuration options: Similarity measure to use (cosine, Pearson-Correlation, etc.) Maximum number of similar items per item Maximum number of co-occurences to consider Input: CSV file (userId, itemID, value) Output: Pairs of itemIDs with associated similarity
Implementations in Mahout RecommenderJob Distributed Itembased Recommender Various configuration options: Similarity measure to use Number of recommendations per user Filter out some users or items Input: CSV file (userId, itemID, value) Output: UserIds with recommended itemIDs and their scores
References http://mahout.apache.org http://spark.apache.org http://isabel- drost.de/hadoop/slides/collabMahout.pdf http://www.slideshare.net/OReillyOSCON/han ds-on-mahout# http://www.slideshare.net/urilavi/intro-to- mahout