Parallel Recommender Algorithm Performance Comparison in Hadoop-Based Frameworks

performance comparison of a parallel recommender n.w
1 / 18
Embed
Share

Explore the performance of a parallel recommender algorithm across three Hadoop-based frameworks, focusing on the evaluation of highly scalable parallel frameworks and algorithms for recommendation systems. The study includes implementing collaborative filtering using MPJ Express integrated with Hadoop and benchmarking the performance against Mahout, Spark, and Giraph.

  • Recommender algorithm
  • Hadoop
  • Parallel computing
  • MPJ Express
  • Frameworks

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Performance Comparison of a Parallel Recommender Algorithm across three Hadoop-based Frameworks HPML Conference, Lyon, Sept 2018 Christina P. A. Diedhiou Supervisor: Dr. Bryan Carpenter School of Computing

  2. Objectives of the Session Aim & Objectives of research Overview Hadoop/ MPJ Express Overview of Recommender Systems Implementation of ALSWR Evaluation and Comparison Future works

  3. Aims & objectives of research Evaluate highly scalable parallel frameworks and algorithms for recommendation systems. Main interest on a Java Message Passing Interface: MPJ Express integrated with Hadoop. Evaluate open source Java message passing library for parallel computing: MPJ Express Use MPJ Express to implement collaborative filtering on datasets with an algorithm:ALSWR Benchmark the performance and measure the parallel speedup Compare our results with other frameworks: Mahout, Spark, Giraph

  4. Hadoop Hadoop: A Framework that stores and processes voluminous amounts of data in a reliable and fault-tolerant manner Hadoop 2 released in 2014 Yarn: Resource Manager: Manage & allocate resources across cluster Node Manager: Run on all nodes, report to RM Application Master: Specific to each job, manage operation within containers, ensure there is enough containers

  5. MPJ Express Open source Java MPI-like library that allows application developers to write and execute parallel applications on multicore processors and compute clusters. 2015 MPJE provides Yarn base runtime

  6. Integration of MPJE in Yarn mpjrun.sh yarn np 2 dev niodev MPJApp.jar 1)Submit YARN application 2)Request container allocation for Application Master (AM) 3)AM generates a Container Launch Context (CLC) and allocates container to each node 4)Each mpj-yarn-wrapper send outputs and error streams of the program to the MPJYarnClient

  7. Recommender Systems What is a Recommender System? Software tools and techniques providing suggestions to users on items they might want/ like Example of systems: Netflix, Google news, YouTube, Amazon recommender Content Based Collabora tive Filtering Demogra phic Recommender System

  8. Collaborative Filtering Based on users purchases or decisions histories Rationality: 2 individuals sharing the same opinion on an item, will have similar taste on another item

  9. Alternating Least Squares with Lambda Regularization (ALSWR) ALSWR is an iterative algorithm. It shifts between fixing two different matrices until a convergence is reached. Step 1: Initialize matrix M in a pseudorandom way Step 2: Fix M, Solve U by minimizing the objective function (the sum of squared errors); Step 3: Fix U, Solve M by minimizing the objective function similarly Steps 2 and 3 are repeated until a stopping criterion is satisfied.

  10. Implementation of ALSWR Step 1 uses locally held matrix R decomposed by row and columns (users & items) Step 2 update the items (movies), between (1) & (2) all locally computed elements of users are gathered and broadcast Step 3 update the users. Between (2) & (3) all locally computed elements of items must be gathered together and broadcast to processing nodes. Communication between nodes of cluster established by collective communication

  11. Collective Communication AllGather: enables elements to be send to all processes by gathering all these elements to all processes. Allreduce: sum over all processes then distribute the results to all processes. Barrier: Synchronisation. Prevents processes to go beyond a barrier unless they have all send a signal

  12. Experiments Data collected: MovieLens users ratings 20+ millions ratings Yahoo music users ratings 717 +millions ratings Hardware: Linux cluster with 4 nodes 16+ cores Software used: MPJ Express, Java, Hadoop 2 Algorithm: Least square methods: Alternating-Least-Squares with Weighted Regularization (ALSWR) Data storage: Hadoop distributed File System (HDFS) Method: Configuration of the nodes with Hadoop and YARN Adding dataset in HDFS Dataset partitioning with Map Reduce or MPJ Code Run ALSWR java code Experiments: Sequential (1 process) VS Parallel speedup (many processes) Comparison with Spark, Mahout, Giraph

  13. Results MovieLens data MPJ vs SPARK vs Mahout vs Giraph MovieLens Dataset MPJE Spark Mahout Giraph 70 Good parallel speedup for MPJ Express and decreases when number of cores increases No Variance for Mahout from 4 cores and above. MPJ Express averagely 13.9 times faster than Mahout MPJ Express averagely 1.4 times faster than Spark 60 Spark: time Time in mn 50 40 30 20 10 0 0 5 10 15 20 Number of Processes

  14. Results MovieLens data Closer Look on MPJE & Spark performance/ parallel speedup Constant progress, but could be improved for better parallel computation MPJ vs SPARK MovieLens Dataset 10 9 8 7 Time in mn 6 MPJ Express 5 Spark 4 Parallel SpeedUp MPJ vs SPARK MovieLens Dataset 3 2 7 1 6 0 0 2 4 6 8 10 12 14 16 18 Parallel Speedup 5 Number of Processes MPJ Express 4 Spark 3 2 1 0 0 5 10 15 20 Number of Processes

  15. Results Yahoo music data MPJ vs SPARK Yahoo Training Dataset Time in minutes 450 Nbr Of Processes 1 2 4 8 12 16 400 MPJ Express 298 142 84.4 45.56 33.15 28.35 Spark 417 217 136 65 54 55 350 300 MPJ express Time in mn 250 200 Spark 150 100 50 Parallel Speedup MPJ vs SPARK Yahoo Dataset 0 0 5 10 15 20 12 Number of Processes 10 Parallel Speedup 8 MPJ Express No computation for Mahout du to size of data Better parallel speedup achieved for MPJE & Spark 6 Spark 4 2 0 0 5 10 15 20 Number of Processes

  16. Future works More study on Spark Results More experiments on Giraph Assess and compare the accuracy RMSE Generating synthetic datasets to reach social media scale

  17. Further readings/ references Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the Netflix prize. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5034 LNCS, 337 348. http://doi.org/10.1007/978-3-540-68880-8_32 Kabijo, M., & Llic, A. (2015). Recommending items to more than a billion people : Machine Learning. Retrieved from https://www.reddit.com/r/MachineLearning/comments/38d7xu/recommending_items_to _more_than_a_billion_people/? Aamir Shafi, 2014, MPJ Express an implementation of message passing interface (MPI) in Java, http://www.powershow.com/view1/154baa- ZDc1Z/MPJ_Express_An_Implementation_of_Message_Passing_Interface_MPI_in_Ja va_powerpoint_ppt_presentation http://mpj-express.org/ https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Purpose

  18. Contacts Christina P. A. Diedhiou christina.diedhiou@port.ac.uk Dr. Bryan Carpenter bryan.carpenter@port.ac.uk

Related


More Related Content