
Matrix Multiplication in Hadoop: Approach and Performance Insights
Explore the implementation of matrix multiplication in Hadoop using map-reduce parallelism, with a focus on matrix representation, map-reduce algorithms, Java implementation details, and performance benchmarks. Learn about the challenges of handling enormous matrices and the benefits of utilizing sparse matrix formats. Enhance your understanding of the topic through visual aids and practical insights for optimizing performance.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Matrix Multiplication in Hadoop Siddharth Saraph
Matrix Multiplication Matrices are very practical: sciences, engineering, statistics, etc. Multiplication is a fundamental nontrivial matrix operation. Simpler than something like matrix inversion (although the time complexity is the same).
Matrix Multiplication Problem: Some people want to use enormous matrices. Cannot be handled on one machine Take advantage of map-reduce parallelism to approach this problem. Heuristics: 10,000x10,000: 100,000,000 entries 100,000x100,000: 10,000,000,000 entries In practice: sparse matrices.
First Step: Matrix Representation How to represent a matrix for input to a map- reduce job? Convenient for sparse matrices: Coordinate list format. (row index, col index, value). (Board). Omit entries with value 0. Entries can be in an arbitrary order.
Second Step: Map Reduce Algorithm Simple entrywise method. Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.
Second Step: Map Reduce Algorithm Chalkboard.
Implementation Java, not Hadoop streaming. Why? Seemed like a more complex project that would require more control. Custom Key and Value classes. Custom Partitioner class for the block method, for distributing keys to reducers. Learn java.
Performance Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix Sparse 1000x1000, .1, 6 splits, 12 reducers, 2.9MB: 5 minutes Sparse 5000x5000, .1, 20 splits, 20 reducers, 73MB: > 1 Hour
MATLAB Performance Windows 7, MATLAB 2015a 64-bit. Engineering Library cluster, 4 GB RAM: 13,000x13,000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. LaFortune cluster, 16 GB RAM: 20,000x20,000 .1 density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.
Improvements? Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. Strassen s algorithm? O(n2.8), benefits of about 10% with matrix dimensions of few thousand. Use a different algorithm? Use a different platform? Spark?
Conclusion What happened to the enormous matrices? From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.