
Harnessing the Power of Random Forest in Distributed R
Explore the capabilities of Distributed R for implementing Random Forest models in a distributed environment. Learn about the features, advantages, and differences between hpdRF_parallelForest and hpdRF_parallelTree algorithms. Discover the process of distributing data across machines, computing histograms, finding optimal splits, and more. Gain insights on using Random Forest in Distributed R and its resemblance to the randomForest function with additional arguments and output features.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Random Forest in Distributed R Arash Fard Vishrut Gupta 1
Distributed R Distributed R is a scalable high-performance platform for the R language that can leverage the resources of multiple machines Easy to use: library(distributedR) distributedR_start() Github page: https://github.com/vertica/DistributedR/ Coming Soon: CRAN installation 2
Distributed R Standard Master Worker framework Distributed data structures: darray distributed array DistR Master dframe distributed data frame dlist distributed list Parallel execution: foreach function executed remotely Master is a normal R console Worker 1 Worker 3 can run standard R packages Worker 2 Worker 4 3
Random Forest in Distributed R hpdRF_parallelForest Great for small/medium sized data Embarrassingly parallel Each worker builds fraction of trees Each worker needs the entire data Calls random forest package Very memory intensive Doesn t scale well hpdRF_parallelTree Great for large sized data: 1 GB + Not Embarrassingly parallel Doesn t require all data to be on worker Scales better than hpdRF_parallelForest Smaller output model Larger Distributed R overhead Approximate Algorithm 4
hpdRF_parallelTree details Distribute data across machines Recursively on leaf nodes: 1. Compute local histograms 2. Combine global histograms and compute optimal split 3. Workers work together to find best split 4. Update tree with decision rule and create new leaf nodes X7 1 2 3 4 5 6 7 8 Compute best split from histogram X7 > 5 Scan feature 7 to create histogram Build tree recursively 5
How to use Random Forest in Distributed R Interface is extremely similar to randomForest function Some additional arguments required nBins default value of 256 nExecutors no default value (controls how much parallelism in hpdRF_parallelForest) completeModel default value set to false (decide whether to calculate OOB error) Some output features not yet there Variable Importance Proximity matrix 6
MNIST dataset with 8.1M observations library(distributedR) library(HPdclassifier) distributedR_start() mnist_train <- read.csv("/mnt/mnist_train.csv",sep="\t") mnist_test <- read.csv("/mnt/mnist_test.csv",sep="\t") model <- hpdrandomForest(response~., mnist_train, ntree = 10) predictions <- predict(model, mnist_test) distributedR_shutdown() Prediction accuracy of 99.7% with just 10 trees! Not recommended to use read.csv (do this in parallel using Distributed R) 7
Scalability of hpdRF_parallelTree Testing Conditions: 1M observations 100 features 12 cores per machine R s random forest takes about 106260 seconds (~29 hours) on larger machine 8
Conclusions Distributed R multi-core and distributed Random Forest in Distributed R Two parallel implementations optimized for different scenarios Email: vishrut.gupta@hp.com 10
Appendix: Comparison with Other Implementations Self reported results on MNIST 8.1M observations wiseRF 8 min H20 19 min Spark Sequoia Forest - 6 min Spark MLLIB - crashed Distributed R 10 min Distributed R is competitive Disclaimer: These results were run on different machines 11