
Practical Machine Learning Pipelines with MLlib Summary
"Explore the practical applications of Machine Learning Pipelines with MLlib presented at Spark Summit East 2015. Learn about Spark MLlib's mission, workflows, and examples like text classification. Understand how ML workflows and pipelines can streamline your machine learning processes."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley Joseph K. Bradley March 18, 2015 March 18, 2015 Spark Summit East 2015 Spark Summit East 2015
About Spark MLlib Started in UC Berkeley AMPLab Shipped with Spark 0.8 Currently (Spark 1.3) Contributions from 50+ orgs, 100+ individuals Good coverage of algorithms classification regression clustering recommendation feature extraction, selection statistics linear algebra frequent itemsets
MLlibs Mission MLlib s mission is to make practical machine learning easy and scalable. Capable of learning from large-scale datasets Easy to build machine learning applications How can we move beyond this list of algorithms and help users developer real ML workflows?
Outline ML workflows Pipelines Roadmap
Outline ML workflows ML workflows Pipelines Roadmap
Example: Text Classification Goal: Given a text document, predict its topic. Dataset: 20 Newsgroups From UCI KDD Archive Features Label Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science CTR, inches of rainfall, ... text, image, vector, ... Set Footer from Insert Dropdown Menu 6
Training & Testing Training Testing/Production Given new unlabeled data: RDD of features Given labeled data: RDD of (features, label) Subject: Apollo Training The Apollo astronauts also trained at (in) Meteor... Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help... Label 0 Label 1 Subject: A demo of Nonsense How can you lie about something that no one... Subject: RIPEM FAQ RIPEM is a program which performs Privacy Enhanced... ... Label 1 Label 0 Learn a model. Use model to make predictions. Set Footer from Insert Dropdown Menu 7
Example ML Workflow Training Pain point Create many RDDs Load data val labels: RDD[Double] = data.map(_.label) labels + plain text Extract features val features: RDD[Vector] val predictions: RDD[Double] labels + feature vectors Train model Explicitly unzip & zip RDDs labels + predictions labels.zip(predictions).map { if (_._1 == _._2) ... } Evaluate
Example ML Workflow Training Pain point Write as a script Load data Not modular Difficult to re-use workflow labels + plain text Extract features labels + feature vectors Train model labels + predictions Evaluate
Example ML Workflow Training Testing/Production Load data Load new data Almost identical workflow labels + plain text plain text Extract features Extract features labels + feature vectors feature vectors Train model Predict using model labels + predictions predictions Evaluate Act on predictions
Example ML Workflow Training Load data Pain point labels + plain text Parameter tuning Key part of ML Involves training many models For different splits of the data For different sets of parameters Extract features labels + feature vectors Train model labels + predictions Evaluate
Pain Points Create & handle many RDDs and data types Write as a script Tune parameters Enter... Pipelines! in Spark 1.2 & 1.3
Outline ML workflows Pipelines Pipelines Roadmap
Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named columns with types label: Double text: String words: Seq[String] features: Vector prediction: Double label text words features 0 This is ... [ This , is , ] [0.5, 1.2, ] 0 When we ... [ When , ...] [1.9, -0.8, ] 1 Knuth was ... [ Knuth , ] [0.0, 8.7, ] 0 Or you ... [ Or , you , ] [0.1, -0.6, ]
DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Domain-Specific Language # Select science articles sciDocs = data.filter( label == 1) Named columns with types # Scale labels data( label ) * 0.5
DataFrame: The ML Dataset DataFrame: RDD + schema + DSL BIG data Named columns with types Domain-Specific Language Shipped with Spark 1.3 APIs for Python, Java & Scala (+R in dev) Integration with Spark SQL Data import/export Internal optimizations Pain point: Create & handle many RDDs and data types
Abstractions Training Load data Extract features Train model Evaluate Set Footer from Insert Dropdown Menu 18
Abstraction: Transformer Training def transform(DataFrame): DataFrame Extract features label: Double text: String label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 19
Abstraction: Estimator Training def fit(DataFrame): Model Extract features LogisticRegression Model label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 20
Abstraction: Evaluator Training def evaluate(DataFrame): Double Metric: accuracy AUC MSE ... Extract features label: Double text: String features: Vector prediction: Double Train model Evaluate Set Footer from Insert Dropdown Menu 21
Abstraction: Model Testing/Production Model is a type of Transformer def transform(DataFrame): DataFrame Extract features text: String features: Vector text: String features: Vector prediction: Double Predict using model Act on predictions Set Footer from Insert Dropdown Menu 22
(Recall) Abstraction: Estimator Training def fit(DataFrame): Model Load data Extract features LogisticRegression Model label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 23
Abstraction: Pipeline Training Pipeline is a type of Estimator def fit(DataFrame): Model Load data Extract features label: Double text: String PipelineModel Train model Evaluate Set Footer from Insert Dropdown Menu 24
Abstraction: PipelineModel Testing/Production PipelineModel is a type of Transformer def transform(DataFrame): DataFrame Load data Extract features text: String features: Vector prediction: Double text: String Predict using model Act on predictions Set Footer from Insert Dropdown Menu 25
Abstractions: Summary Training Testing Load data Load data DataFrame Transformer Extract features Extract features Estimator Train model Predict using model Evaluator Evaluate Evaluate Set Footer from Insert Dropdown Menu 26
Demo Training Current data schema label: Double text: String Load data DataFrame Transformer Tokenizer words: Seq[String] Transformer HashingTF features: Vector Estimator LogisticRegression prediction: Double BinaryClassification Evaluator Evaluator Set Footer from Insert Dropdown Menu 27
Demo Training Load data DataFrame Transformer Tokenizer Pain point: Write as a script Transformer HashingTF Estimator LogisticRegression BinaryClassification Evaluator Evaluator Set Footer from Insert Dropdown Menu 28
Parameters Standard API > hashingTF.numFeatures org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) Typed Defaults Built-in doc Autocomplete > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures Set Footer from Insert Dropdown Menu 29
Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters Tokenizer hashingTF.numFeatures {100, 1000, 10000} HashingTF LogisticRegression lr.regParam {0.01, 0.1, 0.5} CrossValidator BinaryClassification Evaluator
Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters Tokenizer Pain point: Tune parameters HashingTF LogisticRegression CrossValidator BinaryClassification Evaluator
Pipelines: Recap DataFrame Abstractions Parameter API Create & handle many RDDs and data types Write as a script Tune parameters Also Python, Scala, Java APIs Schema validation User-Defined Types* Feature metadata* Multi-model training optimizations* Inspirations scikit-learn + Spark DataFrame, Param API MLBase (Berkeley AMPLab) Ongoing collaborations * Groundwork done; full support WIP.
Outline ML workflows Pipelines Roadmap Roadmap
Roadmap spark.mllib: Primary ML package spark.ml: High-level Pipelines API for algorithms in spark.mllib (experimental in Spark 1.2-1.3) Near future Feature attributes Feature transformers More algorithms under Pipeline API Farther ahead Ideas from AMPLab MLBase (auto-tuning models) SparkR integration
Outline ML workflows Pipelines DataFrame Abstractions Parameter tuning Roadmap Spark documentation http://spark.apache.org/ Pipelines blog post https://databricks.com/blog/2015/01/07 Thank you!