Practical Machine Learning Pipelines with MLlib Summary

practical machine learning pipelines with mllib n.w

1 / 35

Embed Share

"Explore the practical applications of Machine Learning Pipelines with MLlib presented at Spark Summit East 2015. Learn about Spark MLlib's mission, workflows, and examples like text classification. Understand how ML workflows and pipelines can streamline your machine learning processes."

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Practical Machine Learning Pipelines with MLlib Joseph K. Bradley Joseph K. Bradley March 18, 2015 March 18, 2015 Spark Summit East 2015 Spark Summit East 2015

About Spark MLlib Started in UC Berkeley AMPLab Shipped with Spark 0.8 Currently (Spark 1.3) Contributions from 50+ orgs, 100+ individuals Good coverage of algorithms classification regression clustering recommendation feature extraction, selection statistics linear algebra frequent itemsets

MLlibs Mission MLlib s mission is to make practical machine learning easy and scalable. Capable of learning from large-scale datasets Easy to build machine learning applications How can we move beyond this list of algorithms and help users developer real ML workflows?

Outline ML workflows Pipelines Roadmap

Outline ML workflows ML workflows Pipelines Roadmap

Example: Text Classification Goal: Given a text document, predict its topic. Dataset: 20 Newsgroups From UCI KDD Archive Features Label Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science CTR, inches of rainfall, ... text, image, vector, ... Set Footer from Insert Dropdown Menu 6

Training & Testing Training Testing/Production Given new unlabeled data: RDD of features Given labeled data: RDD of (features, label) Subject: Apollo Training The Apollo astronauts also trained at (in) Meteor... Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help... Label 0 Label 1 Subject: A demo of Nonsense How can you lie about something that no one... Subject: RIPEM FAQ RIPEM is a program which performs Privacy Enhanced... ... Label 1 Label 0 Learn a model. Use model to make predictions. Set Footer from Insert Dropdown Menu 7

Example ML Workflow Training Pain point Create many RDDs Load data val labels: RDD[Double] = data.map(_.label) labels + plain text Extract features val features: RDD[Vector] val predictions: RDD[Double] labels + feature vectors Train model Explicitly unzip & zip RDDs labels + predictions labels.zip(predictions).map { if (_._1 == _._2) ... } Evaluate

Example ML Workflow Training Pain point Write as a script Load data Not modular Difficult to re-use workflow labels + plain text Extract features labels + feature vectors Train model labels + predictions Evaluate

Example ML Workflow Training Testing/Production Load data Load new data Almost identical workflow labels + plain text plain text Extract features Extract features labels + feature vectors feature vectors Train model Predict using model labels + predictions predictions Evaluate Act on predictions

Example ML Workflow Training Load data Pain point labels + plain text Parameter tuning Key part of ML Involves training many models For different splits of the data For different sets of parameters Extract features labels + feature vectors Train model labels + predictions Evaluate

Pain Points Create & handle many RDDs and data types Write as a script Tune parameters Enter... Pipelines! in Spark 1.2 & 1.3

Outline ML workflows Pipelines Pipelines Roadmap

Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning

DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named columns with types label: Double text: String words: Seq[String] features: Vector prediction: Double label text words features 0 This is ... [ This , is , ] [0.5, 1.2, ] 0 When we ... [ When , ...] [1.9, -0.8, ] 1 Knuth was ... [ Knuth , ] [0.0, 8.7, ] 0 Or you ... [ Or , you , ] [0.1, -0.6, ]

DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Domain-Specific Language # Select science articles sciDocs = data.filter( label == 1) Named columns with types # Scale labels data( label ) * 0.5

DataFrame: The ML Dataset DataFrame: RDD + schema + DSL BIG data Named columns with types Domain-Specific Language Shipped with Spark 1.3 APIs for Python, Java & Scala (+R in dev) Integration with Spark SQL Data import/export Internal optimizations Pain point: Create & handle many RDDs and data types

Abstractions Training Load data Extract features Train model Evaluate Set Footer from Insert Dropdown Menu 18

Abstraction: Transformer Training def transform(DataFrame): DataFrame Extract features label: Double text: String label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 19

Abstraction: Estimator Training def fit(DataFrame): Model Extract features LogisticRegression Model label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 20

Abstraction: Evaluator Training def evaluate(DataFrame): Double Metric: accuracy AUC MSE ... Extract features label: Double text: String features: Vector prediction: Double Train model Evaluate Set Footer from Insert Dropdown Menu 21

Abstraction: Model Testing/Production Model is a type of Transformer def transform(DataFrame): DataFrame Extract features text: String features: Vector text: String features: Vector prediction: Double Predict using model Act on predictions Set Footer from Insert Dropdown Menu 22

(Recall) Abstraction: Estimator Training def fit(DataFrame): Model Load data Extract features LogisticRegression Model label: Double text: String features: Vector Train model Evaluate Set Footer from Insert Dropdown Menu 23

Abstraction: Pipeline Training Pipeline is a type of Estimator def fit(DataFrame): Model Load data Extract features label: Double text: String PipelineModel Train model Evaluate Set Footer from Insert Dropdown Menu 24

Abstraction: PipelineModel Testing/Production PipelineModel is a type of Transformer def transform(DataFrame): DataFrame Load data Extract features text: String features: Vector prediction: Double text: String Predict using model Act on predictions Set Footer from Insert Dropdown Menu 25

Abstractions: Summary Training Testing Load data Load data DataFrame Transformer Extract features Extract features Estimator Train model Predict using model Evaluator Evaluate Evaluate Set Footer from Insert Dropdown Menu 26

Demo Training Current data schema label: Double text: String Load data DataFrame Transformer Tokenizer words: Seq[String] Transformer HashingTF features: Vector Estimator LogisticRegression prediction: Double BinaryClassification Evaluator Evaluator Set Footer from Insert Dropdown Menu 27

Demo Training Load data DataFrame Transformer Tokenizer Pain point: Write as a script Transformer HashingTF Estimator LogisticRegression BinaryClassification Evaluator Evaluator Set Footer from Insert Dropdown Menu 28

Parameters Standard API > hashingTF.numFeatures org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) Typed Defaults Built-in doc Autocomplete > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures Set Footer from Insert Dropdown Menu 29

Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters Tokenizer hashingTF.numFeatures {100, 1000, 10000} HashingTF LogisticRegression lr.regParam {0.01, 0.1, 0.5} CrossValidator BinaryClassification Evaluator

Parameter Tuning Given: Estimator Parameter grid Evaluator Find best parameters Tokenizer Pain point: Tune parameters HashingTF LogisticRegression CrossValidator BinaryClassification Evaluator

Pipelines: Recap DataFrame Abstractions Parameter API Create & handle many RDDs and data types Write as a script Tune parameters Also Python, Scala, Java APIs Schema validation User-Defined Types* Feature metadata* Multi-model training optimizations* Inspirations scikit-learn + Spark DataFrame, Param API MLBase (Berkeley AMPLab) Ongoing collaborations * Groundwork done; full support WIP.

Outline ML workflows Pipelines Roadmap Roadmap

Roadmap spark.mllib: Primary ML package spark.ml: High-level Pipelines API for algorithms in spark.mllib (experimental in Spark 1.2-1.3) Near future Feature attributes Feature transformers More algorithms under Pipeline API Farther ahead Ideas from AMPLab MLBase (auto-tuning models) SparkR integration

Outline ML workflows Pipelines DataFrame Abstractions Parameter tuning Roadmap Spark documentation http://spark.apache.org/ Pipelines blog post https://databricks.com/blog/2015/01/07 Thank you!

Practical Machine Learning Pipelines with MLlib Summary

Download Presentation

Presentation Transcript

Related

More Related Content