Polyglot Framework for Factorized ML in Industry Track VLDB21

1 / 64

Embed Share

"Explore a polyglot framework for factorized machine learning discussed in the industry track at VLDB21. Learn about challenges and solutions in maintaining dev tools across multiple programming languages, with a focus on data normalization and optimization."

cel_eri Follow

Uploaded on Mar 18, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

VLDB21 Industry Track Towards A Polyglot Framework for Factorized ML David Justo, Shaoqing Yi, Lukas Stadler, Nadia Polikarpova, Arun Kumar 1

A Polyglot Data Science World A Polyglot Data Science World 2

Challenge Challenge Maintaining dev-tools across multiple PLs is hard to scale 3

Morpheus Morpheus 4

De De- -normalizing leads to redundancy normalizing leads to redundancy JOIN A relational dataset The dataset post-join 5

De De- -normalizing leads to redundancy normalizing leads to redundancy 6

De De- -normalizing leads to redundancy normalizing leads to redundancy Op( ) 7

De De- -normalizing leads to redundancy normalizing leads to redundancy Op( ) Op( ) Op( ) 8

De De- -normalizing leads to redundancy normalizing leads to redundancy Op( ) Op( ) Op( ) 9

A development challenge A development challenge New PL New Optimization Opt.1 Massive burden for a Opt.3 Opt.2 research team 10

Key Idea Key Idea We need to disentangle the specification of rewrite rules from target-PL domain knowledge 11

Trinity Trinity A framework to support and develop polyglot/ multi-language factorized ML systems 12

Contributions Contributions To our knowledge, the first system to generalize factorized ML to multiple PLs at once Extend GraalVM s interoperability abstractions with Matrix support Demonstrate perf improvements in 3 new leaner prototypes for R, JS, and Python 13

Agenda Agenda Introduction GraalVM Architecture Evaluation Future Directions 14

Agenda Agenda Introduction GraalVM Architecture Evaluation Future Directions 15

Your new PL here! Shared Interpreter 16

Agenda Agenda Introduction GraalVM Architecture Evaluation Future Directions 17

MorpheusDSL MatrixLib Matrix is sparse in ? Yes No Normalized Matrix Call Foo() Call Bar() Op( ) 18

MorpheusDSL MorpheusDSL Morpheus rewrite rules as AST nodes 19

A MorpheusDSL Rewrite A MorpheusDSL Rewrite * Some PL, e.g R 5 MorpheusDSL MorpheusDSL semantics 20

A MorpheusDSL Rewrite A MorpheusDSL Rewrite Some PL, e.g R * * * MorpheusDSL 5 5 5 MorpheusDSL semantics 21

MatrixLib MatrixLib Unified Matrix interoperability abstraction 22

MatrixLib allows 1 MatrixLib allows 1st st class Matrix interop class Matrix interop You mean %*% ? Sure thing! Invoke MatrixLib s multiplication with 5 You mean __mul__ ? Sure thing! 23

MatrixLib allows 1 MatrixLib allows 1st st class Matrix interop class Matrix interop MatrixLib Invoke MatrixLib s multiplication with 5 Matrix is sparse in ? Yes No Call Foo() Call Bar() 24

Built-in interop In-PL invocation MatrixLib interop MorpheusDSL MatrixLib PL Domain knowledge Op( ) Op( ) Normalized Matrix Op( ) Op( ) 25

Built-in interop In-PL invocation MatrixLib interop MorpheusDSL MatrixLib PL Domain knowledge Op( ) Op( ) Normalized Matrix Op( ) Op( ) 26

Agenda Agenda Introduction GraalVM Architecture Evaluation Future Directions 27

Evaluation Evaluation Algorithms: Logistic Regression K-Means Clustering Linear Regression GNMF Clustering Baselines (in R): Morpheus Materialized Setting: R language Movies dataset (3-table join) 28

Model Training Time (lower is better) Model Training Time (lower is better) 8.56x speed-up 8.13x speed-up LogReg 165 sec Algorithms Morpheus Trinity 5.05x speed-up Materialized 4.98x speed-up Kmeans 358 sec 0 100 Training Time (seconds) 200 300 400 29

Model Training Time (lower is better) Model Training Time (lower is better) 7.55x speed-up 7.49x speed-up LinReg 132 sec Morpheus Algorithms Trinity Materialized 0.79x slow-down GNMF 0.88x slow-down 256 sec 0 100 Training Time (seconds) 200 300 400 30

Aside: When is Factorized ML slower? Aside: When is Factorized ML slower? 1. Unexpected variance in cost of LA operators 2. Minimal redundancy introduced by the join 31

Model Training Time Model Training Time LogReg Kmeans Algorithms Morpheus Trinity LinReg Materialized GNMF 0 100 Training Time (seconds) 200 300 400 32

Evaluation Summary Evaluation Summary Trinity (multi-PL) achieves comparable speed-ups to Morpheus (single PL) Trinity s relative performance difference with Morpheus is small, no larger than 20% 33

Takeaways Takeaways Trinity is a polyglot framework for Factorized ML Solves a developability challenge for today s polyglot Data Science landscape Performance comparable to single-language Morpheus implementations 34

Agenda Agenda Introduction GraalVM Architecture Evaluation Future Directions 35

Where to go next? Where to go next? 1. Remove dependency on GraalVM 2. Automatic PL-specific fine-tuning 3. User Study on debuggability and productivity 36

VLDB21 Industry Track Towards A Polyglot Framework for Factorized ML https://adalabucsd.github.io/morpheus.html 37

Backup Slides Backup Slides 38

Where to go next? Where to go next? Matrix is sparse in ? Trinity Trinity Yes No + Call Bar() Call Foo() + Automatic discovery of PL- aware perf knowledge Trinity as a transpiler 39

Where to go next? Where to go next? User Study: productivity and debuggability 40

Options Options FFIs not PL-general not efficient Compiler hard to extend Shared Runtime challenge: how many PLs are supported 41

JS and polyglot ( JS and polyglot (R+Python R+Python) Evaluation ) Evaluation Algorithm: Linear Regression Setting: synthetic 2-table join dataset Baselines: Materialized TR = 10 , FR = 5 ??= 104,ds= 20 42

JS and polyglot ( JS and polyglot (R+Python R+Python) Evaluation ) Evaluation Algorithm: Linear Regression Setting: synthetic 2-table join dataset Baselines: Materialized TR = 10 , TR = 5 ??= 104,ds= 20 43