Parallelization of Information Flow Queries and Dynamic Information Flow Tracking

jetstream cluster jetstream cluster scale n.w

1 / 38

Embed Share

Explore the parallelization of information flow queries and dynamic information flow tracking techniques as discussed by Andrew Quinn, David Devecsery, Peter Chen, and Jason Flinn. Dive into the complexities, limitations, and strategies for optimizing these processes effectively.

nsant Follow

Uploaded on Jun 22, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

JetStream: Cluster JetStream: Cluster- -scale Parallelization of scale Parallelization of Information Flow Queries Information Flow Queries Andrew Quinn, David Devecsery, Peter Chen and Jason Flinn

Dynamic Information Flow Tracking DIFT instruments execution to track causality Also known as Taint-Tracking Sources (Inputs) Sinks (Outputs) 2

DIFT for Debugging o1 o1 Server 3

DIFT for Debugging Inputs o1 Server Outputs 4

DIFT for Debugging Inputs o1 Server Outputs 5

DIFT for Debugging Inputs o1 Server Outputs 6

DIFT for Debugging Inputs o1 Server Outputs 7

DIFT limitations Overheads ~100x Arnold 14 Long queries X-Ray 12 TaintDroid 10 No native code Backtracker 03 Coarse-grained causality 8

Parallelize DIFT 9

Parallelizing DIFT is HARD A = read() B = read() C = A + B D = X + Y E = C B = 0 Z = A[D] F = E write(F) Sequential Dependencies 10

Parallelizing DIFT is HARD Speck (ASPLOS 08) A = read() B = read() C = A + B D = X + Y E = C B = 0 Z = A[D] F = E write(F) Sequential Dependencies Parallel Lifeguards (ASPLOS 08) 11

Parallelizing DIFT is HARD Speck (ASPLOS 08) A = read() B = read() C = A + B D = X + Y E = C B = 0 Z = A[D] F = E write(F) Embarrassingly Sequential - Ruwase et al. Sequential Dependencies Parallel Lifeguards (APSLOS 08) 12

JetStream 2x 21x Local DIFT epoch parallelism Faster than original execution! Aggregation pipeline parallelism 13

Outline Design of JetStream Local DIFT Aggregation Evaluation 14

Debugging Query Inputs Outputs 15

Local DIFT Time slice execution into Epochs Inputs Outputs 16

Local DIFT Leverage Record and Replay to calculate DIFT in parallel Inputs Outputs 17

Local DIFT Track mapping between all intermediate locations Inputs Outputs 18

Local DIFT Mapping is too expensive to calculate: log operations defer calculating relationships until aggregation Inputs Outputs 19

DIFT Fast Forward: replay execution until start of epoch Analysis: log operations using a graph A D Fast Forward B C D = B + C C = A[D] Analysis A D B C 20

DIFT Fast Forward: replay execution until start of epoch Analysis: log operations using a graph A D Fast Forward B C D = B + C C = A[D] Analysis A D B C 21

DIFT Fast Forward: replay execution until start of epoch Analysis: log operations using a graph A D Fast Forward B C D = B + C C = A[D] Analysis A D B C 22

Local DIFT output Inputs Outputs 23

JetStream Local DIFT epoch parallelism Aggregation pipeline parallelism 24

Aggregation Calculate paths between source and sinks Many nodes are not on path between source and sink Use sequential information to prune work Inputs Outputs 25

Forward Pass Pass locations which are derived from a source Inputs derived locations Outputs 26

Forward Pass Pass locations which are derived from a source Inputs derived locations Outputs 27

Backward Pass Pass locations which are used by a sink Inputs used locations Outputs 28

Backward Pass Pass locations which are used by a sink Inputs used locations Outputs 29

JetStream In the paper: Insights about why na ve approaches fail Partitioning challenging to predict the local DIFT time Pre-pruning garbage collection of the graph 30

Outline Evaluation 31

Experimental Setup CloudLab cluster of 32 machines, 1 128 cores 32

Experimental Setup CloudLab cluster of 32 machines, 1 128 cores sources: home directory sinks: all sources: Cookies sinks: suspicious connections 33

Two Different Scenarios Unexpected analysis: prioritize low record overhead Expected analysis periodic checkpoint gather partitioning stats 34

Scalability of Unexpected Analysis mean: 13x 100 Normalized Spedup 10 1 1 10 100 Number of Cores Ghostscript OpenOffice Gzip Nginx Evince Firefox Mongodb Ideal 35

Scalability of Expected Analysis mean: 21x 100 Normalized Speedup 10 1 1 10 100 Number of Cores Ghostscript OpenOffice Gzip Nginx Evince Firefox Mongodb Ideal 36