
Graph Sketching Techniques for Big Data Processing
Explore graph sketching methods for big data analysis, including sketching vectors and graphs, dynamic graph processing, distributed computing, and connectivity algorithms. Understand how linear sketches and parallel connectivity techniques are utilized in handling large-scale graph datasets efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us
Sketching Graphs? We know how to sketch vectors: ? ?? How about sketching graphs? ? ?,? ?? (adjacency matrix): ?? ??? Sketch columns of ?? ? = ? ,? = |?| ?(????(log?)) sketch per vertex / ?(?) total Check connectivity Check bipartiteness As always, space rather than dimension. Why?
Graph Streams Semi-streaming model: [Muthukrishnan 05; Feigenbaum, Kannan, McGregor, Suri, Zhang 05] Graph defined by the stream of edges ?1, ,?? Space ? ? , edges processed in order Connectivity is easy on ?(?) space for insertion-only Dynamic graphs: Stream of insertion/deletion updates + ??1, ??2, , ??? (assume sequence is correct) Resulting graph has edge ??if it wasn t deleted after the last insertion Linear sketching dynamic graphs: ??? ?= ??? MAe
Distributed Computing Linear sketches for distributed processing ? servers with o(m) memory: Send ?/? edges (?1, ,??) to each server Compute sketches ??1, ,??? locally Send sketches to a central server Compute ???= ? ? has to have a small representation (same issue as in streaming) ????
Connectivity Thm. Connectivity is sketchable in ?(?) space Framework: Take existing connectivity algorithm (Boruvka) Sketch ?? ??? Run Boruvka on ??? Important that the sketch is homomorphic w.r.t the algorithm
Part 1: Parallel Connectivity (Boruvka) Repeat until no edges left: For each vertex, select any incident edge Contract selected edges Lemma: process converges in ?(log?) steps
Part 2: Graph Representation ? 2 For a vertex ? let ?? be a vector in Non-zero entries for edges ?,? ???,? = 1 if ? > ? ???,? = 1 if j < ? Example: ?1= 1,1,1,1,0, ,0 ?2= 1,0,0,0,0,0,1,0,1, ,0 Lem: For any ? ? supp ? ??? = ?(?,? ?) 7 3 2 6 1 4 5 1,2 , 1,3 , 1,4 , 1,5 , 1,6 , 1,7 , 2,3 , 2.4 . 2,5 ,
Part 3: ?0-Sampling There is a distribution over ? ? ? with ? = ?(log2?) such w.p. 9/10 that ? ?: ?? ? ????(?) [Cormode, Muthukrishnan, Rozenbaum 05; Jowhari, Saglam, Tardos 11] Constant probability suffices still ? log? Boruvka iterations
Final Algorithm Construct log? 0-samplers for each ?? Run Boruvka on sketches: Use ?1?? to get an edge incident on a node ? For ? = 2 to ?: To get incident edge on a component ? ? use: ????= ?? ?? ? ? ? ? ? ???? ?? = ?(?,? ?) ? ?
K-Connectivity Graph is ?-connected is every cut has size ? Thm: There is a ?(??log3?)-size linear sketch for k-connectivity Generalization: There is an ?(?log5?/?2)- size linear sketch which allows to approximate all cuts in a graph up to error (1 ?)
K-connectivity Algorithm Algorithm for ?-connectivity: Let ?1 be a spanning forest of ?(?,?) For ? = 2, ,? Let ?? be a spanning forest of ?(?,? ?1 ?? 1) Lem: ?(?,?1+ + ??) is k-connected iff G(V,E) is. Trivial Consider a cut in ?(?, ?=1 ? :this cut didn t grow in step ? there is a cut in ?(?,?) of size < ? contradiction ? ??) of size < ?
K-connectivity Algorithm Construct ? independent linear sketches {?1??,?2?? ,????} for connectivity Run ?-connectivity algorithm on sketches: Use ?1?? to get a spanning forest ?1 of ? Use ?2?? ?2??1= ?2(?? ?1) to find ?2 Use ?3?? ?3??1 ?3??2= ?3(?? ?1 ?2) to find ?3
Bipartiteness Reduction: Given ? define ? where vertices ? (?1,?2); edges ?,? (?1,?2) & (?2,?1) Lem: # connected components doubles iff the graph is bipartite. Thm: ?(?log3?)-size linear sketch for k- connectivity (sketch ? (implicitly).)
Minimum Spanning Tree If ??= # connected components in a subgraph induced by edges of weight 1 + ??: ? ??? ? 1 + ??+ ???? 1 + ? ? ??? ?=0 ? 1 where ??= ( 1 + ??+1 1 + ?? cc(G) = #connected components of ? Round weights up to the nearest power of 1 + ? ?? subgraph with edges of weight 1 + ?? Edges taken by the Kruskal s algorithm: n cc(?0) edges of weight 1 ?? ?0 ??(?1) edges of weight (1 + ?) cc ?? 1 cc ?? edges of weight 1 + ??
Minimum Spanning Tree Let ? = log1+?? where ? = max edge weight Overall weight: ? ?? ?0 + 1 ? 1 ( 1 + ??+1 1 + ??) ??(??) ?1 + ??(?? ?? 1 ??(??)) = ? 1 + ??+ 0 Thm: 1 + ? -approx. MST weight can be computed with ? ? linear sketch for ? = ????(?)
MST: Single Linkage Clustering [Zahn 71] Clustering via MST (Single-linkage): k clusters: remove ? ? longest edges from MST Maximizes minimum intercluster distance [Kleinberg, Tardos]
Cut Sparsification Two problems: Approximating Min-Cut in the graph (up to 1 ?) Preserving all cuts in the graph (up to 1 ?) General cut sparsification framework: Sample each edge ? with probability ?? Assign sampled edges weights 1/?? Expected weight of each cut is preserved, but too many cuts can t take union bound
Cut Sparsification For an edge ? let ?? = weight of the minimum cut that contains ? ? = size of the Min-Cut in G Thm [Fung et al.]: If ? is an undirected weighted ? log2? ?? ?2 graph then if ?? min ,1 then the cut sparsification alg. preserves weights of all cuts up to (1 ?) ? log ? ? ?2,1 preserves Thm [Karger]: ?? min Min-Cut up to (1 ?)
Minimum Cut Algorithm: For ? = 0,1, ,2log? : Let ?? be the subgraph of ? where each edge is sampled with probability 1/2? 1 ?2 log? and ?? are Let ??= F1, ,?? where ? = ? forests constructed by the k-connectivity alg. Return 2??(??) where ? = min{? ? ?? < ?} ? log4? ?2 Space: ? , works for dynamic graph streams
Minimum Cut: Analysis Key property: If ?? has ? edges across a cut then ?? contains all such edges ??2 6 log ? 6 log ? ? = logmax 1, ? ? ?? min is approximating min-cut in ? up to (1 ?) ? = ? : By Chernoff bound # edges in ?? that crosses min-cut in ? is ? ??2,1 min cut in ?? 1 ?2log? ? w.h.p.
Cut Sparsification Algorithm: For ? = 0,1, ,2log? : Let ?? be the subgraph of ? where each edge is sampled with probability 1/2? Let ??= F1, ,?? where ? = ? constructed by the k-connectivity alg. For each edge ? let ??= min i:???? < ? . If ? ??? then add e to the sparsifier with weight 2?? 1 ?2 log2? and ?? are forests ? log5? ?2 Space: ? Analysis similar to the Min-Cut using [Fung et al.] , works for dynamic graph streams