Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Algorithms for Big Data: Streaming  and Sublinear Time Algorithms
Slide Note
Embed
Share

Motivation behind big data research, exploration of sublinear time algorithms, and analysis of algorithms for diameter approximation and property testing. The talk delves into theoretical algorithms for big data problems, focusing on streaming and sublinear time algorithms, showcasing classical examples and methodologies.

  • Big Data
  • Algorithms
  • Sublinear Time
  • Streaming
  • Diameter Approximation

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Algorithms for Big Data: Streaming and Sublinear Time Algorithms Moran Feldman

  2. Motivation Big Data (huge data sets) Why? So What? The Internet Easy to collect data Easy to transfer data New equipment LHC Difficult to process all the data. Difficult to store all the data. 2

  3. What is this Talk About? Big data has motivated a lot of research (both CS and non-CS). In this talk we are interested in theoretical algorithms for big data problems. Sublinear time algorithms Streaming Algorithms We will see a few classical algorithms of each one of these kinds. 3

  4. Sublinear Time Algorithms

  5. Sublinear Time Algorithms Most algorithms read all their input. Require at least a linear time. We are interested in sublinear time algorithms. Cannot afford to read all its input. We will start with a simple example 5

  6. Diameter Approximation All distances are non-negative. d(u, v) = 0 u = v d(u, v) = d(v, u) d(u, v) + d(v, w) d(u, w) Instance A set P of points. A function d: P P R giving the distance between every pair of points. We assume d is a metric. u v z w Objective Approximate the diameter D of P. D = max ( P ) , d , u v u v 6

  7. Algorithm Trivial Algorithm Query the distance between every pair of points, and return the maximum one. Time complexity: O(P2). The size of the input. u v More Involved Algorithm 1. Fix an arbitrary point u. 2. Query the distance of every other point from u. 3. Return the maximum distance found. z w 7

  8. Analysis A square root of the size of the input. u Time Complexity Time O(P). v This is a rare example of a sublinear time deterministic algorithm. Guarantee Let d(v, w) be the diameter. By the triangle inequality: d(u, v) + d(u, w) d(v, w) = D. The algorithm outputs a value D such that: D/2 D D. z w 8

  9. Property testing We are interested in deciding whether an object has some property. Often depends on all the input: Is a list of numbers sorted? Are all the numbers in a set distinct? Is an image an half-plane? To get the right answer with a constant probability one has to read a constant fraction of the input. 9

  10. Property testing (cont.) The exact definition varies. Intuitively, changing a fraction Distinguish between two cases of of the object cannot make it have the property. Object is -far from having the property Object has the property Otherwise Does not matter! Answer Yes Answer No 10

  11. Testing List Sortedness More than n numbers has to be changed to make the list sorted. Instance A list of numbers of n numbers. Objective Test whether the list is sorted (ascending) or -far from being sorted. Trivial Algorithms Pick a uniformly random 1 i n 1, and test whether xi xi+1 . Fails with high probability for the -far instance: Fails with high probability for the -far instance: Pick uniformly random 1 i j n, and test whether xi xj . 1111100000 1032547698 11

  12. Algorithm [EKKRV00] 1. Pick a uniformly random i. 2. Run a binary search for xi. 3. Answer No if the binary search ends up at a point other than i (and Yes otherwise). x4 x2 x6 x1 x3 x5 x7 12

  13. Completeness Analysis We need to show that the algorithm always returns Yes when the input is sorted. 1. Pick a uniformly random i. 2. Run a binary search for xi. 3. Answer No if the binary search ends up at a point other than i (and Yes otherwise). Should never happen when the list is sorted. (and the elements are unique) 13

  14. Soundness Analysis We need to upper bound the probability that the algorithm returns Yes when the list is -far from being sorted. An index i is good if the algorithm returns Yes when it randomly chooses i. Clearly: Number of Good Indexes Probability of Yes = n Thus, we want to upper bound the number of good idexes in an -far list. 14

  15. Main Observation Lemma The elements at the good indexes form a sorted sub-list. Proof Let i < j be two good indexes. Let k be the index of their lowest common ancestor in the binary search tree. Since i and j are good indexes we get: xi < xk xk < xj x4 xk x2 x6 xj x1 x3 xi x5 x7 15

  16. Soundness Analysis (cont.) In an -far list: No (1- )n elements can form a sorted sub-list. There are less than (1- )n good elements. The algorithm answers Yes with probability less than: 1 - Can be improved by repetition. Repeating the algorithm -1 times yields an algorithm that: Always answer Yes for a sorted input. Answer No for an -far input with probability at least 1/e 0.367. Never fails with high probability for a -far input. Time complexity: O( -1 log n) 16

  17. Streaming Algorithms

  18. Motivation Two Scenarios: Disadvantages Poor random Advantages Cost effective: Hardware Energy Fast sequential access Can be stored Network traffic access speed Poor long term reliability: Data has to be copied occasionally. A network element (a) (b) Processes the traffic: For example, detects malicious activity. offsite: Backup Security Magnetic Tape Problem The element can store only a small fraction of the traffic. 18

  19. Streaming Model An answer based on the input Input stream Algorithm Main Issue Multiple Passes The algorithm should use little memory. Often polylogarithmic in the size of the input. Sometimes the algorithm is allowed multiple passes over the input. Appropriate for the magnetic tape motivation. 19

  20. Finding Frequent Elements Theorem [MG82] There is a streaming algorithm using O(k (log n + log m)) space which: Outputs a set of at most k 1 elements. The set contains every element with more than n/k appearances in the stream. Remarks A second pass can be used to detect the elements that really have more than n/k appearances. For simplicity, we present the algorithm for the case k = 2. Occasionally comes up in job interviews. 20

  21. Algorithm 1. Initialize: counter 0. 2. For each arriving element e do 3. If counter = 0 then 4. Set counter 1, candidate e. 5. ElseIf candidate = e then 6. Set counter counter + 1. 7. Else 8. Set counter counter - 1. 9. Return candidate. 21

  22. Analysis Immediate Observations The algorithm uses O(log n + log m) space. The algorithm outputs a single element. If there is no element that appears more than n/2 times, then we are done. Otherwise, let e1/2 be this element. Definition X is defined as follows: X = counter when the candidate is e1/2. X = -counter when the candidate is not e1/2. Lemma At every given time during the execution of the algorithm: X (appearances of e1/2 so far) (appearances of other elements so far). 22

  23. Proof of the Lemma Lemma At every given time during the execution of the algorithm: X (appearances of e1/2 so far) (appearances of other elements so far). Proof The proof is by induction. Trivially holds before the first element arrives. Assume it holds before the arrival of an element e, then: other out. Intuitively, we get an inequality because elements other than e1/2 might cancel each e1/2 is the candidate? Yes Both sides increase by 1 No Both sides increase by 1 and left side changes by 1. Yes e = e1/2 ? Right side decrease by 1 Both sides decrease by 1 No 23

  24. Warping Up the Proof After all the input is processed, we have: X (appearances of e1/2 so far) (appearances of other elements so far). X must be positive as well The right side is positive. e1/2 is the final candidate 24

  25. Streaming Algorithms for Graph Problems Streaming for Graph Problems The stream consists of the edges of the graph. Allows O(n polylog(n)) space. Sublinear in the length of the stream which can be (n2). (Trivial) Algorithm for Counting Connected Components 1. Initially each node is an independent connected component. 2. For each edge e that arrives, if the end points of e belong to different connected component, merge these connected components. 25

  26. Algorithm for Counting Connected Components Example Analysis Space complexity: O(n log n). It is enough to maintain the list of nodes in each connected component. 26

  27. Applications Immediate Application Determining whether a graph is connected. More Interesting Application Determining whether a graph is bipartite. Algorithm 1. Let n1 and n2be the number of connected components in G and G2. 2. G is bipartite if and only if 2n1 = n2. G G2 u u1 u2 v v1 v2 27

  28. Analysis Lemma The copies of the nodes of a connected component C of G form: Two connected components of G2 if C is bipartite. A single connected component of G2 if C is not bipartite. Proof There is never a path in G2 between copies of nodes which are not connected in G. If u and v are connected in G, then each copy of u in G2 is connected to some copy of v. This lemma implies that the algorithm is correct. The copies of the nodes of a connected component C of G form one or two connected components in G2. Moreover, in the later case each component contains exactly one copy of each node of C. 28

  29. Analysis (cont.) C is not bipartite Let v be a node on an odd cycle of C. The cycle becomes a path between v1 and v2 in G2. C is bipartite A path between v1 and v2 in G2 implies an odd cycle in G. Cannot exist since C is bipartite. Alternative view: G G2 a2 A B a b b1 v1 v v2 c c2 d d1 A1 B2 A2 B1 29

Related


More Related Content