Streaming and Sketching in Theoretical Computer Science

cmsc5706 topics in theoretical computer science n.w
1 / 33
Embed
Share

Explore the topics covered in Week 3 of CMSC5706, focusing on streaming and sketching in theoretical computer science. Learn about missing numbers, Count-Min sketch, lower bounds, and communication complexity in handling big data streams efficiently. Understand the challenges of processing data in a streaming fashion with limited space and quick update requirements. Discover algorithms and solutions to identify missing numbers and handle complex tasks using minimal space complexity.

  • Computer Science
  • Streaming
  • Sketching
  • Data Processing
  • Algorithms

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CMSC5706 Topics in Theoretical Computer Science Week 3: Streaming and Sketching Instructor: Shengyu Zhang 1

  2. Map Motivations and model Problem 1: Missing numbers Problem 2: Count-Min sketch Lower bounds Communication complexity 2

  3. Motivations Big mass of data. Data comes as a stream. Cannot see future data. Relatively small space. sketch Cannot store past data Need to process each item fast. Quick update time. Examples: Phone calls, Internet packets, satellite pictures, 3

  4. Problem 1: Missing numbers A set of numbers ? = {1,2, ,?} ? 1 of them come in a stream ?1,?2, ,?? 1; one number is missing. 3,25,6,19,1,10, Task: identify which one is missing. Using small space. 4

  5. A simple algorithm Maintain the sum of the input numbers. ??? = 0 for? = 1 to ? 1 ??? = ??? + ?? return ? ?+1 ??? 2 5

  6. Space complexity ??? is at most ? ?+1 during the algorithm. 2 ? ?+1 2 Thus it takes at most log2 bits to write it down. = ? log2? Space complexity:? log2? . Much smaller than storing the whole stream, which takes at least ?(?log?). 6

  7. More complicated Now the task gets harder. ? 2 of them come in a stream ?1,?2, ,?? 2, two numbers are missing. 3,25,6,19,1,10, Task: identify which two are missing. Using small space. 7

  8. First try Maintain the sum and product of the input numbers. ??? = 0; ??????? = 1 for? = 1 to ? 2 ??? = ??? + ?? ??????? = ??????? ?? ? =? ?+1 solve equations ? + ? = ?,? ? = ? return (?,?) ???, ? = ?!/??????? 2 8

  9. Problem and solution Issue: ??????? is at least ? 2 ! Thus even writing down the number needs log2? 2 ! = (?log?) bits. Too much compared to ?(log?) before. How to do? 9

  10. Improvement Note that we don t need to maintain product. We can maintain anything, as long as finally we can reconstruct the solution from the stored results. One summary that is much smaller than product: sum of squares. Recall: 12+ 22+ + ?2=? ?+1 2?+1 6 10

  11. Improvement Maintain the sum and sum of squares of the input numbers. ??? = 0; ??? = 0 for? = 1 to ? 2 ??? = ??? + ?? ??? = ??? + ?? 2 ? =? ?+1 solve equations ? + ? = ?,?2+ ?2= ?. return (?,?) ???, ? =? ?+1 2?+1 ??? 2 6 11

  12. Space complexity ??? is at most ? ?+1 2?+1 algorithm. during the 6 ? ?+1 2?+1 6 Thus it takes at most log2 ? log2? bits to write it down. = Space complexity:? log2? . 12

  13. Further question Now assume that the numbers are from an arbitrary set ? = {?1,?2, ,??}. ? ? of them come in a stream ?1,?2, ,?? ?; ? numbers are missing. Task: identify which ? are missing. Using small space. 13

  14. First try 2, , ??? ? of the input numbers. Maintain ???, ??? ???1= 0; ???2= 0; ; ????= 0 for? = 1to? ? for? = 1to? ????= ????+ ?? ? ? ?? ???1 ?? ?? ???= ?=1 ??? 2= ?=1 ? 2 ???2 solve system of equations ?= ?=1 ? ? ???? ??? return ?1, ,?? 14

  15. Space complexity ???? is at most ?(??) during the algorithm. Thus it takes at most O(?log2??) = ? ?2log2? bits to write it down. Space complexity:? ?2log2? . 15

  16. Problem 2: high frequency estimation Consider an array ? 1..? of size ?. Items like ?1,+ ,(?2, ), , ??,+ come in a stream. 3,+ ,(3,+),(2,+),(3, ), ? ? + + when (?,+) comes, and ? ? when (?, ) comes Assumption: ? ? 0 all the time. Task: Answer queries like what is ?[18] ? 16

  17. Approximation and error Unlike the previous algorithm, here deterministic algorithm needs a lot of space. But if we allow approximation: only estimate ?[?] up to certain precision error: algorithm fails with some small probability then we ll have an efficient randomized algorithm. 17

  18. Pick log(1/?) hash functions ?: ? ?/? uniformly at random from a family of pairwise independent hash functions. ?/? ?, so it s space efficient. For each ? [?], different ? s map it to different buckets . Idea: only maintain counters for buckets. ?/? ? 18

  19. Algorithm for? = 1tolog 1/? for? = 1to?/? ????? ?,? = 0 for? = 1to? if item ? is ?,+/ for? = 1to log 1/? ++/ count ?, ?? On query ?[?]: return ? ? = min count ?, ?? ? 19

  20. Guarantee At any time of query: Define ? = ??[?] Theorem. ? ? ?[?] ? ? ? ? + ? ? with probability 1 ?. 20

  21. Analysis ? ? ?[?] is easy: Any time when ? ? increases by 1, we increase count ?, ?? for each ?. Thus min also increases by 1. count ?, ?? ? Thus we never miss any increment. 21

  22. Analysis Next: ? ? ? ? + ? ? with prob. 1 ?. ???: the contribution of items other than ? to count j, ?? . Claim. ? ??? =? ? ?? . ? ? ? ? Proof. For each fixed item ? ?, the probability of ?? = ?? is ?/?. There are ? ? ? many items ? ? (counting multiplicity), thus ? ??? =? ? ? ? . ? 22

  23. Pr ?? > ? ? + ? ? Pr ? ? + ???> ? ? + ? ? , ? ? ? = ? ? + ??? by definition = > ? ? + ? ? min count ?, ?? ? ? ? + ???> ? ? + ? ? , ? Pr ? ? + ???> ? ? + ? ? , ? log 1/? = Pr ? ? + ???> ? ? + ? ? because different ? s are independently chosen. 23

  24. Pr ? ? + ???> ? ? + ? ? Recall: ? ??? =? = Pr ???> ? ? ? ?? ? ? ? ? By Markov s inequality, Pr ???> ? ? Pr ???> ?? ??? < 1/? Putting everything together, log1 1 ? ? Pr ? ? > ? ? + ? ? = ? 24

  25. Lower bounds Theorem. In order to estimate ? ? within an error of ? ? with probability 2/3, one needs to use 1 ? space. Proof. We will use one-way communication complexity. 25

  26. Communication complexity ? ? ?(?,?) Two parties, Alice and Bob, jointly compute a function ? on input (?,?). ? known only to Alice and ? only to Bob. Communication complexity: how many bits are needed to be exchanged?

  27. One-way communication complexity ? 0,1? ? [?] ? ?,? = ?? Theorem. Index function needs ? communication bits. even for randomized protocols.

  28. Lower bound Theorem. In order to estimate ? ? within an error of ? ? with probability 2/3, one needs to use 1 ? space. Proof. Given an Index problem input (?,?), with ? = 1/2?. Let ? be: ? ? = 2?? for ? = 1, ,?, and ? 0 = 2 ? ? :??= 0 . ? = 2? = 1/?. Thus ? ? = 1. 28

  29. If one can estimate ?[?] within error ? ? = 1 using space ?, then Alice can use this way to transmit the space to Bob. communication: ? bits. Bob then gets ? [?] which differ from ?[?] by 1. Bob can then determine whether ??= 0 or ??= 1. Thus the communication lower bound implies ? = ? = 1 ?, as desired. 29

  30. One thing left Pairwise independent hash family A family of functions ? = ? ? is pairwise independent if the following two conditions hold when we pick ? uniformly at random: ? ?, the random variable (?) is uniformly distributed in ? ?1 ?2 ?, the random variables ?1 and ?2 are independent,. 30

  31. Note that the condition is equivalent to the following. For any two different ?1 ?2 ?, and any ?1,?2 ?, it holds that ?? ? ?1 = ?1 ??? ?2 = ?2 = 1/ ?2 31

  32. Construction There is an easy construction of the pairwise independent hash function family. Let ? be a prime, and define ?,?? = ?? + ? ??? ? Define family ? = { ?,?:0 ?,? ? 1} Theorem. ? is a family of pairwise independent hash functions. 32

  33. It is enough to show that ?? ? ?1 = ?1 ??? ?2 = ?2 = 1/?2 For any ?1 ?2, ?1 and ?2, there is a unique pair (?,?) s.t. ?,??1 = ?1 ??? ?,??2 = ?2. Indeed, this is just ??1+ ? = ?1 ??? ? ??2+ ? = ?2 ??? ? which has a unique solution for (?,?) because ?1 ?2 1 1 0 due to ?1 ?2. 33

Related


More Related Content