Advanced Algorithm Techniques for Big Data Analysis

csce 689 special topics on algorithms n.w
1 / 43
Embed
Share

Explore the Misra-Gries algorithm for identifying frequent items in big data streams, along with its drawbacks and insights. Learn how the algorithm can be adapted for solving the (?,?,?) Frequent Items Problem efficiently.

  • Algorithms
  • Data Analysis
  • Big Data
  • Misra-Gries
  • Frequent Items

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CSCE 689: Special Topics on Algorithms for Big Data September 27, 2023

  2. Previously: Misra Gries Goal: Given a set ? of ? elements from [?] and a parameter ?, output the items from [?] that have frequency at least ? ? Initialize ? items ?1, ,??with count ?1, ,??= 0 For updates 1, ,?: If ??= ??for some ?, increment counter ??, i.e., ??= ??+ 1 Else if ??= 0 for some ?, set ??= ?? Else decrement all counters ??, i.e., ??= ?? 1for all ? [?]

  3. Misra Gries Claim: At the end of the stream of length ?, we report all items with frequency at least ? ? Intuition: If there are ? coordinates with frequency ? be tracked and reported, since we have ? counters If there are ? counters for the remaining ? Will have at most ? ?decrement operations, which is small enough so that frequent items are still stored ?, they will all 2coordinates with frequency at least ? 2, updates ?, we still have? 2

  4. Misra Gries Drawbacks: Misra-Gries may return false positives, i.e., items that are not frequent In fact, no algorithm using ?(?) space can output ONLY the items with frequency at least ? ? Intuition: Hard to decide whether coordinate has frequency ? ? ? 1 ?or

  5. Misra Gries Intuition: Hard to decide whether coordinate has frequency ? ? ? 1 ?or ?1= 2, ?2= 5, ?3= 4, ?4= 7, ?5= 1, ?6= 9, ?? ? ?+1= ?, ?? ? ?+2= ?, , ??= ? ? ? 1 times

  6. (?,?)-Frequent Items Problem Goal: Given a set ? of ? elements from [?], an accuracy parameter ? (0,1), and a parameter ?, output a list that includes: The items from [?] that have frequency at least ? No items with frequency less than 1 ? ? ? ?

  7. Misra Gries for (?,?)-Frequent Items Problem Initialize ? items ?1, ,?? with count ?1, ,??= 0 For updates 1, ,?: If ??= ?? for some ?, increment counter ??, i.e., ??= ??+ 1 Else if ??= 0 for some ?, set ??= ?? Else decrement all counters ??, i.e., ??= ?? 1for all ? [?]

  8. Misra Gries for (?,?)-Frequent Items Problem ? ? Set ? = Initialize ? items ?1, ,?? with count ?1, ,??= 0 For updates 1, ,?: If ??= ?? for some ?, increment counter ??, i.e., ??= ??+ 1 Else if ??= 0 for some ?, set ??= ?? Else decrement all counters ??, i.e., ??= ?? 1for all ? [?]

  9. Misra Gries for (?,?)-Frequent Items Problem Claim: For all estimated frequencies ?? by Misra-Gries, we have ?? ?? ?? ?? ? Intuition: Have a lot of counters, so relatively few decrements

  10. (?,?)-Frequent Items Problem Goal: Given a set ? of ? elements from [?], an accuracy parameter ? (0,1), and a parameter ?, output a list that includes: The items from [?] that have frequency at least ? No items with frequency less than 1 ? ? ? ?

  11. Misra Gries for (?,?)-Frequent Items Problem ? ? Set ? = Initialize ? items ?1, ,?? with count ?1, ,??= 0 For updates 1, ,?: If ??= ?? for some ?, increment counter ??, i.e., ??= ??+ 1 Else if ??= 0 for some ?, set ??= ?? Else decrement all counters ??, i.e., ??= ?? 1for all ? [?] Output coordinates ?? with ?? 1 ? ? ?

  12. Misra Gries for (?,?)-Frequent Items Problem Claim: For all estimated frequencies ?? by Misra-Gries, we have ?? ?? ?? ?? ? If ?? ? ?? ? Returning coordinates ?? with ?? 1 ? ? ? with ?? ? NO ? with ??< 1 ? ? ?, then ?? ?? ?? ? and if ??< 1 ? ? ?, then ??< ?? ? means: ? will be returned ?will be returned

  13. Misra Gries for (?,?)-Frequent Items Problem Summary: Misra-Gries can be used to solve the (?,?)-frequent items problem ? ? counters ? ? ?log? bits of space Misra-Gries uses ? Misra-Gries is a deterministic algorithm Misra-Gries never overestimates the true frequency

  14. Insertion-Deletion Streams Stream of length ? = ? Universe of size [?], underlying vector ? ?? Each update increases or decreases a coordinate in ? ?1 ?2 ?3 ?4 ?5 ?6 ?7 0 0 0 0 0 0 0 Decrease ?6 ?1 ?2 ?3 ?4 ?5 ?6 ?7 0 0 0 0 0 -1 0

  15. Insertion-Deletion Streams Database Management: In database management, insertion- deletion streams are used to track changes made to the database over time Transaction logs often utilize this concept to record insertions and deletions to ensure data integrity and support features like rollbacks and recovery

  16. Insertion-Deletion Streams Version Control Systems: Insertion-deletion streams track changes made to files, enabling users to see what has been added (inserted) or removed (deleted) in each version Crucial for collaboration and managing software development projects, central to version control systems

  17. Insertion-Deletion Streams Traffic Flow and Transportation Systems: Insertion-deletion streams are used to analyze traffic patterns and changes in transportation systems This helps in optimizing traffic flow, managing congestion, and improving transportation infrastructure

  18. Frequent Items on Insertion-Deletion Streams Misra-Gries on Insertion-Deletion Streams Increase ?1 Increase ?3 Increase ?2 Increase ?2 Decrease ?2 Decrease ?2 Decrease ?3

  19. CountMin Another algorithm for the (?,?)-frequent items problem Can be used on insertion-deletion streams Can be easily parallelized across multiple servers

  20. CountMin Initalization: Create ? buckets of counters and use a random hash function : ? [?] Algorithm: For each update ??, increment the counter ?? ?1 0 ?2 0 ?3 0 ?4 0 At the end of the stream, output the counter ?? as the estimate for ??

  21. CountMin ?1 0 ?2 0 ?3 0 ?4 0 1 ?5 0 ?6 0 ?7 0 ?1 0 ?2 0 ?3 0 ?4 0

  22. CountMin ?1 1 ?2 0 ?3 0 ?4 0 1 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 0 ?2 0 ?3 0 ?4 0

  23. CountMin ?1 1 ?2 0 ?3 0 ?4 0 1 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 0 ?2 0 ?3 0 ?4 0

  24. CountMin ?1 1 ?2 0 ?3 0 ?4 0 1 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 0 ?4 0

  25. CountMin ?1 1 ?2 0 ?3 0 ?4 0 3 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 0 ?4 0

  26. CountMin ?1 1 ?2 0 ?3 1 ?4 0 3 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 1 ?4 0

  27. CountMin ?1 1 ?2 0 ?3 1 ?4 0 2 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 1 ?4 0

  28. CountMin ?1 1 ?2 1 ?3 1 ?4 0 2 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 1 ?4 1

  29. CountMin ?1 1 ?2 1 ?3 1 ?4 0 1 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 1 ?2 0 ?3 1 ?4 1

  30. CountMin ?1 2 ?2 1 ?3 1 ?4 0 1 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 2 ?2 0 ?3 1 ?4 1

  31. CountMin ?1 2 ?2 1 ?3 1 ?4 0 5 ?5 0 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 2 ?2 0 ?3 1 ?4 1

  32. CountMin ?1 2 ?2 1 ?3 1 ?4 0 5 ?5 1 ?6 0 ?7 0 ? = 3? + 2 (mod 4) ?1 3 ?2 0 ?3 1 ?4 1

  33. CountMin ?1 2 ?2 1 ?3 1 ?4 0 ?5 1 ?6 0 ?7 0 What is the estimation for ?4? What about ?3? What about ?5? What about ?1? ? = 3? + 2 (mod 4) ?1 3 ?2 0 ?3 1 ?4 1

  34. CountMin Given a set ? of ? elements from [?], let ?? be the estimated frequency for ?? Claim: We always have ?? ?? for insertion-only streams Suppose ? = ? so that ??= ?? Note that ??counts the number ??of occurrences of any ? with ? = ? = (?), including ?? itself

  35. CountMin Suppose ? = ? so that ??= ?? Note that ??counts the number ??of occurrences of any ? with ? = ? = (?), including ?? itself ??= ?: ? =??? ?? since ? = ? ??= ??+ ? ?, with ?: ? =???

  36. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ???

  37. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ??? E ? ?, with ?: ? =??? ? ?E ?? ? ? = (?)

  38. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ??? E ? ?, with ?: ? =??? ? ?E ?? ? ? = (?) = ? ?E ? ? = (?) ??

  39. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ??? E ? ?, with ?: ? =??? ? ?E ?? ? ? = (?) = ? ?E ? ? = (?) ?? = ? ?Pr ? = (?) ??

  40. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ??? E ? ?, with ?: ? =??? ? ?E ?? ? ? = (?) = ? ?E ? ? = (?) ?? = ? ?Pr ? = (?) ?? 1 ? ?? = ?1 ? = ? ?

  41. CountMin Error Analysis ??= ??+ ? ?, with ?: ? =??? What is the expected error for ??? E ? ?, with ?: ? =??? ? ?E ?? ? ? = (?) = ? ?E ? ? = (?) ?? = ? ?Pr ? = (?) ?? 1 ? ?? = ?1 ? = ? ? Set ? =9? ?, then the expected error is at most ? ?1 9?

  42. CountMin Error Analysis Set ? =9? also works for insertion-deletion streams) ?, then the expected error is at most ? ?1 9? (same analysis By Markov s inequality, the error for ?? is at most ? ?1 probability at least 2 3 3?with How to ensure accuracy for all ? [?]?

  43. CountMin Error Analysis By Markov s inequality, the error for ?? is at most ? ?1 probability at least 2 3 3?with How to ensure accuracy for all ? [?]? Repeat ?(log?) times to get estimates ?1, ,? for each ? [?] and set ??= min(e1, ,e ) (proof is by Chernoff bounds)

Related


More Related Content