
Probabilistic Algorithms & Data Structures Course Information
Learn about Comp.480/580: Probabilistic Algorithms and Data Structures course, taught by Instructor Anshumali Shrivastava at Rice University. Topics include prerequisites, grading criteria, project options, assignments, exams, and the importance of academic integrity. Prepare for engaging lectures, projects, and exams in this informative course.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Comp 480/580: Probabilistic Algorithms and Data Structures
About Instructor: Anshumali Shrivastava https://www.cs.rice.edu/~as143/ Email: anshumali@rice.edu Office Hours: After Tuesdays Class in DH 3118 TAs: TBA Office Hours will be announced soon. Location: DCH 1064 Meeting Hours: Tuesdays and Thursdays 2:30pm - 3:45pm Recordings will be made available after the class.
Important Links to Remember Course Web Page: https://www.cs.rice.edu/~as143/COMP480_580_Fall22/index.html Materials and slides Textbook (You wont need any) In case you like to have one: Read Probability and Computing: Randomized Algorithms and Probabilistic Analysis " by Michael Mitzenmacher and Eli Upfal. Canvas for Discussions. For announcements and discussions. Please sign up on Canvas for Comp 480. Those who are signed up for Comp 580, please sign up for 480 on Canvas.
Prerequisites COMP 182 or Equivalent Required. COMP 382 is useful but not required. Ability to manipulate arrays, linked lists, etc. will be needed for assignments. Familiarity with Basic File I/O. Basic Probability: Comfortable with Random Variables, Expectation and Variance.
Grading Term Project or Final Exam: 30% (Choose by 13th Sept via signup sheet). Assignments: 40% (Biweekly starting 1st Sept due in 2-week time). GENERALLY NO EXCEPTIONS TO DEADLINE. Mid-Term Exam: 20%. (1st Nov) Lecture Scribing: 10% (Scribe 1 Lecture Starting 30thAugust)
If You Choose Project over Finals Group of maximum 2 students. Propose 1-page abstract due 15thSept. (Iterate with me or TAs until approved) Deadline for Finalizing the Project: 9thSept. Checkpoint 1: 2-page Progress Report Due: 30thSept. (Literature Review and Proposed Idea) Mid-Term Checkpoint: 2-page Progress: 30thOct. (Preliminary Results) Final Checkpoint: Full Report (Around the Final day of Exam)
Assignments and Exams Individual only. Basic Blackboard discussions are allowed only for the coding part. Read about Plagiarism https://www.plagiarism.org/article/what-is-plagiarism https://gpsdocs.rice.edu/orientation/Plagiarism_Hewitt_document.pdf Rice Honor Code http://honor.rice.edu/
Scribing What is a scribe? Come to class and take notes. Short: Create lecture notes for the class. Longer Description: 2-3 page clear and concise description of what was taught in a class so that if one of your fellow student missed the class, he can read the scribe and get everything. Latex template available at: https://www.cs.rice.edu/~as143/COMP480_580_Fall22/scribe/scribe.zip First two lectures are scribed by me. See an example. By 25th Sept, choose one of the lectures to scribe. Link to Sign Up: https://docs.google.com/spreadsheets/d/1pe6uyY9YcMuTfqHTvsxs6xHX1-4cXkvChLZRTeiZFog/edit#gid=0 Scribe are due within a week.
Why Probabilistic (Randomized) Algorithms? Case Study 1: Design a Product Search Engine for Amazon.com Take a database D of statistically significant query strings observed in the past. (say around 50 million). Given a new user typed query q, find the closest string ? ? (in some distance) to q and return items purchased with it. Latency: 50 million distance computation per query. A cheap distance function it takes more than 400s or 6min. If you used edit distance, it will be hours. Latency Limit is roughly < 20ms. (20000x improvement needed) Ideas?
Why Probabilistic (Randomized) Algorithms? Case Study 2: Design Chrome Malicious URL Detector Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 25 average string length). 50MB is too heavy for a browser so this cannot be locally done!! Anything locally should be something < 2MB memory. (25x improvement needed) Wait you cannot even compress the strings (Huffman Coding) to that size. Ideas?
Different Scale is a Different World!! Current algorithm is off by 1.5x factor. Reactive Approach: Start with standard approach and use tricks and tools to squeeze everything out and make it better. Similar ideas with smart implementation may do the trick. Current algorithm is off by 10x+ factor. Go to drawing board. You have the transcend the current level of thinking. Anything which is similar to current method is unlikely to work! We cannot solve our problems with the same thinking we used when we created them. - Albert Einstein
No Free Lunch (NFL) Theorem No Free Lunch (NFL) Theorem I want 10x+ improvements over existing ones. What am I trading off? Uncertainty is the refuge of hope. - Henri Frederic Amiel We will trade-off Certainty. (It is a resource!!)
Why Probabilistic (Randomized) Algorithms? Case Study 1: Design a Product Search Engine for Amazon.com Take a database D of statistically significant query strings observed in the past. (say around 50 million). Given a new user typed query q, find the closest string ? ? (in some distance) to q and return items purchased with it. Latency: 50 million distance computation per query. A cheap distance function it takes more than 400s or 6min. If you used edit distance, it will be hours. Latency Limit is roughly < 20ms. (20000x improvement needed) What if there is an algorithm that runs in 2ms and gives you a good enough ? ? and fails one out of million queries?
Why Probabilistic (Randomized) Algorithms? Case Study 2: Design Chrome Malicious URL Detector Lets say you work for Google, in the Chrome team, and you want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. Given a database of about known 1 million malicious URLs, the size of any dictionary for matching will be around 50MB (size of 1 million urls with 25 average string length). 50MB is too heavy for a browser so this cannot be locally done!! Anything locally should be something < 2MB memory but wait you cannot even compress the strings (Huffman Coding) to that size. What if there is an algorithm that only uses 2MB memory and notifies perfectly if the url is malicious, but once out a million matches it can also notify a correct url as malicious? (Is that good enough?)