
theInternet Filter Bubble and Personalization Effects
Explore the impact of the Internet filter bubble and ubiquitous personalization on users. Alan Mislove, an Assistant Professor at CCIS, delves into the dangers and goals of personalization in web content. Discover methodologies for quantifying user experiences and shaping future online interactions.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Quantifying the Internet Filter Bubble Alan Mislove Assistant Professor @ CCIS amislove@ccs.neu.edu
Personalization on the Web 3 Santa Barbara, California Amherst, Massachusetts
Personalization is Ubiquitous 4 Search Results Social Media Goods and Services Music, Movies, Media
Dangers of Personalization 5 Current Events, News, Information Travel and Tourism? Relevant information might not be reachable People are unaware of the personalization
Personalization in the Press 7 The Trouble With the Echo Chamber Online Websites Vary Prices, Deals Based on Users Information
Goals of Our Work 9 1. Measure the web to determine: Who personalizes? How much they personalize? What user features drive personalization? 2. Develop systems to help users: Reveal personalization (increase transparency) Remove personalization (pop the Bubbles) Today s targets:
Outline 10 Methodology Measuring Google Search Real User Accounts Synthetic User Accounts Conclusions and Future Work
High-level Methodology 11 Difference Measure Compare www.a.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.b.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.c.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra.www.b.com Challenges: www.d.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.a.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.c.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. 1. Choosing metrics 2. Controlling noise 3. Selecting queries www.e.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis
Evaluation Metrics 12 Jaccard Index How many results are shared? Range [0,1] Page 1 www.a.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.b.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.c.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. Edit Distance How many results are reordered? Range [0, 10] www.d.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis www.b.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.a.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. www.c.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis adipiscing pharetra. Page 2 www.e.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis
Controlling for Noise 14 Queries run at the same time Difference Noise = Personalization www.a.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis IP addresses in the same /24 129.10.115.14 www.b.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis 74.125.225.67 129.10.115.15 Noise Same Google IP address www.b.com Lorem ipsum dolor sit amet, consectetur adipiscing elit. In mollis 129.10.115.16
More Noise? 15 Search for healthcare Search for obama, then healthcare Problem: subsequent queries may carry-over
Measuring Carry-Over 16 Overlap in Results, Searching for test and test + touring 1 0.9 Average Jaccard Index 0.8 0.7 In our tests, we wait 11 minutes between queries 0.6 10 minute cutoff 0.5 0.4 0.3 0.2 0.1 0 0 2.5 5 7.5 10 12.5 15 17.5 20 Time Between Queries (Minutes)
Experimental Queries 17 Two objectives Broad coverage, i.e. many topics High impact, i.e. popular searches 120 queries in 12 categories News, politics, apparel, gadgets, health, etc.
Outline 18 Methodology Measuring Google Search Real User Accounts Synthetic User Accounts Conclusions and Future Work
Experimental Treatments 19 Questions we want to answer: To what extent is content personalized? What user features drive personalization? Real User Accounts Synthetic User Accounts Leverage real Google accounts with lots of history Create accounts that each vary by one feature Measure personalization in real life Measure the impact of specific features
Real User Experiment 20 Task on Amazon Mechanical Turk (AMT) 200 participants with US Google accounts Each executed all 120 queries Every query paired with two control queries Run from empty accounts, i.e. no history Baseline results for comparison User Query User Query Control Query Control Query HTTP Proxy
Results from Real Users 21 Difference between results is personalization personalized 50 Lower ranks are more AMT Results On average, AMT results have an 11.7% higher chance of differing than the controls Most changes are due to location 45 40 Results Changed (%) Control Results 35 Top ranks are less personalized 30 25 20 15 10 5 0 1 2 3 4 Search Result Rank 5 6 7 8 9 10
What Causes of Personalization? 22 AMT results reveal extensive personalization Next question: what user features drive this? Static Features Gender Age Browser Operating System Location (IP Address) Logged In/Out Historical Features Logged In/Out History of Searches History of Search Result Clicks Browsing History Methodology: use synthetic (fake) accounts
Logged In/Out to Google 23 1 5 0.9 Same results 4 Average Edit Distance Average Jaccard Index But in a different order 0.8 3 No Cookies / No Cookies 0.7 2 Logged In / No Cookies 0.6 1 Logged Out / No Cookies 0.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Day Day
Google IP Geolocation 24 5 1 0.9 4 Plus 1 pair of reordered results Average Edit Distance Jaccard Index On average, 1 different result 0.8 3 MA / MA CA / MA UT / MA IL / MA NC / MA 0.7 2 0.6 1 0.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Days Day
Building Search History 25 Goal: Imitate the behavior of a demographic group by Searching, browsing, search result clicking Dataset from Quantcast 20 demographic groups and their browsing habits Age, gender, income, ethnicity, education Profile Websites Female babycenter.com avon.com refinery29.com tasteofhome.com Male Stackoverflow.com fannation.com southparkstudios.com High income Investors.com weeklystandard.com washingtonexminer.com Low income github.com ebayclassifields.com triplejack.com citizenlink.com
Search History Experiment 26 1 5 0.9 4 Jaccard Index Edit Distance 0.8 3 $0-50K $50-100K $100-150K >$150K No History 0.7 2 0.6 1 0.5 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Day Day
Personalization By Category Least 27 100 personalized 90 80 CDF 70 What Is Books Gadgets Places Politics Most 60 personalized 50 40 0 2 4 6 8 10 Edit Distance
Outline 29 Methodology Measuring Google Search Real User Accounts Synthetic User Accounts Conclusions and Future Work
Recap 30 First steps towards quantifying the Filter Bubble Novel methodology for measuring personalization 11.7% of results are personalized on Google Search 15.8% of results are personalized on Bing Primarily based on location, logged in/out Observe that political and news queries see the most personalization Data and code available: http://personalization.ccs.neu.edu
Future Work: Price Discrimination 33 Personalization of prices? Price discrimination Websites Vary Prices, Deals Based on Users Information Differential pricing Amazon caught in 2000 Small scale studies reveal this behavior on other sites Rank order of products Steering On Orbitz, Mac Users Steered to Pricier Hotels E.g. high priced items rank higher for some people
Future Work: Popping Filter Bubbles 34 We now have methodology and infrastructure for continuous monitoring Observe if, when, and how algorithms are changing Create a clearing house for measurement data Increase transparency for users Developing active defense measures Browser plug-ins to obfuscate users Leveraging social networks to get different points of view into personalized systems
The Filter Bubble Team Questions?