Predicting COVID-19 Cases Using Reddit Posts and Online Resources

predicting covid 19 cases using reddit posts n.w
1 / 15
Embed
Share

Explore how Reddit posts can be used to predict changes in COVID-19 cases, comparing local subreddit data to control datasets like OxCGRT and Google movement data. Learn about the objectives, prior work, pipeline, feature selection, and selected words in this innovative approach.

  • COVID-19
  • Reddit
  • Prediction
  • Data Analysis
  • Epidemiology

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Predicting COVID-19 cases using Reddit posts and other online resources Felix Drinkall and Janet B. Pierrehumbert

  2. Objectives 1. 2. Predict changes in COVID-19 cases using local subreddit data Develop an unsupervised prediction method Doesn t rely on a-prioi information or subjective interpretations Compare local subreddit features to control datasets Control - OxCGRT government response Control - Google movement data Target and control - The COVID Tracking Project Generalise across different regions Washington - r/CoronavirusWA California - r/CoronavirusCA Texas - r/CoronaVirusTX Florida - r/FloridaCoronavirus 3. 4. Prediction target Washington

  3. Prior work Samaras et al. (2020) - Comparing social media and google to detect and predict severe epidemics , Nature Sci Rep 10 Datasets Dec. 2018 May 2019 Epidemiological data Greece ECDC data Social media data Twitter data miner Tweepy Tracks frequency of Greek words for Influenza ( and ) Autoregressive model - ARIMA(X) Social media data beat Google search data Assumptions and drawbacks Steady state in population language Steady state in population knowledge The virus responds to population sentiment in the same way as Influenza

  4. Pipeline

  5. Feature selection Identify overrepresented words Create a reference corpus that represents the rest of Reddit Randomly select datetime from a weighted distribution of posts/day Retrieve 100 next comments across entire of reddit Repeat until the reference Reddit corpus is equal in size to the subreddit Term frequency ratio ???,? , where j is the corpus of interest and k is the reference corpus Add-one smoothing Top 50 selected from top 1000 words Top 50 selected from all words Relevant feature selection Chi-squared test to reduce features to 25 Filters for features that are highly dependent on the target ???,?

  6. Selected words

  7. Control datasets OxCGRT OxCGRT (Oxford COVID-19 Government Response Tracker) C containment policies School closure, workplace closure, cancel public events, restrictions on gatherings, public transport closure, lockdown level, internal and external restrictions E economic policies Income support, debt relief, fiscal measures, international support H health system policies Public info campaigns, testing, contact racing, investment into healthcare and vaccines

  8. Control datasets GCCMR GCCMR (Google s COVID-19 Community Mobility Reports) Uses Location History Anonymised data Tracks movement at key locations on Google maps Changes are compared to the baseline for that day of the week The baseline is the median value from Jan 3rd- Feb 6th 2020

  9. Defining a prediction target

  10. Defining a prediction target Dataset source The COVID Tracking Project US regional COVID-19 data Classification thresholds: Relative Absolute Where (t) is the 7-day moving average of the cases Labelling Data labelled with a binary value dependent on whether caseload exceeds threshold Varying thresholds Performance through more extreme events can be analysed

  11. Results Performance Reddit is the best performing feature set Performance with relative threshold .808 (tied in 1st) Performance with absolute threshold .824 Very good at predicting an extreme events Combination has the highest performance Complimentary information

  12. Results Feature selection Subreddit feature set have the highest feature importance Higher feature importance for absolute threshold Feature importance increases for more extreme epidemiological events

  13. Generalising results Results from Washington also seen in multiple other states Improvement when including Reddit features Strong isolated performance of Reddit features Complementary information GCCMR movement data shows population s adherence to OxCGRT government policies Reddit exposes the population s attitude to the rules which could motivate rule breaking

  14. Conclusion 1. 2. Prediction of COVID-19 using Reddit data is a viable task Use of overrepresented words results in a robust unsupervised approach No a-priori information needed Subreddit data performs well in comparison to other control datasets Very good short term individual performance Faster update speeds than other datasets Can be generalised across different regions Subreddit data complements other control datasets 3. 4.

  15. Work since submission Embedding Topic Modelling to track concepts in subreddit Sentence embedding of comments Density-based clustering of dimensionality-reduced embeddings Reference papers Sia et al. (ACL-2020) - Tired of topic models? Clusters of pretrained word embeddings make for fast and good topics too! Aharoni et al. (ACL-2020) Unsupervised domain clusters in pretrained language models. Rother et al. (SemEval-2020) Clustering on manifolds of contextualized embeddings to detect historical meaning shifts.

Related


More Related Content