Ever-Growing Feed Corpus: Understanding Language Change Over Time

feed corpus n.w
1 / 12
Embed
Share

Explore the Feed Corpus, an up-to-date collection that studies language change over months and years. Learn about methods like feed discovery via Twitter, feed validation, scheduling inputs, and feed crawling for linguistic processing. Dive into the world of linguistic computing with Akshay Minocha, Siva Reddy, and Adam Kilgarriff at Lexical Computing Ltd.

  • Language Change
  • Corpus Study
  • Linguistic Processing
  • Feed Discovery
  • Data Collection

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd

  2. Introduction Study language change o over months, years Most web pages o no info about when written Feeds o written then posted Same feeds over time o we hope identical genre mix only factor that changes is time

  3. Method Feed Discovery Feed Validation Feed Scheduler Feed Crawler Cleaning, de-duplication, Linguistic Processing

  4. Feed Discovery via Twitter Tweets often contain links for posts on feeds o bloggers, newswires often tweet "see my new post at http..." Twitter keyword searches o News, business, arts, games, regional, science, shopping, society, etc. o Ignore retweets o Every 15 minutes

  5. Sample Search Aim - To make the most out of the search results https://twitter.com/search?q=news%20source%3Atwitterfee d%20filter%3Alinks&lang=en&include_entities=1&rpp=1 00 Query - News Source - twitterfeed Filter - Links ( To get all tweets necessarily with links) Language - en ( English ) Include Entities - Info like geo, user, etc. rpp - result per page ( maximum 100 )

  6. Feed Validation Does the link lead directly to a feed? o does metadata contain type=application/rss+xml type=application/atom+xml If yes, good If no o search for a feed in domain of the link o If no search for feed in (one_step_from_domain) If still no o link is blacklisted

  7. Scheduling Inputs o Frequency of update average over last ten feeds o Yield Rate ratio, raw data input to 'good text' output as in Spiderling, Suchomel and Pomikalek 2012 Output o priority level for checking the feed

  8. Feed Crawler Visit feed at top of queue Is there new content? o If yes o Is it already in corpus? Onion: Pomikalek if no clean up JusText: Pomikalek add to corpus

  9. Prepare for analysis Lemmatise, POS-tag Load into Sketch Engine

  10. Initial run: Feb-March 2013 Raw:1.36 billion English words 300 million words after deduplication, cleaning 150,000+ feeds

  11. Future Work Include "Category Tags" Other languages o Collection started now o Identification by langid.py (Lui and Baldwin 2012) "No-typo" material o copy-edited subset, so newspapers, business: yes personal blogs: no o method: manual classification of 100 highest-volume feeds

  12. Thank You http://www.sketchengine.co.uk

Related


More Related Content