Ever-Growing Feed Corpus: Understanding Language Change Over Time

1 / 12

Embed Share

Explore the Feed Corpus, an up-to-date collection that studies language change over months and years. Learn about methods like feed discovery via Twitter, feed validation, scheduling inputs, and feed crawling for linguistic processing. Dive into the world of linguistic computing with Akshay Minocha, Siva Reddy, and Adam Kilgarriff at Lexical Computing Ltd.

mige_8 Follow

Uploaded on Mar 21, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd

Introduction Study language change o over months, years Most web pages o no info about when written Feeds o written then posted Same feeds over time o we hope identical genre mix only factor that changes is time

Method Feed Discovery Feed Validation Feed Scheduler Feed Crawler Cleaning, de-duplication, Linguistic Processing

Feed Discovery via Twitter Tweets often contain links for posts on feeds o bloggers, newswires often tweet "see my new post at http..." Twitter keyword searches o News, business, arts, games, regional, science, shopping, society, etc. o Ignore retweets o Every 15 minutes

Sample Search Aim - To make the most out of the search results https://twitter.com/search?q=news%20source%3Atwitterfee d%20filter%3Alinks&lang=en&include_entities=1&rpp=1 00 Query - News Source - twitterfeed Filter - Links ( To get all tweets necessarily with links) Language - en ( English ) Include Entities - Info like geo, user, etc. rpp - result per page ( maximum 100 )

Feed Validation Does the link lead directly to a feed? o does metadata contain type=application/rss+xml type=application/atom+xml If yes, good If no o search for a feed in domain of the link o If no search for feed in (one_step_from_domain) If still no o link is blacklisted

Scheduling Inputs o Frequency of update average over last ten feeds o Yield Rate ratio, raw data input to 'good text' output as in Spiderling, Suchomel and Pomikalek 2012 Output o priority level for checking the feed

Feed Crawler Visit feed at top of queue Is there new content? o If yes o Is it already in corpus? Onion: Pomikalek if no clean up JusText: Pomikalek add to corpus

Prepare for analysis Lemmatise, POS-tag Load into Sketch Engine

Initial run: Feb-March 2013 Raw:1.36 billion English words 300 million words after deduplication, cleaning 150,000+ feeds

Future Work Include "Category Tags" Other languages o Collection started now o Identification by langid.py (Lui and Baldwin 2012) "No-typo" material o copy-edited subset, so newspapers, business: yes personal blogs: no o method: manual classification of 100 highest-volume feeds

Thank You http://www.sketchengine.co.uk

Ever-Growing Feed Corpus: Understanding Language Change Over Time

Download Presentation

Presentation Transcript

Related

More Related Content