
Techniques for Analyzing Corpus Linguistics Data
Learn about corpus linguistic techniques such as descriptive statistics, frequency analysis, concordance lines, collocate lists, key word identification, and more. Explore an example project analyzing student writing across disciplines and languages. Discover how to analyze tokens, types, STTR, word lengths, and top words in written English, along with tasks like finding word dispersion in text.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Video 2: Corpus linguistic techniques Maria Leedham Full resource: https://www.ncrm.ac.uk/resources/online/all/?id=20855
Ways of analysing a corpus Descriptive statistics on mean sentence length, variety of lexical items used, (per text and for the whole corpus) Produce frequency word or phrase lists Get concordance lines and sort these in different ways Look at collocate lists for insight into the data Find key words and phrases compared to a reference corpus Look at plot dispersions to see how a linguistic feature is used across whole texts/the corpus Search for similar lexical items/chunks through semantic or Part of Speech tagging
EXAMPLE CADS PROJECT A: Student writing in different disciplines
Student writing in different disciplines Aims: To uncover similarities and differences in student writing across disciplinary areas and by students with different first languages RQs (i) How does student writing differ in Engineering and Business Studies? (ii) How does student writing differ across students with first language Chinese or first language English?
Descriptive statistics Tokens Types STTR (Standardised Type Token Ratio) Mean word/ sentence/ paragraph lengths Averages and per text
Top 20 words in written English (OEC) 1 the 2 be 3 to 4 of 5 and 6 a 7 in 8 that 9 have 10 I 11 it 12 for 13 not 14 on 15 with 16 he 17 as 18 you 19 do 20 at WordList TASK: What are the top 10 words in written English, taken from web sources? (eg blog postings, internet sites).
Finding where words are in the text: eg a plot dispersion for 'I' N 1 2 3 4 5 6 7 8 9 File Words 6005g.txt 0348c.txt 0202j.txt 0411c.txt 0200f.txt 0212d.txt 0169i.txt 0347g.txt 0342c.txt 0398e.txt 6101k.txt 6101c.txt Hits r 1,000 persion 40 51.15 38 38.42 31 7.55 27 5.06 26 5.40 26 13.25 25 33.83 23 31.77 23 22.66 23 4.32 21 38.39 21 7.27 Plot 782 989 0.790 0.867 0.626 0.396 0.673 0.704 0.730 0.733 0.808 -0.017 0.734 0.042 4,108 5,331 4,813 1,962 739 724 1,015 5,321 547 2,887 10 11 12 TASK: Why might some of these undergraduate assignments (from varied disciplines) have so many instances of I at the END of the writing? Clue: what might students be asked to do at the end of their assignment?
When Ibegan this Self awareness module, I learnt I am psychologically and physically boundary less, but prefer to stay in the same organisation; also, I am self directed and actively manage my career in line with personal values which plays an important role of choosing employment. 7014a (Chinese year 3 Business narrative recount) As I got into this piece of work, I enjoyed doing it more than Iexpected. There s a great moment on each question where you press enter after putting in a lot of code or chasing a bug and it all just works. 6101l (English year 1 Computing exercise)
Key Words and key clusters A word which is key occurs more often than would be expected by chance in comparison with the reference corpus. Mike Scott, 2008
Semantic domains in Engineering TASK: Is anything unexpected shown?
Semantic domains in Engineering TASK: Is anything unexpected shown?