
Efficiency of Text Coverage Evaluation in Vocabulary Learning
Explore the efficiency of evaluating word groups in text coverage for vocabulary learning, discussing the comparison between different sets of words and their impact on text coverage in various genres. The study proposes a new index called Text Covering Efficiency (TCE) to assess this aspect systematically and discusses its validation and implications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
1 How do we evaluate a group of words in gaining text coverage? Tatsuhiko Matsushita Tatsuhiko Matsushita (University of Tokyo) (University of Tokyo) Vocab@Vic 2013 Victoria University of Wellington
2 Text Covering Efficiency of the Grouped Words by Genre (Not Graded by Level) *Domain-unspecified Corpus Code MC BSB UPC Essays, Novels etc. UYN TB MTT-Ss TIS JS-Bn Bn (Journal Articles) JS-Tn Tn (Journal Articles) BCCWJ MTT-Bn MTT-Tn BCCWJ-T Conver- sation News- paper Ss Bn Tn Academic (Various) Novels, Essays etc. Whole Genre Ss Ss & Ha (Intro.) (Intro.) (Intro.) 32.82 2.90 Total Tokens (Million) 1.13 2.30 2.10 5.68 0.19 0.05 0.04 0.01 0.07 0.72 2.71 Number of Lexemes in VDRJ F-JLPT Level TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one- million-token text in the target domain. WIS Label 56 42 21 28 0.2 0.6 177 92 46 90 36 11 0.3 0.4 177 93 L4-L1, Others General 13,302 AW LAD LW 21K+ AKW Others 61 10 6 67 0.1 0.6 184 95 59 28 15 41 0.2 0.8 178 93 58 29 12 46 0.2 0.4 177 93 48 80 44 11 0.2 0.4 177 94 50 82 35 10 0.1 0.1 183 96 51 81 30 10 0.1 0.1 187 96 50 80 35 12 0.1 0.3 183 96 46 88 27 9 0.4 0.1 171 90 46 89 23 14 0.4 0.2 168 89 41 103 26 40 108 24 1-20,000 2,591 2,542 1,616 91,104 30,821 5,024 682- 20,000 L3-L1, Others 7 7 0.3 0.2 163 86 0.5 0.1 159 85 20,001+ -- 1-5,000 L2, L1, L4-L1, Others 1K-05K L4-L1, Others 1K-10K 10,024 1-10,000 *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Common Academic Words *LAD: Limited-academic-domain words *LW: Literary Words *AKW: Assumed Known Words (mostly proper nouns) *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences (This slide will appear again later in 5. Results and Discussion.)
3 Contents 1. Motives for the Study 2. Research Questions and Goals 3. Proposal of a New Index: Text Covering Efficiency (TCE) 4. Method of Validating TCE 5. Results and Discussion 6. Conclusion
4 1. Motives for the Study (1) How efficiently can we learn vocabulary? What words should learners learn first, second and next? Domain-specific words such as academic words (Coxhead, 2000) are often extracted for efficient vocabulary learning in a genre. Text coverage has been used for evaluating these groups of words (Coxhead, 2000; Hyland & Tse, 2007)
5 1. Motives for the Study (2) However, text coverage is not appropriate for comparing the efficiency between grouped words when the numbers of words are different between the groups. How can we compare the efficiency between a group of domain-specific words and the other words? e.g. 1 How can we compare the efficiency between learning AWL (Coxhead, 2000) and UWL (Xue & Nation, 1984)? e.g. 2 How can we compare the efficiency between learning technical term lists in different genres? e.g. 3 How can we compare lists at different frequency levels in a genre e.g. sublists of AWL? How many times more efficient in gaining text coverage in different genres by learning the sublist 1 than the sublist 2? e.g. 4 For gaining higher text coverage, at which stage should learners transit from learning general words to domain- specific words?
6 1. Motives for the Study (3) For example, the table below (Hyland & Tse, 2007) does not show the difference in efficiencyin gaining the text coverage because the numbers of words in AWL and GSL are different.
7 2. Research Questions and Goals (1) Research Questions 1. What index is appropriate for comparing the efficiency between grouped words in gaining text coverage when the numbers of words are different between the groups? 2. Is there any advantages of comparing the efficiency between grouped words in gaining text coverage other than deciding the most efficient learning order of words?
8 2. Research Questions and Goals (2) Goals 1. To propose an index: Text Covering Efficiency (TCE) 2. To show the validity and usefulness of TCE for a. deciding the most efficient order of words to learn b. analyzing lexical features of text genres by applying TCE to some groups of Japanese domain-specific words and other types of grouped words.
9 3. Proposal of a New Index: Text Covering Efficiency (TCE) (1) Problem: numbers of words are different between the groups to be compared Solution: Standardization Dividing text coverage (tokens) of a group of words by the number of the grouped words Dividing the quotient by the total number of tokens in the target text (domain) to adjust the difference in size of the texts and make the figures from differently-sized texts comparable.
10 3. Proposal of a New Index: Text Covering Efficiency (TCE) (2) For the user s convenience, the figure is multiplied by 1,000,000. The solution means the expected number of tokens of a word from the grouped words in a one-million-token text in the target domain. Therefore, it is comparable with the standardized frequency per million. In other words, TCE is an expected standardized frequency of a grouped word. Text Covering Efficiency (TCE) = the mean text coverage per one million tokens of the target text by a word from the grouped words.
11 3. Proposal of a New Index: Text Covering Efficiency (TCE) (3) The formula for TCE E= ?? ??? 1,000,000 = ?? 1,000,000 ??? ?? ?? E: Text covering efficiency = Expected number of tokens (= text coverage) of a word in the tested group in a one-million-token text in the target domain ??: Number of tokens (= text coverage) of the tested group of words in the target text ???: Number of lexemes of the tested group of words ??: Number of tokens in the target text (text length)
12 4. Method of Validating TCE (1) How can we validate an index: TCE? How can we validate an index? By applying the index to the actual data to check if: 1. the results do not conflict with the findings from previous studies 2. the results show something which will not be clearly shown without the index TCE was applied to some grouped Japanese words in different text genres
13 4. Method of Validating TCE (2) Domain-Specific Words to Be Tested (Japanese) Common Academic Words (CAW) (Matsushita, 2011) (Japanese) Limited-Academic-Domain Words (LAD) (Japanese) Literary Words (LW) (Matsushita, 2012) These word lists can be downloaded from Matsushita Laboratory for Language Learning http://www17408ui.sakura.ne.jp/tatsum/English_to p_Tatsu.html
14 4. Method of Validating TCE (3) Method of Extraction of the Domain-specific Words Target Corpora: Technical texts in the four genres of Humanities, Social sciences, Technological natural sciences and Biological natural sciences Reference Corpus: Balanced Contemporary Corpus of Written Japanese (BCCWJ), 2009 monitor version excluding the target corpora part Index: Log-likelihood Ratio (LLR) Criteria for extraction 4-domain words and 3-domain words: CAW 2-domain words and 1-domain words: LAD
15 4. Method of Validating TCE (4) Test corpora: Text Genres Used for the Validation JS-Bn: Journal articles on biological natural sciences. 0.72 million tokens. MTT-Bn: Technical texts in biological natural sciences. 0.01 million tokens. JS-Tn: Journal articles on technological natural sciences. 2.71 million tokens. MTT-Tn: Technical texts in technological natural sciences. 0.07 million tokens. MTT-Ss: Technical texts in social sciences. 0.05 million tokens. TB: Texts in social sciences for intermediate and advanced learners of Japanese. 0.19 million tokens. TIS: Texts in a textbook in international studies. Mainly social science texts. . 0.04 million thousand tokens. UYN: Newspaper texts of 5.68 million tokens. BSB: Texts from best seller books. Mainly composed of literary works. 2.10 million tokens. UPC: Lieterary texts. 2.30 million tokens. MC: Conversation texts. 1.13 million tokens.
16 5. Results and Discussion (1) TCE of the Grouped Words by Genre (Not Graded by Level) *Domain-unspecified Corpus Code MC BSB UPC Essays, Novels etc. UYN TB MTT-Ss TIS JS-Bn Bn (Journal Articles) JS-Tn Tn (Journal Articles) BCCWJ MTT-Bn MTT-Tn BCCWJ-T Conver- sation News- paper Ss Bn Tn Academic (Various) Novels, Essays etc. Whole Genre Ss Ss & Ha (Intro.) (Intro.) (Intro.) 32.82 2.90 Total Tokens (Million) 1.13 2.30 2.10 5.68 0.19 0.05 0.04 0.01 0.07 0.72 2.71 Number of Lexemes in VDRJ F-JLPT Level TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one- million-token text in the target domain. WIS Label 56 42 21 28 0.2 0.6 177 92 46 90 36 11 0.3 0.4 177 93 L4-L1, Others General 13,302 AW LAD LW 21K+ AKW Others 61 10 6 67 0.1 0.6 184 95 59 28 15 41 0.2 0.8 178 93 58 29 12 46 0.2 0.4 177 93 48 80 44 11 0.2 0.4 177 94 50 82 35 10 0.1 0.1 183 96 51 81 30 10 0.1 0.1 187 96 50 80 35 12 0.1 0.3 183 96 46 88 27 9 0.4 0.1 171 90 46 89 23 14 0.4 0.2 168 89 41 103 26 40 108 24 1-20,000 2,591 2,542 1,616 91,104 30,821 5,024 682- 20,000 L3-L1, Others 7 7 0.3 0.2 163 86 0.5 0.1 159 85 20,001+ -- 1-5,000 L2, L1, L4-L1, Others 1K-05K L4-L1, Others 1K-10K 10,024 1-10,000 *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Common Academic Words *LAD: Limited-academic-domain words *LW: Literary Words *AKW: Assumed Known Words (mostly proper nouns) *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences
17 5. Results and Discussion (2) Ranking for TCE of the Grouped Words in Each Genre (Not Graded by Level) *Domain-unspecified Corpus Code MC BSB UPC Essays, Novels etc. UYN TB MTT-Ss TIS JS-Bn Bn (Journal Articles) JS-Tn Tn (Journal Articles) BCCWJ MTT-Bn MTT-Tn BCCWJ-T Conver- sation News- paper Ss Bn Tn Academic (Various) Novels, Essays etc. Whole Genre Ss Ss & Ha (Intro.) (Intro.) (Intro.) 32.82 2.90 Total Tokens (Million) 1.13 2.30 2.10 5.68 0.19 0.05 0.04 0.01 0.07 0.72 2.71 Number of Lexemes in VDRJ F-JLPT Level Ranking for TCE of the Grouped Words in Each Genre WIS Label 1 2 4 3 6 5 2 1 3 4 6 5 L4-L1, Others General 13,302 AW LAD LW 21K+ AKW *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Common Academic Words *LAD: Limited-academic-domain words *LW: Literary Words 2 3 4 1 6 5 1 3 4 2 6 5 1 3 4 2 6 5 2 1 3 4 6 5 2 1 3 4 6 5 2 1 3 4 6 5 2 1 3 4 6 5 2 1 3 4 5 6 2 1 3 4 5 6 2 1 3 4 5 6 2 1 3 4 5 6 1-20,000 2,591 2,542 1,616 91,104 30,821 682- 20,000 L3-L1, Others 20,001+ -- L2, L1, Others *AKW: Assumed Known Words (mostly proper nouns) *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences
5. Results and Discussion(3)TCE of the Grouped Words by Level and Genre *Domain-unspecified Corpus Code MC BSB UPC BCCWJ UYN Essays, Novels etc. paper Total Tokens (Million) 1.13 2.30 2.10 32.82 18 TB MTT-Ss TIS JS-Bn Bn (Journal Articles) JS-Tn Tn (Journal Articles) MTT-Tn BCCWJ-T MTT-Bn News- Ss Bn Tn Conver- sation Novels, Essays etc. Academic (Various) Whole Genre Ss Ss & Ha (Intro.) (Intro.) (Intro.) 2.90 5.68 0.19 0.05 0.04 0.01 0.07 0.72 2.71 Number of Lexemes in VDRJ TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one-million- token text in the target domain. F-JLPT Level WIS Level Label 640.3 430.2 99.4 149.1 31.8 65.3 49.3 41.8 7.4 7.8 9.3 8.5 3.3 3.3 4.4 4.1 1.8 1.9 2.6 2.6 0.2 0.6 176.7 92.5 551.1 654.1 123.3 72.1 17.7 152.3 80.2 13.9 3.7 21.1 21.0 1.7 1.5 9.3 11.6 0.4 0.8 5.5 8.2 0.2 0.3 0.4 176.6 92.8 General AW LAD LW General AW LAD LW General AW LAD LW General AW LAD LW General AW LAD LW 21K+ AKW 1,027 716.0 175.1 47.6 474.1 35.0 11.8 14.4 72.4 4.4 1.4 2.2 15.5 1.6 0.5 1.1 2.1 0.8 0.4 0.6 0.7 0.1 0.6 184.2 94.7 671.8 367.1 84.5 201.5 32.8 38.3 35.8 62.0 7.2 4.0 5.2 13.2 3.2 1.7 2.5 6.1 1.9 1.2 1.5 3.9 0.2 0.8 177.7 92.6 672.8 367.7 65.0 248.0 27.4 41.8 29.7 63.5 7.1 4.9 5.2 15.2 2.9 1.8 2.0 6.7 1.9 1.2 1.1 4.5 0.2 0.4 177.4 92.5 530.6 560.0 222.6 55.0 33.4 138.9 102.9 16.7 7.7 16.5 19.7 3.1 3.0 7.0 8.6 0.6 1.5 3.7 4.3 0.3 0.2 0.4 177.3 94.0 585.8 729.3 162.0 74.1 21.6 134.1 85.0 11.7 4.9 16.1 15.0 1.2 2.1 6.0 6.0 0.2 0.9 3.0 3.9 0.1 0.1 0.1 183.2 95.6 623.0 744.8 139.1 79.6 14.1 132.6 75.5 8.6 2.4 12.5 13.7 0.7 0.8 3.9 3.1 0.1 0.4 2.7 2.1 0.1 0.1 0.1 186.6 96.2 572.9 682.6 251.2 87.9 27.5 134.6 78.9 14.9 6.8 12.1 13.1 1.6 2.0 5.1 4.9 0.3 1.0 2.9 1.6 0.6 0.1 0.3 182.9 95.6 571.1 625.7 105.1 68.4 10.4 138.8 58.6 6.6 2.2 38.9 13.5 3.3 1.4 8.5 9.8 0.2 0.6 2.9 5.5 0.0 0.4 0.1 171.1 90.2 564.0 778.0 80.0 93.5 10.3 127.0 38.9 18.1 2.7 38.6 17.0 2.6 1.4 16.4 18.2 0.5 0.7 14.9 5.6 0.1 0.4 0.2 167.8 88.9 495.6 723.4 91.2 44.1 13.9 169.3 51.4 9.5 3.2 31.8 16.6 1.1 2.0 15.6 9.7 0.6 1.2 8.0 8.3 0.8 0.3 0.2 163.1 86.2 481.2 687.5 93.9 46.3 10.6 178.8 37.6 8.9 2.9 35.9 19.9 1.5 1.5 14.5 15.5 0.3 0.9 12.5 8.0 0.1 0.5 0.1 159.0 84.6 L4, L3 1-1,291 70 78 142 Basic 682- 1,291 L3 1,478 1,101 704 446 3,070 664 788 483 3,681 431 543 345 4,046 325 429 200 91,104 30,821 5,024 10,024 1,292- 5,000 Inter. 5,001- 10,000 Adv. 1 L2 10,001 - 15,000 L1 Adv. 2 Others 15,001 - 20,000 S-Adv. 21K+ AKW 1K-05K 1K-05K 1K-10K 1K-10K 20,001+ -- 1-5,000 L4-L1, Others 1-10,000 *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Common Academic Words *LAD: Limited-academic-domain words *LW: Literary Words L4-L1, Others *AKW: Assumed Known Words (mostly proper nouns) *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences
5. Results and Discussion(4) TCE of the Grouped Words in Each Genre *Domain-unspecified Corpus Code MC BSB UPC BCCWJ 19 UYN TB TIS MTT-Bn MTT-Tn BCCWJ-T JS-Bn JS-Tn MTT-Ss Ss & Ha Bn Tn News- paper Conver- sation Novels, Essays etc. Essays, Novels etc. Ss Bn Tn Academic (Various) Genre Ss Whole (Journal Articles) (Journal Articles) (Intro.) (Intro.) (Intro.) Total Tokens (Million) 1.13 2.30 2.10 32.82 5.68 0.19 0.05 0.04 0.01 0.07 2.90 0.72 2.71 F- Number of Lexemes in VDRJ Ranking for TCE of the Grouped Words in Each Genre WIS JLPT Level Level Label General AW LAD LW General AW LAD LW General AW LAD LW General AW LAD LW General AW LAD LW 21K+ AKW 1,027 1 2 5 1 6 9 8 4 10 14 11 1 2 4 3 8 6 7 5 1 2 4 3 8 6 7 5 1 2 4 3 8 5 6 7 2 1 3 6 7 4 5 9 12 10 8 16 17 13 11 19 18 15 14 21 22 20 2 1 3 6 7 4 5 2 1 3 5 7 4 6 2 1 3 5 7 4 6 8 11 10 9 16 15 12 13 21 18 14 17 19 22 20 2 1 4 5 9 3 6 2 1 5 4 2 1 4 6 9 3 5 2 1 4 6 2 1 4 5 L4, L3 1-1,291 70 78 142 Basic 682- 1,291 L3 13 10 12 1,478 1,101 704 446 3,070 664 788 483 3,681 431 543 345 4,046 325 429 200 91,104 30,821 3 6 9 3 5 3 6 1,292- 5,000 Inter. 10 13 10 14 12 16 10 15 12 15 13 15 15 10 13 12 10 13 12 12 11 9 10 16 15 13 14 20 19 17 18 22 21 8 9 7 8 7 7 8 7 8 7 8 5,001- 10,000 9 8 Adv. 1 10 16 17 11 17 16 12 11 19 18 15 14 22 21 20 17 16 11 12 22 18 13 15 19 21 20 14 17 11 10 20 18 15 13 22 19 21 16 17 12 11 19 18 14 13 22 21 20 18 16 16 17 10 7 L2 9 9 13 20 15 12 16 21 19 17 22 18 15 18 16 11 17 20 19 14 22 21 15 18 16 11 17 19 20 14 22 21 L1 10,001 - 15,000 9 Adv. 2 8 11 20 17 14 13 19 21 22 9 Others 19 18 12 14 22 20 21 20 18 11 14 22 19 21 15,001 - 20,000 S-Adv. 21K+ AKW 20,000+ -- *TCE means the expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain. *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Common Academic Words *LAD: Limited-academic-domain words *LW: Literary Words *Numbers in bold show the rankings higher than expected ranking i.e. 1-4 for basic, 5-8 for intermediate, 9-12 for Adv. 1, 13-16 for Adv. 2, 17-20 for S-Adv and 21-22 for 21K+ and AKW. On the other hand, Italic numbers show the rankings lower than expected ranking. *AKW: Assumed Known Words (mostly proper nouns) *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences
20 Text Covering Efficiency (TCE) and its Rankings of the Grouped Words by Level and Genre (Detailed) *Domain-unspecified Corpus Code UYN JS-Bn Bn (Journal Articles) 0.72 Corpus Code UYN JS-Bn News- paper Bn News- paper Genre Genre (Journal Articles) Total Tokens (Million) 5.68 Total Tokens (Million) 5.68 0.72 Number of Lexemes in VDRJ 1,027 Number of Lexemes in VDRJ 1,027 F- F- WIS Label 1 Label 2 JLPT Level WIS JLPT Level Label 2 Label 1 L4, L3 General Basic 530.6 667.0 1098.7 474.8 273.6 195.8 229.5 97.1 70.9 55.0 33.4 155.7 121.5 113.2 48.8 168.1 39.6 50.4 16.7 7.7 15.2 17.1 20.1 6.7 39.1 7.1 12.3 3.1 3.0 5.0 7.6 8.6 1.7 20.9 3.4 5.7 0.6 1.5 2.6 3.9 3.5 2.5 12.2 1.4 2.9 0.3 0.2 0.4 177.3 94.0 495.6 1-1,291 1-1,291 L4, L3 Basic 2 1 3 4 6 5 2 1 3 5 General Basic+Aca4D Basic+Aca3D Basic+Aca2D Basic+Aca1D_Ah 31 39 45 13 6 5 9 142 Basic+Aca4D Basic+Aca3D Basic+Aca2D Basic+Aca1D_Ah Basic+Aca1D_Ss Basic+Aca1D_Tn Basic+Aca1D_Bn Basic+Lit Inter Inter+Aca4D Inter+Aca3D Inter+Aca2D Inter+Aca1D_Ah Inter+Aca1D_Ss Inter+Aca1D_Tn Inter+Aca1D_Bn Inter+Lit Adv Adv+Aca4D Adv+Aca3D Adv+Aca2D Adv+Aca1D_Ah Adv+Aca1D_Ss Adv+Aca1D_Tn Adv+Aca1D_Bn Adv+Lit H_Adv H_Adv+Aca4D H_Adv+Aca3D H_Adv+Aca2D H_Adv+Aca1D_Ah H_Adv+Aca1D_Ss H_Adv+Aca1D_Tn H_Adv+Aca1D_Bn H_Adv+Lit S_Adv S_Adv+Aca4D S_Adv+Aca3D S_Adv+Aca2D S_Adv+Aca1D_Ah S_Adv+Aca1D_Ss S_Adv+Aca1D_Tn S_Adv+Aca1D_Bn S_Adv+Lit 21K+ AKW 31 39 45 13 6 5 9 142 AW AW 425.1 113.1 35.1 67.6 92.2 77.8 44.1 13.9 241.1 95.4 59.4 30.6 30.2 39.4 89.1 9.5 3.2 41.4 27.3 21.7 2.8 5.3 23.3 20.3 1.1 2.0 22.6 13.4 9.5 2.2 3.7 17.0 22.0 0.6 1.2 10.3 7.6 9.6 1.9 0.8 7.9 17.6 0.8 0.3 0.2 163.1 86.2 682- 1,291 682- 1,291 15 10 7 9 12 26 4 6 11 16 17 14 8 30 36 13 18 22 37 34 19 23 42 39 20 27 31 38 35 25 21 45 41 28 33 29 40 43 32 24 44 46 47 L3 L3 LAD Basic+Aca1D_Ss LAD Basic+Aca1D_Tn 11 12 13 18 8 9 10 15 7 16 14 22 27 23 21 20 30 17 29 24 36 37 32 28 26 41 19 35 31 44 42 39 33 34 40 25 43 38 46 47 45 Basic+Aca1D_Bn Basic+Lit Inter Inter+Aca4D Inter+Aca3D Inter+Aca2D Inter+Aca1D_Ah Inter+Aca1D_Ss Inter+Aca1D_Tn Inter+Aca1D_Bn Inter+Lit Adv.1 Adv.1+Aca4D Adv.1+Aca3D Adv.1+Aca2D Adv.1+Aca1D_Ah LW General LW General 1,478 559 542 391 104 111 1,478 559 542 391 104 111 AW AW 1,292- 5,000 1,292- 5,000 LAD LAD 46 52 446 46 52 446 LW General LW General 3,070 212 452 429 104 127 3,070 212 452 429 104 127 AW AW 5,001- 10,000 5,001- 10,000 LAD Adv.1+Aca1D_Ss LAD 60 68 483 Adv.1+Aca1D_Tn L2 60 68 483 Adv.1+Aca1D_Bn Adv.1+Lit Adv.2 Adv.2+Aca4D Adv.2+Aca3D Adv.2+Aca2D Adv.2+Aca1D_Ah L2 LW General L1 3,681 103 328 296 LW General 3,681 103 328 296 AW L1 AW Oth ers 10,001 - 15,000 71 74 48 54 345 Others 10,001- 15,000 LAD Adv.2+Aca1D_Ss 71 74 48 54 345 Adv.2+Aca1D_Tn LAD Adv.2+Aca1D_Bn Adv.2+Lit S-Adv S-Adv+Aca4D S-Adv+Aca3D S-Adv+Aca2D S-Adv+Aca1D_Ah LW General 4,046 LW General 56 269 232 60 55 29 53 200 4,046 AW 56 269 232 60 55 29 53 200 AW 15,001 - 20,000 15,001- 20,000 LAD S-Adv+Aca1D_Ss S-Adv+Aca1D_Tn LAD S-Adv+Aca1D_Bn S-Adv+Lit 21K+ AKW 1K-05K 1K-10K LW 21K+ AKW 91,104 30,821 5,024 10,024 20,001+ -- 1-5,000 LW 21K+ AKW 91,104 30,821 20,000+ -- L4-L1, Others1K-05K L4-L1, Others1K-10K 1-10,000
21 TCE in Biological Natural Science Journal Articles by Type of and Level of Grouped Words TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain Basic Basic Inter. Inter. Adv. 1 Adv. 1 Adv. 2 Adv. 2 S-Adv. S-Adv. General CAW (4D) CAW (3D) LAD (1D-Bn) 495.6 1098.7 425.1 77.8 13.9 241.1 95.4 89.1 3.2 41.4 27.3 20.3 2.0 22.6 13.4 22.0 1.2 10.3 7.6 17.6
22 TCE in Biological Natural Science Journal Articles by Type of and Level of Grouped Words TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain 300.0 250.0 200.0 General CAW (4D) 150.0 CAW (3D) LAD (1D-Bn) 100.0 50.0 0.0 Inter. Adv. 1 Adv. 2 S-Adv.
Corpus Code MC Conve r- sation 1.13 BSB Novels, Essays etc. 2.30 UPC Essays, Novels etc. 2.10 UYN TB TIS MTT-Bn MTT-Tn BCCWJ-T JS-Bn JS-Tn BCCWJ MTT-Ss Tn 23 Bn News- paper Ss & Ha Ss Tn Bn Academic (Various) Whole Genre Ss (Journal Articles) 0.72 (Journal Articles) 2.71 (Intro.) (Intro.) (Intro.) 5. Results and Discussion (5) TCE of the Grouped Words by Level and Genre (Detailed) *Domain- unspecified 32.82 2.90 Total Tokens (Million) 5.68 0.19 0.05 0.04 0.01 0.07 Number of Lexemes in VDRJ 1,027 F- TCE: Text Covering Efficiency = Expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain. WIS Label 1 Label 2 JLPT Level 640.3 525.5 354.5 96.5 122.7 72.4 92.0 102.5 149.1 31.8 81.8 48.4 47.4 54.3 57.4 43.4 41.6 41.8 7.4 7.8 7.8 8.6 9.1 11.1 9.6 10.6 8.5 3.3 3.2 3.3 3.9 4.5 5.5 4.6 5.1 4.1 1.8 1.9 1.9 2.2 2.4 3.6 3.1 3.1 2.6 0.2 0.6 176.7 92.5 551.1 856.2 1098.7 1069.0 493.5 425.1 143.8 113.1 102.4 35.1 119.4 67.6 90.6 92.2 71.9 77.8 72.1 44.1 17.7 13.9 198.9 241.1 104.2 95.4 82.7 59.4 49.8 30.6 125.8 30.2 58.0 39.4 44.5 89.1 13.9 9.5 3.7 3.2 23.5 41.4 20.0 27.3 19.5 21.7 12.2 2.8 38.0 5.3 16.5 23.3 16.1 20.3 1.7 1.1 1.5 2.0 10.1 22.6 9.0 13.4 10.2 9.5 9.6 2.2 21.2 3.7 10.3 17.0 10.5 22.0 0.4 0.6 0.8 1.2 6.4 10.3 5.4 7.6 6.5 9.6 5.3 1.9 18.7 0.8 9.0 7.9 7.6 17.6 0.2 0.8 0.3 0.3 0.4 0.2 176.6 163.1 92.8 86.2 L4, L3 General Basic 716.0 186.9 165.7 41.0 43.4 20.7 52.8 101.8 474.1 35.0 13.8 9.8 13.7 11.2 16.5 24.0 12.9 72.4 4.4 1.2 1.4 1.9 2.9 1.7 3.2 2.7 15.5 1.6 0.5 0.5 0.9 1.1 1.5 0.8 1.8 2.1 0.8 0.5 0.4 0.5 0.8 0.6 0.6 0.7 0.7 0.1 0.6 184.2 94.7 671.8 405.0 337.0 73.9 124.5 47.3 73.0 110.9 201.5 32.8 47.5 28.8 33.3 54.6 31.5 24.5 35.7 62.0 7.2 3.5 4.3 4.7 7.1 5.1 3.9 6.5 13.2 3.2 1.5 1.7 2.0 3.4 2.9 1.6 3.8 6.1 1.9 0.9 1.2 1.2 2.6 1.4 1.4 1.8 3.9 0.2 0.8 177.7 92.6 *Aca: Academic Vocabulary (AW & LAD) *4D/3D/2D/1D: 4-/3-/2-/1-domain words *AKW: Assumed Know Words (mostly proper nouns) 672.8 382.0 356.3 58.8 91.4 38.5 60.7 78.1 248.0 27.4 56.2 26.9 24.9 56.7 26.4 22.7 24.8 63.5 7.1 5.5 4.7 5.0 8.6 2.9 4.7 5.9 15.2 2.9 2.0 1.8 1.9 3.4 1.3 2.3 1.6 6.7 1.9 0.8 1.3 1.1 0.8 0.8 1.7 1.0 4.5 0.2 0.4 177.4 92.5 530.6 667.0 474.8 273.6 195.8 229.5 97.1 70.9 55.0 33.4 155.7 121.5 113.2 48.8 168.1 39.6 50.4 16.7 7.7 15.2 17.1 20.1 6.7 39.1 7.1 12.3 3.1 3.0 5.0 7.6 8.6 1.7 20.9 3.4 5.7 0.6 1.5 2.6 3.9 3.5 2.5 12.2 1.4 2.9 0.3 0.2 0.4 177.3 94.0 585.8 887.6 603.5 195.0 159.4 141.9 51.4 75.6 74.1 21.6 173.8 93.1 89.0 47.0 167.6 23.3 9.9 11.7 4.9 17.4 15.5 14.6 4.4 37.6 2.7 2.0 1.2 2.1 5.0 6.3 5.0 1.0 21.8 1.3 1.0 0.2 0.9 3.8 2.8 3.8 0.5 13.5 0.4 0.3 0.1 0.1 0.1 183.2 95.6 623.0 965.8 569.1 178.3 153.5 62.6 39.5 28.5 79.6 14.1 174.8 89.0 79.7 72.2 114.7 9.0 25.1 8.6 2.4 17.9 9.9 13.8 3.6 28.6 2.0 11.3 0.7 0.8 4.8 3.7 4.5 1.1 3.2 0.0 0.4 0.1 0.4 3.5 2.5 1.8 0.7 7.2 0.7 0.7 0.1 0.1 0.1 186.6 96.2 572.9 772.9 610.7 302.1 366.8 126.5 33.2 34.3 87.9 27.5 181.7 86.0 89.3 78.9 81.0 18.1 49.7 14.9 6.8 11.0 12.6 17.0 5.9 14.4 3.6 5.6 1.6 2.0 5.5 4.9 6.3 1.7 7.1 1.0 2.2 0.3 1.0 7.2 2.0 1.4 1.6 1.7 0.0 3.1 0.6 0.1 0.3 182.9 95.6 571.1 881.6 1177.6 422.3 47.9 55.3 12.0 0.0 583.4 68.4 10.4 197.8 78.0 79.8 25.6 4.5 48.5 89.9 6.6 2.2 21.0 47.3 20.3 4.1 0.0 1.2 21.2 3.3 1.4 5.6 9.4 14.1 0.0 0.0 0.0 21.3 0.2 0.6 7.7 1.9 6.8 0.0 0.0 19.8 4.1 0.0 0.4 0.1 171.1 90.2 *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences 564.0 495.6 481.2 1-1,291 Basic+Aca4D Basic+Aca3D Basic+Aca2D Basic+Aca1D_Ah 31 39 45 13 6 5 9 142 AW 460.3 113.4 55.6 4.5 37.5 22.3 93.5 10.3 198.4 53.4 44.7 9.8 4.1 171.0 10.8 18.1 2.7 32.2 41.6 27.5 2.1 1.6 14.5 4.5 2.6 1.4 7.7 19.2 23.2 0.8 1.8 58.6 0.2 0.5 0.7 17.9 14.3 7.5 0.0 0.0 14.8 4.0 0.1 0.4 0.2 167.8 88.9 384.3 119.2 67.1 20.4 97.4 53.1 46.3 10.6 271.0 83.7 48.2 18.5 15.3 77.0 9.1 8.9 2.9 54.5 27.2 25.7 9.0 4.7 49.3 2.2 1.5 1.5 18.4 13.3 14.3 9.9 1.0 67.9 2.9 0.3 0.9 21.3 10.7 10.7 2.9 1.7 19.4 2.4 0.1 0.5 0.1 159.0 84.6 682- 1,291 L3 LAD Basic+Aca1D_Ss Basic+Aca1D_Tn Basic+Aca1D_Bn Basic+Lit Inter Inter+Aca4D Inter+Aca3D Inter+Aca2D Inter+Aca1D_Ah Inter+Aca1D_Ss Inter+Aca1D_Tn Inter+Aca1D_Bn Inter+Lit Adv.1 Adv.1+Aca4D Adv.1+Aca3D Adv.1+Aca2D Adv.1+Aca1D_Ah LW General 1,478 559 542 391 104 111 AW 1,292- 5,000 LAD 46 52 446 LW General 3,070 212 452 429 104 127 AW 5,001- 10,000 LAD Adv.1+Aca1D_Ss 60 68 483 Adv.1+Aca1D_Tn L2 Adv.1+Aca1D_Bn Adv.1+Lit Adv.2 Adv.2+Aca4D Adv.2+Aca3D Adv.2+Aca2D Adv.2+Aca1D_Ah LW General L1 3,681 103 328 296 AW Oth ers 10,001 - 15,000 71 74 48 54 345 LAD Adv.2+Aca1D_Ss Adv.2+Aca1D_Tn Adv.2+Aca1D_Bn Adv.2+Lit S-Adv S-Adv+Aca4D S-Adv+Aca3D S-Adv+Aca2D S-Adv+Aca1D_Ah LW General 4,046 56 269 232 60 55 29 53 200 AW 15,001 - 20,000 LAD S-Adv+Aca1D_Ss S-Adv+Aca1D_Tn S-Adv+Aca1D_Bn S-Adv+Lit 21K+ AKW 1K-05K 1K-10K LW 21K+ AKW 91,104 30,821 5,024 10,024 20,001+ -- 1-5,000 L4-L1, Others1K-05K L4-L1, Others1K-10K 1-10,000 *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Academic Words *LAD: Limited-academic-domain words *LW: Literary Words
Corpus Code MC BSB UPC BCCWJ UYN TB MTT-Ss TIS MTT-Bn MTT-Tn BCCWJ-T JS-Bn JS-Tn Tn 24 Essays, Novels etc. 2.10 32.82 Novels, Essays etc. Bn News- paper Conver- sation Ss Bn Tn Academic (Various) Genre Ss Whole Ss & Ha (Journal Articles) (Journal Articles) (Intro.) (Intro.) (Intro.) 5. Results and Discussion (6) TCE of the Grouped Words by Level and Genre (Detailed) *Domain- unspecified Total Tokens (Million) 1.13 2.30 5.68 0.19 0.05 0.04 0.01 0.07 2.90 0.72 2.71 Number of Lexemes in VDRJ 1,027 F- Ranking for TCE of the Grouped Words in Each Genre WIS JLPT Level Label 2 Label 1 1-1,291 Basic 1 3 4 9 8 1 2 3 7 5 1 2 3 9 5 1 2 3 7 5 2 1 3 4 6 5 3 1 2 4 7 8 2 1 3 4 6 3 1 2 5 4 7 3 1 4 2 1 3 6 9 2 1 3 5 9 7 2 1 3 5 2 1 3 5 L4, L3 General Basic+Aca4D Basic+Aca3D Basic+Aca2D Basic+Aca1D_Ah Basic+Aca1D_Ss Basic+Aca1D_Tn Basic+Aca1D_Bn Basic+Lit Inter Inter+Aca4D Inter+Aca3D Inter+Aca2D Inter+Aca1D_Ah Inter+Aca1D_Ss Inter+Aca1D_Tn Inter+Aca1D_Bn Inter+Lit Adv Adv+Aca4D Adv+Aca3D Adv+Aca2D Adv+Aca1D_Ah Adv+Aca1D_Ss Adv+Aca1D_Tn Adv+Aca1D_Bn Adv+Lit H_Adv H_Adv+Aca4D H_Adv+Aca3D H_Adv+Aca2D H_Adv+Aca1D_Ah H_Adv+Aca1D_Ss H_Adv+Aca1D_Tn H_Adv+Aca1D_Bn H_Adv+Lit S_Adv S_Adv+Aca4D S_Adv+Aca3D S_Adv+Aca2D S_Adv+Aca1D_Ah S_Adv+Aca1D_Ss S_Adv+Aca1D_Tn S_Adv+Aca1D_Bn S_Adv+Lit 21K+ AKW 31 39 45 13 6 5 9 142 AW 12 10 21 40 15 10 10 19 682- 1,291 L3 12 12 12 10 12 13 15 10 18 30 13 17 LAD 11 12 13 18 13 11 12 18 15 14 10 13 12 23 7 5 2 8 6 4 8 6 4 8 6 4 7 9 6 12 15 28 2 9 12 26 LW General 9 7 22 25 1,478 559 542 391 104 111 10 15 19 16 18 13 11 17 15 11 17 14 10 16 18 13 9 20 30 26 25 21 24 27 22 19 32 40 38 35 31 33 39 29 23 36 45 43 44 34 42 41 37 28 47 46 13 11 14 16 10 15 18 17 7 21 24 26 25 20 31 27 23 19 30 33 36 35 29 39 32 38 22 34 45 40 41 44 43 37 42 28 47 46 18 9 13 14 12 11 15 17 16 27 26 25 23 22 19 21 20 24 36 37 35 33 31 28 30 29 32 45 43 44 42 41 34 39 38 40 47 46 16 6 10 8 12 11 17 13 19 25 22 21 18 27 20 31 28 38 34 29 30 26 37 24 42 33 45 41 23 35 40 39 36 47 32 43 46 44 8 9 5 9 5 8 9 5 8 7 4 4 8 4 6 4 7 AW 10 11 26 31 10 15 7 16 14 22 27 23 21 20 30 17 29 24 36 37 32 28 26 41 19 35 31 44 42 39 33 34 40 25 43 38 46 47 45 10 14 6 16 24 23 28 19 20 21 29 15 33 35 37 34 26 25 27 39 17 36 38 44 40 30 32 31 41 22 42 43 47 46 45 11 15 6 14 16 26 40 18 20 21 27 17 24 25 41 42 31 34 30 32 19 29 28 44 43 37 38 36 39 22 33 35 47 46 45 11 16 17 14 8 30 36 13 18 22 37 34 19 23 42 39 20 27 31 38 35 25 21 45 41 28 33 29 40 43 32 24 44 46 47 14 21 23 1,292- 5,000 11 7 22 16 23 32 17 21 19 28 14 33 20 40 36 25 27 26 35 30 47 41 46 42 29 31 34 39 24 38 37 43 45 44 14 28 11 LAD 5 8 46 52 446 24 19 33 14 12 15 35 37 22 29 34 38 27 18 16 39 36 30 32 34 11 16 17 31 33 13 38 40 41 22 25 24 29 42 6 26 32 17 13 18 29 40 35 16 31 34 27 23 20 40 40 40 15 38 36 24 33 25 40 40 19 30 40 37 39 6 LW General 3,070 212 452 429 104 127 20 31 30 25 22 27 21 23 14 28 43 42 33 32 29 34 26 24 35 44 46 45 36 40 39 38 37 47 41 AW 5,001- 10,000 LAD 60 68 483 L2 LW General 3,681 103 328 296 L1 AW Others 10,001- 15,000 71 74 48 54 345 LAD 8 9 43 41 40 20 23 28 46 46 21 32 45 42 44 36 45 43 18 26 27 35 39 20 37 47 44 46 LW General 4,046 56 269 232 60 55 29 53 200 AW 15,001- 20,000 LAD LW 21K+ AKW 91,104 30,821 20,000+ -- *TCE means the expected number of tokens of a lexeme in the tested group in a one-million-token text in the target domain. *WIS: Word Rankings for International Students *F-JLPT: The former Japanese Language Proficiency Test *VDRJ: Vocabulary Database for Reading Japanese *AW: Academic Words *Aca: Academic Vocabulary (AW & LAD) *LAD: Limited-academic-domain words *4D/3D/2D/1D: 4-/3-/2-/1-domain words *LW: Literary Words *AKW: Assumed Know Words (mostly proper nouns) *Numbers in bold show the rankings higher than expected ranking i.e. 1-9 for basic, 10-18 for intermediate, 19-27 for Adv. 1, 28-36 for Adv. 2, 37-45 for S-Adv and 46-472 for 21K+ and AKW. On the other hand, Italic numbers show the rankings lower than expected ranking. *Ha: Humanities & Arts *Ss: Social Sciences *Tn: Technological Natural Sciences *Bn: Biological Natural Sciences
25 5. Results and Discussion (7) The result shows that TCE clearly indicates the efficiency in gaining text coverage, and thus it is useful for deciding a more efficient learning/teaching order of words. These findings do not seem to conflict with previous studies. Lexical features of texts in different genres can also be examined by checking the TCE figures. E.g. Japanese newspaper texts have similar lexical features to academic texts in social sciences. You can find things you cannot see without the index. For example, such an analysis allows you to say things like, Learning the intermediate Japanese Common Academic Words is 6.2 times more efficient in covering Japanese social science texts than learning other words at the same level, and 8.3 times more efficient than learning the advanced common academic words .
26 8. Conclusion TCE: Text Covering Efficiency = the mean text coverage per one million tokens of the target text by a word from the grouped words TCE enables us to compare many different types of grouped words in many different genres. Therefore, it makes easier to decide what words should be learned first to read texts in a genre. TCE enables us to examine the lexical features of texts in different genres.
27 Thank you.
28 References Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213 238. Hyland, K., & Tse, P. (2007). Is there an Academic Vocabulary ? TESOL Quarterly, 41(2), 235 253. Matsushita, T. . (2011). [Extracting and validating the Japanese Academic Word List]. [2011 [Proceedings of the Conference for Teaching Japanese as a Foreign Language, Spring 2011] (p 244 249). Matsushita, T. . (2012). [Extracting and validating the Japanese Literary Word List: A corpus-based approach]. (The Ninth Symposium for Japanese Language Education and Japanese Studies), City University of Hong Kong, November 24, 2012 Richards, B. J., & Malvern, D. D. (1997). Quantifying lexical diversity in the study of language development. Reading: University of Reading. Xue, G., & Nation, I. S. P. (1984). A university word list. Language Learning and Communication, 3(2), 215 229.
29 Added note: Robustness of TCE In addition, TCE is a robust index by which different lexical features in different genres can be clarified as well. As argued about TTR (Richards & Malvern, 1997), the relationship between the numbers of tokens and lexemes will be different depending on the text size. Nevertheless, it is not a problem for TCE because the formula does not use the number of lexemes occurring in the text but uses the number of lexemes of the target group of words. This is a reasonable idea because learners generally do not know which words will occur in a particular text. For example, to evaluate the value of the intermediate literary words as a source for gaining the text coverage, it is reasonable to divide the tokens by the number of lexemes of the intermediate literary words which a learner will learn before s/he reads the text.