
Measure Style in Student Essays: A Detailed Analysis
Explore a study on measuring style in student essays by Sandeep Mathias and Pushpak Bhattacharyya from IIT Bombay. The study delves into word choice, sentence fluency, and overall style assessment. Discover how goodness of a word/phrase is calculated and its significance in evaluating essays.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Thank Goodness! A Way to Measure Style in Student Essays SANDEEP MATHIAS AND PUSHPAK BHATTACHARYYA CENTER FOR INDIAN LANGUAGE TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE, IIT BOMBAY
2 Outline Introduction What is goodness? Experimental Setup Results Conclusion & Future Work
3 Introduction Essay: A piece of text, written in response to a topic / prompt. Essay grading: Giving a score to the essay either holistically, or on a particular trait, like style. Automatic essay grading: Using machines to grade essays.
Related Work 4 Holistic scoring of essays using deep learning: Taghipour and Ng (2016), Dong and Zhang (2016), Dong and Zhang (2017), Tay et al. (2018). Cross-domain essay scoring: Phandi et al. (2015), Dong and Zhang (2016). Commercial systems: E-Rater Trait-specific essay scoring: Persing et al. (2010), Persing and Ng (2013), Persing and Ng (2014), Persing and Ng (2015), Persing and Ng (2016), Somasundaran et al. (2014), Taghipour (2017), etc.
Problem Definitions Word Choice 5 Word choice is the quality in the essay which measures how precise the vocabulary of the essay is. Good: Sally Yates expressed her concernabout Michael Flynn s ties with Russia. Bad: Sally Yates said that she was concernedabout Michael Flynn s ties with Russia.
Problem Definitions Sentence Fluency 6 Sentence fluency is the quality in the essay which measures how well- written the sentences of the essay are. Good: I was told, by my guide, to prepare slides for the presentation. I should not have procrastinated the work till the last minute. Bad: My guide said me to make slides for presentation. I mistakenly delayed till the end, and I now make last minute examples.
Problem Definitions Style 7 Style is a measure of how well-written the essay is. It is a combination of word choice and sentence fluency.
Goodness of a word / phrase 8 Goodness of a word / phrase is the weighted average of the count of the word / phrase, weighted by score of the essay (either style, or word choice, or sentence fluency). ?? ??(?) ???(?), where Ci(W) is the count of the word / phrase in ???????? ? = essays given a score of i. Example: In the corpus, let the word humourous appear 5 times in essays with a score of 4, 8 times in essays with a score of 5, and only once in essays with a score of 6. Hence, goodness(humourous) = 66/14 = 4.71
Training & Testing 9 Training: First pass: For each word / phrase in an essay, assign its goodness score as the score of the essay. Second pass: For each word / phrase in the corpus, the goodness is the mean score of all of its occurrences in the essays. Testing: For each known word / phrase, assign its goodness score from the earlier calculated scores. Handle each unknown word / phrase
Unknown Word Handling 10 Unknown word not found in the training data, but present in GloVe Find the most similar word in the training data (using cosine similarity) and assign its goodness score to the unknown word. Spelling error not found in the training data, AND in GloVe word vectors Goodness of a spelling error is 0. Unknown phrase not found in the training data Goodness is the mean score of its corresponding words.
Other Features 11 Essay Statistics word count, sentence count Punctuation Features count of commas, explanation points, question marks, quotations. Complexity Features Flesch Reading Ease score, avg. parse tree depth, count of SBARs. Language Modeling Features LM score, perplexity, perplexity per word, OOVs Coherence-based Features pairwise sentence content word similarity, entity grid features (window size from 2 to 4)
Experimental Setup Dataset Used 12 ONLY Prompts #7 (style) and #8 (word choice & sentence fluency) of the ASAP Automatic Essay Grading Dataset. Prompt ID 7 8 Score Range 1 4 1 6 Essays 1569 723 Avg. Length ~250 words ~600 words Quantities Style WC & SF
Experimental Setup Classification Details 13 Evaluation Method: 5 fold cross-validation, using stratified sampling. Evaluation Metric: Quadratic Weighted Kappa. Classifier: Ordinal Class Classifier using Naive Bayes (NB) and Random Forest (RF).
Results 14 Feature Set Taghipour and Ng (2016) All features goodness Content word goodness ALL words goodness ALL words + content phrase goodness ALL features Other human rater Style (NB) 0.4902 0.5485 0.2259 0.2821 0.0792 0.5617 0.5444 Word Choice (RF) 0.2511 0.3433 0.3323 0.3557 0.1785 0.4233 0.4816 Sentence Fluency (RF) 0.3463 0.3886 0.3586 0.3984 0.2241 0.4443 0.5091
Analysis of Goodness Scores 15 Range 1-2 2-3 3-4 4-5 5-6 Example words ower, rumers, computers tho, trash, reward, relay ok, fair, forever cherish, role, obvious dire, aggressively, anguish Example phrases sameting funing, adefokil stoeshi, feel happy we love laughter, a good thing, laugh that much make me happy, a joke, love to laugh cherish forever, the center of attention one of utter sarcasm, went on similarly, something ridiculous
Analysis Content Phrases 16 Content phrases suffer from data sparsity. Example: cherish forever occurred in only 1 fold, and had a goodness value of ~4.5. The constituent words cherish and forever have goodness scores of ~4.1 and ~3.8.
Analysis Adversarial Essays 17 An adversarial essay is one where a human rater would rate it low but our system would be fooled into rating it high. Creation of adversarial essays: Taghipour (2017) suggests using context free grammars, and language modeling to create spurious essays, before trying to detect whether an input essay is spurious or not. Farag et al. (2018) construct adversarial essays by permuting the sentences of good scoring essays. We create adversarial essays by writing lots of long sentences of rubbish using only good words.
Analysis Adversarial Essays Results (Average Score Increase) 18 Property Style Word Choice Sentence Fluency Using goodness only 1.20 2.05 1.96 Using ALL features 0.42 1.36 1.22
Conclusion & Future Work 19 We have defined a property of the essay called the goodness score, and use it as a way to score the style, word choice and sentence fluency of essays. In future, we plan to extend our work to cover other aspects of essay scoring, like content, organization, etc.
THANK YOU! QUESTIONS?
References 21 Fei Dong and Yue Zhang. 2016. Automatic features for essay scoring. In Proceedings of the 2016 Conference of Empirical Methods in Natural Language Processing, pages 1072 1077, Austin, Texas. Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, USA. Isaac Persing, Alan Davis, and Vincent Ng. 2010. Modeling organization in student essays. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 229 239, Cambridge, Massachussetts, USA. Isaac Persing and Vincent Ng. 2013. Modeling thesis clarity in student essays. In Proceedings of the 51st Annual Meeting for the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria. Isaac Persing and Vincent Ng. 2014. Modeling prompt adherence in student essays. In Proceedings of the 52nd Annual Meeting for the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, USA. Isaac Persing and Vincent Ng. 2015. Modeling argument strength in student essays. In Proceedings of the 53rd Annual Meeting for the Association of Computational Linguistics and the 7th International Joint Conference in Natural Language Processing (Volume 1: Long Papers), pages 543 552, Beijing, China. Isaac Persing and Vincent Ng. 2016. Modeling stance in student essays. In Proceedings of the 54th Annual Meeting for the Assiciation for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Swapna Somasundaran, Jill Burstein, and Martin Chodorow. 2014. Lexical chaining for measuring discourse coherence quality in test-taker essays. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 950 961. Dublin, Ireland. Kaveh Taghipour. 2017. Robust Trait-Specific Essay Scoring Using Neural Networks and Density Estimators. Ph.D. thesis. Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882 1891, Austin, Texas.