Data Annotation and Prediction for Student Behavior Analysis
In this content, the focus is on developing models for predicting aspects of student behavior and academic performance using data annotation and classification methods. It delves into the importance of labeling data, obtaining outside knowledge, and collecting bronze-standard labels for behavioral constructs where no gold-standard metric exists. The process of identifying off-task students, potential system gamers, and at-risk students is detailed, along with the necessity of having labeled data and understanding the construct of interest. The text emphasizes the significance of data collection and identification in predictive modeling for educational purposes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data Annotation for Classification
Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are off-task? Which students will fail the class?
Classification Develop a model which can infer a categorical predicted variable from some combination of other aspects of the data Which students will fail the class? Is the student currently gaming the system? Which type of gaming the system is occurring?
We will We will go into detail on classification methods tomorrow
In order to use prediction methods We need to know what we re trying to predict And we need to have some labels of it in real data
For example If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class We need to first collect some data And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally when
So we need to label some data We need to obtain outside knowledge to determine what the value is for the construct of interest
In some cases We can get a gold-standard label For instance, if we want to know if a student passed a class, we just go ask their instructor
But for behavioral constructs There s no one to ask We can t ask the student (self-presentation) There s no gold-standard metric So we use data labeling methods or observation methods (e.g. quantitative field observations, video coding) To collect bronze-standard labels Not perfect, but good enough
One such labeling method Text replay coding
Text replays Pretty-prints of student interaction behavior from the logs
Sampling You can set up any sampling schema you want, if you have enough log data 5 action sequences 20 second sequences Every behavior on a specific skill, but other skills omitted
Sampling Equal number of observations per lesson Equal number of observations per student Observations that machine learning software needs help to categorize ( biased sampling )
Major Advantages Both video and field observations hold some risk of observer effects Text replays are based on logs that were collected completely unobtrusively
Major Advantages Blazing fast to conduct 8 to 40 seconds per observation
Notes Decent inter-rater reliability is possible (Baker, Corbett, & Wagner, 2006) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010) (Montalvo et al, 2010) Agree with other measures of constructs (Baker, Corbett, & Wagner, 2006) Can be used to train machine-learned detectors (Baker & de Carvalho, 2008) (Baker, Mitrovic, & Mathews, 2010) (Sao Pedro et al, 2010)
Major Limitations Limited range of constructs you can code Gaming the System yes Collaboration in online chat yes (Prata et al, 2008) Frustration, Boredom sometimes Off-Task Behavior outside of software no Collaborative Behavior outside of software no
Major Limitations Lower precision (because lower bandwidth of observation)
Find a partner Could be your project team-mate, but doesn t have to be You will do this exercise with them
Get a copy of the text replay software On your flash drive Or at http://www.joazeirodebaker.net/algebra- obspackage-LSRM.zip
Skim the instructions At Instructions-LSRM.docx
Log into text replay software Using exploratory login Try to figure out what the student s behavior means, with your partner Do this for ~5 minutes
Now pick a category you want to code With your partner
Now code data According to your coding scheme (is-category versus is-not-category) Separate from your partner For 20 minutes
Now put your data together Using the observations-NAME files you obtained Make a table (in excel?) showing
Coder 1 Y Coder 1 N Coder 2 Y 15 2 Coder 2 N 3 8
Now We can compute your inter-rater reliability (also called agreement)
Agreement/ Accuracy The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes
Agreement/ Accuracy There is general agreement across fields that agreement/accuracy is not a good metric What are some drawbacks of agreement/accuracy?
Agreement/ Accuracy Let s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of 10000 actions For a coding scheme with two codes 92% accuracy Good, right?
Non-even assignment to categories Percent Agreement does poorly when there is non-even assignment to categories Which is almost always the case Imagine an extreme case Uniqua (correctly) picks category A 92% of the time Tasha always picks category A Agreement/accuracy of 92% But essentially no information
An alternate metric Kappa (Agreement Expected Agreement) (1 Expected Agreement)
Kappa Expected agreement computed from a table of the form Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Count Rater 1 Category 2 Count Count
Kappa Expected agreement computed from a table of the form Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Count Rater 1 Category 2 Count Count Note that Kappa can be calculated for any number of categories (but only 2 raters)
Cohens (1960) Kappa The formula for 2 categories Fleiss s (1971) Kappa, which is more complex, can be used for 3+ categories I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you
Expected agreement Look at the proportion of labels each coder gave to each category To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) Do the same thing for categoryB Add these two values together and divide by the total number of labels This is your expected agreement
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is the percent agreement?
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is the percent agreement? 80%
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is Tyrone s expected frequency for on-task?
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is Tyrone s expected frequency for on-task? 75%
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is Pablo s expected frequency for on-task?
Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 5 Tyrone On-Task 15 60 What is Pablo s expected frequency for on-task? 65%