
Regression in Prediction for Data Mining
Regression in prediction involves developing models to infer a single aspect of the data from other aspects. This can be used for predicting future outcomes or making inferences about the present, with examples in educational scenarios and automated decision-making. The process involves predicting numerical labels based on features in the data set.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Week 1, video 2: Regressors
Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Sometimes used to predict the future Sometimes used to make inferences about the present
Prediction: Examples A student is watching a video in a MOOC right now. Is he bored or frustrated? A student has used educational software for the last half hour. How likely is it that she knows the skill in the next problem? A student has completed three years of high school. What will be her score on the college entrance exam?
What can we use this for? Improved educational design If we know when students get bored, we can improve that content Automated decisions by software If we know that a student is frustrated, let s offer the student some online help Informing teachers, instructors, and other stakeholders If we know that a student is frustrated, let s tell their teacher
Regression in Prediction There is something you want to predict ( the label ) The thing you want to predict is numerical Number of hints student requests How long student takes to answer How much of the video the student will watch What will the student s test score be
Regression in Prediction A model that predicts a number is called a regressor in data mining The overall task is called regression
Regression To build a regression model, you obtain a data set where you already know the answer called the training label For example, if you want to predict the number of hints the student requests, each value of numhints is a training label Skill pknow time ENTERINGGIVEN 0.704 9 ENTERINGGIVEN 0.502 10 USEDIFFNUM 0.049 6 ENTERINGGIVEN 0.967 7 REMOVECOEFF 0.792 16 REMOVECOEFF 0.792 13 USEDIFFNUM 0.073 5 . totalactions 1 2 1 3 1 2 2 numhints 0 0 3 0 1 0 0
Regression Associated with each label are a set of features , other variables, which you will try to use to predict the label Skill ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM . pknow 0.704 0.502 0.049 0.967 0.792 0.792 0.073 time 9 10 6 7 16 13 5 totalactions 1 2 1 3 1 2 2 numhints 0 0 3 0 1 0 0
Regression The basic idea of regression is to determine which features, in which combination, can predict the label s value Skill ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM . pknow 0.704 0.502 0.049 0.967 0.792 0.792 0.073 time 9 10 6 7 16 13 5 totalactions 1 2 1 3 1 2 2 numhints 0 0 3 0 1 0 0
Linear Regression The most classic form of regression is linear regression Numhints = 0.12*Pknow + 0.932*Time 0.11*Totalactions Skill COMPUTESLOPE pknow 0.544 time 9 totalactions 1 numhints ?
Quiz Skill COMPUTESLOPE pknow 0.322 time 15 totalactions 4 numhints ? Numhints = 0.12*Pknow + 0.932*Time 0.11*Totalactions What is the value of numhints? 8.34 13.58 3.67 9.21 FNORD A) B) C) D) E)
Quiz Numhints = 0.12*Pknow + 0.932*Time 0.11*Totalactions Which of the variables has the largest impact on numhints? (Assume they are scaled the same) Pknow Time Totalactions Numhints They are equal A) B) C) D) E)
However These variables are unlikely to be scaled the same! If Pknow is a probability From 0 to 1 We ll discuss this variable later in the class And time is a number of seconds to respond From 0 to infinity Then you can t interpret the weights in a straightforward fashion You need to transform them first
Transform When you make a new variable by applying some mathematical function to the previous variable Xt = X2
Transform: Unitization Increases interpretability of relative strength of features Reduces interpretability of individual features Xt = X M(X) SD(X)
Linear Regression Linear regression only fits linear functions Except when you apply transforms to the input variables Which most statistics and data mining packages can do for you
Ln(X) 3 2 1 0 -15 -10 -5 0 5 10 15 -1 -2 -3 -4 -5
Sqrt(X) 3.5 3 2.5 2 1.5 1 0.5 0 -15 -10 -5 0 5 10 15
X2 120 100 80 60 Xt 40 20 0 -15 -10 -5 0 5 10 15
X3 1500 1000 500 0 Xt -15 -10 -5 0 5 10 15 -500 -1000 -1500
1/X 80 60 40 20 0 -15 -10 -5 0 5 10 15 -20 -40 -60 -80
Sin(X) 1.5 1 0.5 0 -15 -10 -5 0 5 10 15 -0.5 -1 -1.5
Linear Regression Surprisingly flexible But even without that It is blazing fast It is often more accurate than more complex models, particularly once you cross-validate Caruana & Niculescu-Mizil (2006) It is feasible to understand your model (with the caveat that the second feature in your model is in the context of the first feature, and so on)
Example of Caveat Let s graph the relationship between number of graduate students and number of papers per year
Data 16 14 12 10 Papers per year 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of graduate students
Data 16 14 12 10 Papers per year Too much time spent filling out personnel action forms? 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of graduate students
Model Number of papers = 4 + 2 * # of grad students - 0.1 * (# of grad students)2 But does that actually mean that (# of grad students)2 is associated with less publication? No!
Example of Caveat 16 14 12 Papers per year 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of graduate students (# of grad students)2 is actually positively correlated with publications! r=0.46
Example of Caveat 16 14 12 Papers per year 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of graduate students The relationship is only in the negative direction when the number of graduate students is already in the model
Example of Caveat So be careful when interpreting linear regression models (or almost any other type of model)
Regression Trees (non-linear; RepTree) If X>3 Y = 2 else If X<-7 Y = 4 Else Y = 3
Linear Regression Trees (linear; M5) If X>3 Y = 2A + 3B else If X< -7 Y = 2A 3B Else Y = 2A + 0.5B + C
Linear Regression Tree 16 14 12 10 Papers per year 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of graduate students
Later Lectures Other regressors Goodness metrics for comparing regressors Validating regressors
Next Lecture Classifiers another type of prediction model