Unlocking the Art of Feature Engineering in Data Mining Studio

1 / 31

Embed Share

Dive into the world of feature engineering at Feature Engineering Studio, where you'll learn the intricate process of distilling and engineering features for data mining. Discover why feature engineering is a crucial yet underappreciated aspect of developing prediction models, and explore the tools and methods that will be covered in this design studio-style course. Get ready to reshape educational data efficiently and effectively using tools like Excel, Google Refine, and RapidMiner. Join us in mastering the art of feature engineering and enhancing your predictive modeling skills!

gildard Follow

Uploaded on Mar 20, 2025 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Feature Engineering Studio January 21, 2015

Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features for data mining

What Well Cover The process of feature engineering and distillation brainstorming features deciding what features to create criteria for selecting features actually creating the features studying the impact of features on model goodness

Why? Feature engineering is the most important, and least well-studied part of the process of developing prediction models It is an art, it is human-driven design It involves lore rather than well-known and validated principles It is hard! (But fun, and important)

Why? It s well known in data mining (and statistics for that matter) That your model will never be any good if your features (predictors) aren t very good

The Big Idea How can we take the voluminous, ill-formed, and yet under-specified data that we now have in education And shape it into a reasonable set of variables In an efficient, effective, and predictive way?

Tools Well Use Excel Google Refine RapidMiner Other relevant tools (TBD/your choice)

Course times Monday 11am-12:40pm Wednesday 11am-12:40pm Not every week; please see online schedule

Course Prerequisite Core Methods in Educational Data Mining Or instructor approval I will approve anyone who has at least a little bit of background building prediction models or similar statistical models Talk to me after class, during my office hours, or by appointment

That said If you haven t had experience building prediction models in RapidMiner or a similar tool, then you ll need to learn We will have a few special lab sessions to help you catch up if you don t have experience with this paradigm or tools You can definitely catch up

Who here? Took or audited my Core Methods course? Has built a prediction model using a classification algorithm and cross-validation? Has built a regression model in a stats package using stepwise regression? Has run a regression in a stats package? Has built any kind of mathematical model?

How this class works Lots of assignments (13) They can t be late, because we will discuss them in class 3 of 12 regular assignments can be missed without penalty, but not the final presentation (#13) Important note: You cannot do extra assignments and take the best grades. Only the first 9 assignments turned in will be graded. Not many required readings Essential to participate in critique and class discussions

Who here? Has had a design studio style course before?

This is not A lecture class A reading discussion seminar

This is A class where you will be working on a project of your own choosing the whole semester A class where you ll get, and give, a lot of constructive criticism

The semester project You will build a prediction model If you have your own data set, and research question perfect! If you don t have your own data set, and research question no worries! I will help you find one!

Two types of classes Regular sessions Discuss readings, work on projects Lab sessions Extra practice with tools Lecture on concepts beyond regular class topics Including core content from HUDK4050 needed for this class Not a substitute for HUDK4050, we ll be covering about 5% of HUDK4050 in these sessions

Assignments Let s look at syllabus

Readings Will be made available very soon

Any questions?

Upcoming Classes 1/26 Lab session on data set finding Come to this if you don t have a data set in mind 2/2 Problem proposal (Asgn. 1 due) 2/4 Data cleaning (Asgn. 2 due) 2/16 Lab session on RapidMiner Come to this if you ve never built a classifier or regressor in RapidMiner (or a similar tool) Statistical significance tests using linear regression don t count 2/23 Feature distillation in Excel (Asgn.3 due)

Assignment One Problem Proposal Due Monday, February 2 Be ready to talk for 5 minutes on: A data set Give where it came from and how big it is You need to already have this data set, or be able to acquire it in the next two weeks A prediction model you will build in this data set What variable will you predict? What kind of variables will you use to predict it? Why is this worth doing?

Example (Pardos et al., 2014) Data set ASSISTments system, formative assessment and learning software for math used by 60k students a year (Razzaq et al., 2007) 810,000 data points from 229 students studied Student actions in the software have been overlaid with synchronized field observations of student affect (boredom, frustration, etc.) 3075 field observations Each field observation connects to 20 seconds of log file actions

Example (Pardos et al., 2014) We will predict whether a student is bored at a specific time So that we can replicate the human judgments without needing a field observer We will predict this from what was going on in the log files at the time the field observation was made We know every student action s correctness, timing, relevant skill, and probability they knew the skill

Example (Pardos et al., 2014) This is worth doing because boredom is known to predict student learning (Craig et al., 2004; Rodrigo et al., 2009; Pekrun et al., 2010) And building a detector will help us study boredom more thoroughly As well as enabling us to intervene on boredom in real time

Important Considerations Is the problem genuinely important? (usable or publishable) Is there a good measure of ground truth? (the variable you want to predict) Do we have rich enough data to distill meaningful features? Is there enough data to be able to take advantage of data mining?

You dont need to be able to answer these questions in a week Think about them Think about your problem Email me or come to my office hours (or set up an appointment) Bring it to class We ll discuss it in class No idea is perfect right from the start!

Be ready to answer questions

Be ready to answer questions Be ready to ask questions too

No data ready at hand? Come to next Monday s session, we will find you data!

Any questions or concerns?

Unlocking the Art of Feature Engineering in Data Mining Studio

Download Presentation

Presentation Transcript

Related

More Related Content