Feature Engineering in Machine Learning Class with Kelsey Emnett

machine learning for beginners class 11 n.w
1 / 20
Embed
Share

Explore the intricacies of feature engineering in machine learning with Kelsey Emnett's Class 11. Dive into incorporating business knowledge, handling class imbalance, and the impact of infrastructure design. Real-life examples and challenging modeling scenarios are covered. Join the session to enhance your understanding of this key aspect of machine learning.

  • Machine Learning
  • Feature Engineering
  • Kelsey Emnett
  • Class 11
  • Data Science

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Machine Learning for Beginners: Class 11 By Kelsey Emnett

  2. Links Files located here Files located here: bit.ly/machine-learning-introduction Class videos Class videos will be posted by the Sunday after class on my website www.kelseyemnett.com and my YouTube channel https://www.youtube.com/channel/UCR7ejcDyKykQ3wE- KS08Smg.

  3. Topics for Today Feature engineering Incorporating business knowledge into modeling Handling class imbalance Infrastructure and design matter

  4. Overview Covering topics that are difficult to pick up from studying or data science tutorials Will describe real examples and how we approached difficult modeling challenges Will be focusing on more subjective areas of machine learning

  5. Feature Engineering

  6. Feature Engineering What is feature engineering? Why do we do it? How do we evaluate multiple features? Combine Asian or Pacific Islander and Native Hawaiian or Other Pacific Islander groups due to small sample size.

  7. Feature Engineering Methods Bucketizing Bucketizing continuous features KBinsDiscretizer function in Scikit-Learn Can help with skewed distributions Recoding categorical variables Recoding categorical variables to combine categories Deal with class imbalance for categories with small sample sizes Feature crossing Feature crossing: Concatenate features to create a single feature with combinations of two or more features Captures the interaction between features Helps models learn relationships faster Handle missing values that are actually not applicable not applicable

  8. Real Life Example: Bucketing Features Bucketing continuous features In rewards programs, often have super-users that are extreme outliers Examples: Certificates used variable had a long tail Attempted a bucketed variable with the tail in a single bucket

  9. Real Life Example: NA Categories Dealing with not-applicable categories Example: Received a product by third visit Really three categories: Received the product by the third visit, did not receive a product by the third visit, never had a third visit Split into -1, 0, and 1 categories Domain knowledge and data exploration is important!

  10. Dealing with Class Imbalance

  11. Class Imbalance When predicting categories with an uneven distribution Predicting the elites segment for retail client Only 7% of the data were elites Difficult for model to train when there are very few examples of a class Model will have a high rate of false negatives This is the reason accuracy is not a good model evaluation metric Image Source: https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

  12. Methods for Handling Class Imbalance Up Up- -Sampling Sampling Down Down- -Sampling Sampling SMOTE Moving classification threshold Class weights Up-Sampling: Randomly duplicating examples of the minority class Down-Sampling: Randomly removing examples of the majority class Image Source: https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

  13. Methods for Handling Class Imbalance Up-Sampling Down-Sampling SMOTE SMOTE Moving classification threshold Class weights 1. Choose a minority class as the input vector 2. Find its k-nearest neighbors 3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor 4. Repeat the steps until data is balanced Citation: https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

  14. Handling Class Imbalance Up-Sampling Down-Sampling SMOTE Moving classification Moving classification threshold threshold Class weights Move classification threshold to minimize false positives and false negatives Demonstration: Demonstration: https://plotly.com/python/roc-and-pr-curves/

  15. Handling Class Imbalance Up-Sampling Down-Sampling SMOTE Moving classification threshold Class weights Class weights For prediction models, you can increase or decrease class class weights weights Will increase the value of a class during training Example: Have a class imbalance with fewer 1 values Can increase value for correct guesses on 1 values Setting 1.2 for class 1 will increase the value of correct 1 guesses by 20% Helps model get more effective for the less common class

  16. Infrastructure Best Practices

  17. Structuring Databases Well-structured databases enable you to concentrate on analytics and innovate rapidly Increase ease of use and improve data quality Citation: https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html

  18. Model Pipelining Don t create a monolith! Split pipeline components based on their purpose Benefits Reduce the impact of errors Prevent duplication of work Better enable collaboration Simplify debugging process Example Example Final data formatting ETL Model 1 Model 2 Dashboard

  19. Conclusions

  20. Data Science is Subjective Common question: Common question: What is the cut off? When does the metric become good or bad? There are no right answers! Keys to success: Keys to success: Domain knowledge Understanding of business goal Understand what the models do Exhaustive testing and model evaluation

Related


More Related Content