Machine Learning Techniques for Categorical Data Imputation

Machine Learning Techniques for Categorical Data Imputation
Slide Note
Embed
Share

This presentation delves into the utilization of machine learning methods for imputing categorical data, focusing on non-response treatments, imputation procedures, and recommendations for statistical surveys. It explores algorithmic and model-based imputation methods, while discussing the challenges and results of applying machine learning techniques to address missing information in statistical surveys.

  • Machine Learning
  • Categorical Data
  • Imputation Methods
  • Statistical Surveys
  • Data Modeling

Uploaded on Mar 10, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Use of Machine Learning Methods to Impute Categorical Data Pilar Rey del Castillo* EUROSTAT, Unit B1: Quality, Research and Methodology 24-26 September 2012 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  2. Use of Machine Learning Methods to Impute Categorical Data non-response in statistical surveys approaches Problem different missing information in machine learning evaluation criteria Case of categorical variables: practical recommendations from the statistical approach just reuse procedures designed for numeric variables Aim: show the commitment to the almost exclusive use of probabilistic data models prevents statisticians from using the most convenient technologies 24-26 September 2012 2 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  3. Outline of the presentation 1. Review non-response treatments imputation procedures: evaluation criteria 2. Recommendations for categorical data imputation from the statistical community: why these are not appropriate 3. Results of comparisons with two machine learning methods 4. Final remarks 24-26 September 2012 3 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  4. Non-response treatments Deletion procedures: using only the units with complete data for further analysis Tolerance procedures: internal, not removing incomplete records or completing them Imputation procedures: replacing each missing value by an estimate 24-26 September 2012 4 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  5. Imputation procedures Algorithmic methods: use an algorithm to produce the imputations (cold and hot-deck, nearest-neighbour, mean, machine learning classification & prediction techniques ) Model-based methods: the predictive distributions have a formal statistical model state of the art: MI 24-26 September 2012 5 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  6. Criteria for evaluating the imputation results Statistical surveys: valid & efficient inferences, being treatment part of the overall procedure " Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit-rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective" (Rubin, 1996) Machine learning: general artificial intelligence framework (empirical results through simulating missing data and measuring the closeness between real & imputed) 24-26 September 2012 6 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  7. Categorical data imputation in statistical surveys State of the art: MI or other model-based Log-linear model : not always possible Logistic regression models: sometimes problems at the estimation step Binary case: Rubin & Schenker (1986), Schafer (1997): to approximate by using a Gaussian distribution Non-binary case: Yucel & Zaslavsky (2003), Van Gingel et al. (2007): rounding multivariate normal distribution Criticisms from the practical perspective (Horton (2003), Ake (2005), Allison (2006), Demirtas (2008)) Contradiction (theoretical framework: focus on model adequacy) (practical recommendations: models clearly not adequate) 24-26 September 2012 7 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  8. Problem of categorical data imputation to be solved Survey microdata file: opinion poll (no.2750 in CIS catalogue) Quantitative variables (8): ideological self-location; rating of three specific political figures; likelihood to vote; likelihood to vote for three specific political parties Ordered categorical variables (2): government and opposition party ratings (converted to quantitative) Categorical variables with non-ordered categories (7): voting intention; voting memory; the autonomous community; the political party the respondent would prefer to see win Voting intention to be imputed: 11 categories (biggest political parties, "blank vote", "abstention", "others") 13.280 interviews with no missing values 24-26 September 2012 8 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  9. Imputation methods to be compared MI logistic regression Classifiers (matching each class with one of the Voting intention categories) Fuzzy min-max neural network classifier recently extended to deal with mixed numeric & categorical data as inputs (Rey del Castillo & Carde osa, 2012) Bayesian network classifier: not Na ve Bayes classifier but a more complex architecture learnt with a score + search paradigm 24-26 September 2012 9 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  10. Comparison criterion Not possible classical surveys inference criterion because no models EUREDIT project: Wald statistic for categorical variables: but none of the methods overcome the proposed test! Correctly imputed rate is used (ten-fold cross-validation) 24-26 September 2012 10 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  11. Results of the comparison Correctly imputed rate % Imputation method MI logistic regression 66.0 Fuzzy min-max neural network classifier 86.1 Bayesian network classifier 87.4 24-26 September 2012 11 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  12. Conclusions & final remarks 1. Always similar differences between machine learning / MI logistic 2. Simplest case with missing data exclusively on one variable 3. Extensible to numeric variables ? 4. Machine learning procedures easier to automate Non-dependence on model assumptions Don't break down when large number of variables ? More robust to outliers ? 5. Machine learning may be used for massive imputation tasks 24-26 September 2012 12 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  13. Thank you !!! 24-26 September 2012 13 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  14. References (1) Ake, C. F. (2005), Rounding After Multiple Imputation with Non-Binary Categorical Covariates, SAS Conference Proceedings: SAS User Group International 30, Philadelphia, PA, April 2005. Allison, P. (2006), Multiple Imputation of Categorical Variables under the Multivariate Normal Model, paper presented at the Annual Meeting of the American Sociological Association, Montreal Convention Center, Montreal, Quebec, Canada, August 2006. Demirtas, H. (2008), On Imputing Continuous Data When the Eventual Interest Pertains to Ordinalized Outcomes Via Threshold Concept, Computational Statistics and Data Analysis, vol. 52, pp. 2261-2271. Horton, N. J., Lipsitz, S. R. and Parzen, M. (2003), A Potential for Bias when Rounding in Multiple Imputation, The American Statistician, vol. 57, no. 4, pp. 229-232, November 2003. Rey-del-Castillo, P., and Carde osa, J. (2012), Fuzzy Min Max Neural Networks for Categorical Data: Application to Missing Data Imputation, Neural Computing and Applications, vol. 21, no. 6 (2012), pp. 1349-1362, DOI 10.1007/s00521 011 0574 x, Springer-Verlag London. Rubin, D. B. (1996), Multiple Imputation After 18+ Years, Journal of the American Statistical Association, vol. 91, no. 434, Applications and Case Studies, June 1996. 24-26 September 2012 14 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

  15. References (2) Rubin, D. B. and Schenker, N. (1986), Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse, Journal of the American Statistical Association, vol. 81, no. 394, Survey Research Methods, June 1986. Schafer, J. L. and Graham, J. W. (2002), Missing Data: Our View of the State of the Art, Psychological Methods, vol. 7, no. 2, pp. 147-177. Van Ginkel, J. R., Van der Ark, L. A. and Sijtsma, K. (2007), Multiple Imputation of Item Scores when Test Data are Factorially Complex, British Journal of Mathematics and Statistical Psychology, vol. 60, pp. 315-337. Yucel, R. M. and Zaslavsky, A. M. (2003), Practical Suggestions on Rounding in Multiple Imputation, Proceedings of the Joint American Statistical Association Meeting, Section on Survey Research Methods, Toronto, Canada, August 2003. 24-26 September 2012 15 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing

More Related Content