Inference from Non-Probability Samples Using Machine Learning

Inference from Non-Probability Samples Using Machine Learning
Slide Note
Embed
Share

The use of machine learning techniques in making inferences from non-probability samples. Topics cover expanding the toolbox for statistical analysis, handling disruptive technologies, big data challenges, and methods such as k-Nearest Neighbors and Regression Trees.

  • Machine Learning
  • Statistical Analysis
  • Non-Probability Samples
  • Big Data Challenges
  • k-Nearest Neighbors

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Expanding the toolbox: Inference from non-probability samples using machine learning Joep Burger, Bart Buelens, Jan van de Brakel Statistics Netherlands INPS, 16 17 March 2017, Paris

  2. Disruptive technologies Big picture Transport Light Music

  3. Disruptive technologies Big picture Official statistics

  4. Big data Human-sourced data social media internet search Process-mediated data scanners electronic funds transfers Machine-generated data GPS sensors UNECE 2013 4

  5. Potential Timelier Higher frequency More detail Higher precision ( more accurate) Lower measurement bias Cheaper Less burden 5

  6. Challenges Representation big data sample population of interest non-probability samples Others measurement (data information) processing privacy continuity 6

  7. Street Bump App Automatic pothole mapping Selection bias wealthier, younger neighborhoods

  8. INPS Inference: generalize from sample to population Predict missing values Pseudo-design-based Model-based Algorithmic Auxiliary information Known for all units in the population ? ? ? sample population remainder

  9. Formal Population quantity of interest ? values known for sample S, unknown for remainder R Prediction estimator with Variance through bootstrapping

  10. Methods Sample mean (SAM) Pseudo-design-based (PDB) Generalized linear model (GLM) k-Nearest neighbors (KNN) Artificial neural network (ANN) Regression tree (RTR) Support vector machine (SVM)

  11. Sample mean (SAM) Mean of observed units ??= ? =1 ? ? ? ??

  12. Pseudo-design-based (PDB) Mean of observed units within stratum Use auxiliary variables 1 ? ??= ? = ?? ? ? ?

  13. Generalized linear model (GLM) Generalized combination of auxiliary variables ?(? ??) = ??? ??= ? 1( ???)

  14. k-Nearest neighbors (KNN) Mean of ? observed units closest in ? space Distance measure ??=1 ?2 ? ? ?? ?? ? ? ?1

  15. Artificial neural network (ANN) Artificial neuron Network of artificial neurons ??= ???(??, ?) Kon Mamadou Tadiou

  16. Regression tree (RTR) Construct binary tree Maximize between variance Stopping criterion ?1 ? ?1> ? ?1 ? ?1> ? ?2 ? ?2> ? Mean of observed units within leaf 1 ?? ??= ??? ??, ? = ?? ? ? ? Algorithmic version of PDB

  17. Support vector machine (SVM) Linear separation in ?>?( ? ) Learn in ?(Kernel trick) Original space N Higher-dimensional space M ? ??,?? = ??, ?? ?>? ??= ??? ??,?? ?

  18. Case study Online Kilometer Registration 6,7 mln privately owned cars Mileage readings annual mileage Auxiliary variables Registration year Weight Fuel type Owner s age

  19. Non-probability sample Only data about young cars Inference: ??= ? ?? ? ? ?? ? ?

  20. Inference

  21. Accuracy

  22. Conclusions Both PS and NPS may suffer from selection bias Beyond pseudo-design-based methods Model-based Algorithmic (Many, continuous) auxiliary variables crucial Registers Paradata Profiling

  23. Working paper https://www.cbs.nl/nl- nl/achtergrond/2015/44/predictive-inference- for-non-probability-samples Contact j.burger@cbs.nl

More Related Content