
Machine Learning Approach for Product Code Classification in Economic Census
Explore how machine learning methods are utilized for classifying product codes in the Economic Census, addressing the challenges of handling vast arrays of unique codes and write-ins. The study delves into autocoding techniques, prior research in classification systems, and considerations for real-time autocoders in the survey process.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Machine Learning for In- Instrument Product Code Search Clayton Knappenberger (he/him) Data Scientist U.S. Census Bureau 04/16/2024 Any opinions and conclusions expressed herein are those of the author(s) and do not reflect the views of the U.S. Census Bureau. The Census Bureau has reviewed this data product to ensure appropriate access, use, and disclosure avoidance protection of the confidential source data (Project No. P-7504847, Disclosure Review Board (DRB) approval number: CBDRB-FY24-EWD001-002). 1
Background The Economic Census is conducted every 5 years Roughly 4.2 million U.S. business establishments receive the Economic Census North American Product Classification System (NAPCS) Beginning in 2017, establishments asked to report revenue by NAPCS Establishments likely to offer multiple products and services NAPCS has 7,234 unique codes Establishments have the option to write-in a free-text answer instead 2
The Problem In 2017 Census received about 1 million NAPCS write-ins For Internal Use Only 3
Autocoders Ice cream Car wash Legal advice Lots of experience building these: 1. Collect data 2. Classify write-ins in batch later Survey of Occupational Injuries and Illnesses (Measure, 2017; BLS, 2023) is a sophisticated example Cuts respondent out of the loop How to validate the predictions? Data Collection Internal Database Post- collection Processing Autocoder 4
Autocoder in the survey? Ice cream Car wash Legal advice What if the autocoder worked in real time instead of in batch? Pros: Get respondent feedback Less data comes to us as a write-in Cons: Latency is a thing requests might timeout Infrastructure more complicated Data Collection Autocoder Post- collection Processing Internal Database 5
Prior work Researcher Roberson and Nguyen (2018) Classification System Equipment/Structures/NA Method Logistic Regression on term frequencies N. Codes 3 Training Data 14,000 Dumbacher and Whitehead (2022) NAICS Hierarchical ensemble of IR methods with learned weights 1,012 3.7 million Moscardi and Schultz (2023) SCTG Logistic Regression on unigram, bigram, 3-5 character n-gram TF-IDF weighted term frequencies and one-hot-encoded NAICS codes 514 400,000 In-survey autocoders tend to: Be simpler models (logistic regression) Use information retrieval ideas Have good training data to classes balance 6
How does NAPCS compare? NAPCS collected only in 2017 Economic Census 2012 data could be converted into NAPCS Initially: Training data 225,000 NAPCS codes 7,234 Subject matter experts removed 950 NAPCS codes Training data 149,500 NAPCS codes 6,284 Data Source Economic Census Number of Observations 180,000 Advantages Disadvantages Represents target population Reflects natural language Descriptions not classified perfectly Descriptions contain misspellings Text does not always reflect natural language Classification Analytical Processing System Subject matter expert (SME) examples 24,000 Provides a rich vocabulary Descriptions are classified correctly Incorporates institutional knowledge Can be continuously updated with SME feedback Definitive source of NAPCS descriptions 11,500 Small data source NAPCS title file 9,100 Small data source 7
SINCT new and sales car Smart Instrument NAPCS Classification Tool Uses Doc2Vec unsupervised neural network for document embeddings Incoming queries are then compared to training corpus Respondent shown NAPCS codes for top-10 closest examples U4 Hidden layer - ? U2 U3 U1 Concatenated Word/Doc Vectors - ?,? 1 0,0,1,0 Doc ID used Le and Mikolov (2014) 8
Testing Split 149,500 classified write-ins into: 145,000 training samples 4,500 test samples Trained and evaluated three different SINCT models: 1. Search only: only embed the write-in text 2. NAICS + Search: embed the concatenated write-in text + NAICS description 3. Combined: run models 1 and 2 and merge/sort their top-10 recommendations by cosine similarity 9
Testing Results Respondent-level accuracy: how often did respondent find a correct NAPCS code across all their searches? Search-level accuracy: how often did a search return the correct NAPCS code? Latency: how long did they have to wait to see results? NAICS + Search model had highest accuracy and competitive latency Metric Top 10 Accuracy Respondent-level Search-level Inference Latency (s) Mean Minimum Median Maximum Search Only NAICS + Search Combined 85.7% 72.3% 72.1% 60.2% 81.1% 68.1% 0.013 0.004 0.011 0.060 0.014 0.005 0.012 0.072 0.025 0.014 0.025 0.124 10
2022 Economic Census Results No timeouts! EC17 to EC22: 78% reduction Remaining Write-ins Retail - 26.4% Healthcare - 18.1% Wholesale - 9.2% Other Services - 8.0% NAPCS Write-in Reduction EC17 - EC22 300,000 -78% EC17 EC22 250,000 200,000 -85% 150,000 100,000 -50% -82% -86% -68% -79% -77%-82%-85% -51% -86% -85% 50,000 -78% -53% -62% -91% -71% - 11
Next steps? Can we extend SINCT to other surveys? Annual Integrated Economic Survey Can we extend SINCT to other product/service codes? Harmonized System codes UPC/GTINs barcodes Can we use transformer-based embeddings? 12
Questions? Clayton.G.Knappenberger@census.gov Taylor.J.Wilson@census.gov Emily.L.Wiley@census.gov 13