When Lavels Become Obsolete: Updating SOII Autocoder to OIICS v3.0

Slide Note

David H. Oh, a data scientist at the U.S. Bureau of Labor Statistics, shares insights on updating the SOII Autocoder to OIICS v3.0. Learn about his experiences and challenges faced during this process in the Office of Compensation and Working Conditions.

lem_sh Follow

Uploaded on Feb 25, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

When Labels Become Obsolete: Updating SOII Autocoder to OIICS v3.0 David H. Oh Data Scientist U.S. Bureau of Labor Statistics (BLS) Office of Compensation and Working Conditions (OCWC) 2023 FedCASIC Workshops - Session 5C April 12, 2023 1 U.S. BUREAU OF LABOR STATISTICS bls.gov

Motivation BLS recently completed a major update to the Occupational Injury and Illness Classification System (OIICS): OIICS v2.01 v3.0 Research Question What happens to a machine learning model ( autocoder ) when the labels in its training data become obsolete? 2 U.S. BUREAUOF LABOR STATISTICS bls.gov

Overview Background The Challenge: Updating SOII Autocoder to OIICS v3.0 Traditional Approach: a Simple Crosswalk Next Step: Weak Supervision Using Snorkel 3 U.S. BUREAUOF LABOR STATISTICS bls.gov

Disclaimer This is an ongoing project. Any results presented here are preliminary and are subject to change. 4 U.S. BUREAUOF LABOR STATISTICS bls.gov

Survey of Occupational Injuries and Illnesses Annual establishment survey collecting injury and illness information since 1972 Information collected: Total number of cases resulting in days away from work or days of job transfer and work restrictions Detailed case and demographic information about some injury or illness cases 200,000+ descriptions of work-related injuries and illnesses each year 5 U.S. BUREAUOF LABOR STATISTICS bls.gov

Survey of Occupational Injuries and Illnesses Example Narrative Job title: sanitation worker Codes Assigned SOC: 37-2011 (Janitor) OIICS-Nature: 111 (Fracture) OIICS-Part: 420 (Arm) OIICS-Event: 422 (Fall, slipping) OIICS-Source: 6620 (Floor) What was the employee doing just before the incident? mopping floor in gym What happened? slipped on wet floor and fell What part of the body was affected? fractured right arm What object directly harmed the employee? wet floor 6 U.S. BUREAUOF LABOR STATISTICS bls.gov

Computer-Assisted Coding SOII Autocoder 2014-17: Logistic Regression Model 2018-20: LSTM Model 2021-Present: Transformer Model The usage of computer-assisted coding was expanded gradually over time 2014: Just a few occupational codes 2015-17: Expanded to Nature, Part, Event, and Source 2019: Expanded to Secondary Source Percent of SOII codes automatically assigned by survey year 100% 80% 60% 40% 20% 0% 1 2 3 4 5 6 7 8 Series1 Series2 Series3 Series4 Series5 Series6 7 U.S. BUREAUOF LABOR STATISTICS bls.gov

Standardized Classification Systems Official statistics rely on standardized classification systems to aggregate microdata into meaningful statistics North American Industry Classification System (NAICS) Standard Occupational Classification (SOC) System Occupational Injury and Illness Classification System (OIICS) They are periodically updated to better reflect the current state of the categories that they cover SOC 2000 SOC 2010 SOC 2018 OIICS v2.01 OIICS v3.0 8 U.S. BUREAUOF LABOR STATISTICS bls.gov

OIICS v3.0 First developed by BLS in 1992 to code detailed injury, illness, and fatality data Recent update aims to better capture information useful and relevant for targeting safety and health interventions and to reflect changing technologies or emerging areas of interest Counts of Unique Codes by OIICS Type* OIICS v2.01 2,535 507 182 441 1,405 OIICS v3.0 2,418 503 166 342 1,407 Difference -117 -4 -16 -99 2 Total Nature Part Event Source *Excludes summary-level codes 9 U.S. BUREAUOF LABOR STATISTICS bls.gov

The Challenge Problem: What happens to a machine learning model ( autocoder ) when the labels in its training data become obsolete? SOII Autocoder needs to assign codes using the updated OIICS v3.0 It is trained on a large quantity of labeled SOII cases using OIICS v2.01 A large portion of its training data will have labels that are no longer valid Possible Response: Option 1. Pause autocoding for the time being until enough data are collected using OIICS v3.0 to train a new autocoder Option 2. Update the existing training data labels to v3.0 10 U.S. BUREAUOF LABOR STATISTICS bls.gov

Traditional Approach: A Simple Crosswalk Crosswalk A dictionary-based approach, mapping a code from an old code to a new code One-to-one mapping Definition of a code using OIICS v2.01 is captured entirely within the definition of a single code in OIICS v3.0 One-to-many mapping The definition of a code using OIICS v2.01 is not entirely captured by a single code in OIICS v3.0 Instead, the code can be conditionally mapped to multiple codes in OIICS v3.0 11 U.S. BUREAUOF LABOR STATISTICS bls.gov

OIICS Crosswalk Distribution of Mapping Types in the OIICS Crosswalk File Distribution of Mapping Types in the Training Data One-to-One 87% 93% 85% 79% 88% One-to-Many 13% 7% 11% 21% 12% One-to-One 72% 59% 74% 51% 81% 93% One-to-Many 28% 41% 26% 49% 19% 7% Total Nature Part Event Source Total Nature Part Event Source Secondary Source 12 U.S. BUREAUOF LABOR STATISTICS bls.gov

OIICS Crosswalk: Integrity Test 1. Use a test dataset containing gold standard (GS) codes in both OIICS v2.01 and OIICS v3.0 2. Crosswalk GS OIICS v2.01 to v3.0 and compare against GS OIICS v3.0 3. Consider the crosswalk to be valid if One-to-One: gold_standard_<code>_v3 = <code>_crosswalked One-to-Many: gold_standard_<code>_v3 {<code>_crosswalked} 13 U.S. BUREAUOF LABOR STATISTICS bls.gov

OIICS Crosswalk: Integrity Test Crosswalk Integrity Test Using GS Data One-to-One One-to-Many 81% 94% 95% 78% 77% 67% Total Nature Part Event Source Secondary Source 83% 91% 92% 81% 70% 62% 14 U.S. BUREAUOF LABOR STATISTICS bls.gov

SOII Autocoder: Simply Crosswalked 1. Apply OIICS crosswalk to more than 13.78 million OIICS v2.01 codes For codes with one-to-many mapping, randomly assign one of their possible mappings 2. Train a SOII Autocoder using training data that was simply crosswalked 3. Measure the performance of the trained autocoder using a gold standard dataset 15 U.S. BUREAUOF LABOR STATISTICS bls.gov

SOII Autocoder: Simply Crosswalked SOII Autocoder Performance: Simply Crosswalked vs Production Simply Crosswalked Accuracy 66.0% 72.8% 83.4% 49.4% 63.0% 61.2% Production Macro-F1 43.3 49.3 64.6 39.1 50.4 13.3 Accuracy 82.0% 90.7% 92.6% 70.4% 76.3% 79.9% Macro-F1 59.2 61.9 78.4 51.7 63.1 40.8 Average Nature Part Event Source Secondary Source 16 U.S. BUREAUOF LABOR STATISTICS bls.gov

Limitations to Traditional Approach OIICS crosswalk is not perfect OIICS is a conglomeration of four different set of codes (Nature, Part, Event, and Source) that interact with one another OIICS v2.01 v3.0 resulted in major changes that are difficult to entirely capture in a simple crosswalk No simple way to conditionally map one-to-many codes to their correct codes 17 U.S. BUREAUOF LABOR STATISTICS bls.gov

Next Step: Weak Supervision Snorkel Weak Supervision A broad collection of techniques in machine learning in which models are trained using sources of information that are easier to provide than hand-labeled data, where this information is incomplete, inexact, or otherwise less accurate Supplement crosswalk file with other sources of information Coding heuristics from subject-matter experts Code relationships reflected in edits file Snorkel Data Programming Started as a project at Stanford in 2016 Programmatically create weakly labeled data 18 U.S. BUREAUOF LABOR STATISTICS bls.gov

Contact Information David H. Oh Data Scientist U.S. Bureau of Labor Statistics Office of Compensation and Working Conditions Oh.David@bls.gov 19 U.S. BUREAUOF LABOR STATISTICS bls.gov

When Lavels Become Obsolete: Updating SOII Autocoder to OIICS v3.0

Download Presentation

Presentation Transcript

Related

More Related Content