Re-Development of Cell Suppression Methodology at US Census Bureau

Slide Note

This project focuses on enhancing the cell suppression methodology at the US Census Bureau by introducing a new program based on linear programming techniques. The team addresses issues related to processing models, table relations, and objective functions to improve the accuracy and efficiency of data analysis. The methodology involves preprocessing, elimination of duplicates, and protection of company data through aggregate supercells. Various constraints and generators are applied to ensure data accuracy, with a focus on additivity and bounds. The project aims to streamline data collection and analysis processes for the 2012 economic census.

kaem_116 Follow

Uploaded on Mar 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Re-development of the Cell Suppression Methodology at the US Census Bureau Philip Steel, James Fagan, Paul Massell, Richard Moore Jr., John Slanta, Bei Wang

Background Jewett s network flow program Need for new program 2012 economic census LP (linear programming) methodology R&M cell suppression team

Processing Model Preprocessing Create table description Determine primaries Unduplicate Sequential processing of primaries Queue reduction Test company protection (aggregate/supercell) Sequential processing of supercells

Table relations Marginals are the sum of interior cells Geographic relationships tend to generate our most complex sets of table relations State is the sum of metropolitan areas within the state and the balance. State is also the sum of counties Of the form A=B+..+Z where A,B, ,Z are (one of) rows columns or levels that define some Cartesian integer space (i,j,k) Duplicates are recorded as A=B (eg a county is also a place)

( ) rows cols levs = = i k j i , , ( ( ) u i , ( ) l i , = + minimize: Y c x x , , , , i j k j k j k = 1 1 1 j k ) A subject to: ( ) levs = = ( i ) , ( i ) j ( i ) 1 , j ( i ) 1 , j u l u l (a) x x x x , , , , , j k k 2 j k ( , , ) i k A for i =1, ... , rows, j = 1, ... ,cols : levs > 1, ws(i,j,1) = 0 ( 1 A k j for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) 1, ws(ii,j,k) = 0 ( 1 A k j i for i = 1, ... , rows, jj = 1, ... , cc, k = 1, ... , levs : limc(cc) 1, ws(i,jj,k) = 0 (d) j i k j i h x , , , , 0 for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) A (e) prot xu plev pcol prow = , , where: ( = c k j i 0 hi,j,k = max(0,vi,j,k) ) lim ( ) r ii (b) = ( rowrel ) ( rowrel ) ( rowrel ) ( rowrel ) u l u l x x x x ( , ), , ( , ), , ( 0 , ii ), , ( 0 , ii ), , ii i j k ii i j k j k j k = i ( , , ) i ) lim ( ) c jj = (c) = ( i ) ( i ) ( i ) ( i ) u colrel l colrel u colrel l colrel x x x x , ( , ), , ( , ), , ( 0 , jj ), , ( 0 , jj ), jj j k jj j k k k j ( , , ) ( ) ( ) u l , 0 x h ; , , , k i j k i j k = ( ) ( prow ) l 0 x ; , , pcol plev ) max , 0 , ( , ) v when i j , k U , , i j k , , ) , ( when i j k P C

Objective Function ( ) rows cols levs = = i k j i , , ( ( ) u i , ( ) l i , = + Y c x x , , , , i j k j k j k = 1 1 1 j k ) A

Additivity constraint generator (based on row relations) (b) for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) 1, ws(ii,j,k) = 0

Bounds hi,j,k = max(0,vi,j,k) for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) A

For the primary

( ) rows cols levs = = i k j i , , ( ( ) u i , ( ) l i , = + minimize: Y c x x , , , , i j k j k j k = 1 1 1 j k ) A subject to: ( ) levs = = ( i ) , ( i ) j ( i ) 1 , j ( i ) 1 , j u l u l (a) x x x x , , , , , j k k 2 j k ( , , ) i k A for i =1, ... , rows, j = 1, ... ,cols : levs > 1, ws(i,j,1) = 0 ( 1 A k j for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) 1, ws(ii,j,k) = 0 ( 1 A k j i for i = 1, ... , rows, jj = 1, ... , cc, k = 1, ... , levs : limc(cc) 1, ws(i,jj,k) = 0 (d) j i k j i h x , , , , 0 for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) A (e) prot xu plev pcol prow = , , where: ( = c k j i 0 hi,j,k = max(0,vi,j,k) ) lim ( ) r ii (b) = ( rowrel ) ( rowrel ) ( rowrel ) ( rowrel ) u l u l x x x x ( , ), , ( , ), , ( 0 , ii ), , ( 0 , ii ), , ii i j k ii i j k j k j k = i ( , , ) i ) lim ( ) c jj = (c) = ( i ) ( i ) ( i ) ( i ) u colrel l colrel u colrel l colrel x x x x , ( , ), , ( , ), , ( 0 , jj ), , ( 0 , jj ), jj j k jj j k k k j ( , , ) ( ) ( ) u l , 0 x h ; , , , k i j k i j k = ( ) ( prow ) l 0 x ; , , pcol plev ) max , 0 , ( , ) v when i j , k U , , i j k , , ) , ( when i j k P C

Skip P Model changes only on the target primary constraints. How can the minimal solution for one target be transformed to be a solution for another target? By applying a scalar that converts the flow through the second P to the fixed value of the model! Can be done when the scalar does not violate the bounding conditions and the complementary flow in the target is 0. I.e. when the solutions flow through the secondary target exceeds its protection requirement.

Empirical confirmation In our large sparse tables, we would see a lot of objective 0 results. That is, the solver finds a 0 cost pattern to protect the primary it is already protected! Skip P eliminated most objective 0 results and left intact the sequence of positive objectives their solutions.

Fat solution CPLEX is using a dual simplex method to find solutions. The solutions have a growing 0 cost component, with many more cells than are required to protect the target P. The flow in the 0 cost cells far exceeds what is required to protect the target P (except in very small or dense examples). The solution lights up the possible flows in the table s current state, giving a fat solution.

Skip P and the fat solution Optimization number Count of P with flow Running total of skipped P 1 2 . . . 587 588 3961 3952 . . . 11035 11037 3076 3243 . . . 10448 10449

dg10 sector 44 Cartesian cells: 367,605 (2d) Non-zero cells: 159,849 Relations: 283 (row and column) 14,000 potential tables, linked P: 95,062 LP problems: 10,604 Typical LP size Reduced LP has 64826 rows, 156809 columns, and 528838 nonzeros Time: 8hr:37min (includes everything)

Comparison between network and LP on one (of hundreds) dataset from 2007 Network flow LP C 14,551 11,283 Cvalue 1,813,213,710 598,886,234 PubValue 12,348,960,578 13,563,288,054 (@10%) undersuppressions # 0 time 24min 8hrs 37min Statistics based on unduplicated data with an approximation of a published status flag

Thankyou! philip.m.steel@census.gov

Re-Development of Cell Suppression Methodology at US Census Bureau

Download Presentation

Presentation Transcript

Related

More Related Content