
Creating Synthetic Microdata for Higher Education in Japan
"Explore the creation of synthetic microdata for educational use in Japan, including issues with existing data, correcting methods, and future considerations. Learn about the legal framework and advancements in statistical legislation enabling broader access to official data for academic research."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Creating Synthetic Microdata Creating Synthetic Microdata for Higher Educational Use in Higher Educational Use in Japan Reproduction of Distribution Reproduction of Distribution Type based based on the Descriptive Statistics on the Descriptive Statistics for Japan: : Type Kiyomi Kiyomi Shirakawa Shirakawa Hitotsubashi University / National Statistics Center Hitotsubashi University / National Statistics Center Yutaka Yutaka Abe Abe Hitotsubashi University Hitotsubashi University Shinsuke Ito Shinsuke Ito Chuo University Chuo University
Outline Outline 1. Synthetic 1. Synthetic Microdata in Japan Microdata in Japan 2 2. Problems . Problems with Existing Synthetic with Existing Synthetic Microdata 3. Correcting 3. Correcting Existing Synthetic Existing Synthetic Microdata 4 4. Creating . Creating New New Synthetic Synthetic Microdata 5. 5. Sensitivity Sensitivity Rules Rules for New Synthetic Microdata for New Synthetic Microdata 6 6. Comparison . Comparison between Various Sets between Various Sets of Synthetic Microdata Synthetic Microdata 7 7. Conclusions . Conclusions and Future Outlook and Future Outlook Microdata Microdata Microdata of 2 2
1 1. Synthetic . Synthetic Microdata Microdata in Japan in Japan Synthetic Microdata for educational use are available in Synthetic Microdata for educational use are available in Japan: Japan: Generated Generated using multidimensional statistical tables using multidimensional statistical tables. . Based Based on on the the methodology methodology of microaggregation (Ito ( (Ito (2008), 2008), Ito and Ito and Takano (2011 Takano (2011), of microaggregation ), Makita et al ( Makita et al (2013)) 2013)) Created based on the Created based on the o original 2004 2004 National Survey of National Survey of Family Income and Expenditure Expenditure Synthetic Synthetic microdata are not original microdata. microdata are not original microdata. riginal microdata from the microdata from the Family Income and 3 3
Legal Legal Framework Framework N New ew Statistics Act Statistics Act in Japan(April Enables the provision of Anonymized microdata Enables the provision of Anonymized microdata ( (Article 36 Article 36) and ) and tailor tailor- -made tabulations (Article 34). made tabulations (Article 34). Allows Allows a wider use of official microdata. a wider use of official microdata. Allows Allows use of official statistics in higher education use of official statistics in higher education and academic research. academic research. However, permission process is required. However, permission process is required. in Japan(April 2009) 2009) and To provide To provide an alternative to Anonymized an alternative to Anonymized m NSTAC has NSTAC has developed Synthetic developed Synthetic microdata that can be accessed without accessed without a a permission process. permission process. microdata icrodata, the , the microdata that can be 4 4
Image of Image of Frequency Frequency of of Original Original and and Synthetic M Synthetic Microdata icrodata Source: Source: Makita Makita et al. (2013). et al. (2013). 5 5
2. 2. Problems Problems with Existing Synthetic with Existing Synthetic Microdata Microdata ( (1) All variables a 1) All variables are re subjected to exponential subjected to exponential transformation in units of cells in the result table. transformation in units of cells in the result table. Living Living Expenditure Expenditure Food Food Structure of Structure of Dwelling Dwelling Number of Number of Earners Earners Frequency Frequency Mean Mean SD SD C.V. C.V. Mean Mean SD SD C.V. C.V. 4,132 4,132 302,492.8 302,492.8 148,598.9 148,598.9 0.491 0.491 71,009.0 71,009.0 25,089.5 25,089.5 0.353 0.353 Wooden Wooden 1,436 1,436 300,390.3 300,390.3 170,211.4 170,211.4 0.567 0.567 71,018.5 71,018.5 24,187.6 24,187.6 0.341 0.341 Wooden with fore Wooden with fore roof roof 501 501 298,961.0 298,961.0 125,682.9 125,682.9 0.420 0.420 73,507.3 73,507.3 24,947.7 24,947.7 0.339 0.339 One person One person Ferro Ferro- -concrete concrete 1,624 1,624 306,947.4 306,947.4 131,895.0 131,895.0 0.430 0.430 69,873.1 69,873.1 25,844.2 25,844.2 0.370 0.370 Unknown Unknown 571 571 298,209.7 298,209.7 153,651.1 153,651.1 0.515 0.515 72,024.1 72,024.1 25,125.1 25,125.1 0.349 0.349 4,201 4,201 346,195.7 346,195.7 215,911.7 215,911.7 0.624 0.624 78,209.1 78,209.1 25,288.1 25,288.1 0.323 0.323 Wooden Wooden 1,962 1,962 346,980.3 346,980.3 172,673.2 172,673.2 0.498 0.498 Too large Too large 78,961.7 78,961.7 24,233.5 24,233.5 0.307 0.307 Wooden with fore Wooden with fore roof roof 558 558 356,021.5 356,021.5 160,579.8 160,579.8 0.451 0.451 81,039.4 81,039.4 24,628.2 24,628.2 0.304 0.304 Two persons Two persons Ferro Ferro- -concrete concrete 1,120 1,120 353,093.9 353,093.9 313,837.8 313,837.8 0.889 0.889 76,860.8 76,860.8 26,250.7 26,250.7 0.342 0.342 Others Others 3 3 260,759.8 260,759.8 37,924.3 37,924.3 0.145 0.145 72,733.1 72,733.1 5,358.9 5,358.9 0.074 0.074 Unknown Unknown 558 558 320,224.5 320,224.5 148,230.3 148,230.3 0.463 0.463 75,468.5 75,468.5 27,241.1 27,241.1 0.361 0.361 6 6
(2) Correlation coefficients (numerical) between all variables a (2) Correlation coefficients (numerical) between all variables are reproduced reproduced. . re In the below table, In the below table, several correlation coefficients a several correlation coefficients are The reason is that correlation The reason is that correlation coefficients between uncorrelated coefficients between uncorrelated variables a variables are re also reproduced also reproduced. . re too too small. small. Living expenditure Living expenditure Food Food Housing Housing Living expenditure Living expenditure 1.00 1.00 0.50 0.50 0.28 0.28 Food Food 0.43 0.43 1.00 1.00 - -0.03 0.03 Housing Housing 0.28 0.28 - -0.06 0.06 1.00 1.00 Too Too small small Top Top half: half: original data; bottom original data; bottom half: synthetic half: synthetic microdata microdata 7 7
(3) Qualitative attributes of groups having a frequency (size) of 1 (3) Qualitative attributes of groups having a frequency (size) of 1 or 2 a or 2 are re transformed to "Unknown" (V) or deleted. transformed to "Unknown" (V) or deleted. The information loss when using this method The information loss when using this method is Furthermore Furthermore, the variations within the groups , the variations within the groups are merge qualitative attributes between different groups. merge qualitative attributes between different groups. is too large. too large. are too large to too large to Individual Data Multidimensional Tables Employment Status 1 Employment Status 1 Employment Status 1 Employment Status 1 Number Gender Gender N Gender N Number Gender 1 1 1 3 1 3 1 1 2 1 1 1 3 1 1 V 3 2 1 1 3 1 1 1 4 2 : : 3 1 1 4 1 3 : : : 4 1 V 5 1 4 5 1 V 6 1 4 6 1 V : : : : : : Note: " Note: "V" stands for "unknown". V" stands for "unknown". Source: Source: Makita et al. (2013). Makita et al. (2013). Figure Figure 1 1: : Processing Processing records minimum minimum size size of of 3 3. . records with with common common values values for for qualitative qualitative attributes attributes into into groups groups with with a a 8 8
Scatter plots of numerical examples for Anscombe's quartet Scatter plots of numerical examples for Anscombe's quartet 9 9
Examples of numerical values for Anscombe's quartet Examples of numerical values for Anscombe's quartet I I II II III III IV IV x x y y x x y y x x y y x x y y 10.0 10.0 8.04 8.04 10.0 10.0 9.14 9.14 10.0 10.0 7.46 7.46 8.0 8.0 6.58 6.58 8.0 8.0 6.95 6.95 8.0 8.0 8.14 8.14 8.0 8.0 6.77 6.77 8.0 8.0 5.76 5.76 13.0 13.0 7.58 7.58 13.0 13.0 8.74 8.74 13.0 13.0 12.74 12.74 8.0 8.0 7.71 7.71 9.0 9.0 8.81 8.81 9.0 9.0 8.77 8.77 9.0 9.0 7.11 7.11 8.0 8.0 8.84 8.84 11.0 11.0 8.33 8.33 11.0 11.0 9.26 9.26 11.0 11.0 7.81 7.81 8.0 8.0 8.47 8.47 14.0 14.0 9.96 9.96 14.0 14.0 8.10 8.10 14.0 14.0 8.84 8.84 8.0 8.0 7.04 7.04 6.0 6.0 7.24 7.24 6.0 6.0 6.13 6.13 6.0 6.0 6.08 6.08 8.0 8.0 5.25 5.25 4.0 4.0 4.26 4.26 4.0 4.0 3.10 3.10 4.0 4.0 5.39 5.39 19.0 19.0 12.50 12.50 12.0 12.0 10.84 10.84 12.0 12.0 9.13 9.13 12.0 12.0 8.15 8.15 8.0 8.0 5.56 5.56 7.0 7.0 4.82 4.82 7.0 7.0 7.26 7.26 7.0 7.0 6.42 6.42 8.0 8.0 7.91 7.91 5.0 5.0 5.68 5.68 5.0 5.0 4.74 4.74 5.0 5.0 5.73 5.73 8.0 8.0 6.89 6.89 Property Property Value Value Mean Mean of of x x in each case in each case 9 (exact) 9 (exact) Sample Sample variance variance of of x x in each case in each case 11 (exact) 11 (exact) Mean of Mean of y y in each case in each case 7.50 (to 2 decimal places) 7.50 (to 2 decimal places) Sample variance of Sample variance of y y in each case in each case 4.122 or 4.127 (to 3 decimal places) 4.122 or 4.127 (to 3 decimal places) Correlation Correlation between between x x and and y y in each case in each case 0.816 (to 3 decimal places) 0.816 (to 3 decimal places) Linear regression Linear regression line in each case line in each case y y = = 3.00 3.00 + + 0.500 0.500x x ( (to 2 and 3 decimal places, respectively) to 2 and 3 decimal places, respectively) http:// http://en.wikipedia.org/wiki/Anscombe%27s_quartet en.wikipedia.org/wiki/Anscombe%27s_quartet 10 10
A Kind of Synthetic microdata in Japan A Kind of Synthetic microdata in Japan Individual microdata Individual microdata Statistical Tables Statistical Tables No tabulation: No tabulation: maximum, minimum, median, range maximum, minimum, median, range frequency, mean, frequency, mean, SD correlation correlation coefficient, coefficient, skewness skewness, kurtosis SD f frequency, mean, requency, mean, standard deviation standard deviation , kurtosis AUF AUF PUF PUF (Academic Use Files) (Academic Use Files) (Public Use Files) (Public Use Files) reproduce original dist. type reproduce original dist. type not consider original dist. type not consider original dist. type 11 11
Information Information Shortly, the National Statistics center will be Shortly, the National Statistics center will be delivering delivering new new Public Use Public Use F File Survey Survey of Family of Family Income Income and ile in the National in the National and Expenditure Expenditure 2009 2009. . However, there isn't a However, there isn't a plan 'Academic Use File 'Academic Use File'. '. Therefore, in this paper, we will be suggestion Therefore, in this paper, we will be suggestion of a creating its file. of a creating its file. plan that creating an that creating an 12 12
3 3. Correcting Existing Synthetic . Correcting Existing Synthetic Microdata Microdata The following approaches can be used to correct The following approaches can be used to correct the existing Synthetic microdata. the existing Synthetic microdata. (1) (1)Select Select the transformation method (logarithmic the transformation method (logarithmic transformation, exponential transformation, square transformation, exponential transformation, square- - root transformation, reciprocal transformation) based root transformation, reciprocal transformation) based on the original distribution type (normal, bimodal, on the original distribution type (normal, bimodal, uniform, etc uniform, etc.). .). (2) (2)Detect Detect non non- -correlations for each variable correlations for each variable. . (3) (3)Merge qualitative Merge qualitative attributes in groups with a size of attributes in groups with a size of 1 or 2 1 or 2 into into a group that has a minimum size of 3 in a group that has a minimum size of 3 in the upper hierarchical level. the upper hierarchical level. 13 13
Box Box- -Cox Cox Transformation Transformation Histogram of Wool$cycles 12 10 = 0 = 0 logarithmic logarithmic transformation transformation 8 Frequency = 0.5 = 0.5 square square- -root root transformation transformation 6 = = - -1 1 reciprocal reciprocal transformation transformation 4 = 1 = 1 linear transformation linear transformation 2 0 0 1000 2000 3000 4000 Wool$cycles 14 14
4 4. Creating . Creating New Synthetic Microdata New Synthetic Microdata In order to improve problems with existing Synthetic In order to improve problems with existing Synthetic microdata, new synthetic microdata were created based on microdata, new synthetic microdata were created based on the following approaches. the following approaches. ( (1) Create microdata based on kurtosis and 1) Create microdata based on kurtosis and skewness (2) Create microdata based on the two tabulation tables of (2) Create microdata based on the two tabulation tables of the basic table and details the basic table and details table table (3) Create microdata based on multivariate normal random (3) Create microdata based on multivariate normal random numbers and exponential transformation numbers and exponential transformation skewness This process allows creating This process allows creating synthetic characteristics characteristics similar to those of similar to those of the synthetic microdata microdata with the original original microdata. with microdata. It It is is called called Academic Academic Use Use File' File' . . 15 15
(1) Microdata created (1) Microdata created based on Kurtosis and Skewness based on Kurtosis and Skewness 16 Differences of kurtosis and skewness Differences of kurtosis and skewness 14 12 10 8 6 4 2 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,000- Original microdata Close data Not close data Original microdata and transformed indicators for each transformation Original microdata and transformed indicators for each transformation Natural lognormal Natural lognormal transformation transformation Square Square- -root transformation transformation root Reciprocal Reciprocal transformation transformation Original data Original data Log2 transformation Log2 transformation Mean Mean 861.370 861.370 9.139 9.139 6.335 6.335 26.451 26.451 2.651 2.651 Standard deviation Standard deviation 882.057 882.057 1.363 1.363 0.945 0.945 12.960 12.960 2.548 2.548 - -0.448 0.448 - -0.448 0.448 Kurtosis Kurtosis 4.004 4.004 0.974 0.974 4.185 4.185 0.107 0.107 0.107 0.107 Skewness Skewness 2.002 2.002 1.115 1.115 1.943 1.943 Frequency Frequency 27 27 = 0 = 0 - -0.047 0.047 16 16
(2) Microdata created (2) Microdata created based T Table able and and Details Details T Table) based on able) on two two T Tabulation abulation T Tables (Basic ables (Basic Basic Basic Table Table (matches with original mean and standard deviation, approximate correlation (matches with original mean and standard deviation, approximate correlation coefficients for each variable) coefficients for each variable) Living expenditure Living expenditure Food Food Housing Housing Mean Mean 195,624.8 195,624.8 54,647.8 54,647.8 1,648.8 1,648.8 Standard deviation Standard deviation 59,892.6 59,892.6 21,218.1 21,218.1 3,144.4 3,144.4 Kurtosis Kurtosis - -1.004164 1.004164 1.628974 1.628974 6.918601 6.918601 Skewness Skewness 0.346305 0.346305 0.992579 0.992579 2.605260 2.605260 Frequency Frequency 20 20 20 20 8 8 Living expenditure Living expenditure Food Food Housing Housing Correlation coefficients Correlation coefficients Living expenditure Living expenditure 1 1 Food Food 0.643 0.643 1 1 Housing Housing - -0.335 0.335 - -0.489 0.489 1 1 Details Table (means and standard deviations for creating synthetic microdata for multidimensional Details Table (means and standard deviations for creating synthetic microdata for multidimensional cross fields) cross fields) Living expenditure Living expenditure Mean Mean 185,499.9 185,499.9 150,424.8 150,424.8 269,749.0 269,749.0 209,347.8 209,347.8 236,587.8 236,587.8 137,080.2 137,080.2 Food Food Groups Groups Frequency Frequency 3 3 3 3 3 3 4 4 3 3 4 4 Standard deviation Standard deviation 65,680.5 65,680.5 28,599.3 28,599.3 43,611.7 43,611.7 50,580.8 50,580.8 40,679.9 40,679.9 15,119.7 15,119.7 Frequency Frequency 3 3 3 3 3 3 4 4 3 3 4 4 Mean Mean 31,193.5 31,193.5 51,457.2 51,457.2 80,520.1 80,520.1 45,359.0 45,359.0 75,606.2 75,606.2 48,797.2 48,797.2 Standard deviation Standard deviation 6,406.9 6,406.9 20,795.2 20,795.2 28,447.0 28,447.0 12,618.4 12,618.4 3,049.8 3,049.8 1,071.9 1,071.9 1 1 2 2 3 3 4 4 5 5 6 6 17 17
(3) (3) Microdata created Microdata created based N Normal ormal R Random andom N Numbers T Transformation ransformation based on umbers and on Multivariate Multivariate and Exponential Exponential in the Box in the Box- -Cox transformation is required in transformation is required in order to change the order to change the distribution type of the distribution type of the original data into a standard original data into a standard distribution. distribution. Cox Based on Based on in the Box Cox Cox transformation transformation in the Box- - a a random random number approximates approximates kurtosis kurtosis and and skewness of of the the microdata microdata selected selected. . number that that the the skewness original original However, However, approximately approximately using exponential using exponential transformation transformation was was 18 18
5. Sensitivity 5. Sensitivity Rules Rules for Academic Use File for Academic Use File Rule Rule Def. : Def. : A cell is considered unsafe A cell is considered unsafe Minimum Minimum frequency frequency rule rule the cell frequency is less than a pre the cell frequency is less than a pre- -specified minimum frequency minimum frequency n n (the common choice is (the common choice is n n=3). specified =3). the the sum of the sum of the n n largest contributions exceeds largest contributions exceeds k k% of the cell total, e.g. the cell total, e.g. x x1 1 x x2 2 x x k k / / 100 100 X X % of ( ( , , k k ) ) rule rule the the cell t cell total minus 2 largest contributions otal minus 2 largest contributions x x1 1and is less than is less than p p% of the largest % of the largest contribution, e.g. X X- -x x1 1 x x2 2 p p / / 100 100 x x1 1 and x x2 2 p p% % rule rule contribution, e.g. * * Reference 1 Reference 1 A Network of Excellence in the European Statistical System in the field of Statistical Disclosure A Network of Excellence in the European Statistical System in the field of Statistical Disclosure Control ( Control (ESSNet ESSNet SDC) SDC) NSTAC Working Paper, No.10 NSTAC Working Paper, No.10 19 19
Combination when N=20 Combination when N=20 (trial) requirement of combination requirement of combination assuming each variable is integer: assuming each variable is integer: ??+ ??+ + + +???+ ???= ??? ?? ?? ??? ??? (trial) Is AUF safe or Is AUF safe or unsafe? unsafe? Range of each variable Range of each variable N N- -th maximum and minimum value th maximum and minimum value Max. Max. 100 100 50 50 33 33 25 25 20 20 16 16 14 14 12 12 11 11 10 10 9 9 8 8 7 7 7 7 6 6 6 6 5 5 5 5 5 5 5 5 Mini. Mini. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Number of Combinations: Number of Combinations: 97,132,873 patterns 97,132,873 patterns 20 20
Number of combination by frequency Number of combination by frequency X X1 1(largest variable in combination) (largest variable in combination) mode mode # of # of combination of mode combination of mode frequency frequency # # of combi. of combi. Max. Max. Mini. Mini. ratio ratio 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 97,132,873 97,132,873 86,658,411 86,658,411 75,772,412 75,772,412 64,684,584 64,684,584 53,662,038 53,662,038 43,018,955 43,018,955 33,097,743 33,097,743 24,234,058 24,234,058 16,713,148 16,713,148 10,718,685 10,718,685 6,292,069 6,292,069 3,314,203 3,314,203 1,527,675 1,527,675 596,763 596,763 189,509 189,509 46,262 46,262 8,037 8,037 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10 10 10 12 12 12 12 15 15 17 17 20 20 25 25 34 34 50 50 20 20 20 20 21 21 21 21 22 22 22 22 23 23 23 23 24 24 25 25 27 27 28 28 30 30 32 32 36 36 40 40 46 46 50 50 5,498,387 5,498,387 4,865,168 4,865,168 4,214,238 4,214,238 3,565,531 3,565,531 2,914,553 2,914,553 2,307,803 2,307,803 1,743,387 1,743,387 1,248,206 1,248,206 840,657 840,657 523,214 523,214 296,777 296,777 149,938 149,938 65,728 65,728 24,135 24,135 7,119 7,119 1,592 1,592 5.7 5.7 5.6 5.6 5.6 5.6 5.5 5.5 5.4 5.4 5.4 5.4 5.3 5.3 5.2 5.2 5.0 5.0 4.9 4.9 4.7 4.7 4.5 4.5 4.3 4.3 4.0 4.0 3.8 3.8 3.4 3.4 3.1 3.1 2.9 2.9 2.0 2.0 251 251 26 26 884 884 51 51 1 1 21 21
unsafe combinations by unsafe combinations by p p% rule ( freq. N=20) % rule ( freq. N=20) X X1 1 freq. freq. X X1 1 freq. freq. X X1 1 freq. freq. X X1 1 freq. freq. X X1 1 freq. freq. 100 100 1 1 89 89 19 19 78 78 195 195 67 67 373 373 56 56 195 195 99 99 1 1 88 88 30 30 77 77 195 195 66 66 373 373 55 55 139 139 98 98 2 2 87 87 30 30 76 76 272 272 65 65 272 272 54 54 139 139 97 97 2 2 86 86 45 45 75 75 272 272 64 64 272 272 53 53 139 139 96 96 4 4 85 85 45 45 74 74 373 373 63 63 272 272 52 52 139 139 95 95 4 4 84 84 67 67 73 73 373 373 62 62 272 272 51 51 139 139 94 94 7 7 83 83 67 67 72 72 508 508 61 61 272 272 50 50 97 97 93 93 7 7 82 82 97 97 71 71 508 508 60 60 195 195 49 49 95 95 92 92 12 12 81 81 97 97 70 70 373 373 59 59 195 195 48 48 90 90 91 91 12 12 80 80 139 139 69 69 373 373 58 58 195 195 46 46 52 52 90 90 19 19 79 79 139 139 68 68 373 373 57 57 195 195 47 47 78 78 Total Total 8,849 8,849 only 8,849 sets in 97,132,873 only 8,849 sets in 97,132,873 sets sets (0.01%) (0.01%) 22 22
Maximum number of combinations grouping Maximum number of combinations grouping freq. N=20 combinations by each freq. N=20 combinations by each statistic statistic N= N=30 30 N= N=20 20 difference difference StDev StDev Skew Skew Kurt Kurt Median Median Intruder Intruder 331,258 331,258 223,627 223,627 107,631 107,631 873 873 550 550 323 323 16 16 23 23 7 7 23 23 12 12 11 11 22 22 9 9 13 13 23 23
Range of each statistic grouping by SD, Skewness and Range of each statistic grouping by SD, Skewness and Kurtosis (freq. N=20) (freq. N=20) Kurtosis Count Count freq. freq. Max. Max. SD SD Mini. Mini. SD SD Max. Skew. Max. Skew. Mini. Skew. Mini. Skew. Max. Kurt Max. Kurt Mini. Kurt Mini. Kurt 16 16 2 2 4.180153611 4.180153611 3.741657387 3.741657387 0.57644737 0.57644737 0.361706944 0.361706944 - -0.587458267 0.587458267 - -0.840557276 0.840557276 15 15 13 13 4.565315462 4.565315462 3.524351377 3.524351377 0.964551851 0.964551851 0.192366206 0.192366206 0.309376382 0.309376382 - -1.054150134 1.054150134 14 14 64 64 4.565315462 4.565315462 3.324549831 3.324549831 0.955694047 0.955694047 0.145282975 0.145282975 0.530688627 0.530688627 - -1.231641448 1.231641448 13 13 155 155 4.963021151 4.963021151 3.077935056 3.077935056 0.901252349 0.901252349 0.124964037 0.124964037 0.334557548 0.334557548 - -1.371295887 1.371295887 12 12 445 445 5.211323703 5.211323703 2.901905 2.901905 1.105031963 1.105031963 0.083177673 0.083177673 0.882366328 0.882366328 - -1.325211176 1.325211176 11 11 1,153 1,153 5.380275868 5.380275868 2.91998558 2.91998558 1.35646267 1.35646267 0.060994516 0.060994516 1.329705653 1.329705653 - -1.43192732 1.43192732 10 10 3,233 3,233 5.830951895 5.830951895 2.695024656 2.695024656 1.50255637 1.50255637 - -0.009153874 0.009153874 2.780946447 2.780946447 - -1.463372549 1.463372549 9 9 8,186 8,186 5.938279035 5.938279035 2.675424216 2.675424216 2.643593746 2.643593746 - -0.054613964 0.054613964 8.856069776 8.856069776 - -1.506246692 1.506246692 8 8 20,059 20,059 6.316228055 6.316228055 2.533979604 2.533979604 2.790656548 2.790656548 - -0.129051837 0.129051837 9.651964674 9.651964674 - -1.619289547 1.619289547 7 7 46,302 46,302 6.844129255 6.844129255 2.406132516 2.406132516 3.167347883 3.167347883 - -0.377822461 0.377822461 11.83376246 11.83376246 - -1.677127049 1.677127049 6 6 109,605 109,605 7.813618341 7.813618341 2.339590607 2.339590607 3.529878009 3.529878009 - -0.44212357 0.44212357 14.2766552 14.2766552 - -1.701512451 1.701512451 5 5 269,146 269,146 8.926601286 8.926601286 2.152110347 2.152110347 3.859596389 3.859596389 - -0.595837348 0.595837348 16.21353047 16.21353047 - -1.755565919 1.755565919 4 4 718,999 718,999 10.35679284 10.35679284 1.91942974 1.91942974 4.019587737 4.019587737 - -0.71911404 0.71911404 17.17271192 17.17271192 - -1.858246651 1.858246651 3 3 2,210,969 2,210,969 13.58404637 13.58404637 1.716790151 1.716790151 4.237554114 4.237554114 - -1.141558149 1.141558149 18.50919755 18.50919755 - -1.934757558 1.934757558 2 2 8,524,260 8,524,260 15.97366253 15.97366253 1.376494403 1.376494403 4.412088065 4.412088065 - -1.578947368 1.578947368 19.62372574 19.62372574 - -2.036823063 2.036823063 24 24
6.Comparison 6.Comparison between Various Sets between Various Sets of Synthetic Comparison Comparison of original microdata and each of original microdata and each set of synthetic of Synthetic Microdata set of synthetic microdata Microdata microdata 2 2 4 4 1 1 3 3 Hierarchization, and kurtosis, Hierarchization, and kurtosis, skewness and of Box skewness and of Box- -Cox transformation transformation Multivariate lognormal random Multivariate lognormal random numbers numbers Cox Original microdata Original microdata Kurtosis and skewness Kurtosis and skewness No. No. Living Living expenditure expenditure Living Living expenditure expenditure Living Living expenditure expenditure Living Living expenditure expenditure Food Food Food Food Food Food Food Food 125,503.5 125,503.5 255,675.9 255,675.9 175,320.4 175,320.4 181,085.6 181,085.6 124,471.0 124,471.0 145,717.7 145,717.7 319,114.3 319,114.3 253,685.2 253,685.2 236,447.6 236,447.6 137,315.3 137,315.3 253,393.7 253,393.7 232,141.8 232,141.8 214,540.4 214,540.4 234,151.4 234,151.4 278,431.0 278,431.0 197,180.8 197,180.8 118,895.1 118,895.1 130,482.8 130,482.8 147,969.1 147,969.1 150,973.7 150,973.7 29,496.1 29,496.1 25,806.2 25,806.2 38,278.2 38,278.2 74,122.1 74,122.1 33,256.8 33,256.8 46,992.8 46,992.8 113,177.1 113,177.1 67,253.6 67,253.6 61,129.8 61,129.8 27,050.1 27,050.1 47,205.6 47,205.6 52,259.6 52,259.6 54,920.9 54,920.9 74,993.0 74,993.0 78,916.1 78,916.1 72,909.6 72,909.6 48,821.6 48,821.6 47,798.5 47,798.5 50,277.9 50,277.9 48,291.0 48,291.0 110,487.8 110,487.8 232,691.8 232,691.8 213,320.2 213,320.2 183,430.4 183,430.4 134,867.6 134,867.6 132,976.4 132,976.4 242,622.5 242,622.5 320,055.9 320,055.9 246,568.6 246,568.6 144,192.6 144,192.6 267,708.8 267,708.8 212,050.7 212,050.7 213,439.1 213,439.1 205,595.0 205,595.0 282,652.7 282,652.7 221,515.6 221,515.6 127,964.3 127,964.3 159,328.0 159,328.0 133,795.5 133,795.5 127,232.9 127,232.9 25,143.0 25,143.0 37,905.5 37,905.5 30,531.9 30,531.9 75,469.1 75,469.1 39,568.9 39,568.9 39,333.7 39,333.7 68,472.2 68,472.2 113,008.5 113,008.5 60,079.7 60,079.7 32,572.9 32,572.9 60,344.8 60,344.8 37,656.3 37,656.3 50,862.2 50,862.2 73,919.1 73,919.1 79,126.9 79,126.9 73,772.7 73,772.7 50,240.7 50,240.7 48,533.5 48,533.5 47,660.6 47,660.6 48,754.2 48,754.2 107,684.0 107,684.0 281,880.8 281,880.8 254,267.3 254,267.3 294,589.9 294,589.9 193,191.6 193,191.6 189,242.7 189,242.7 151,183.6 151,183.6 271,338.1 271,338.1 157,306.9 157,306.9 167,431.0 167,431.0 270,301.8 270,301.8 223,946.8 223,946.8 225,103.2 225,103.2 165,972.3 165,972.3 249,749.1 249,749.1 183,281.1 183,281.1 115,639.3 115,639.3 170,231.1 170,231.1 125,789.2 125,789.2 114,366.4 114,366.4 23,459.9 23,459.9 56,520.4 56,520.4 37,419.4 37,419.4 112,843.9 112,843.9 54,363.3 54,363.3 53,980.3 53,980.3 55,303.2 55,303.2 79,991.4 79,991.4 50,650.9 50,650.9 36,116.3 36,116.3 78,246.4 78,246.4 43,827.9 43,827.9 63,861.2 63,861.2 49,350.6 49,350.6 73,474.1 73,474.1 48,672.3 48,672.3 71,059.5 71,059.5 38,723.5 38,723.5 22,188.5 22,188.5 42,903.1 42,903.1 133,549.9 133,549.9 123,716.6 123,716.6 152,784.8 152,784.8 195,764.8 195,764.8 202,865.8 202,865.8 193,003.4 193,003.4 191,620.1 191,620.1 72,773.7 72,773.7 201,114.6 201,114.6 217,530.7 217,530.7 297,608.7 297,608.7 175,993.6 175,993.6 297,653.0 297,653.0 123,197.1 123,197.1 277,501.6 277,501.6 235,221.1 235,221.1 182,363.2 182,363.2 158,939.4 158,939.4 212,194.2 212,194.2 267,100.1 267,100.1 38,559.9 38,559.9 42,930.1 42,930.1 67,263.8 67,263.8 8,286.1 8,286.1 75,558.0 75,558.0 70,994.2 70,994.2 52,311.7 52,311.7 13,621.6 13,621.6 74,899.0 74,899.0 60,736.0 60,736.0 77,464.3 77,464.3 71,416.6 71,416.6 86,400.5 86,400.5 31,645.5 31,645.5 69,910.5 69,910.5 58,700.6 58,700.6 49,433.2 49,433.2 45,131.8 45,131.8 37,995.6 37,995.6 59,697.3 59,697.3 25 25
The most useful microdata from the indicators The most useful microdata from the indicators in are are in column number 2. in column number 2. in the below the below table table 2 2 4 4 1 1 3 3 Hierarchization, and kurtosis, Hierarchization, and kurtosis, skewness and of Box skewness and of Box- -Cox transformation transformation Multivariate lognormal random Multivariate lognormal random numbers numbers Cox Original microdata Original microdata Kurtosis and skewness Kurtosis and skewness No. No. Living Living expenditure expenditure Living Living expenditure expenditure Living Living expenditure expenditure Living Living expenditure expenditure Food Food Food Food Food Food Food Food Mean Mean 195,624.8 195,624.8 54,647.8 54,647.8 195,624.8 195,624.8 54,647.8 54,647.8 195,624.8 195,624.8 54,647.8 54,647.8 195,624.8 195,624.8 54,647.8 54,647.8 Standard deviation Standard deviation 59,892.6 59,892.6 21,218.1 21,218.1 59,892.6 59,892.6 21,218.1 21,218.1 59,892.6 59,892.6 21,218.1 21,218.1 59,892.6 59,892.6 21,218.1 21,218.1 Kurtosis Kurtosis - -1.004164 1.004164 1.628974 1.628974 - -0.810215 0.810215 1.473853 1.473853 - -1.220185 1.220185 1.721354 1.721354 - -0.212358 0.212358 - -0.052164 0.052164 Skewness Skewness 0.346305 0.346305 0.992579 0.992579 0.310913 0.310913 1.050568 1.050568 0.160612 0.160612 0.949106 0.949106 0.035785 0.035785 - -0.709361 0.709361 0.642511 0.642511 0.689447 0.689447 0.642511 0.642511 0.642511 0.642511 Correlation coefficients Correlation coefficients Maximum value Maximum value 319,114.3 319,114.3 113,177.1 113,177.1 320,055.9 320,055.9 113,008.5 113,008.5 294,589.9 294,589.9 112,843.9 112,843.9 297,653.0 297,653.0 86,400.5 86,400.5 Minimum value Minimum value 118,895.1 118,895.1 25,806.2 25,806.2 110,487.8 110,487.8 25,143.0 25,143.0 107,684.0 107,684.0 22,188.5 22,188.5 72,773.7 72,773.7 8,286.1 8,286.1 Note that for reference, column number 4 is the same as the trial synthetic microdata method. Note that for reference, column number 4 is the same as the trial synthetic microdata method. 26 26
Scatter plots of living expenditure and food for each microdata Scatter plots of living expenditure and food for each microdata Original mircodata Column number 2 Column number 3 Column number 4 Food Food 120,000.0 Hierarchization Hierarchization, , and kurtosis, and kurtosis, skewness and skewness and of Box of Box- -Co transformation transformation 113,008.5 112,843.9 Multivariate Multivariate lognormal lognormal random random numbers numbers Kurtosis Kurtosis and skewness skewness and Cox x 100,000.0 86,400.5 80,000.0 77,464.3 71,059.5 60,000.0 40,000.0 25,806.2 20,000.0 13,621.6 8,286.1 living living expenditure expenditure 0.0 0.0 50,000.0 100,000.0 150,000.0 200,000.0 250,000.0 300,000.0 350,000.0 27 27
Example Result Table for Academic Use File Example Result Table for Academic Use File Items Items Living expenditure Living expenditure Food Food Frequency Frequency Mean Mean SD SD No. No. A A B B C C D D E E F F Frequency Frequency Mean Mean SD SD 1 1 2 2 1 1 1 1 2 2 5 5 1 1 3 3 185,499.9 185,499.9 65,680.5 65,680.5 3 3 31,193.5 31,193.5 6,406.9 6,406.9 2 2 1 1 1 1 3 3 6 6 210,086.9 210,086.9 73,208 73,208 6 6 65,988.7 65,988.7 27,387.3 27,387.3 2 2 1 1 1 1 3 3 6 6 1 1 3 3 150,424.8 150,424.8 28,599.3 28,599.3 3 3 51,457.2 51,457.2 20,795.2 20,795.2 2 2 2 2 1 1 1 1 3 3 7 7 1 1 3 3 269,749.0 269,749.0 43,611.7 43,611.7 3 3 80,520.1 80,520.1 28,447.0 28,447.0 3 3 1 1 1 1 1 1 7 7 221,022.1 221,022.1 45,197.7 45,197.7 7 7 58,322.1 58,322.1 18,550.2 18,550.2 3 3 1 1 1 1 1 1 5 5 1 1 4 4 209,347.8 209,347.8 50,580.8 50,580.8 4 4 45,359.0 45,359.0 12,618.4 12,618.4 3 3 3 3 1 1 1 1 1 1 6 6 1 1 3 3 236,587.8 236,587.8 40,679.9 40,679.9 3 3 75,606.2 75,606.2 3,049.8 3,049.8 4 4 3 3 1 1 1 1 2 2 5 5 1 1 4 4 137,080.2 137,080.2 15,119.7 15,119.7 4 4 48,797.2 48,797.2 1,071.9 1,071.9 Mean Mean 195,624.8 195,624.8 54647.8 54647.8 Standard deviation Standard deviation 59,892.6 59,892.6 21218.1 21218.1 Kurtosis Kurtosis - -1.004 1.004 1.629 1.629 Skewness Skewness 0.346 0.346 0.993 0.993 Correlation coefficients Correlation coefficients 0.643 0.643 0 0 A A: : 5 5- -year D D: : company company size year age age groups groups; ; B B: : employment/unemployed employment/unemployed; ; C C: : company size; ; E E: : industry industry code code; ; F F: : occupation company classification classification; ; occupation code code 28 28
7 7. Conclusions . Conclusions and Future Outlook and Future Outlook Conclusions Conclusions 1. 1. We suggested improvements We suggested improvements to National Statistics Center for statistics education and training National Statistics Center for statistics education and training. . 2. 2. We created new We created new synthetic microdata using several methods that synthetic microdata using several methods that adhere to this disclosure limitation method. adhere to this disclosure limitation method. 3. 3. The The results show that kurtosis, skewness, and Box results show that kurtosis, skewness, and Box- -Cox transformation are transformation are useful for creating synthetic microdata in useful for creating synthetic microdata in addition to addition to frequency frequency, mean, standard deviation, and correlation , mean, standard deviation, and correlation coefficient which have previously been used as indicators coefficient which have previously been used as indicators. . to synthetic synthetic microdata created by the microdata created by the Cox Next Steps Next Steps 1. 1. Decide the Decide the number of cross fields (dimensionality) of the basic table number of cross fields (dimensionality) of the basic table and details table and the style (indicators to tabulate) of the result and details table and the style (indicators to tabulate) of the result table according to the statistical fields in the public survey table according to the statistical fields in the public survey. . 2. 2. Expand Expand this work to the creation and this work to the creation and improvement of microdata microdata from from other surveys. other surveys. improvement of synthetic synthetic 29 29
References References 1. 1. Anscombe Anscombe, , F F. .J J. .( (1973 G G. ., , Keller, Keller, W American American Statistical 2. 2. Defays Defays, , D D. . and Official Official Statistics Statistics, , Vol 3. 3. Domingo Domingo- -Ferrer Statistical Statistical Disclosure no no. .1 1, , pp pp. .189 189- -201 4. 4. H hne( H hne(2003 2003) ) SAFE presented presented at pp pp. .1 1- -3 3. . 5. 5. Ito Ito, , S S. ., , Isobe, Isobe, S S. ., , Akiyama, Avoidance Avoidance Methods Working Working Paper, Paper, No 6. 6. Ito Ito, , S S. .( (2009 2009) ) On Kumamoto Kumamoto Gakuen 7. 7. Makita Makita, , N N. ., , Ito, Microdata Microdata for for Educational Conference, Conference, Macau 1973), ), "Graphs "Graphs in in Statistical W. . J J. . and and Pannekoek, Pannekoek, J J. .( (1990 Statistical Association Association, , Vol and Anwar, Anwar, M M. .N N. .( (1998 Vol. .14 14, , No No. .4 4, , pp Ferrer, , J J. . and and Mateo Mateo- -Sanz, Disclosure Control , Control , IEEE 201. . SAFE- - A A Method at Joint Joint ECE/Eurostat ECE/Eurostat Work Statistical Analysis," 1990) ) Disclosure Disclosure Control Vol. . 85 85, , No No. . 409 409 pp 1998) ) Masking Masking Microdata pp. .449 449- -461 461. . Sanz, J J. . M M. .( (2002 2002) ) Practical IEEE Transactions Transactions on Analysis," American American Statistician Control of pp. .38 38- -45 45. . Microdata Using Using Micro Statistician, , 17 of Microdata , Microdata , Journal 17- -21 21. . Bethlehem Bethlehem, , J J. . Journal of of the the Micro- -Aggregation , Aggregation , Journal Journal of of Practical Data on Knowledge Knowledge and Data- -oriented oriented Microaggregation Microaggregation for and Data Data Engineering Engineering, , vol for 14, , vol. .14 Method for for Statistical Statistical Disclosure Work Session Session on Disclosure Limitation on Statistical Statistical Data Limitation of Data Confidentiality, Confidentiality, Luxembourg, of Microdata , Microdata , Paper Luxembourg, Paper Akiyama, H H. .( (2008 Methods: : Based Based on No. .10 10, , pp pp. .33 On Microaggregation Microaggregation as Gakuen University University, , Vol Ito, S S. ., , Horikawa, Horikawa, A A. ., , Goto, Educational Use Macau Tower, Tower, Macau, 2008) ) A on National National Survey 33- -66 66 (in (in Japanese Japanese) ). . as Disclosure Vol. .15 15, , No Goto, T T. ., , Yamaguchi, Use in in Japan , Japan , Paper Macau, China, China, pp A Study Study on on Effectiveness Effectiveness of Survey of of Family Family Income of Microaggregation Microaggregation as Income and and Expenditure , Expenditure , NSTAC as Disclosure Disclosure NSTAC Disclosure Avoidance No. .3 3 4 4, , pp pp. .197 Yamaguchi, K K. . ( (2013 Paper Presented Presented at pp. .1 1- -9 9. . Avoidance Methods , 197- -232 232 (in (in Japanese) 2013) ) Development at 2013 2013 Joint Methods , Journal Japanese) Development of Joint IASE Journal of of Economics, Economics, of Synthetic Synthetic IAOS Satellite Satellite IASE / / IAOS 30 30
Thank you for your attention. Thank you for your attention.
Creating the Academic Use File Creating the Academic Use File Synthetic Microdata based on frequency, SD, skewness and kurtosis Synthetic Microdata based on frequency, SD, skewness and kurtosis Original Data Original Data tabular items: tabular items: frequency, SD, frequency, SD, skewness, kurtosis skewness, kurtosis tabulation tabulation individual individual data data tables tables Making Academic Use File Making Academic Use File extract extract candi candi- - date date data data decide decide combi. combi. of data of data GRG GRG linear Combination Combination by freq. by freq. DB DB freq., SD, freq., SD, skewness, skewness, kurtosis kurtosis non non- -linear extract extract narrow narrow collate collate SD and skew. is SD and skew. is the same, the same, kurtosis kurtosis is approximate approximate using using MS MS- -EXCEL EXCEL Solver Solver because each because each value is integer, value is integer, extract by extract by each statistic each statistic approximation approximation store all of store all of combination combination by frequency by frequency transform transform total to 100 total to 100 is 32 32
Rules for output checking Rules for output checking Type Type of Statistics of Statistics Type of Output Type of Output Frequency Frequency tables Magnitude tables Magnitude tables minima and percentiles(incl. median) Classification Classification Unsafe Unsafe Unsafe Unsafe Unsafe Unsafe tables Maxima, Maxima, minima and percentiles(incl. median) Mode Mode Safe Safe Unsafe Unsafe Safe Safe Safe Safe Descriptive Descriptive statistics statistics Means, indices, ratios, indicators Means, indices, ratios, indicators Concentration Concentration ratios Higher moments of distributions moments of distributions covariance, kurtosis, skewness covariance, kurtosis, skewness ratios Higher incl. variance, incl. variance, Graphs: pictorial representations of actual data Graphs: pictorial representations of actual data Linear regression coefficients Linear regression coefficients Non Non- -linear regression coefficients linear regression coefficients Estimation residuals Estimation residuals Summary and test statistics from estimates Summary and test statistics from estimates R R2 2, , X X2 Unsafe Unsafe Safe Safe Safe Safe Unsafe Unsafe Correlation Correlation and and Regression Analysis Regression Analysis 2 etc. etc. Safe Safe Correlation coefficients Correlation coefficients Safe Safe Brandt, M., Franconi, L., Guerke, C., Hundepool, A., Lucarelli, M., Mol, J., Ritchie, F., Seri, G. and Welpton, R. (2010) Gui Brandt, M., Franconi, L., Guerke, C., Hundepool, A., Lucarelli, M., Mol, J., Ritchie, F., Seri, G. and Welpton, R. (2010) Guidel the checking of output based on microdata research. Project Report. ESSnet SDC. the checking of output based on microdata research. Project Report. ESSnet SDC. delines for ines for