Anonymized Taxpayer Data: Balancing Privacy and Publicity

sat m s abierto n.w
1 / 19
Embed
Share

Explore how anonymized taxpayer information is published while respecting data privacy laws and statistical properties, using a methodology that reconciles maximum publicity with information privacy.

  • Anonymization
  • Taxpayer Data
  • Privacy
  • Publicity
  • Statistical Properties

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. SAT ms abierto Fiscal Data Anonymization

  2. Objective Publishing anonymized taxpayer information respecting both legal provisions that protect the confidentiality of personal data, and maintain the statistical properties both of temporal and cross-sectional order. How did we do it?

  3. Introduction Policymakers: Input for decision making; Public policy generation and evaluation. Researchers: Micro data has a higher analytical propertis than aggregate data Goverment: Registration and / or taxpayer activity control Fiscal and Administrative Data Personal data protection Dilemma Maximum publicity

  4. Introduction We use a methodology that reconcile the dichotomy: maximum publicity and the privacy of information. With this methodology we create a panel of microdata with selected variables of Income Tax. The expected result is to preserve the statistical properties of the original data.

  5. Introduction Two approaches were examined: How to solve the dilemma? It modifies the veracity of the data in eliminate the link between records and individuals. order to Randomization. Data Anonymization It consists of spatial, temporal and monetary groupings, among others; in order to dilute the attributes of individuals. Taking care of taxpayer privacy and maintaing usefulness of the data. Generalization.

  6. Data The information comes from the annual statement. 2011 to 2015. With update every two years. The variables are from components of the income tax determination: Corporations: 36 variables (Deductions, expenses, investments, among other relevant information) Individuals: 43 variables Number of Anonymized Records for Annual Statements Fiscal year 2010 2011 2012 2013 2014 2015 TOTAL Individual records 3,173,063 3,539,472 3,858,513 4,163,485 4,512,515 4,666,323 23,913,371 Corporations records 520,627 560,293 596,079 616,269 701,203 707,213 3,701,684 Source: Servicio de Administraci n Tributaria

  7. Methodology: Noise additives To construct the vector of anonymized data ? = ? + ? Where: ?is the tranformed data vector ? is the original data vectores ? is the ramdon varaible (noise) with a distribution ?~? ?,??. Random noise It is obteined the variance and covariance matrix of the original data ???(?) = ? With computacional items ? is generated with the following characteristics ? ? = ? y ??? ? = ?? Where ? is an ad hoc select scalar by the research.

  8. Heuristic Solution The anonymization process is a heuristic solution and through knowledge of the data micro-adjustments are made to obtain robust and compatible results with the real data. Anonymized income 300,000 200,000 100,000 0 -100,000 -200,000 -300,000

  9. Heuristic Solution Nulls and zeros The standardization of the variables is performed, converting all null values to zero Separated those taxpayer that in all the variables presented information in "0" Selected data The process was carried out with information per accounting year Exclusion of atypical data The "Total Cumulative Revenue" is taken as the pivot variable, excluding observations above the mean plus three standard deviations (???+ 3 ???). Partition Stratifications are made for entry of the bases, due to the high differentials that existed within the income and other variables. Special Treatments Scalar If the variance and covariance matrix is not positive definite, the variable causing the problem was excluded and treated separately One scalar was selected per stratum ??? ? = ??

  10. Results General speaking, statistics of the original variables are very similar to their anonymous pairs, for example, income averages have a 0.3% difference, while the variation for the mean of deductions and tax charged to the exercise are 0.9% and 5%. Comparison of basic stistics, Fiscal year 2015 (population data) Income Deductions Taxes to charge Original Anonymised Original Anonymised Original Anonymised Max (millions) 10,493 10,704 128,187 128,097 2,183 2,189 Min 0 0 0 0 0 0 Q1 - - 93,987 14,741 - - Median 684,250 668,295 1,933,570 1,970,640 - - Q3 7,077,461 6,955,911 10,765,422 10,383,603 4,251 322 Mean 29,421,101 29,500,515 35,152,400 35,461,566 486,516 511,020 N 707,213 707,213 569,932 569,932 707,213 707,213 Std. Dev. 226,598,954 227,290,288 296,595,661 297,686,999 9,402,328 9,431,887 Sum (millions) 20,806,985 20,863,148 20,034,478 20,210,681 344,070 361,400 Source: Servicio de Administraci n Tributaria Medians have a variation of less than 2%. The variations between the standard deviations of the three selected variables are around 0.3%. When observing the quartiles is perceived a greater variation, these variations denote that data to date the process of anonimization is not harmless and if it affects nonparametric results.

  11. Results Test of equality, Fiscal year 2015 (population data) Income Deductions Taxes to charge Means 0.8352 0.5786 0.1218 Wilcoxon 0.6720 0.0000 0.5885 Variances 0.8880 0.9870 0.9720 Source: Servicio de Administraci n Tributaria To verify that the means are not statistically different the Student's t-test for independent data was performed and sufficient evidence was not found to conclude that the means are not different at a level of 5%. The standard deviations of the original vs. anonymous variables are not statistically different at the same level of significance. In order to reinforce the results obtained in the equality of means and variances, the Wilcoxon signed rank test was performed, which concludes that the original and anonymized series of the income and the tax to be charged are not statistically different; but rejects this hypothesis in the case of deductions.

  12. Results Ordinary (??????=?(??????????), ??????=?(?????), ??????????=?(?????), etc.) correlating original versus original and anonimized versus anonimized. Least Squares regressions were performed between the selected variables Correlations comparative, Fiscal year 2015 (population data) Income Deductions Taxes to charge Original Anonymised Original Anonymised Original Anonymised Regression coefficients Income 1 1 0.9368 0.9371 0.0218 0.0217 Deductions 0.6730 0.6723 1 1 0.0114 0.0113 Taxes to charge 12.6348 12.6045 9.2982 9.2921 1 1 R2 Income 1 1 0.6304 0.6300 0.2708 0.2736 Deductions 0.6304 0.6300 1 1 0.1055 0.1053 Taxes to charge 0.2748 0.2736 0.1055 0.1053 1 1 Pearson Correlation Income 1 1 0.7940 0.7937 0.5243 0.5231 Deductions 0.7940 0.7937 1 1 0.3248 0.3245 Taxes to charge 0.5243 0.5231 0.3248 0.3245 1 1 Spearman Correlation Income 1 1 0.9834 0.9484 0.5628 0.5091 Deductions 0.9834 0.9484 1 1 0.4791 0.4026 Taxes to charge 0.5628 0.5091 0.4791 0.4026 1 1 Source: Servicio de Administraci n Tributaria Similarly, the Pearson correlation was obtained and its variations ranged from 0.4% to 1.2%, and Spearman'snon-parametric technique wasperformed, varying between 3% and 16.8%.

  13. Advantages: Transparency can create greater confidence of institutions. Knowledge is created by allowing analysis from "new information" and evaluation of past phenomena. Information is shared to carry out analysis of fiscal policy with micro data. Disadvantages / Risks: Categorical data are lost (it would prevent analysis by strata, sectors, localities, among other interest groups). Data seemingly inconsistent. Incorrect Incentives: you would get information that could affect the behavior of the taxpayer. For example, the ratio of zeroes and nulls could generate the perception that most taxpayers do not declare or do not contribute, which could encourage non-compliance with tax obligations.

  14. Conclusions Categorical data was eliminated so that their linkability with other public databases is not possible. For the process of anonymization it is necessary to know and understand the data that will be masked. Statistical properties are maintained. Generation of noise is elaborated in harmony with aggregates and not at the micro data level. One of the purposes of publishing is to encourage feedback from citizens who use the information to improve the quality of the data.

  15. Bibliografa Lind, D., Manson, R., & Marchal W.(2000). Estad stica para Administraci n y Econom a, 10 Edici n. Bogot : Ed. Alfaomega. Lechner S. & Pohlmeier, W. (2005). Data Masking by Noise Addition and the Estimation of Nonparametric Regression Models. Journal of Economics and Statistics, Vol. 225, No. 5, Themenheft: Econometrics of Anonymized Micro Data (September 2005), pp. 517-528. Recuperado de: http://www.jstor.org/stable/23813318 INEGI (n.d). Planteamientos, tendencias y recomendaciones internacionales en materia de acceso por parte de la comunidad de investigadores a los microdatos [diapositivas PowerPoint] Recuperado de: http://www.inegi.org.mx/rne/docs/Pdfs/Mesa6/20/JesusRomo.pdf Kim J. J. & Winkler, W. E. (1995). Masking Microdata Files. Proceedings of the American Statistical Association, Section on Survey Research Methods. Recuperado de: : https://www.census.gov/srd/papers/pdf/rr97-3.pdf Ash, R. B. (2007). Lecture 21. The Multivariate Normal Distribution. En: Ash R. B (edit.) Lectures on Statistics. University of Illinois. Recuperado de: http://www.math.uiuc.edu/~r-ash/Stat/StatLec21-25.pdf Mivule, K. (2013). Utilizing Noise Addition for Data Privacy, an Overview.Computer Science Department. Bowie State University. Recuperado de: https://arxiv.org/ftp/arxiv/papers/1309/1309.3958.pdf Domingo-Ferrer. J., Seb , F., & Castell -Roca, J. ().On the Security of Noise Addition for Privacy in Statistical Databases. En: Domingo-Ferrer J., TorraV. (edit.) Privacy in Statistical Databases. PSD 2004. Lecture Notes in Computer Science, vol. 3050. Springer, Berlin, HeidelbergRecuperadode: https://crises-deim.urv.cat/web/docs/publications/lncs/441.pdfhttps://crises-deim.urv.cat/web/docs/publications/lncs/441.pdf rgano consultivo europeo. (2014).Dictamen 8/2014 sobre la evoluci n reciente de la Internet de los objetos. . Grupo de trabajo sobre la protecci n de datos del art culo 29. Recuperado de: http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion- recommendation/files/2014/wp223_es.pdf

  16. ANEXO

  17. Sample Results The median of the income varies by 2% between anonymised and original; in the case of deductions the variation is 0.6%. The third quartile between original and anonymised for income has values in millions of pesos of 12.9 and 13.0, that is to say 0.6% of difference; in the case of deductions, the amounts in millions of pesos are 16.9 and 16.9, that is, 0% of variation. The taxes in charge show a variation of 28%, which shows that the anonymisation process is not harmless at the micro data level. Comparison of basic stistics, fiscal year 2015 (sample data) Income Deductions Taxes to charge Estatistical Original Anonymised Original Anonymised Original Anonymised Max (millions) Min Q1 Median Q3 Mean N Std. Dev. 4,357 4,097 3,892 3,725 139 143 0 0 0 0 0 0 0 0 0 0 0 0 352,898 3,190,841 16,944,614 50,593,161 403,355 3,212,584 16,937,072 50,927,721 1,942,755 12,940,180 46,808,998 1,000 239,703,766 1,983,231 13,025,318 46,935,628 1,000 232,967,366 27,720 986,840 1,000 7,841,426 20,034 996,981 1,000 7,961,613 859 859 236,701,835 231,089,470 Sum (millions) 46,809 46,936 43,460 43,747 987 997 Source: Servicio de Administraci n Tributaria

  18. Sample Results We found an improvement in the Wilcoxon signed rank test because we can not reject the null hypothesis that the distributions of the original and anonymized variables are not the same. tests of equality, fiscal year 2015 (sample data) Income Deductions Taxes to charge Means 0.9904 0.9764 0.9771 Wilcoxon 0.7701 0.4415 0.1815 The tests of mean and variances, lead us to not be able to reject the equality between the original and anonymized series. Variances 0.9420 0.9480 0.9740 Source: Servicio de Administraci n Tributaria Correlations comparative, fiscal year 2015 (sample data) Income Deductions Taxes to charge Original Anonymised Original Anonymised Original Anonymised Regression coefficients Income 1 1 0.9145 0.9155 0.0261 0.0269 Deductions 1.0851 1.0761 1 1 0.0266 0.0274 Taxes to charge 24.3981 23.0671 20.8858 19.8491 1 1 R2 Income 1 1 0.9923 0.9851 0.6370 0.6214 Deductions 0.9923 0.9851 1 1 0.5559 0.5431 Taxes to charge 0.6370 0.6214 0.5559 0.5431 1 1 Pearson Correlation Income 1 1 0.9961 0.9925 0.7981 0.7883 Deductions 0.9961 0.9925 1 1 0.7456 0.7370 Taxes to charge 0.7981 0.7883 0.7456 0.7369 1. 1 Spearman Correlation Income 1 1 0.9849 0.9550 0.5632 0.5112 Deductions 0.9849 0.9550 1 1 0.4715 0.3922 Taxes to charge 0.5632 0.5112 0.4715 0.3922 1 1 Source: Servicio de Administraci n Tributaria

  19. Sample Results Comparative with results in data panel fiscal year 2015 (sample data) Income Deductions Taxes to charge Original Anonymised Anonymised Anonymised Regressors Original Original Random effects Betas 1.0000 2.8279 3.1902 Constant Income Deductions Taxes to charge 1.0000 2.6936 3.1705 0.0109 1.0000 0.0212 0.0095 1.0000 0.0189 0.2988 0.5029 1.0000 0.2968 0.5374 1.0000 Income Deductions Taxes to charge - - 14,873 16,080 -32,080 91,260 -30,388 92,062 392,956 125,240 391,821 - 120,872 - 17,521 18,363 - - Rho Income Deductions Taxes to charge 1.0000 0.7092 0.3360 1.0000 0.6549 0.2559 0.5736 1.0000 0.5868 0.5470 1.0000 0.5576 0.2811 0.6836 1.0000 0.2100 0.6305 1.0000 Fixed effects Betas 1.0000 1.5955 3.0162 Constant - 418,506 139,156 Income Deductions Taxes to charge 1.0000 1.5429 2.9886 0.0057 1.0000 0.0053 0.0051 1.0000 0.0051 0.3088 0.1473 1.0000 0.3046 0.1600 1.0000 Income Deductions Taxes to charge - 17,181 18,125 - 36,634 98,253 - 33,988 99,725 417,172 144,071 - - 19,123 19,784 - - Rho Income Deductions Taxes to charge 1.0000 0.7603 0.5601 1.0000 0.7115 0.4730 0.6381 1.0000 0.6514 0.6146 1.0000 0.6245 0.4118 0.7344 1.0000 0.3332 0.6870 1.0000 Source: Servicio de Administraci n Tributaria

More Related Content