Italian Examples of the use of big data for producing statistics

Italian Examples of the use of big data for producing statistics
Slide Note
Embed
Share

Italian National Institute of Statistics (Istat) established a Technical Commission to guide investments in Big Data adoption for statistical production processes. The Commission developed a roadmap combining top-down analysis and bottom-up experimentations, with a focus on utilizing various data sources for producing statistics. Additionally, short-term actions include exploring the use of Big Data sources in official statistics and domains such as labor force statistics, ICT usage, online search data, and more.

  • Big Data
  • Statistics
  • Italy
  • Istat
  • Eurostat

Uploaded on Mar 08, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Italian Examples of the use of big data for producing statistics Monica Scannapieco THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

  2. Istat Big Data Strategy - 1 Istat (The Italian National Institute of Statistics) set up a Technical Commission with the objective to orient investments on Big Data adoption in statistical production processes Duration: from February 2013 to February 2015 Members coming from different areas: Official Statistics, Academy, Private Sector Eurostat

  3. Istat Big Data Strategy - 2 The Commission released a Roadmap for Big Data adoption as a result of a mixed approach that combined: Top-down phase: analysis of the state of the art of Big Data research and practice Bottom-up phase: experimentations R o a d m a p Eurostat

  4. Istat Big Data Strategy - 3 A new technical Commission has been set up since January 2016, with the (main) objective to monitor the roadmap implementation Eurostat

  5. Which use Roadmap Short Term Actions - 1 Possible use of Big data sources in OS: by itself in combination with more traditional data sources such as sample surveys and administrative registers Short term use Eurostat

  6. Which use Roadmap Short Term Actions - 2 Finalization to production: Source type Domain(s) Labour Force statistics ICT usage and Price statistics Online Search Data Internet-scraped Data Mobility and Tourism statistics Mobile Phone Data Price statistics Scanner data Eurostat

  7. Which use Roadmap Short Term Actions - 3 Laboratory to deal with other source types Source type Social Media Domain(s) Social statistics (e.g. Consumer Confidence) Traffic and Agriculture statistics Images: Traffic Webcams & Orthoimages Eurostat

  8. Examples of experiences so far ICT Usage in Enterprises based on Internet as a Data Source (IaD) Persons and Places based on Mobile Phone Data Eurostat

  9. ICT Usage in Enterprises Purpose: Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) Methodology Scraping of web sites for data extraction Supervised classification task Eurostat

  10. The ICT in enterprises survey In Italy, the survey investigates on a universe of 211,851 enterprises employees, by means of a sampling survey involving 19,186 of them (2011) with at least 10 the (45% survey, of 8,687 In the 2013 their round website of indicated respondent units) sampling The access to the indicated websites in order to gather information directly within them, gives different opportunities Eurostat

  11. The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: Objective: predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a ( Web sales facilities or e-commerce ) Eurostat

  12. The overall methodology 2013 and 2014 rounds of the survey have both been used in the experiment. Phase 1- Web scraping: For all respondents declaring to own a website, their website have been scraped, Phase 2 Estimation: Texts collected in phase 1 were submitted to classical text mining procedures in order to build a term/document matrix Learners: to predict values of target variables (for instance, e-commerce (yes/no) ) on the basis or relevant terms individuated in the websites Eurostat

  13. Phase 1: Web Scraping So far, three different solutions investigated: 1. the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling, content extraction, indexing and searching; 2. HTTrack (http://www.httrack.com/), a free and open source software tool that permits to mirror locally a web site, by downloading each page that composes its structure; 3. JSOUP (http://jsoup.org) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net). Currently developing ad-hoc JSOUP based solutions Eurostat

  14. Solution # websites reached Average number of c webpages per site 15,2 Time spent Type of Storage Storage dimensions Phas e 1: web s raping Nutch 7020 / 8550=82,1% 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 HTML ADaMSoft compressed binary files 500MB hours Eurostat

  15. Phase 2: Estimation 2013 data have been used as train dataset, while 2014 data have been used as test dataset The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. Eurostat

  16. Quality Indicators Accuracy Sensitivity Specificity Proportion of e-commerce (observed) Proportion of e- commerce (predicted) Learner GLM (Logistic) 0.69 0.68 0.69 0.19 0.22 0.79 0.63 0.83 0.19 0.25 Random Forest 0.70 0.62 0.72 0.19 0.20 Neural Network 0.67 0.66 0.67 0.19 0.22 Boosting 0.82 0.38 0.92 0.19 0.19 Bagging 0.75 0.55 0.79 0.19 0.21 Na ve Bayes 0.66 0.71 0.65 0.19 0.28 LDA 0.82 0.25 0.95 0.19 0.16 RPART (Tree) Eurostat

  17. Eurostat

  18. Conclusions for the ICT Usage in Enterprises Project So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (scenario 1 reduction of respondent burden) A more relevant possibility is to combine survey data and Big Data (scenario 2) in order to improve the quality of the estimates Eurostat

  19. Conclusions for the ICT Usage in Enterprises Project The aim is to adopt a full predictive approach with a combined use of data: 1. all the websites owned by the whole population of enterprises are identified and their content collected by web scraping (= Big Data); 2. survey data (the truth ground ) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; 3. estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. Eurostat

  20. The Persons and Places Project Purpose Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data Actors involved in the project Istat National Research Council University of Pisa Methodology Inference of population mobility profiles from GSM Call Detail Records (CDRs) Comparison with data derived from administrative sources Eurostat

  21. Data CDR (Wind, province of Pisa, october 2011) Admninistrative data (P&P, province of Pisa, december 2011) Eurostat

  22. Methodology CDR Data Extraction Aggregation Risk evaluati on Classification Statistics Interpretation Admin Data Validation Eurostat

  23. Aggregation: Individual Call Profiles The temporal aggregation is by week, where each day of a given week is grouped in weekdays and weekend Given for example a temporal window of 28 days (4 weeks), the resulting matrix has 8 columns (2 columns for each week, one for the weekdays and one for the weekend) A further temporal partitioning is applied to the daily hours. A day is divided in several timeslots, representing interesting times of the day Eurostat

  24. Classification Profile Classification, i.e. the attribution of ICPs to the proper class was performed into two steps: Extraction of representative call profiles, i.e. a relatively small set of synthetic call profiles, each summarizing an homogeneous set of (real) ICPs This step reduces the set of samples to be manually classified. The labels assigned to the representative profiles are propagated to the full set of ICPs Eurostat

  25. Classification The mean values of the ICPs belonging to each cluster serves as prototype / representative of the cluster The choice of the parameter K, equal to 100, was made by performing a wide range of experiments, trying to minimize the intra-cluster distance and maximizing the inter-cluster distance Once extracted the representatives (RCPs), they have been labeled by domain experts in the identified Profile Classes Eurostat

  26. Classification The second step, i.e. the propagation of the labels manually assigned to the RCPs, followed a standard 1-Nearest-Neighbor (1-NN) classification step. That corresponds to assign to each ICP the label of the closest RCP Eurostat

  27. Resident Individual call profile A Dynamic resident B A Commuters Classification algorithm A A B Visitors Eurostat

  28. A flow from A ->B defined by dynamic resident in B that work in A (commuters) Commuter Dynamic Resident B A Eurostat

  29. Comparison of estimations made starting from CDRs wrt Admin Data GSM rescaled considering the market share of the operator Eurostat

  30. Commuters (inbound flow) Eurostat

  31. Dynamic resident (outboun d flow) Eurostat

  32. Eurostat

  33. Inbound commuters in Pisa Eurostat

  34. Inbound commuters in Pisa

  35. Outbound commuters in Pisa

  36. Conclusions for the Persons and Places Project Semi-automatic methodology for estimation of population flows Good alignment with administrative data results First steps towards usage of mobile phone data for OS Eurostat

  37. Recommendations from experimentations - 1 ICT Usage in Enterprises: Even unstructured data can be harnessed by OS. Very promising preliminary results in terms of quality of the estimates wrt questionnaire-based estimates Dedicated IT infrastructure for (i) scraping and (ii) scaling up Eurostat

  38. Recommendations from experimentations - 2 Persons and Places: Privacy issues in dealing with mobile phone data. First positive solutions by Italian Garante per la Privacy Comparison with administrative data suggests reliability of mobile phone data estimaton (though still necessary to work for ensuring OS quality levels) Eurostat

  39. References Persons and Places: Furletti, B., Gabrielli, L., Garofalo,G., Giannotti,F., Milli, L., Nanni,M., Pedreschi,D., Vivio, R.: Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach. SIS, Cagliari, 2014. Labour Market Estimation: Bacchini, F. , D Al , M., Falorsi,S., Fasulo, A., Pappalardo,A.: Does Google index improve the forecast of Italian labour market? SIS, Cagliari, 2014. ICT Usage: Barcaroli, G., Scannapieco, M., Nurra, A, Scarn ,M., Salamone, S., Summa, D.: Internet as Data Source in Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Vol44, no 2, 2015. Analyses techniques: Barcaroli G., De Francisci S., Scannapieco M., Big Data Analysis: Experiences and Best Practices in Official Statistics, Conference of European Statistical Stakeholders, Rome, 2014. IT issues: Barcaroli G., De Francisci S., Scannapieco M., Summa D.: Dealing with Big Data for Official Statistics: IT Issues; MSIS, Dublin, 2014 Introductory: Scannapieco M., Virgillito A., Zardetto D. : Placing Big Data in Official Statistics: A Big Challenge?, NTTS, Brussels, 2013. Eurostat

More Related Content