Italian Examples of the use of big data for producing statistics
Italian National Institute of Statistics (Istat) established a Technical Commission to guide investments in Big Data adoption for statistical production processes. The Commission developed a roadmap combining top-down analysis and bottom-up experimentations, with a focus on utilizing various data sources for producing statistics. Additionally, short-term actions include exploring the use of Big Data sources in official statistics and domains such as labor force statistics, ICT usage, online search data, and more.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Italian Examples of the use of big data for producing statistics Monica Scannapieco THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
Istat Big Data Strategy - 1 Istat (The Italian National Institute of Statistics) set up a Technical Commission with the objective to orient investments on Big Data adoption in statistical production processes Duration: from February 2013 to February 2015 Members coming from different areas: Official Statistics, Academy, Private Sector Eurostat
Istat Big Data Strategy - 2 The Commission released a Roadmap for Big Data adoption as a result of a mixed approach that combined: Top-down phase: analysis of the state of the art of Big Data research and practice Bottom-up phase: experimentations R o a d m a p Eurostat
Istat Big Data Strategy - 3 A new technical Commission has been set up since January 2016, with the (main) objective to monitor the roadmap implementation Eurostat
Which use Roadmap Short Term Actions - 1 Possible use of Big data sources in OS: by itself in combination with more traditional data sources such as sample surveys and administrative registers Short term use Eurostat
Which use Roadmap Short Term Actions - 2 Finalization to production: Source type Domain(s) Labour Force statistics ICT usage and Price statistics Online Search Data Internet-scraped Data Mobility and Tourism statistics Mobile Phone Data Price statistics Scanner data Eurostat
Which use Roadmap Short Term Actions - 3 Laboratory to deal with other source types Source type Social Media Domain(s) Social statistics (e.g. Consumer Confidence) Traffic and Agriculture statistics Images: Traffic Webcams & Orthoimages Eurostat
Examples of experiences so far ICT Usage in Enterprises based on Internet as a Data Source (IaD) Persons and Places based on Mobile Phone Data Eurostat
ICT Usage in Enterprises Purpose: Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) Methodology Scraping of web sites for data extraction Supervised classification task Eurostat
The ICT in enterprises survey In Italy, the survey investigates on a universe of 211,851 enterprises employees, by means of a sampling survey involving 19,186 of them (2011) with at least 10 the (45% survey, of 8,687 In the 2013 their round website of indicated respondent units) sampling The access to the indicated websites in order to gather information directly within them, gives different opportunities Eurostat
The web questionnaire is used to collect information on the characteristics of the websites owned or used by the enterprises: Objective: predict values of questions from B8a to B8g using machine learning techniques applied to texts (text mining) scraped from the websites. Particular effort was dedicated to question B8a ( Web sales facilities or e-commerce ) Eurostat
The overall methodology 2013 and 2014 rounds of the survey have both been used in the experiment. Phase 1- Web scraping: For all respondents declaring to own a website, their website have been scraped, Phase 2 Estimation: Texts collected in phase 1 were submitted to classical text mining procedures in order to build a term/document matrix Learners: to predict values of target variables (for instance, e-commerce (yes/no) ) on the basis or relevant terms individuated in the websites Eurostat
Phase 1: Web Scraping So far, three different solutions investigated: 1. the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling, content extraction, indexing and searching; 2. HTTrack (http://www.httrack.com/), a free and open source software tool that permits to mirror locally a web site, by downloading each page that composes its structure; 3. JSOUP (http://jsoup.org) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net). Currently developing ad-hoc JSOUP based solutions Eurostat
Solution # websites reached Average number of c webpages per site 15,2 Time spent Type of Storage Storage dimensions Phas e 1: web s raping Nutch 7020 / 8550=82,1% 32,5 hours Binary files on HDFS 2,3 GB (data) 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on file system 16, 1 GB JSOUP 7835/8550=91,6% 68 11 HTML ADaMSoft compressed binary files 500MB hours Eurostat
Phase 2: Estimation 2013 data have been used as train dataset, while 2014 data have been used as test dataset The performance of each learner has been evaluated by means of the usual quality indicators: accuracy: rate of correctly classified cases on the total; sensitivity: rate of correctly classified positive cases on total positive cases; specificity: rate of correctly classified negative cases on total negative cases. Eurostat
Quality Indicators Accuracy Sensitivity Specificity Proportion of e-commerce (observed) Proportion of e- commerce (predicted) Learner GLM (Logistic) 0.69 0.68 0.69 0.19 0.22 0.79 0.63 0.83 0.19 0.25 Random Forest 0.70 0.62 0.72 0.19 0.20 Neural Network 0.67 0.66 0.67 0.19 0.22 Boosting 0.82 0.38 0.92 0.19 0.19 Bagging 0.75 0.55 0.79 0.19 0.21 Na ve Bayes 0.66 0.71 0.65 0.19 0.28 LDA 0.82 0.25 0.95 0.19 0.16 RPART (Tree) Eurostat
Conclusions for the ICT Usage in Enterprises Project So far, the pilot explored the possibility to replicate the information collected by the questionnaire using the scraped content of the website and applying the best predictor (scenario 1 reduction of respondent burden) A more relevant possibility is to combine survey data and Big Data (scenario 2) in order to improve the quality of the estimates Eurostat
Conclusions for the ICT Usage in Enterprises Project The aim is to adopt a full predictive approach with a combined use of data: 1. all the websites owned by the whole population of enterprises are identified and their content collected by web scraping (= Big Data); 2. survey data (the truth ground ) are combined with Big data in order to establish relations (models) between the values of target variables and the terms collected in corresponding scraped websites; 3. estimated models obtained in step 2 are applied to the whole set of texts obtained in step 1 in order to produce estimates related to the target variables. Eurostat
The Persons and Places Project Purpose Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data Actors involved in the project Istat National Research Council University of Pisa Methodology Inference of population mobility profiles from GSM Call Detail Records (CDRs) Comparison with data derived from administrative sources Eurostat
Data CDR (Wind, province of Pisa, october 2011) Admninistrative data (P&P, province of Pisa, december 2011) Eurostat
Methodology CDR Data Extraction Aggregation Risk evaluati on Classification Statistics Interpretation Admin Data Validation Eurostat
Aggregation: Individual Call Profiles The temporal aggregation is by week, where each day of a given week is grouped in weekdays and weekend Given for example a temporal window of 28 days (4 weeks), the resulting matrix has 8 columns (2 columns for each week, one for the weekdays and one for the weekend) A further temporal partitioning is applied to the daily hours. A day is divided in several timeslots, representing interesting times of the day Eurostat
Classification Profile Classification, i.e. the attribution of ICPs to the proper class was performed into two steps: Extraction of representative call profiles, i.e. a relatively small set of synthetic call profiles, each summarizing an homogeneous set of (real) ICPs This step reduces the set of samples to be manually classified. The labels assigned to the representative profiles are propagated to the full set of ICPs Eurostat
Classification The mean values of the ICPs belonging to each cluster serves as prototype / representative of the cluster The choice of the parameter K, equal to 100, was made by performing a wide range of experiments, trying to minimize the intra-cluster distance and maximizing the inter-cluster distance Once extracted the representatives (RCPs), they have been labeled by domain experts in the identified Profile Classes Eurostat
Classification The second step, i.e. the propagation of the labels manually assigned to the RCPs, followed a standard 1-Nearest-Neighbor (1-NN) classification step. That corresponds to assign to each ICP the label of the closest RCP Eurostat
Resident Individual call profile A Dynamic resident B A Commuters Classification algorithm A A B Visitors Eurostat
A flow from A ->B defined by dynamic resident in B that work in A (commuters) Commuter Dynamic Resident B A Eurostat
Comparison of estimations made starting from CDRs wrt Admin Data GSM rescaled considering the market share of the operator Eurostat
Commuters (inbound flow) Eurostat
Dynamic resident (outboun d flow) Eurostat
Inbound commuters in Pisa Eurostat
Inbound commuters in Pisa
Outbound commuters in Pisa
Conclusions for the Persons and Places Project Semi-automatic methodology for estimation of population flows Good alignment with administrative data results First steps towards usage of mobile phone data for OS Eurostat
Recommendations from experimentations - 1 ICT Usage in Enterprises: Even unstructured data can be harnessed by OS. Very promising preliminary results in terms of quality of the estimates wrt questionnaire-based estimates Dedicated IT infrastructure for (i) scraping and (ii) scaling up Eurostat
Recommendations from experimentations - 2 Persons and Places: Privacy issues in dealing with mobile phone data. First positive solutions by Italian Garante per la Privacy Comparison with administrative data suggests reliability of mobile phone data estimaton (though still necessary to work for ensuring OS quality levels) Eurostat
References Persons and Places: Furletti, B., Gabrielli, L., Garofalo,G., Giannotti,F., Milli, L., Nanni,M., Pedreschi,D., Vivio, R.: Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach. SIS, Cagliari, 2014. Labour Market Estimation: Bacchini, F. , D Al , M., Falorsi,S., Fasulo, A., Pappalardo,A.: Does Google index improve the forecast of Italian labour market? SIS, Cagliari, 2014. ICT Usage: Barcaroli, G., Scannapieco, M., Nurra, A, Scarn ,M., Salamone, S., Summa, D.: Internet as Data Source in Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Vol44, no 2, 2015. Analyses techniques: Barcaroli G., De Francisci S., Scannapieco M., Big Data Analysis: Experiences and Best Practices in Official Statistics, Conference of European Statistical Stakeholders, Rome, 2014. IT issues: Barcaroli G., De Francisci S., Scannapieco M., Summa D.: Dealing with Big Data for Official Statistics: IT Issues; MSIS, Dublin, 2014 Introductory: Scannapieco M., Virgillito A., Zardetto D. : Placing Big Data in Official Statistics: A Big Challenge?, NTTS, Brussels, 2013. Eurostat