
Enhancing Information Discovery in Social Science Research
Explore the challenges in information search in social science research, and the development and evaluation of methods for automatic detection and linking of relevant entities in publications. Learn about mining data references to improve information discovery. Discover the importance of linking research data with publications for reliable retrieval.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
twitter.com/openminted_eu Peter Mutschke Mining data references from social science publications to enhance information discovery and linking Digital Infrastructures for Research Brussels, December 01, 2017 1
GESIS: Infrastructure for the Social Sciences Research-based Services Basic Research GESIS stall at the Sociological Conference 2010 2
Difficulties in Information Search Survey: 119 respondents, 44 completed the survey 65% non-university research institute, 26 % university, 9% other Main problems: non-relevant hits due to inability of retrieval systems to disambiguate search terms finding relevant research data missing links between data and publications 3
Develop and evaluate methods for automatic detection and linking of relevant entities in Social Science publications in order to advance reliable and context-sensitive retrieval and linking: Named Entity Recognition Keyword Extraction Detection of data citations Detection of data citations 4
Variable Detection and Linking v39: Believe in life after death v40: Believe in Heaven ISSP 2008 OLGA NE POROV , ZDEN K R. NE POR (2009). Religion: An Unsolved Problem for the Modern Czech Nation Link Database 5
Survey Variables Variable Label Question Categories v20 1 Strongly agree 2 Agree 3 Neither agree nor disagree 4 Disagree 5 Strongly disagree 8 Can't choose 9 No answer Religious leaders should not influence vote [How much do you agree or disagree with each of the following:] Q.10a Religious leaders should not try to influence how people vote in elections. v33 1 I don't believe in God 2 I don't know whether there is a God and I don't believe there is any way to find out 3 I don't believe in a personal God, but I do believe in a Higher Power of some kind 4 Closest to Rs belief about God? Please indicate which statement below comes closest to expressing what you believe about God. v52 0 No religion 100 Roman Catholic 110 Greek Catholic 200 Protestant 210 Anglican, Church of England, Episcopal 220 Baptists Religion respondent raised in Q22a What religion, if any, were you raised in? Was it Protestant, Catholic, Jewish, some other religion, or no religion? (IF PROTESTANT) What specific denomination was that? 6
Task Detect variable mentions in full texts link the text to the survey variable match to label match to label and answer categories match to answer categories 7
Classification Approach First Prototype: approach A: supervised model using ML Classifiers (NB, KNN, SVM) based on a BoW representation approach B: supervised model using a regression classifier based on semantic similarity scores Second Prototype: more features (e.g., German morphology, DerivBase, Paraphrase Corpus, Distributional Corpus, Word Embeddings) joint classification of variables in close-by sentences use document structure information 8
Gold Standard Corpus approx. 100 scientific publicationshaving an established link to a survey study (compiled from the Social Science Open Access Repository) 70 variables out of 23 general-domain topics such as religion, economy, political attitudes and participation, attitudes towards marriage, family and partnership, and use of media 500 positive and 1,500 negative variable sentence pairs, hand-tagged by two social science students. positive = a sentence fragment can be linked to a survey variable negative = no semantical equivalence between any variable sentence pair 9
Preliminary Evaluation Results Approach A (ML classifiers based on BoW): macro average precision (MAP) ~ 0.3 macro average recall (MAR) ~ 0.2 Approach B (similarity scores): positive annotations in the range of 0.96 to 4.20 (out of 0-5), average ~ 2.6 10
OpenMinTeD and the EOSC chance to create new roles to foster interdisciplinary research and sharing of information: creating a new value chain (by discovering hidden relationships between different types of information) becoming a hub for open tools, applications, and resources in the area of Text Mining 11
twitter.com/openminted_eu facebook.com/openminted bit.do/openmintedlinkedin vimeo.com/openminted 1 bit.do/openmintedplus 2 3 4 5 6 7 peter.mutschke@gesis.org www.openminted.eu 12