
Assessment of Data Sharing and Reuse Practices in Interaction Studies
This research delves into the current practices of sharing and reusing audio/video data of natural interaction. It explores the technical environments and considerations emerging from workshops and conferences in conversation analysis (CA) and interactional linguistics (IL). The focus is on the challenges and opportunities in developing tools for browsing corpora and the constraints imposed by available technologies, particularly in serving different analytical paradigms.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Part II Audio/video-recorded and transcribed data of naturally occurring interaction: Assessment of current practices of data sharing and reuse with special attention to their technical environments
Part II - content Considerations about practices of data sharing and reuse in CA and IL as they emerge from workshop conferences and discussions Assessment of the practices Assessment criteria Practices of data sharing Practices of data reuse 2 Closing workshop: Project results and future perspectives, 16.09.2024
Some thoughts from the 2nd workshop on designing searchable corpora of social interaction Workshop held at the University of Basel on 7-8.12.2023 with Arnulf Deppermann, Henrike Helmer and Silke Reineke from the Corpus of Spoken German (FOLK) designed at the IDS 3 Closing workshop: Project results and future perspectives, 16.09.2024
The IDS corpora Datenbank f r Gesprochenes Deutsch (DGD) including Archiv f r Gesprochenes Deutsch (AGD) Large and heterogenous collection of corpora, including heritage corpora + Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) Systematic collection of audio and video recordings of a wide spectrum of natural everyday interactions in German society built since 2008, launched in 2012 targets in special ways Conversation Analysis and Interactional Linguistics, end of 2023: 347 hours of recordings (190h audio-only; 157h video), 1.317 documented speakers, 3.3 million tokens transcribed. 4 Closing workshop: Project results and future perspectives, 16.09.2024
Focus on potentialities but also limitations tension between developing tools for browsing corpora and the inherent limitations of technologies available technologies are more keen to serve analyses from certain paradigms rather than others: - e.g. research on single linguistic forms and co-occurrences, as well as correlations with metadata, e.g. corpus linguistics. VERSUS - research in conversation analysis (CA) and interactional linguistics (IL) focusing on actions and their linguistic (as well as embodied) formats, CA and IL as an interesting community of practices because it challenges the simplest searching engines and expresses complex analytical requirements -> CA as cripping the corpus technology Closing workshop: Project results and future perspectives, 16.09.2024 from an emic perspective 5
The challenge of searching for sequential environments instead of searching all the occurrences of a form constraining the environment in which the form occurs (CA: sequential environment): not only forms that precede and follow the target form but interactional features ex. Searching for dispreferred responses negative expression (e.g. verb in neg form) preceded by some self-repairs and hitches preceded by a pause preceded by a connective but CA forces to think about environments in terms that are much more complex than just co- occurrences or co- texts 6 Closing workshop: Project results and future perspectives, 16.09.2024
The problem of etic vs emic approaches: ex. metadata metadata are conceived in relation to the (socio)linguistic literature and in relation to current standards fundamental for doing correlations BUT in CA relevant categories oriented to by the participants to refer to identities, settings, activities are established locally and emergently. They also change with the unfolding of the interaction. - eg. a participant can interact at some point as a mother, a political activist, a university professor, . they not only conflict with metadata (e.g. characterization of speakers in terms of gender, age, social class .) but with the idea of metadata itself Again, CA is an interesting challenger to established practices and questions the absence of contextuality and dynamicity of some essential features of databases 7 Closing workshop: Project results and future perspectives, 16.09.2024
Integrating the database in a web of practices manual practices that represent the current advanced methodology searches enabled by the engines in the database manual refinement of the results CA is not born out of computerized procedures but tries to use them in a way that do not distort what CA is aiming at doing -> invites to think about the research procedures more generally and the integration of the use of databanks within these global procedures instead of treating the search engines as producing end results, treating them as heuristic tools for exploring corpora importance of preliminary manual/qualitative in-depth analyses for preparing searches outputs of searches to be refined: exported and worked out manually OR saved in a sub-space within the system, enabling refinements in a personal workspace (cf. also CLAPI) 8 Closing workshop: Project results and future perspectives, 16.09.2024
CH-ORD Final Workshop, Lugano, 16.09.2024 Database evolution for the study of social interaction: Designing annotations for long-term usability Report on the 3rd Workshop of the CHORD project J r me Jacquin (UNIL) & Simona Pekarek Doehler (UNINE) 9
CH-ORD Final Workshop, Lugano, 16.09.2024 Main sources Two reports about the 3rd Workshop Stern, G. & Miecznikowski, J. (2024). Large data banks as continuous accomplishments: Developing, managing, and up-dating infrastructures for data sharing Profazi, N. & Miecznikowski, J. (2024). Managing and using data banks of spoken language: perspectives from corpus providers and users Disclaimer This presentation provides only a short summary, and some cherry-picked ideas that are presented and detailed in the two reports 10
CH-ORD Final Workshop, Lugano, 16.09.2024 The workshop / 15-16 January 2024, University of Neuch tel Aim Identify the advantages & challenges posed by the development and management of large data banks, with a specific focus on the organization, annotation and the warranting of longevity Focus on interactional data and interactional analysis of the data (IL, CA) Invited speakers Johannes Wagner (University of Southern Denmark): co-manager of TalkBank - TalkBank emerged from CHILDES (1984) then expanded to enclose today 14 corpus banks - Includes a transcription environment (CHAT), an annotation program (CLAN) - Primary data include audio, video, pictures, linked to transcripts; data from various languages Carole Etienne (Laboratoire ICAR, Lyon): founding member of CLAPI (Corpus de langue parl e en interaction) CLAPI has been developed since the nineties, at the ICAR lab. CLAPI is focused on data in French Primary data include video-audio recordings and transcripts of interactions in private and institutional settings (more than 40 corpora) - - - 11
CH-ORD Final Workshop, Lugano, 16.09.2024 Large data banks as continuous accomplishments Challenges to warrant longevity 1. Adapting to evolving technologies 2. Observing and adopting scientific standards: the FAIR principles 3. Responding to evolving user needs 4. Dealing with the complexity of heritage data (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 12
CH-ORD Final Workshop, Lugano, 16.09.2024 Large data banks as continuous accomplishments 1. Adapting to evolving technologies How to warrant accessibility and interoperability in the long term, given the often-fast evolving technologies? E.g. TalkBank: its conversation analytic component (CA Bank) necessitated translating Jeffersonian transcription conventions into computer-readable format. (e.g., problems with polyfunctional symbols stemming from original typewriter-based conventions) E.g. TalkBank underwent a set of technological updates, such as o transition from ASCII to Unicode character encoding standards (1990s), which necessitated a re-coding of the data o change, within the Macintosh processing system, from 16 to 32bit-rates, which required the implementation of a new version of CLAN (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 13
CH-ORD Final Workshop, Lugano, 16.09.2024 Large data banks as continuous accomplishments 2. Observing and adopting scientific standards: the FAIR principles How to warrant Findability, Accessibility, Interoperability, Reusability of data (FAIR) E.g. CLAPI o provides general metadata for each corpus o addresses heterogeneity and representativity of the data by creating a sub-platform for a type of data that is over-represented o for reusability purposes, video tracks of the primary data were complemented with audio-only data tracks for users uneasy with video analysis; transcripts are offered in simplified format o search tools are tailored to CA and IL communities, but also contain sociolinguistic metadata, concordance tool; search is possible in terms of nb. of speakers, overlap and other issues (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 14
CH-ORD Final Workshop, Lugano, 16.09.2024 Large data banks as continuous accomplishments 3. Responding to evolving user needs How to warrant continuous adaption to evolving user needs and practices of the concerned scientific communities, which can be diverse? E.g. the emergence, within CA and IL, of mixed methods paradigms that integrate quantification in their analytic procedures Responding to the CA communities need for collaborative work on transcripts (which raises technical issues because of multiple coexisting versions for the same transcript) o TalkBAnk: avoids this additional complexity by keeping only one master file. However, CLAN generally prioritized functionality over usability. Adapting transcript format to user preferences / importance of various converters o CLAPI: transcripts were initially in EAF format only (an XML schema specific to ELAN), then started providing DOC/RTF/TXT formats that are known to a wider user community. (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 15
CH-ORD Final Workshop, Lugano, 16.09.2024 Large data banks as continuous accomplishments 4. Dealing with the complexity of heritage data How to integrate data collected before the database was set up? Heritage data have a structure that does not correspond to the technical standards of the database o This necessitates manual processing, e.g. manual annotation Heritage data have been collected in compliance to ethical standards that differ from those of today o Data cannot be fully shared-> provide only snapshots of the data & grant full access to researchers signing an agreement with data owner Heritage data raise issues about citation practices o Acknowledgment of a resource's production is a necessary symbolic reward and an important incentive for researchers and institutions to make data available (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 16
CH-ORD Final Workshop, Lugano, 16.09.2024 Upshots and perspectives Keeping databases up-to-date is an arduous, expensive and time-consuming commitment Importance of : - Raising awareness about the interdependence between the data hosted and the hosting platform: the possibility, for large databases, to continue functioning over time depends on the structure of the hosted corpora they host and on the previous choices. - Raising awareness, among researchers about existing databases, their affordances and requirements, to allow them to maximize the interoperability and adaptability of the data they produce. (Stern & Miecznikowski, 2024; Profazi & Miecznikowski, 2024) 17
CH-ORD Final Workshop, Lugano, 16.09.2024 Thank you for your attention, and looking forward to any question or feedback! jerome.jacquin@unil.ch simona.pekarek@unine.ch Stern, G. & Miecznikowski, J. (2024). Large data banks as continuous accomplishments: Developing, managing, and up-dating infrastructures for data sharing Profazi, N. & Miecznikowski, J. (2024). Managing and using data banks of spoken language: perspectives from corpus providers and users 18
Part II - content Considerations about practices of data sharing and reuse in CA and IL as they emerge from workshop conferences and discussions Assessment of the practices Assessment criteria Practices of data sharing Practices of data reuse 19 Closing workshop: Project results and future perspectives, 16.09.2024
Assessment criteria Informativity (Is information lost or gained during the practices? Do the practices add or reduce the scientific value of the data?) FAIR principles Legal aspects CARE principles Efficiency Sustainability 20 Closing workshop: Project results and future perspectives, 16.09.2024
Part II - content Considerations about practices of data sharing and reuse in CA and IL as they emerge from workshop conferences and discussions Assessment of the practices Assessment criteria Practices of data sharing Practices of data reuse 21 Closing workshop: Project results and future perspectives, 16.09.2024
Data sharing: changing fieldwork practices Pr misse: La sensibilit aux aspects thiques e la protection de la sph re priv e des participants/locuteurs, aussi en ce qui concerne les donn es sensibles, est en g n ral tr s d velopp e parmi les chercheurs en AC et en IL. Consensus clair : Il s agit d une pratique diffus e depuis des d cennies en AC/IL. La perspective du partage des donn es au-del d un groupe de recherche singulier n cessite pourtant des adaptations des pratiques existantes et ce proc s est en cours: Compr hension des notions de donn e personnelle , donn e sensible Formulation des objectifs de recherche qui puissent inclure des recherches futures partiellement inconnues Mention des mesures de d -identification sans faire des promesses d anonymisation irr alistes Canaux de diffusion Destinataires et gestion de l acc s. N.B. Le manque de d clarations de consensus qui pr voyent la diffusion des donn es est un probl me pineux lors de la digitalisation des corpus h r dit s . 22 Closing workshop: Project results and future perspectives, 16.09.2024
Data sharing: de-identification Long tradition in CA and IL when it comes to the pseudonymisation of transcripts and the de- identification of audio excerpts and still images. Sharing not only excerpts, but big quantities of multimedia data changes the scale of de- identification tasks. Semi automatized solutions are necessary to do these tasks efficiently; such solutions are not yet adopted routinely. Data sharing: inter-institutional and international collaborations For legal reasons concerning data ownership, inter-institutional collaborations to jointly manage corpus data require detailed explicit agreements. International collaboraitons to jointly manage data are even more complicated because it is necessary to take several legal systems into account. Currently problems are treated case by case and require a lot of time and effort on the side of researchers and institutions. 23 Closing workshop: Project results and future perspectives, 16.09.2024
Data sharing: processing of transcripts Data sharing via corpus platforms requires usually a certain amount of preprocessing of transcripts to conform to the input formats of the platform. Such preprocessing requires computational know-how, human resources and time. All three elements are challenging! Data sharing: metadata CA and IL scholars are used to describe their data in detail. In view of sharing corpora in digital environments, it is necessary that corpus descriptions be formalised by means of descriptive, technical and administrative metadata categories in dedicated documents or document sections. This kind of formalisation is not part of the analytical attitude of CA and IL scholars and raises epistemological problems. It also requires technical know-how about file formats and standards. Metadata management is not routinely performed and often minimal solutions are adopted. Schmidt, T. (2022). Daten und Metadaten. In M. Bei wenger, L. Lemnitzer & C. Meyer-Spitzer (Eds.), Forschen in der Linguistik. Eine Methodeneinf hrung f r das Germanistik- Studium (pp. 249 258). Wilhelm Fink. 24 Closing workshop: Project results and future perspectives, 16.09.2024
Part II - content Considerations about practices of data sharing and reuse in CA and IL as they emerge from workshop conferences and discussions Assessment of the practices Assessment criteria Practices of data sharing Practices of data reuse 25 Closing workshop: Project results and future perspectives, 16.09.2024
Data reuse - General premise Reusing third party interactional data has evident advantages in terms of sustainability and efficiency, especially for comparative and diachronic studies (scientific value). It is a strongly entrenched aspect of CA and IL methodology to gather one s own data, performing fieldwork and transcribing spoken and/or multimodal discourse. Once collected, recordings are repeatedly viewed and listened to and entire transcripts are read and revised multiple times. Reusing third party interactional data is therefore perceived by researchers as to some extent incomplete access to the data (see part I). Example: Users asked TalkBank for functionalities allowing to retranscribe data. Example: Some workshop participants during our project voiced concerns against the use of third party data generally. Negative assessments of data reuse may lead to refusal instead of a critical investiation of both the opportunities and the challenges of data reuse. 26 Closing workshop: Project results and future perspectives, 16.09.2024
Data reuse collection-based and quantitative research On the methodological level, typically corpus reuse includes not only qualitative single case studies, but some kind of collection building and/or quantitative research. Not all CA and IL scholars have adequate know-how to put a corpus to use adopting these methods. Not all corpus infrastructure offer excellent support for collection building. As a consequence, the potential of data reuse is not fully realised. Further problems: Formats and query languages are not standardized and that complexity impacts data reuse. Contextualization: the importance of documentation. Practices of citation and referencing: not fully satisfactory, according to the testimonials by corpus hosting platforms. 27 Closing workshop: Project results and future perspectives, 16.09.2024