Research Infrastructure for Arts and Humanities: Token-Annotated Corpora Overview

common lab research infrastructure for the arts n.w

1 / 12

Embed Share

Exploring token-annotated corpora and treebanks in the field of arts and humanities. Uncover the tools and methodologies for search and analysis, as well as the potential of Linked Open Data (LOD) in enhancing research capabilities and query languages.

kal_mcn Follow

Uploaded on May 31, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Common Lab Research Infrastructure for the Arts and Humanities 1

OVERVIEW Search in Token-annotated Corpora Search in Treebanks 2

OVERVIEW Search in Token-annotated Corpora Search in Treebanks 3

TOKEN-ANNOTATED Corpus SONAR (535 m tokens, Dutch) Pos-tagged Encoded in FoLiA Search OpenSONAR 4 search interfaces, of increasing complexity Expert: CQP queries 4

TOKEN-ANNOTATED Other Corpora Own Corpora BNC Contemporary Dutch Corpus Search- all use CQP. AutoSearch BNC Lancaster Contemporary Dutch Corpus 5

TOKEN-ANNOTATED LOD? Will it bring advantages? If so, which ones? Does it retain the power and simple notation of CQP? SPARQL queries? REs over token descriptions? 6

OVERVIEW Search in Token-annotated Corpora Search in Treebanks 7

TREEBANKS Treebank = text corpus in which each sentence has been assigned a syntactic structure I use CGN, LASSY, CHILDES for Dutch LINDAT/CLARIN for many different languages T ndra for (mainly) German INESS treebanks for multiple languages Query languages: CGN, LASSY, CHILDES: XPATH/XQUERY LINDAT/CLARIN: PML-TQ T ndra: Tiger INESS: Tiger 8

TREEBANKS Dedicated search applications: GrETEL Example-based search & XPATH PaQU Dedicated search for dependencies & XPATH Performance: OK for 65k sent /1 M token corpora Too slow for 7 M sent corpora (and getting slower every 18 months) 9

TREEBANKS LOD: Could it be used to overcome the many different query languages in use? Query language? Same potential, transparent notation? Query language syntax NO problem Queries get very complex very quickly Must know the structure of the syntactic structures in every fine detail 10

TREEBANKS LOD: Linking to other resources. Combined syntactic/morphological/semantic search Wordnet for checking for semantic properties (mass/count, human/nonhuman) CELEX for morphological/phonological properties performance? 11

Thanks for your attention 12

Research Infrastructure for Arts and Humanities: Token-Annotated Corpora Overview

Download Presentation

Presentation Transcript

Related

More Related Content