
Estonian National Corpus 2019: Features and Search Capabilities
Explore the Estonian National Corpus 2019 and its features, including metadata availability, subcorpus sizes, annotations, search interface, and search possibilities like thesaurus, synonyms, and terminology extraction. Learn about accessing corpus files and more. Dive into this valuable resource for translation studies and linguistic analysis.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Estonian National Corpus 2019 Sketch Engine Liisi Jakobson Translation Studies
SUBCORPUS SIZES Subcorpus Text_Type_Blogs Text_Type_Discussion Text_Type_Education Text_Type_Fiction Text_Type_Food Text_Type_Health Text_Type_Journals Text_Type_News Text_Type_Religion Text_Type_Science Text_Type_Society Text_Type_Sports Text_Type_Wikipedia token % Representative (enough variety) and balanced (right proportions)? 10.139 13.665 2.024 0.201 0.389 1.763 12.856 11.289 2.468 2.027 2.693 2.449 2.167
Annotations The Estonian National Corpus is a morphologically annotated corpus by the tagging tool EstNLTK v1.6. the abbreviated tag contains only basic information about part of speech the longtag contains detailed information, including other categories for particular parts of speech. The concordance annotation (mode) lets you categorize or add labels to concordance lines.
Search interface The search can be pretty complex; e.g. I can search by different subcorpuses and text types (domain, website, newspaper No etc); Iti s possible to search a phrase; You don t have to start a new search, you can just change the criteria; If the search word has more than one part of speach, then it is possible to switch between them; You can save your search.
Search possibilities thesaurus synonyms and similar words for every word keywords terminology extraction of one-word and multi-word units word lists lists of Estonian nouns, verbs, adjectives etc. organized by frequency n-grams frequency list of multi-word units concordance examples in context
The access to the actual corpus files Files can t be downloaded directly from the interface. Sketch Engine is not a public cloud. Texts you upload will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to individually selected users by sharing the corpus.
Exporting your search results Maximum of 10 000 rows will be downloaded. You can choose how many rows you want to download and determine the context-size. Downloading is possible in .txt, .csv, .xlsx or xml format.
Pluses and minuses User friendly interface Thorough user manual terminology extraction Texts are protected GDEX
Do the texts contain the linguistic features I m investigating? Estonian National Corpus 2019 1.5 billion words Remote sensing 2022 4.6 million words