Centre for Language Resources and Technologies at University of Ljubljana

Slide Note

The Centre for Language Resources and Technologies at the University of Ljubljana is a key player in developing language infrastructure for contemporary Slovene. It encompasses various faculties and focuses on scientific research, language standardization, terminology, and multilinguality. The center has a rich history, including projects like Gigafida and FidaPLUS, which have significantly contributed to the understanding and development of the Slovene language. Through its initiatives, the center aims to enhance language technologies and cater to special needs in the linguistic domain.

kamarion Follow

Uploaded on Mar 09, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Gigafida, Kres, ccGigafida and ccKres Simon Krek Centre for Language Resources and Technologies, University of Ljubljana, "Jo ef Stefan" Institute, Artificial Intelligence Laboratory

Contents Centre for Language Resources and Technologies (University of Ljubljana) History (FIDA FidaPLUS Gigafida) From Gigafida 1.0 to Gigafida 2.0 Project plan New corpus material Processing tools and web concordancer Corpus of standard Slovene Availability Plans for the future SlaviCorp, Prague, 25. 9. 2018 2

Centre for Language Resources and Technologies Part of the Network of Research and Infrastructural Centres at the University of Ljubljana Planned from 2012, established in 2015 Faculty of Arts, Faculty of Computer and Information Science, Faculty of Social Sciences, Faculty of Education, Faculty of Electrical Engineering Aimed at scientific research and the development of language infrastructure for contemporary (standard) Slovene Language description, language standardization, language technologies, terminology, multilinguality, special needs Web page: https://www.cjvt.si/en/ SlaviCorp, Prague, 25. 9. 2018 3

History FIDA and FidaPLUS FIDA (1997-2000) Faculty of Arts (Uni-Lj), Jo ef Stefan Institute, DZS publishing house, Amebis 100 million words Amebis rule-based tagger ( 85% accuracy) Web concordancer (for partners) FidaPLUS (2004-2006) Two applicative research projects 620 million words Available on the web (general access with registration) Same tagger & web concordancer SlaviCorp, Prague, 25. 9. 2018 4

History Gigafida Communication in Slovene (2008-2013) Web page: http://eng.slovenscina.eu/ Corpora (written, spoken, learners), lexicon, lexical database, tagger, parser, training corpus, web portals (concordancer etc.) Gigafida, Kres, ccGigafida, ccKres (2012) Gigafida 1.2 billion words Kres 100 million (balanced) ccGigafida & ccKres (open licence) 100/10 million words Statistical tagger ( 92% accuracy) New concordancer, focus on user-friendliness (surveys, log analysis etc.) SlaviCorp, Prague, 25. 9. 2018 5

History Gigafida/Kres (genre) Gigafida 1.2 billion Kres 100 million 6 SlaviCorp, Prague, 25. 9. 2018

Gigafida 2.0 project plan Project Upgrade of Gigafida, Kres, ccGigafida and ccKres (2014-2018) Ministry of Culture Centre for Language Resources and Technologies Plan New material (1.5B) Processing (new tools) Different focus (standard Slovene) Availability (distribution) SlaviCorp, Prague, 25. 9. 2018 7

Gigafida 2.0 new material Solving two problems Outdatedness Underrepresentation Outdatedness News corpus (http://newsfeed.ijs.si/) news portals (rtvslo.si, 24ur.com, siol.net, urnal24.si, sta.si) daily newspapers (delo.si, dnevnik.si, vecer.si etc) From 2012 onwards, two batches (300M words) Underrepresentation textbooks and literature (high sales or borrowing in libraries) 10M SlaviCorp, Prague, 25. 9. 2018 8

Gigafida 2.0 processing & concordancer Tagging Obeliks tagger (Gigafida 1.0) Reldi tagger (94,27% accuracy) Meta-tagger Training corpus: 10,000 tokens with different decisions Tokenization Rule-based (from Obeliks tool) Obeliks4J (https://github.com/clarinsi/Obeliks4J) Deduplication SlaviCorp, Prague, 25. 9. 2018 9

Gigafida 2.0 tagging Meta-tagger (training corpus: 10,000 tokens with different decisions) Both taggers are wrong or not enough context: 1,121 Same tag, different lemma: 858 Excluded: 1,979 Training set: 8,021 Obeliks is right: 2,654, 33.09% Reldi is right: 5,367, 66.91% Improvement: 0.71 (tags) 0.73 (lemmas) Tool: https://github.com/clarinsi/meta-tagger SlaviCorp, Prague, 25. 9. 2018 10

Gigafida 2.0 deduplication Old material from FIDA and FidaPLUS Onion tool (http://corpus.tools/wiki/Onion) parameters n-gram (n, default 7) Percent of n-gram overlap (p, default 0.5) Parameters considered 7 0.7 (75.8%) 9 0.5 (75.1%) chosen Gigafida 1.1 DeDup http://clarin.si/noske/ SlaviCorp, Prague, 25. 9. 2018 11

Gigafida 2.0 web concordancer 12 SlaviCorp, Prague, 25. 9. 2018

Gigafida 2.0 corpus of standard Slovene Three categories (1) the first category includes texts with a high degree of probability that the authors were required or wanted to produce texts in standard Slovene textbooks, literature, magazines, newspapers, legislation etc. (2) media that intentionally use non-standard language, e.g. a newspaper published by Slovenian minority in Italy (Novi Matajur) (3) computer-mediated communication (CMC) typical of social media, forums etc. Gigafida 2.0 (standard) Janes (CMC) specialized (KAS, IMP etc.) SlaviCorp, Prague, 25. 9. 2018 13

Gigafida 2.0 availability Distributer: Centre for Language Resources and Technologies Access: repository CLARIN.SI ccGigafida & ccKres Creative Commons (CC BY-SA) Gigafida & Kres Standardized (online) contract Authentication Scrambling January 2019 SlaviCorp, Prague, 25. 9. 2018 14

Plans for the future Developmenent: Centre for Language Resources and Technologies Distribution: CLARIN.SI repository Research programme Uni-Lj (Dec 2018) Yearly upgrade? Neural tagger Ongoing projects New grammar & KOLOS ELEXIS (European Lexicographic Infrastructure) Annotation (syntactic treebank, SRL, NER, semantic types?) Extraction of data (n-grams, collocations, valency patterns etc.) SlaviCorp, Prague, 25. 9. 2018 15

Centre for Language Resources and Technologies at University of Ljubljana

Download Presentation

Presentation Transcript

Related

More Related Content