Enhancing Bulgarian Language Processing with End-to-End Treebank-Informed Pipeline

implementing an end to end treebank informed n.w
1 / 16
Embed
Share

This article discusses the implementation of a new pipeline for Bulgarian language processing, integrating semantic analysis capabilities and achieving competitive accuracy scores. The adoption of spaCy, a Python-based framework, is explored for its speed, flexibility, and ease of use, with a focus on adapting it to Bulgarian language specifics through tokenization, lemmatization, and training statistical models.

  • Bulgarian Language
  • End-to-End Pipeline
  • Semantic Analysis
  • spaCy Framework
  • Tokenization

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Implementing an End-to-End Treebank- Informed Pipeline for Bulgarian 19th International Workshop on Treebanks and Linguistic Theories Alexander Popov, Petya Osenova and Kiril Simov Bulgarian Academy of Sciences

  2. Previous pipeline for Bulgarian Developed in Java and in the XML-based CLaRK system Java wrappers that call various modules: rule-based ones implemented in CLaRK (tokenizer, lemmatizer, etc.) and statistical ones (Mate Tools POS tagger, dependency parser, UKB word sense disambiguation, etc.) Good accuracy scores, but quite slow, heterogeneous and brittle, difficult to maintain 3

  3. Desiderata for a new pipeline implemented within a single framework includes semantic analysis capabilities (word sense disambiguation, named entity recognition) achieves competitive accuracy scores affords processing speeds suitable for real applications can handle big volumes of data 4

  4. Choosing a framework: spaCy Python-based Developed to be suitable for industrial-strength solutions: therefore fast, well-structured, flexible, and easy to use Non-destructive tokenization: the original input can always be recovered from the processed output Fixed neural architectures: but spaCy can be easily combined with other Python-based modules and libraries 5

  5. spaCy: processing speed https://spacy.io/usage/facts-figures 6

  6. Adapting spaCy for Bulgarian 1) Adding language-specific lists and rules for tokenization and lemmatization i.e. adding/modifying resources and methods in spacy.lang.bg and in spacy-lookups-data 2) Training statistical models on Bulgarian data (the BTB Treebank converted into UD) Saving the models and their metadata for later use 7

  7. Tokenization Tokenization is carried out via rules and language-specific exceptions. This amounts to compiling: a list of strings, each one of which is to be analyzed a single token item attributes associated with abbreviated tokens: lemmas, as well as morphological analyses regular expressions for handling tokens with special symbols, like hyphens, apostrophes regular expressions for handling punctuation marks that should not split strings into tokens (e.g. [DD.MM.YYYY]) 8

  8. Lemmatization Bulgarian is a relatively rich language in terms of its morphology the unique fine-grained morphological tags in the BTB treebank number 578. While spaCy uses by default the Universal POS tagset, this is not always expressive enough to make correct decisions due to ambiguity A cascade of decreasingly sophisticated mappings are used to fix lemmas: morpho-tag >> POS tag >> string.lower() 9

  9. Word forms with multiple lemmas 10

  10. POS tagging and dependency parsing POS tagging accuracy: 94.13% on the development set and 94.49% on the test set (UD Treebank) Old pipeline accuracy: 96.87 % (trained on an older treebank version) Dependency parsing: 88.95% UAS / 83.03% LAS on development set (UD) 89.71% UAS / 83.95% LAS on the test set (UD) Old pipeline: 90.83% UAS / 87.41% LAS (older data) 11

  11. Evaluation of the core dependency relations 12

  12. Named entity recognition Since NE data is not included in the UD version, we trained the NER module separately on the BulTreeBank data We also included data from the BSNLP corpus (Marinova et al., 2020), includes Event, Product and Other types, in addition to Person, Location and Organization The two corpora were processed into the spaCy-readable IOB format, concatenated and shuffled, to balance them between the training (20803 sentences) and development (2312 sentences) portions 14

  13. NER evaluation Combined (cross-category) results: Precision 92.75 %; Recall 93.31 %; F-measure 93.03 % 15

  14. Word sense disambiguation Experimented with the EWISER system (Bevilacqua and Navigli, 2020) because of its superior accuracy compared to other systems, easy integration with spaCy, and multilingual support EWISER improves the state-of-the-art in WSD on the popular evaluation framework for English (Raganato et al., 2017), breaking through the 80% glass ceiling State-of-the-art results on German, French, Italian and Spanish data when trained on the SemCor & multilingual BERT 16

  15. Adapting the WSD module for Bulgarian preparing a dictionary that maps lemmas to possible synsets replacing the Bulgarian synset IDs with IDs from Princeton WN mapping the PWN IDs from version 3.1 (used in BTB) to version 3.0 (used in EWISER) mapping WN IDs to BabelNet IDs, in order to produce the final dictionary required by EWISER compiling a list of lemmas that the system can recognize (i.e. they are present in the dictionary) 18

  16. Conclusions & future work End-to-end pipeline for Bulgarian, including semantic analysis Improvements: better tokenization/segmentation through more precise rules and syntactic parses better mapping between POS/morpho-tags and lemma candidates adapting the BTB-WordNet data for training WSD models adding more gold data for the training of the NER module and the UD parser optimizing the training parameters 19

Related


More Related Content