Universal Dependencies: Treebank Annotation Schemes and Syntactic Analysis
Universal Dependencies project aims to provide cross-linguistically consistent grammatical annotation to support multilingual NLP and linguistic research. It offers guidelines for annotating syntactic words and morphological features, emphasizing the importance of maximizing parallelism and standardizing annotation principles across languages.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Universal Dependencies Joakim Nivre Uppsala University
Universal Dependencies Background: Treebank annotation schemes vary across languages Hard to compare results across languages [Nivre et al. 2007] Hard to evaluate cross-lingual learning [McDonald et al. 2013] Hard to build multilingual systems Universal Dependencies (http://universaldependencies.github.io/docs/): Stanford universal dependencies [de Marneffe et al. 2014] Google universal part-of-speech tags [Petrov et al. 2012] Interset morphological features [Zeman 2008] First guidelines released Oct 1, 2014 First 10 treebanks released Jan 15, 2015
Universal Dependencies Syntactic words explicit splitting of clitics and contractions Universal part-of-speech tags + morphological features Dependency tree + augmented dependencies (not shown)
Goals Cross-linguistically consistent grammatical annotation Support multilingual NLP and linguistic research Build on common usage and existing de-facto standards Complement not replace language-specific schemes Open community effort anyone can contribute
Guiding Principles Maximize parallelism Don't annotate the same thing in different ways Don't make different things look the same Don't annotate things that are not there Don't annotate things that are not there Languages select from a universal pool of categories Allow language-specific extensions
Design Principles Dependency Widely used in practical NLP systems Available in treebanks for many languages Lexicalism Basic annotation units are words syntactic words Words have morphological properties Words enter into syntactic relations Recoverability Transparent mapping from input text to word segmentation
Morphological Annotation . . Le La DET chat chat NOUN chasse chasser VERB Mood=Ind Number=Sing Person=3 les le DET chiens chien NOUN PUNCT Definite=Def Gender=Masc Number=Sing Gender=Masc Number=Sing Definite=Def Gender=Masc Number=Sing Gender=Masc Number=Plur Lemma represent the semantic content of a word Part-of-speech tag represent its grammatical class Features represent lexical and grammatical properties of the lemma or the particular word form
Syntactic Annotattion Content words are related by dependency relations Function words attach to the content word they modify Punctuation attach to head of phrase or clause
CoNLL-U Format ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC 1 2 3 4-5 du 4 5 6 7 Le chat boit Le chat boire _ De le Lait . DET NOUN VERB _ ADP DEP NOUN PUNCT _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2 3 0 _ 6 6 3 3 det nsubj Root _ case det obj Punct _ _ _ _ _ _ _ _ _ _ _ _ _ _ SpaceAfter=no _ de le lait .
Dependency Structure English Swedish Keeping content words as heads promotes parallelism across languages Function words often correlate with morphology
DependencyRelations [de Marneffeet al. 2014] Taxonomy of 42 universal grammatical relations, broadly supported across many languages in language typology Language specific subtypes can be added
Morphology: POS Open class words ADJ ADV INTJ NOUN PROPN VERB Closed class words ADP AUX CONJ DET NUM PART PRON SCONJ Other PUNCT SYM X Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset [Petrov et al. 2012]
Morphology: Universal Features Standardized inventory of morphological features, based on the Interset system [Zeman 2008] Lexical features Inflectional features Nominal* Gender Animacy Number Case Definite Degree Verbal* VerbForm Mood Tense Aspect Voice Evident Polarity Person Polite PronType NumType Poss Reflex Foreign Abbr
Morphology: Examples la hanno fatto casa Definite=Def|Gender=Fem|Number=Sing|PronType=Art Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part Gender=Fem|Number=Sing