Extending Universal Dependencies to Tamil Poetics:Multiword Tokenisation and Ellipsis in the ThirukkuRaḷ Treebank
Abstract
Treebanks are critical resources in Natural Language Processing (NLP), supporting parser development, linguistic
research, and the evaluation of large language models. While Tamil has seen progress in Universal Dependencies
(UD) treebanking, existing corpora have been restricted to prose texts, leaving its vast poetic tradition
underrepresented. This paper presents the first effort to be made to construct a syntactic treebank for Tamil poetry,
specifically focussing on the ThirukkuRaḷ, which is composed in kuRaḷ veṇpā form. A central challenge in this work
is posed by the treatment of multiword tokens (MWTs) and elliptical constructions, both of which are observed to
occur frequently in Tamil verse due to its agglutinative morphology and metrical constraints. An annotation strategy
is proposed within the Enhanced UD (EUD) framework to systematically address five major types of ellipsis—casal,
verbal, adjectival, comparative/simile, and cumulative—alongside complex MWT patterns. These annotations not
only enhance the representation of Tamil poetic syntax but also broaden the applicability of UD guidelines to
underrepresented genres. The contribution is shown to underscore the linguistic and computational importance of
capturing the structural specificities of Tamil poetry, while establishing a foundation for future cross-linguistic and
literary treebanking efforts
