Extending Universal Dependencies to Tamil Poetics:Multiword Tokenisation and Ellipsis in the ThirukkuRaḷ Treebank

Dilukshana, C.; Sarveswaran, K

View/Open

MERGED FULL JOURNAL KJMS VOL 7 (2) NOV 2025 with DOI (pages 217-226).pdf (516.3Kb)

Date

2025-11

Author

Dilukshana, C.

Sarveswaran, K

Metadata

Show full item record

Abstract

Treebanks are critical resources in Natural Language Processing (NLP), supporting parser development, linguistic research, and the evaluation of large language models. While Tamil has seen progress in Universal Dependencies (UD) treebanking, existing corpora have been restricted to prose texts, leaving its vast poetic tradition underrepresented. This paper presents the first effort to be made to construct a syntactic treebank for Tamil poetry, specifically focussing on the ThirukkuRaḷ, which is composed in kuRaḷ veṇpā form. A central challenge in this work is posed by the treatment of multiword tokens (MWTs) and elliptical constructions, both of which are observed to occur frequently in Tamil verse due to its agglutinative morphology and metrical constraints. An annotation strategy is proposed within the Enhanced UD (EUD) framework to systematically address five major types of ellipsis—casal, verbal, adjectival, comparative/simile, and cumulative—alongside complex MWT patterns. These annotations not only enhance the representation of Tamil poetic syntax but also broaden the applicability of UD guidelines to underrepresented genres. The contribution is shown to underscore the linguistic and computational importance of capturing the structural specificities of Tamil poetry, while establishing a foundation for future cross-linguistic and literary treebanking efforts

URI

https://ir.kdu.ac.lk/handle/345/8980

Collections

Volume 07, Issue 02, 2025 [21]