Building a Sinhala-English Parallel Corpus for Neural Machine Translation Based on Exam Questions

Rilfi, MRM; Gunawansha, UGYM; Prasandika, KAC; Chandrani, KGA

View/Open

40.pdf (823.8Kb)

Date

2021

Author

Rilfi, MRM

Gunawansha, UGYM

Prasandika, KAC

Chandrani, KGA

Metadata

Show full item record

Abstract

In any neural machine translation between two natural languages, parallel corpus is a compulsory part of the training process. The most crucial step in an MT system is to develop an effective method for gathering parallel corpus. The construction of a parallel corpus, on the other hand, necessitates substantial knowledge of both languages and is a time-consuming procedure. Due to these limits, digitizing documents becomes extremely challenging, lowering the quality of machine translation systems. This research offers a method for producing an English to Sinhala parallel corpus that is both faster and more efficient, while requiring less human intervention. This system generates a parallel corpus for language pair using the following steps: scanning the exam question papers using a special type of scanner, Image optimization for Optical Character Recognition (OCR), text extraction from images and converting unstructured text into structured form as parallel corpus.

URI

http://ir.kdu.ac.lk/handle/345/5248

Collections

Computing [62]