dc.description.abstract | Leukon, a language spoken by approximately 1,200 individuals on Simeulue Island, Aceh,
is critically endangered due to the dominance of other languages and social changes. To
preserve this language, the development of digital resources, such as machine translation
systems, is considered a viable solution for documentation and educational purposes for
future generations. However, this endeavor is hindered by the limited availability of
parallel corpora essential for training translation models. This study aims to construct a
Leukon-Indonesian parallel corpus as a linguistic resource to support machine translation
development.The corpus development involved several stages, including transcription
extraction, corpus normalization, and sentence structure refinement using deletion,
insertion, and replacement techniques. Spelling corrections were performed by building a
word dictionary as a reference and applying fuzzy matching techniques based on the
Levenshtein algorithm to detect and correct errors. Optimization was further achieved by
removing duplicates and employing Concatenation Augmentation techniques to enhance
data diversity.The resulting parallel corpus was evaluated by training a machine translation
model using a Bidirectional Long Short-Term Memory (BiLSTM) architecture with
Attention mechanisms. Performance metrics, including Cosine Similarity and Quadratic
Weighted Kappa (QWK), were used for evaluation. The corpus comprises 1,111 lines,
achieving a Cosine Similarity score of 0.666 and a QWK score of 0.975. These findings
underscore the potential of the constructed corpus to support the preservation of the Leukon
language through the development of effective and sustainable machine translation
systems. | en_US |