Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia

Amalia, Amalia

View/Open

Fulltext (3.373Mb)

Date

2021

Author

Amalia, Amalia

Advisor(s)

Sitompul, Opim Salim

Mantoro, Teddy

Nabahan, Erna Budhiarti

Metadata

Show full item record

Abstract

The conventional word embedding method treats words as the smallest independent entity unit and ignores word's internal structure. This causes the embedding unable to handle out of Vocabulary (OOV) and unable to capture the explicit relationship between syntactic morphology. One solution is to generate a word embedding from its subwords, such as characters or morphemes. This study aims to generate a subword embedding for bahasa Indonesia by considering additional information on bahasa Indonesia morphology. The biggest challenge of this research is the process of tokenizing each word into its subwords and the process of merging sub words encoding into a full word. This study's framework is word segmentation using Byte Pair Encoding (BPE), sub-word encoding using the GloVe algorithm, merging subword encoding into a whole word encoding using the addition method, and subword embedding using Skip Gram word2vec. The training corpus is from Wikipedia bahasa Indonesia, with total numbers of words is 729,260,836 and 364,184 unique words in the capacity of 746 MB. The implemented model of this study produces a pre-trained Subword embedding with 385,446 tokens in a capacity of 950.3 MB. The evaluation is carried out by utilizing a series of analogy test set to measure the semantic and syntactic relation. The evaluation refers to the benchmark analogy test set of Google and BATS adapted for bahasa Indonesia. The evaluation result showed that the model of this study is able to handle OOV and had a good result in capturing semantic and syntactic relations. In additional contribution, this study also generated another corpus combination from Wikipedia and crawled newspaper with 1,652,081,275 words. This study also yields pretrained word vectors for bahasa Indonesia, built using word2vec and fastText algorithm. This study also generated an intrinsic evaluation model test set of 4935 analogy questions test for bahasa Indonesia.

URI

https://repositori.usu.ac.id/handle/123456789/51040

Collections

Doctoral Dissertations [64]