Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia
View/ Open
Date
2021Author
Amalia, Amalia
Advisor(s)
Sitompul, Opim Salim
Mantoro, Teddy
Nabahan, Erna Budhiarti
Metadata
Show full item recordAbstract
The conventional word embedding method treats words as the smallest independent
entity unit and ignores word's internal structure. This causes the embedding unable
to handle out of Vocabulary (OOV) and unable to capture the explicit relationship
between syntactic morphology. One solution is to generate a word embedding from
its subwords, such as characters or morphemes. This study aims to generate a
subword embedding for bahasa Indonesia by considering additional information on
bahasa Indonesia morphology. The biggest challenge of this research is the process
of tokenizing each word into its subwords and the process of merging sub words
encoding into a full word. This study's framework is word segmentation using Byte
Pair Encoding (BPE), sub-word encoding using the GloVe algorithm, merging
subword encoding into a whole word encoding using the addition method, and
subword embedding using Skip Gram word2vec. The training corpus is from
Wikipedia bahasa Indonesia, with total numbers of words is 729,260,836 and
364,184 unique words in the capacity of 746 MB. The implemented model of this
study produces a pre-trained Subword embedding with 385,446 tokens in a capacity
of 950.3 MB. The evaluation is carried out by utilizing a series of analogy test set
to measure the semantic and syntactic relation. The evaluation refers to the
benchmark analogy test set of Google and BATS adapted for bahasa Indonesia. The
evaluation result showed that the model of this study is able to handle OOV and
had a good result in capturing semantic and syntactic relations. In additional
contribution, this study also generated another corpus combination from Wikipedia
and crawled newspaper with 1,652,081,275 words. This study also yields pretrained
word vectors for bahasa Indonesia, built using word2vec and fastText
algorithm. This study also generated an intrinsic evaluation model test set of 4935
analogy questions test for bahasa Indonesia.