• Login
    View Item 
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Doctoral Dissertations
    • View Item
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Doctoral Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia

    View/Open
    Fulltext (3.373Mb)
    Date
    2021
    Author
    Amalia, Amalia
    Advisor(s)
    Sitompul, Opim Salim
    Mantoro, Teddy
    Nabahan, Erna Budhiarti
    Metadata
    Show full item record
    Abstract
    The conventional word embedding method treats words as the smallest independent entity unit and ignores word's internal structure. This causes the embedding unable to handle out of Vocabulary (OOV) and unable to capture the explicit relationship between syntactic morphology. One solution is to generate a word embedding from its subwords, such as characters or morphemes. This study aims to generate a subword embedding for bahasa Indonesia by considering additional information on bahasa Indonesia morphology. The biggest challenge of this research is the process of tokenizing each word into its subwords and the process of merging sub words encoding into a full word. This study's framework is word segmentation using Byte Pair Encoding (BPE), sub-word encoding using the GloVe algorithm, merging subword encoding into a whole word encoding using the addition method, and subword embedding using Skip Gram word2vec. The training corpus is from Wikipedia bahasa Indonesia, with total numbers of words is 729,260,836 and 364,184 unique words in the capacity of 746 MB. The implemented model of this study produces a pre-trained Subword embedding with 385,446 tokens in a capacity of 950.3 MB. The evaluation is carried out by utilizing a series of analogy test set to measure the semantic and syntactic relation. The evaluation refers to the benchmark analogy test set of Google and BATS adapted for bahasa Indonesia. The evaluation result showed that the model of this study is able to handle OOV and had a good result in capturing semantic and syntactic relations. In additional contribution, this study also generated another corpus combination from Wikipedia and crawled newspaper with 1,652,081,275 words. This study also yields pretrained word vectors for bahasa Indonesia, built using word2vec and fastText algorithm. This study also generated an intrinsic evaluation model test set of 4935 analogy questions test for bahasa Indonesia.
    URI
    https://repositori.usu.ac.id/handle/123456789/51040
    Collections
    • Doctoral Dissertations [51]

    Repositori Institusi Universitas Sumatera Utara (RI-USU)
    Universitas Sumatera Utara | Perpustakaan | Resource Guide | Katalog Perpustakaan
    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of USU-IRCommunities & CollectionsBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit DateThis CollectionBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit Date

    My Account

    LoginRegister

    Repositori Institusi Universitas Sumatera Utara (RI-USU)
    Universitas Sumatera Utara | Perpustakaan | Resource Guide | Katalog Perpustakaan
    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV