Corpus Generation untuk Natural Language Processing Bahasa Indonesia dengan Metode Web Corpora

Rahmatunnisa, Nadia

View/Open

Fulltext (3.472Mb)

Date

2019

Author

Rahmatunnisa, Nadia

Advisor(s)

Amalia

Lydia, Maya Silvi

Metadata

Show full item record

Abstract

Korpus merupakan kumpulan teks yang tersimpan secara elektronik untuk berbagai kebutuhan penyelidikan dan penelitian, salah satunya dalam pemrosesan bahasa alami (Natural Language Processing). Perkembangan NLP di Indonesia dinilai terhambat karena kurangnya sumber daya seperti korpus. Pembuatan korpus pada awalnya dengan mengkompilasi teks asli ke dalam komputer. Seiring dengan perkembangannya, korpus saat ini bisa dibangun dengan metode web corpora, yaitu menjadikan web sebagai sumber teks. Hal ini didukung oleh fakta bahwa web merupakan sumber teks terbesar yang tersedia dalam berbagai bahasa. Dalam penelitian ini, korpus yang akan dibuat adalah korpus bahasa Indonesia yang diambil dari tujuh situs berita dengan berbagai kategori. Tahapan dalam membangun web corpora yang pertama adalah menentukan dataset, dimana ada 52 URL berita yang akan diteliti. Kemudian dilanjutkan dengan menganalisis struktur web, hal ini dilakukan karena perbedaan struktur untuk setiap situs. Kemudian web akan melalui proses scraping dengan menggunakan framework Scrapy. Setelah itu hasil scraping akan melalui proses data cleaning, yaitu menghilangkan noisy, seperti karakter non-ASCII, simbol, angka, duplikasi spasi, dan stopword removal. Data yang telah di-cleaning tersebut kemudian di disusun menjadi korpus agar mudah digunakan. Pada penelitian ini, persentase konten yang berhasil didapat dari tujuh situs berita yang diteliti adalah sebanyak 85,85% dan diperoleh 569.458 berita dengan hasil jangkauan 219.392 jenis kata. Korpus yang dihasilkan dari metode web korpora dapat dijadikan data untuk penelitian Natural Language Processing.

Corpus is a collection of texts which stored electronically for various needs, such as study and research, one of is Natural Language Processing (NLP). The development of NLP in Indonesia is considered hampered due to lack of resources, such as the corpus. At first, generating a corpus by compiling the original text into the computer. Nowadays, along with its development, corpus can be built by using web corpora method, which is to make the web become a source of text. This is supported by the fact that the web is the largest text source that available in many languages. In this study, the type of corpus to be built is the corpus of Indonesian language, which taken from seven news sites with various categories. The first step in building a web corpora is deciding the dataset, where there are 52 URLs from various categories news. The next step is analyzing the structure of each website, because it have different way to extract the elements. After that, the web will be crawled by a crawler and extracted and stored by a scraper. The crawling and scraping are worked by Scrapy. Then the scraped data will be processed by the data cleaning step. There are non-ASCII characters removal, symbols removal, numbers removal, duplicate spaces removal, and stopword removal. The cleaned data is stored and arranged to be a corpus, then it easy to use. In this study, the persentation of the Scrapy works is about 85.85%, and scraping 569,458 content news and producing 219,392 type of word. The corpus from this study can be use for Natural Language Processing reasearches.

URI

http://repositori.usu.ac.id/handle/123456789/21514

Collections

Undergraduate Theses [1180]