Corpus Generation untuk Natural Language Processing Bahasa Indonesia dengan Metode Web Corpora

Rahmatunnisa, Nadia

Corpus Generation untuk Natural Language Processing Bahasa Indonesia dengan Metode Web Corpora

dc.contributor.advisor	Amalia
dc.contributor.advisor	Lydia, Maya Silvi
dc.contributor.author	Rahmatunnisa, Nadia
dc.date.accessioned	2019-12-03T03:42:59Z
dc.date.available	2019-12-03T03:42:59Z
dc.date.issued	2019
dc.identifier.uri	http://repositori.usu.ac.id/handle/123456789/21514
dc.description.abstract	Korpus merupakan kumpulan teks yang tersimpan secara elektronik untuk berbagai kebutuhan penyelidikan dan penelitian, salah satunya dalam pemrosesan bahasa alami (Natural Language Processing). Perkembangan NLP di Indonesia dinilai terhambat karena kurangnya sumber daya seperti korpus. Pembuatan korpus pada awalnya dengan mengkompilasi teks asli ke dalam komputer. Seiring dengan perkembangannya, korpus saat ini bisa dibangun dengan metode web corpora, yaitu menjadikan web sebagai sumber teks. Hal ini didukung oleh fakta bahwa web merupakan sumber teks terbesar yang tersedia dalam berbagai bahasa. Dalam penelitian ini, korpus yang akan dibuat adalah korpus bahasa Indonesia yang diambil dari tujuh situs berita dengan berbagai kategori. Tahapan dalam membangun web corpora yang pertama adalah menentukan dataset, dimana ada 52 URL berita yang akan diteliti. Kemudian dilanjutkan dengan menganalisis struktur web, hal ini dilakukan karena perbedaan struktur untuk setiap situs. Kemudian web akan melalui proses scraping dengan menggunakan framework Scrapy. Setelah itu hasil scraping akan melalui proses data cleaning, yaitu menghilangkan noisy, seperti karakter non-ASCII, simbol, angka, duplikasi spasi, dan stopword removal. Data yang telah di-cleaning tersebut kemudian di disusun menjadi korpus agar mudah digunakan. Pada penelitian ini, persentase konten yang berhasil didapat dari tujuh situs berita yang diteliti adalah sebanyak 85,85% dan diperoleh 569.458 berita dengan hasil jangkauan 219.392 jenis kata. Korpus yang dihasilkan dari metode web korpora dapat dijadikan data untuk penelitian Natural Language Processing.	en_US
dc.description.abstract	Corpus is a collection of texts which stored electronically for various needs, such as study and research, one of is Natural Language Processing (NLP). The development of NLP in Indonesia is considered hampered due to lack of resources, such as the corpus. At first, generating a corpus by compiling the original text into the computer. Nowadays, along with its development, corpus can be built by using web corpora method, which is to make the web become a source of text. This is supported by the fact that the web is the largest text source that available in many languages. In this study, the type of corpus to be built is the corpus of Indonesian language, which taken from seven news sites with various categories. The first step in building a web corpora is deciding the dataset, where there are 52 URLs from various categories news. The next step is analyzing the structure of each website, because it have different way to extract the elements. After that, the web will be crawled by a crawler and extracted and stored by a scraper. The crawling and scraping are worked by Scrapy. Then the scraped data will be processed by the data cleaning step. There are non-ASCII characters removal, symbols removal, numbers removal, duplicate spaces removal, and stopword removal. The cleaned data is stored and arranged to be a corpus, then it easy to use. In this study, the persentation of the Scrapy works is about 85.85%, and scraping 569,458 content news and producing 219,392 type of word. The corpus from this study can be use for Natural Language Processing reasearches.	en_US
dc.language.iso	id	en_US
dc.publisher	Universitas Sumatera Utara	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.subject	Korpus	en_US
dc.subject	Web Corpora	en_US
dc.subject	Scrapy	en_US
dc.subject	Web Scraping	en_US
dc.title	Corpus Generation untuk Natural Language Processing Bahasa Indonesia dengan Metode Web Corpora	en_US
dc.type	Thesis	en_US
dc.identifier.nim	NIM141401025
dc.description.pages	56 Halaman	en_US
dc.description.type	Skripsi Sarjana	en_US

Files in this item

Name:: 141401025.pdf
Size:: 3.472Mb
Format:: PDF
Description:: Fulltext

View/Open

This item appears in the following Collection(s)

Undergraduate Theses [1181]
Skripsi Sarjana

Show simple item record