Pembersihan Konten yang Tidak Berhubungan pada Artikel Berita Online Menggunakan Algoritma Boilerpipe dan String Processing

Dwi, Dhany

Pembersihan Konten yang Tidak Berhubungan pada Artikel Berita Online Menggunakan Algoritma Boilerpipe dan String Processing

dc.contributor.advisor	Gunawan, Dani
dc.contributor.advisor	Arisandi, Dedy
dc.contributor.author	Dwi, Dhany
dc.date.accessioned	2021-10-25T04:50:05Z
dc.date.available	2021-10-25T04:50:05Z
dc.date.issued	2021
dc.identifier.uri	https://repositori.usu.ac.id/handle/123456789/44665
dc.description.abstract	Perkembangan jumlah web berita online di dunia mengalami peningkatan yang sangat pesat. Pada umumnya halaman web tidak hanya berisi konten utama, tetapi juga elemen lain seperti panel navigasi, iklan, dan link ke dokumen terkait atau disebut juga boilerplate. Untuk memastikan halaman web berkualitas tinggi, diperlukan algoritma penghapusan boilerplate yang baik untuk mengidentifikasi konten yang relevan dari halaman web. Tujuan ekstraksi konten atau mendeteksi boilerplate adalah untuk memisahkan konten utama dari panel navigasi, iklan, pemberitahuan hak cipta dan link ke dokumen terkait di halaman web. Dalam sistem untuk menghilangkan boilerplate terdapat dua fase: fase ekstraksi konten dan string processing. Fase pertama yaitu ekstraksi konten untuk mengambil konten utama menggunakan algoritma boilerpipe, lalu ke fase berikutnya yaitu string processing untuk membersihkan berita terkait yang biasanya terdapat di tengah-tengah konten berita. Dari hasil penelitian cukup efektif karna mendapatkan tingkat akurasi yang cukup tinggi setelah dihitung persentase kemiripannya menggunakan cosine similarity, dan juga dapat digunakan untuk mempermudah penelitian berikutnya dalam hal text processing.	en_US
dc.description.abstract	The development of the number of online news websites in the world has increased very rapidly. General web sites contain not just main content, for good measure other elements such as navigation panels, advertisements, and links to adjacent documents are also called boilerplate. To assure the good quality of the web pages, I need to discover the appropriate content from a good boilerplate removal algorithm. The purpose of extracting content or detecting boilerplate is to segregate the main content from navigation panels, advertisements, copyright notifications and links to documents accompanying web pages. In the system to eliminate boilerplate there are two phases: content extraction phase and string matching. The first step is content extraction to retrieve the main content using boilerpipe algorithm, and then to the next phase to clean up the most complicated news that is usually found in the middle of news content using string processing. From the results of the study is quite effective because it gets a fairly high level of accuracy after calculating the percentage of similarities using cosine similarity, and can also be used to facilitate subsequent research in terms of text processing.	en_US
dc.language.iso	id	en_US
dc.publisher	Universitas Sumatera Utara	en_US
dc.subject	Web Berita Online	en_US
dc.subject	Boilerplate	en_US
dc.subject	Ekstraksi Konten	en_US
dc.subject	String Processing	en_US
dc.title	Pembersihan Konten yang Tidak Berhubungan pada Artikel Berita Online Menggunakan Algoritma Boilerpipe dan String Processing	en_US
dc.type	Thesis	en_US
dc.identifier.nim	NIM151402122
dc.description.pages	55 Halaman	en_US
dc.description.type	Skripsi Sarjana	en_US

Files in this item

Name:: 151402122.pdf
Size:: 2.033Mb
Format:: PDF
Description:: Fulltext

View/Open

This item appears in the following Collection(s)

Undergraduate Theses [800]
Skripsi Sarjana

Show simple item record