Pembersihan Konten yang Tidak Berhubungan pada Artikel Berita Online Menggunakan Algoritma Boilerpipe dan String Processing

Dwi, Dhany

View/Open

Fulltext (2.033Mb)

Date

2021

Author

Dwi, Dhany

Advisor(s)

Gunawan, Dani

Arisandi, Dedy

Metadata

Show full item record

Abstract

Perkembangan jumlah web berita online di dunia mengalami peningkatan yang sangat pesat. Pada umumnya halaman web tidak hanya berisi konten utama, tetapi juga elemen lain seperti panel navigasi, iklan, dan link ke dokumen terkait atau disebut juga boilerplate. Untuk memastikan halaman web berkualitas tinggi, diperlukan algoritma penghapusan boilerplate yang baik untuk mengidentifikasi konten yang relevan dari halaman web. Tujuan ekstraksi konten atau mendeteksi boilerplate adalah untuk memisahkan konten utama dari panel navigasi, iklan, pemberitahuan hak cipta dan link ke dokumen terkait di halaman web. Dalam sistem untuk menghilangkan boilerplate terdapat dua fase: fase ekstraksi konten dan string processing. Fase pertama yaitu ekstraksi konten untuk mengambil konten utama menggunakan algoritma boilerpipe, lalu ke fase berikutnya yaitu string processing untuk membersihkan berita terkait yang biasanya terdapat di tengah-tengah konten berita. Dari hasil penelitian cukup efektif karna mendapatkan tingkat akurasi yang cukup tinggi setelah dihitung persentase kemiripannya menggunakan cosine similarity, dan juga dapat digunakan untuk mempermudah penelitian berikutnya dalam hal text processing.

The development of the number of online news websites in the world has increased very rapidly. General web sites contain not just main content, for good measure other elements such as navigation panels, advertisements, and links to adjacent documents are also called boilerplate. To assure the good quality of the web pages, I need to discover the appropriate content from a good boilerplate removal algorithm. The purpose of extracting content or detecting boilerplate is to segregate the main content from navigation panels, advertisements, copyright notifications and links to documents accompanying web pages. In the system to eliminate boilerplate there are two phases: content extraction phase and string matching. The first step is content extraction to retrieve the main content using boilerpipe algorithm, and then to the next phase to clean up the most complicated news that is usually found in the middle of news content using string processing. From the results of the study is quite effective because it gets a fairly high level of accuracy after calculating the percentage of similarities using cosine similarity, and can also be used to facilitate subsequent research in terms of text processing.

URI

https://repositori.usu.ac.id/handle/123456789/44665

Collections

Undergraduate Theses [800]