Implementasi Arsitektur EfficientNetV2-Transformer pada Aplikasi Image Captioning Bahasa Indonesia
Implementation of EfficientNetV2-Transformer Architecture for Indonesian Image Captioning Application

Date
2024Author
Sinulingga, Muhammad Teguh
Advisor(s)
Amalia
Jaya, Ivan
Metadata
Show full item recordAbstract
Image captioning is a task that combines computer vision, natural language processing
(NLP), and machine learning. In this task, the model not only needs to recognize objects
or scenes in the image, but also needs to be able to describe the relationships between
them. Image captioning has various use case, such as adding titles to news images,
creating descriptions for medical images, supporting text-based image search, providing
image information for visually impaired users, and facilitating interaction between
humans and robots. Currently, research on image captioning in Bahasa Indonesia using
a combination of CNN-Transformer architectures is still limited. Recent research shows
that one of the CNN families, EfficientNetV2, as a development from EfficientNet, has
good performance in image feature extraction. In addition, the Transformer architecture
has been widely used in NLP-based tasks such as machine translation. However, until
now there has been no study that develops an image captioning system in Bahasa
Indonesia using a combination of these two architectures. This research aims to develop
an image captioning system that can generate image descriptions in Bahasa Indonesia.
The test results show that the developed model is able to achieve the best BLEU-1,
BLEU-2, BLEU-3, and BLEU-4 metric scores of {0.6028, 0.3547, 0.2247, 0.1572}
respectively. This study also found that the use of EfficientNetV2 at small scale and
medium scale resulted in different image descriptions and varied evaluation scores.
Collections
- Undergraduate Theses [1254]
