Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation

Adel Jalal Yousif; Mohammed H. Al-Jammas

Please use this identifier to cite or link to this item: http://148.72.244.84:8080/xmlui/handle/xmlui/12927

Full metadata record

DC Field	Value	Language
dc.contributor.author	Adel Jalal Yousif	-
dc.contributor.author	Mohammed H. Al-Jammas	-
dc.date.accessioned	2024-03-19T13:29:18Z	-
dc.date.available	2024-03-19T13:29:18Z	-
dc.date.issued	2024-03-01	-
dc.identifier.citation	https://djes.info/index.php/djes/article/view/1259	en_US
dc.identifier.issn	1999-8716	-
dc.identifier.uri	http://148.72.244.84:8080/xmlui/handle/xmlui/12927	-
dc.description.abstract	Video captioning techniques have practical applications in fields like video surveillance and robotic vision, particularly in real-time scenarios. However, most of the current approaches still exhibit certain limitations when applied to live video, and research has predominantly focused on English language captioning. In this paper, we introduced a novel approach for live real-time Arabic video captioning using deep neural networks with a parallel architecture implementation. The proposed model primarily relied on the encoder-decoder architecture trained end-to-end on Arabic text. Video Swin Transformer and deep convolutional network are employed for video understanding, while the standard Transformer architecture is utilized for both video feature encoding and caption decoding. Results from experiments conducted on the translated MSVD and MSR-VTT datasets demonstrate that utilizing an end-to-end Arabic model yielded better performance than methods involving the translation of generated English captions to Arabic. Our approach demonstrates notable advancements over compared methods, yielding a CIDEr score of 78.3 and 36.3 for the MSVD and MSRVTT datasets, respectively. In the context of inference speed, our model achieved a latency of approximately 95 ms using an RTX 3090 GPU for a temporal video segment with 16 frames captured online from a camera device.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Diyala – College of Engineering	en_US
dc.subject	Arabic Video Captioning,	en_US
dc.subject	Parallel Architecture	en_US
dc.title	Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation	en_US
dc.type	Article	en_US
Appears in Collections:	مجلة ديالى للعلوم الهندسية / Diyala Journal of Engineering Sciences (DJES)

Files in This Item:

File	Description	Size	Format
8-1259 FF.pdf		1.02 MB	Adobe PDF	View/Open

Show simple item record