Can Multimodal Transformers Beat LLMs? A Cross-Attention Approach to Sarcasm Detection in Social Media Videos

Narang, Mohammadhossein (2025) Can Multimodal Transformers Beat LLMs? A Cross-Attention Approach to Sarcasm Detection in Social Media Videos. Master thesis, Voice Technology (VT).

Preview

PDF
MA6028608MNarang.pdf
Download (949kB) | Preview

Abstract

Detecting sarcasm in social media videos is a complex challenge for natural language processing, largely due to the inherent ambiguity and semantic incongruity of sarcastic expressions, where the intended meaning often contrasts with the literal words. Sarcasm often depends on subtle, unspoken cues such as exaggerated intonation, prosodic changes, or facial expressions that convey underlying attitudes. For example, a raised pitch or exaggerated tone when saying “What a fantastic plan!” may signal a negative or ironic sentiment beneath the surface meaning. While these characteristics complicate automatic detection, the multimodal nature of sarcasm, which includes textual, auditory, and visual signals, offers complementary information that can be exploited to enhance recognition accuracy. This thesis proposes a novel sarcasm detection system employing a transformer-based architecture augmented with cross-attention mechanisms, allowing the model to dynamically inte- grate and interpret synchronized inputs from text, speech, and facial expressions. Leveraging the MUStARD++ dataset, the model is trained to identify sarcasm in short video content typical of plat- forms like TikTok. Traditional sarcasm detection methods typically depend on textual cues alone, limiting their ability to capture the nuanced cues embedded in tone and facial expression. By incor- porating cross-modal attention, the system dynamically prioritizes and aligns salient features across modalities, effectively capturing the complex interplay of conflicting cues such as a cheerful tone contrasted with negative words that are essential to recognizing sarcasm. Comparative experiments with large language models are conducted to benchmark the proposed model’s performance against unimodal and text-only baselines, highlighting the advantages of multimodal integration for sarcasm detection. This research advances the field of affective computing and has practical applications in content moderation, recommendation systems, and social media analytics. Ethical considerations, including bias mitigation and user privacy, are addressed, with future work proposed to explore transfer learning for low-resource contexts and real-time deployment strategies. Keywords: Sarcasm detection, multimodal sarcasm recognition, cross-attention mechanisms, transformer architecture, natural language processing, prosodic features, affective computing, video content analysis, social media analytics.

Item Type:	Thesis (Master)
Name supervisor:	Gao, X. and Nayak, S.
Date Deposited:	16 Jun 2025 11:15
Last Modified:	16 Jun 2025 11:15
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/664

Actions (login required)

View Item