An Exploration of Cross-Lingual Model Transfer in Multimodal Sarcasm Detection

Zhang, Meiling (2025) An Exploration of Cross-Lingual Model Transfer in Multimodal Sarcasm Detection. Master thesis, Voice Technology (VT).

Preview

PDF
meilingfinalthesis.pdf
Download (871kB) | Preview

Abstract

Sarcasm detection poses unique challenges due to the contrast between literal expressions and intended meaning, especially in spoken and multimodal communication. Misinterpreting sarcasm can negatively impact applications such as sentiment analysis, human-computer interaction, and online content moderation. While significant progress has been made for English text, robust and generalizable approaches for other languages—particularly tonal languages like Mandarin Chinese—remain underexplored, as does the integration of multimodal cues. This study introduces a transfer learning framework for multimodal sarcasm detection across English and Mandarin Chinese. The models are constructed and evaluated using three complementary modalities: text, audio, and visual information. For text features, BERT-based sentence embeddings are used to capture deep semantic and contextual nuances. High-level audio features are extracted using VGGish, a deep audio representation model pre-trained on large-scale datasets; these features can implicitly capture intonation, emotion, and other paralinguistic cues. Visual representations are extracted from video segments using ResNet-152, which captures facial expressions and gestures relevant to sarcasm. The modality-specific features are then integrated through a fusion mechanism, allowing for comprehensive multimodal modeling. Experiments are conducted on the public English MUStARD dataset and MCSD, a Mandarin Chinese sarcasm dataset, both of which contain aligned multimodal data. The task is formulated as a binary classification problem using support vector machines as the baseline classifier. To address the scarcity of labeled Mandarin data, cross-lingual transfer learning experiments are designed for both zero-shot and few-shot settings. In the zero-shot scenario, models trained exclusively on English data are directly evaluated on the Chinese test set. In the few-shot scenario, a small number of labeled Chinese samples are used to adapt the model, simulating low-resource transfer conditions. Results show that in cross-lingual few-shot transfer from English (MUStARD) to Mandarin (MCSD), providing 40 labeled target samples increases the macro F1 of multimodal fusion (text + audio + video) from 46.8% (zero-shot) to 61.2%. Conversely, transferring from MCSD to MUStARD under the same setting, the macro F1 of multimodal fusion improves from 47.9% to 63.6%. Similar improvements are observed for text and audio modalities. These findings highlight the effectiveness and generalizability of few-shot multimodal transfer learning in both directions. BERT-based text embeddings and VGGish-based audio features contribute most to cross-lingual generalization, while ResNet-based visual features provide complementary cues. Notably, the greater performance gains for audio models when transferring to Mandarin suggest that paralinguistic cues—such as tone and prosody—may be especially salient for sarcasm detection in Mandarin. Overall, this work systematically investigates multimodal sarcasm detection across English and Mandarin, demonstrating that integrating text, audio, and visual cues, together with transfer learning, enables robust performance in both high- and low-resource settings. The results show that zero-shot and few-shot cross-lingual adaptation can effectively extend sarcasm detection to underexplored languages and modalities. Keywords: Sarcasm Detection, Multimodal Learning, Transfer Learning, Cross-Lingual, Mandarin Chinese

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S. and Gao, X.
Date Deposited:	16 Jun 2025 11:39
Last Modified:	16 Jun 2025 11:39
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/667

Actions (login required)

View Item