An Exploration of Cross-Lingual Model Transfer in Multimodal Sarcasm Detection

Zhang, Meiling (2025) An Exploration of Cross-Lingual Model Transfer in Multimodal Sarcasm Detection. Master thesis, Voice Technology (VT).

Preview

PDF
MScSpeechTechThesisTemplatev120520-6.pdf
Download (871kB) | Preview

Abstract

Sarcasm detection poses unique challenges due to the contrast between literal expressions and intended meaning, especially in spoken and multimodal communication. Misinterpretation of sarcasm can negatively impact sentiment analysis, human-computer interaction, and online content moderation. While significant progress has been made for English text, robust and generalizable approaches for other languages—especially tonal languages such as Mandarin Chinese—remain underexplored, as does the integration of multimodal cues. This study presents a transfer learning framework for multimodal sarcasm detection across English and Mandarin Chinese. Models are constructed and evaluated using three complementary modalities: text, audio, and visual information. For text features, BERT-based sentence embeddings capture deep semantic and contextual nuances. High-level audio features are extracted using VGGish, a deep audio representation model pre-trained on large-scale audio datasets; these features may implicitly reflect intonation, emotion, and other paralinguistic cues. ResNet-152 is employed to extract high-level visual representations from video segments, capturing facial expressions and gestures relevant to sarcasm. These modality-specific features are integrated through a fusion mechanism to provide complementary information, enabling comprehensive multimodal modeling. Experiments are conducted on the public English MUStARD dataset and MCSD, a Mandarin Chinese sarcasm dataset, both containing aligned multimodal data. The task is formulated as binary classification using support vector machines as the baseline classifier. To address the scarcity of labeled Mandarin data, we design cross-lingual transfer learning experiments in both zero-shot and few-shot settings. In the zero-shot scenario, models trained exclusively on English data are directly evaluated on the Chinese test set; in the few-shot scenario, a small number of labeled Chinese samples adapt the model, simulating low-resource transfer conditions. In cross-lingual few-shot transfer from English (MUStARD) to Mandarin (MCSD), providing 40 labeled target samples increases the macro F1 of multimodal fusion (text+audio+video) from 46.8% (zero-shot) to 61.2%. Conversely, transferring from MCSD to MUStARD under the same setting, the macro F1 of multimodal fusion improves from 47.9% to 63.6%. Similar improvements are observed for text and audio modalities. These findings highlight the effectiveness and generalizability of few-shot multimodal transfer learning in both directions. BERT-based text embeddings and VGGish-based audio features contribute most to cross-lingual generalization, while ResNet-based visual features provide complementary cues. The greater performance gains for audio models when transferring to Mandarin suggest that paralinguistic cues—such as tone and prosody—may be more salient for sarcasm detection in Mandarin than in English. This work systematically investigates multimodal sarcasm detection across English and Mandarin, demonstrating that integrating text, audio, and visual cues—together with transfer learning—enables robust performance in both high- and low-resource settings. Our results show that zero-shot and few-shot cross-lingual adaptation can effectively extend sarcasm detection to underexplored languages and modalities. Keywords: Sarcasm Detection, Multimodal Learning, Transfer Learning, Cross-Lingual, Mandarin Chinese

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S. and Gao, X.
Date Deposited:	03 Sep 2025 12:04
Last Modified:	03 Sep 2025 12:04
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/768

Actions (login required)

View Item