A Trial Toward Real-Time Vision-to-Speech Systems: An Exploratory Study of BLIP and FastSpeech 2 for Assistive Applications and Latency-Precision Trade-Offs

Qiu, Yan (2025) A Trial Toward Real-Time Vision-to-Speech Systems: An Exploratory Study of BLIP and FastSpeech 2 for Assistive Applications and Latency-Precision Trade-Offs. Master thesis, Voice Technology (VT).

Preview

PDF
MSC5906350YQiu.pdf
Download (429kB) | Preview

Abstract

The number of individuals affected by visual impairments worldwide continues to rise, creating a growing need for real-time assistive technologies that can enhance navigation, situational awareness, and independence. While current assistive tools provide valuable support, they often suffer from high latency, lack of contextual clarity, and prohibitive costs. Recent advancements in neural text-to-speech (TTS) systems, such as FastSpeech 2 and ChatTTS, offer an opportunity to bridge this gap by delivering fast, natural-sounding speech. This thesis focuses on optimizing low-latency TTS pipelines tailored for real-time assistive applications. The project will benchmark state-of-the-art TTS models, apply optimization strategies such as post-training quantization, TensorRT acceleration, and enhance input text clarity through prompt engineering and lightweight rephrasing of outputs from vision-language models like BLIP-2. By addressing these problems, this research aims to create a complete, accessible, and responsive assistive voice pipeline that empowers visually impaired users to interact with their environment more safely and effectively.

Item Type:	Thesis (Master)
Name supervisor:	Nayak, S.
Date Deposited:	27 Jul 2025 11:12
Last Modified:	27 Jul 2025 11:12
URI:	https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/735

Actions (login required)

View Item