Qiu, Yan (2025) A Trial Toward Real-Time Vision-to-Speech Systems: An Exploratory Study of BLIP and FastSpeech 2 for Assistive Applications and Latency-Precision Trade-Offs. Master thesis, Voice Technology (VT).
|
PDF
MSC5906350YQiu.pdf Download (429kB) | Preview |
Abstract
The number of individuals affected by visual impairments worldwide continues to rise, creating a growing need for real-time assistive technologies that can enhance navigation, situational awareness, and independence. While current assistive tools provide valuable support, they often suffer from high latency, lack of contextual clarity, and prohibitive costs. Recent advancements in neural text-to-speech (TTS) systems, such as FastSpeech 2 and ChatTTS, offer an opportunity to bridge this gap by delivering fast, natural-sounding speech. This thesis focuses on optimizing low-latency TTS pipelines tailored for real-time assistive applications. The project will benchmark state-of-the-art TTS models, apply optimization strategies such as post-training quantization, TensorRT acceleration, and enhance input text clarity through prompt engineering and lightweight rephrasing of outputs from vision-language models like BLIP-2. By addressing these problems, this research aims to create a complete, accessible, and responsive assistive voice pipeline that empowers visually impaired users to interact with their environment more safely and effectively.
Item Type: | Thesis (Master) |
---|---|
Name supervisor: | Nayak, S. |
Date Deposited: | 27 Jul 2025 11:12 |
Last Modified: | 27 Jul 2025 11:12 |
URI: | https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/735 |
Actions (login required)
![]() |
View Item |