Towards conversational speech synthesis : Experiments with data quality, prosody modification, and non-verbal signals

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current TTS sys- tems have already reached a high degree of intelligibility, and they can be readily used to read aloud a given text. For many applications, e.g. public address systems, reading style is enough to convey the message to the people. However, more recent applications, such as human-machine interaction and speech-to-speech translation, call for TTS systems to be increasingly human- like in their conversational style. The goal of this thesis is to address a few issues involved in a conversational speech synthesis system.First, we discuss issues involve in data collection for conversational speech synthesis. It is very important to have data with good quality as well as con- tain more conversational characteristics. In this direction we studied two methods 1) harvesting the world wide web (WWW) for the conversational speech corpora, and 2) imitation of natural conversations by professional ac- tors. In former method, we studied the effect of compression on the per- formance of TTS systems. It is often the case that speech data available on the WWW is in compression form, mostly use the standard compression techniques such as MPEG. Thus in paper 1 and 2, we systematically stud- ied the effect of MPEG compression on TTS systems. Results showed that the synthesis quality indeed affect by the compression, however, the percep- tual differences are strongly significant if the compression rate is less than 32kbit/s. Even if one is able to collect the natural conversational speech it is not always suitable to train a TTS system due to problems involved in its production. Thus in later method, we asked the question that can we imi- tate the conversational speech by professional actors in recording studios. In this direction we studied the speech characteristics of acted and read speech. Second, we asked a question that can we borrow a technique from voice con- version field to convert the read speech into conversational speech. In paper 3, we proposed a method to transform the pitch contours using artificial neu- ral networks. Results indicated that neural networks are able to transform pitch values better than traditional linear approach. Finally, we presented a study on laughter synthesis, since non-verbal sounds particularly laughter plays a prominent role in human communications. In paper 4 we present an experimental comparison of state-of-the-art vocoders for the application of HMM-based laughter synthesis. 

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.