Lexical and Acoustic Modelling of Swedish Prosody

Abstract: Prosody and intonation are very important ingredients of human speech. In speech technology, text-to-speech (TTS) and automatic speech recognition (ASR) systems must incorporate prosodic models in order to reach acceptable performances. We contribute to the modelling of Swedish prosody in a speech technology context from lexical and acoustic angles. Three aspects of lexical prosodic modelling are studied. We sketch a model for handling prosody in 'non-standard' words and then we perform a study of Swedish word stress. In a larger study, we build a CART-based system for the prediction of pronunciation from orthography. By using a letter-by-letter prediction method, allophones are correctly predicted in 96.87%, and whole words in 72.26% of the cases. For prosody, predictions based on whole-word features perform better: location of primary stress is correct in 88.6% and word accent in 87.7%. In the acoustic modelling section, we first present two surveys: one with special reference to previous work on TTS-related intonation modelling of Swedish and one on intonation modelling in general, with special emphasis on stylization models. Our own work is concerned with the development of a system for the generation of F0 contours from phonological intonation labels. In the other survey, we describe some of the more influential intonation models in the field. This review leads to the selection of two of the models for our continued work in the thesis: a stylization-based model, where temporal and frequency information is extracted directly from actual F0 contours, and Taylor's tilt model, which parameterizes the contours using a mathematical function. The stylization model is then used in building a data-driven method for the generation of pitch patterns in Swedish content words. The model is able to produce pitch contours that closely approach real ones, but the performance varies with the complexity of the pitch patterns. Both models are used in the final study. This is concerned with the automatic identification and classification of dialect, word accent categories and prominence levels in Swedish data. Material from almost 100 dialects and 250 speakers is used to build a model that predicts these features from F0 contours. For material from the whole Swedish-speaking area, the best results for word accent, prominence level and regional dialect type are 79.3%, 62.2% and 59.1% correct, respectively. For individual villages, word accent data can be classified correctly with an accuracy of more than 90%.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.