Improving Quality of Service in Baseband Speech Communication

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: Speech is the most important communication modality for human interaction. Automatic speech recognition and speech synthesis have extended further the relevance of speech to man-machine interaction. Environment noise and various distortions, such as reverberation and speech processing artifacts, reduce the mutual information between the message modulated inthe clean speech and the message decoded from the observed signal. This degrades intelligibility and perceived quality, which are the two attributes associated with quality of service. An estimate of the state of these attributes provides important diagnostic information about the communication equipment and the environment. When the adverse effects occur at the presentation side, an objective measure of intelligibility facilitates speech signal modification for improved communication.The contributions of this thesis come from non-intrusive quality assessment and intelligibility-enhancing modification of speech. On the part of quality, the focus is on predictor design for limited training data. Paper A proposes a quality assessment model for bounded-support ratings that learns efficiently from a limited amount of training data, scales easily with the sampling frequency, and provides a platform for modeling variations in the individual subjective ratings. The predictive performance of the model for the mean of the subjective quality ratings compares favorably to the state-of-art in the field. Patterns in the spread of the individual ratings are captured in the feature space of the training data.Paper B focuses on enhancing predictive performance for the mean of the quality variable when the signal feature space is sparsely sampled by the training data. Using a Gaussian Processes framework, the deterministic signal-based feature set is augmented with a stochastic feature that is hypothesized to be jointly distributed with the target quality rating. An uncertainty propagation mechanism ensures that the variance of this feature is reflected in the prediction. The proposed architecture can take advantage of i) data that cannot be pooled due to subjective test protocol incompatibility and ii) models trained on data that are no longer available.With respect to intelligibility enhancement, a hierarchical perspective of the speech communication process, extended from foundational work in the field, is used in paper C to create a unified framework for method analysis and comparison. A high-level intelligibility measure related to the probability for correct recognition is derived using a hit-or-miss distortion criterion in the transcription domain. The measure is used to optimize two speech modifications at different levels of the message encoding hierarchy leading to significantly enhanced intelligibility in noise. The conceptual novelty of the method comes at the cost of higher complexity and the requirement for additional information including message transcription, sound segmentation, and a model of speech.Mapping the high-level measure to a lower level takes away the need for additional information and preserves asymptotically high-level optimality. Two methods are proposed to reduce degradation in the accuracy of the spectral dynamics due to additive noise. The focus of paper D is dynamics preservation in a range that is lower-bounded by an optimal band-power threshold. The performance of the method is competitive but allows for improvement in power efficiency. This issue is addressed in paper E which proposes and optimizes a distortion measure for spectral dynamics leading to a significant increase in intelligibility. Use of functional optimization techniques allows for families of solutions, among which are dynamic range compressors adaptive to the statistics of the speech and the noise.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)