Voice Source and Duration Modelling for Voice Conversion and Speech Repair

< Back

Directors: Julián Flórez Esnal (Vicomtech)

University: Department of Engineering, University of Cambridge

Date: 15.07.2008

Place: Cambridge, UK

Voice Conversion aims at transforming a source speaker’s speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology. Whilst state-of-the-art implementations are capable of achieving reasonable conversions between speakers with similar voice characteristics and prosodic patterns, they do not work as well in scenarios where the differences between the source and the target speech are more extreme. This is mainly due to limitations in the modelling and conversion of the voice source and prosody. In this thesis, a refined modelling and transformation of the voice source and duration is proposed to increase the robustness of voice conversion systems in extreme applications. In addition, the developed techniques are tested in a speech repair framework. Voice source modelling refinement involves using the Liljencrants-Fant model instead of the linear prediction residuals employed by the existing implementations to represent the voice source. A speech model has been developed which automatically estimates voice source and vocal tract filter parameterisations. The use of this speech modelling technique for the analysis, modification and synthesis of speech allows the application of linear transformations to convert voice source parameters. The performance of the developed conversion system has been shown to be comparable to that of state-of-the-art implementations in terms of speaker identity, but to produce converted speech with a better quality. Regarding duration, a decision tree approach is proposed to convert duration contours. Its application has been shown to reduce the mean square error distance between the converted and target duration patterns and to increase their correlation. The developed speech model and duration conversion techniques are then tested in an extreme application: the repair of the voice source and duration limitations of tracheoesophageal speech. Tracheoesophageal voice source repair involves the replacement of the glottal source, smoothing of jitter and shimmer, reduction of the aspiration noise component and raising of the fundamental frequency in some cases. As for duration, decision trees trained on normal data are employed to repair the tracheoesophageal duration contours. The performance of the repair algorithms has been found to be highly dependent on the quality of the tracheoesophageal speakers. Whilst the repaired speech has been found to be less deviant, ugly and unpleasant to listen to overall, its naturalness, intelligibility and rhythm is still relatively poor compared to that achieved for normal speakers.