Probabilistic Kernels for Improved Text-to-Speech Alignment in Long Audio Tracks

Authors: German Bordel Mikel Peñagarikano Luis Javier Rodriguez-Fuentes Aitor Álvarez Muniain Amparo Varona

Date: 01.01.2016

Signal Processing Letters, IEEE


Abstract

The synchronization of text transcripts with audio tracks is typically solved by forced alignment at the phonetic level. However, when dealing with either very long audio tracks or acoustically inaccurate text transcripts, more complex methods are needed, usually based on heavy and costly ASR systems. In a previous work, we showed that a simple and lightweight method could be effectively applied, based on a free phonetic decoding of the speech signal and the alignment of the free and reference phonetic sequences, allowing the transfer of timestamps from the former to the latter. This method has yielded competitive results on the Hub4-97 dataset and is currently applied to synchronize the videos and minutes of the Basque Parliament plenary sessions. In this paper, probabilistic kernels (similarity functions) are applied, based on the hypothesis that a confusion matrix computed from a large corpus of speech conveys key information about the behavior of the phonetic decoder, and that the probabilistic interpretation of this information may help design informative kernels leading to improved alignments. The probabilistic kernels proposed in this work outperform our baseline kernels and other alternatives, including a reference ASR-based approach and a knowledge-based kernel, in experiments on the Hub4-97 dataset.

BIB_text

@Article {
title = {Probabilistic Kernels for Improved Text-to-Speech Alignment in Long Audio Tracks},
journal = {Signal Processing Letters, IEEE},
pages = {126-129},
number = {1},
volume = {23},
keywds = {

Acoustics; Databases; Decoding; Kernel; Probabilistic logic; Speech; Videos; long audio tracks; probabilistic kernel; text-to-speech alignment


}
abstract = {

The synchronization of text transcripts with audio tracks is typically solved by forced alignment at the phonetic level. However, when dealing with either very long audio tracks or acoustically inaccurate text transcripts, more complex methods are needed, usually based on heavy and costly ASR systems. In a previous work, we showed that a simple and lightweight method could be effectively applied, based on a free phonetic decoding of the speech signal and the alignment of the free and reference phonetic sequences, allowing the transfer of timestamps from the former to the latter. This method has yielded competitive results on the Hub4-97 dataset and is currently applied to synchronize the videos and minutes of the Basque Parliament plenary sessions. In this paper, probabilistic kernels (similarity functions) are applied, based on the hypothesis that a confusion matrix computed from a large corpus of speech conveys key information about the behavior of the phonetic decoder, and that the probabilistic interpretation of this information may help design informative kernels leading to improved alignments. The probabilistic kernels proposed in this work outperform our baseline kernels and other alternatives, including a reference ASR-based approach and a knowledge-based kernel, in experiments on the Hub4-97 dataset.


}
isi = {1},
date = {2016-01-01},
year = {2016},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

close overlay