Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

Date: 01.02.2022

Applied Sciences


Abstract

This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully- and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency.

BIB_text

@Article {
title = {Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database},
journal = {Applied Sciences},
pages = {1889},
volume = {12},
keywds = {
automatic speech recognition; deep learning; Spanish; convolutional neural networks; recurrent neural networks; embedded systems; quartznet; Wav2vec2.0; self-supervised learning
}
abstract = {

This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully- and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency.


}
doi = {10.3390/app12041889},
date = {2022-02-01},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

Edificio Ensanche,
Zabalgune Plaza 11,
48009 Bilbao (Spain)

close overlay