This year, the MLLP has participated in the 2021 edition of the Blizzard Challenge, with the neural text-to-speech synthesis system proposed by the MLLP obtaining some of the best results in the challenge.
The Blizzard Challenge is a renowned international challenge aiming to compare different research techniques in building text-to-speech (TTS) systems, and has been organized annually since 2005. Some of the most important tech companies and institutions participate every year (e.g., Microsoft, Samsung, Tencent or OPPO, among others), presenting their latest research and developments in the area. The challenge is to take the transcribed speech recordings provided by the organizing committee, build text-to-speech models using this data, and synthesize a prescribed set of test sentences. Then, the synthetic samples from all participants are evaluated through extensive subjective listening tests.
In this years’ challenge, the organization released a dataset containing approximately 5 hours of studio-quality recordings from a single European Spanish female speaker. The first task (Hub task 2021-SH1) consisted in building a text-to-speech model from the provided data to synthesize texts containing only Spanish words. The second task (Spoke task 2021-SS1) considered code-switching synthesis, where the texts may contain a small number of English words. Participants were allowed to use any kind of additional external data (freely-available or not) up to a total of 100 hours.
The MLLP participated in the Hub 2021-SH1 task, building a two-stage neural text-to-speech system comprising a sequence-to-sequence acoustic model and a Generative Adversarial Network (GAN) based vocoder model using exclusively the provided 5-hour dataset.
The MLLP’s TTS system for the Hub 2021-SH1 task obtained excellent results in terms of speech naturalness and intelligibility in the subjective listening tests, as can be seen in the results table below (focusing on naturalness). Among the other 12 participants (including Microsoft, Samsung, Tencent and several top universities), only one system obtained comparatively higher naturalness mean opinion score results (using an additional 80 hours of non-freely-available Spanish TTS data to train their models).
You can watch the MLLP system’s presentation by MLLP researcher Alejandro Pérez González de Martos in the following video (from 37:32):
For more information, we recommend checking the following links:
- Video: Blizzard Challenge 2021 summary and results
- Video: Blizzard Challenge 2021 MLLP system presentation (from 37:32)
- Proceedings: Blizzard Challenge 2021 system descriptions
- Results article: Blizzard Challenge 2021 description and results paper
The MLLP research group is proud to add these excellent scientific results in Text-to-Speech to previous top results in scientific challenges for streaming and offline Automatic Speech Recognition (IberSpeech-RTVE 2020 Challenge, IberSpeech-RTVE 2018 Challenge) and Machine Translation (WMT19, WMT18).
We would like to thank the organizers of the Blizzard Challenge 2021 for their work: the University of Science and Technology of China (USTC), the University of Edinburgh and the Blizzard Challenge committee. We look forward to future editions of the challenge.