A 1300-hour English speech and text corpus of parliamentary debates for (streaming) ASR training and benchmarking, speech data filtering and speech data verbatimization. https://www.mllp.upv.es/europarl-asr/

Gonçal cd503e3df8 Initial commit 2 years ago
LICENSE cd503e3df8 Initial commit 2 years ago
README.md cd503e3df8 Initial commit 2 years ago

README.md

Europarl-ASR

Europarl-ASR: A Large Speech+Text Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization.

  • 1300 hours EN transcribed speech data.
  • 18 hours EN speech data w/ revised verbatim and official non-verbatim transcriptions, split in 2 dev/test partitions for 2 realistic ASR tasks.
  • 3 full sets of timed transcriptions for the training data: official non-verbatim, automatically noise-filtered, and automatically verbatimized.
  • 70 million tokens EN text data.