A 1300-hour English speech and text corpus of parliamentary debates for (streaming) ASR training and benchmarking, speech data filtering and speech data verbatimization. https://www.mllp.upv.es/europarl-asr/

1 Commits

1 Branches

0 Releases

Gonçal cd503e3df8 Initial commit		5 years ago
LICENSE	cd503e3df8 Initial commit	5 years ago
README.md	cd503e3df8 Initial commit	5 years ago

Europarl-ASR

Europarl-ASR: A Large Speech+Text Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization.

1300 hours EN transcribed speech data.
18 hours EN speech data w/ revised verbatim and official non-verbatim transcriptions, split in 2 dev/test partitions for 2 realistic ASR tasks.
3 full sets of timed transcriptions for the training data: official non-verbatim, automatically noise-filtered, and automatically verbatimized.
70 million tokens EN text data.

README.md

Europarl-ASR