Europarl-ASR: A Large Speech+Text Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization.
- 1300 hours EN transcribed speech data.
- 18 hours EN speech data w/ revised verbatim and official non-verbatim transcriptions, split in 2 dev/test partitions for 2 realistic ASR tasks.
- 3 full sets of timed transcriptions for the training data: official non-verbatim, automatically noise-filtered, and automatically verbatimized.
- 70 million tokens EN text data.