|
@@ -3,8 +3,8 @@ Europarl-ASR v1.0
|
|
|
2 April 2021
|
|
|
[www.mllp.upv.es/europarl-asr](https://www.mllp.upv.es/europarl-asr)
|
|
|
|
|
|
-A large English-language speech and text corpus of parliamentary debates for
|
|
|
-streaming ASR benchmarking, speech data filtering and speech data verbatimization.
|
|
|
+A 1300-hour English-language speech and text corpus of parliamentary debates for
|
|
|
+(streaming) ASR training and benchmarking, speech data filtering and speech data verbatimization.
|
|
|
|
|
|
Keywords: automatic speech recognition; speech corpus; speech data filtering;
|
|
|
speech data verbatimization.
|
|
@@ -280,19 +280,19 @@ Europarl-ASR (EN) includes:
|
|
|
|
|
|
#### Speech data
|
|
|
|
|
|
-* 1300 hours of English-language annotated speech data (33K speeches, 1K
|
|
|
+* 1263 hours of English-language annotated speech data (33,002 speeches, 1046
|
|
|
speakers).
|
|
|
* 3 full sets of timed transcriptions: official non-verbatim transcriptions,
|
|
|
automatically noise-filtered transcriptions and automatically verbatimized
|
|
|
transcriptions.
|
|
|
-* 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
+* 17.5 hours of speech data with both manually revised verbatim transcriptions
|
|
|
and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
knowledge of the speaker).
|
|
|
|
|
|
#### Text data
|
|
|
|
|
|
-* 70 million tokens of English-language text data.
|
|
|
+* 69.4 million tokens of English-language text data.
|
|
|
|
|
|
#### Pretrained language models
|
|
|
|