|
@@ -4,7 +4,7 @@ v1.0<br />
|
|
[www.mllp.upv.es/europarl-asr](www.mllp.upv.es/europarl-asr)
|
|
[www.mllp.upv.es/europarl-asr](www.mllp.upv.es/europarl-asr)
|
|
|
|
|
|
A large English-language speech and text corpus of parliamentary debates for
|
|
A large English-language speech and text corpus of parliamentary debates for
|
|
-streaming ASR benchmarking and speech data filtering/verbatimization.
|
|
+streaming ASR benchmarking, speech data filtering and speech data verbatimization.
|
|
|
|
|
|
Keywords: automatic speech recognition; speech corpus; speech data filtering;
|
|
Keywords: automatic speech recognition; speech corpus; speech data filtering;
|
|
speech data verbatimization.
|
|
speech data verbatimization.
|
|
@@ -19,8 +19,9 @@ README CONTENTS
|
|
---------------
|
|
---------------
|
|
|
|
|
|
- [Overview](#overview)
|
|
- [Overview](#overview)
|
|
-- [Corpus structure and contents](#contents)
|
|
+- [Get the data](#get)
|
|
- [Additional Europarl-ASR materials](#additional)
|
|
- [Additional Europarl-ASR materials](#additional)
|
|
|
|
+- [Corpus structure and contents](#contents)
|
|
- [Extended description](#description)
|
|
- [Extended description](#description)
|
|
- [Acknowledgements](#ack)
|
|
- [Acknowledgements](#ack)
|
|
- [Legal disclaimers](#legal)
|
|
- [Legal disclaimers](#legal)
|
|
@@ -58,6 +59,33 @@ tokens, Europarl-ASR also includes tools to add all English-language text from
|
|
the DCEP Digital Corpus of the European Parliament.
|
|
the DCEP Digital Corpus of the European Parliament.
|
|
|
|
|
|
|
|
|
|
|
|
+<a id="get"></a>GET THE DATA
|
|
|
|
+------------
|
|
|
|
+
|
|
|
|
+Download the full Europarl-ASR speech and text corpus from:
|
|
|
|
+
|
|
|
|
+https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0.tar.gz
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+<a id="additional-materials"></a>ADDITIONAL Europarl-ASR MATERIALS
|
|
|
|
+---------------------------------
|
|
|
|
+
|
|
|
|
+In addition to the speech and text data included in the main release and
|
|
|
|
+described in this document, we are making available for download the following
|
|
|
|
+materials to facilitate the reproducibility of our experiments:
|
|
|
|
+
|
|
|
|
+* The pretrained Europarl-ASR English-language n-gram language model, together
|
|
|
|
+ with its vocabulary file:
|
|
|
|
+
|
|
|
|
+ https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz
|
|
|
|
+
|
|
|
|
+* The Europarl-ASR English-language verbatim transcription guidelines, which
|
|
|
|
+ were applied to produce the manually revised verbatim transcriptions for the
|
|
|
|
+ dev and test sets:
|
|
|
|
+
|
|
|
|
+ https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
|
|
|
|
+
|
|
|
|
+
|
|
<a id="contents"></a>CORPUS STRUCTURE AND CONTENTS
|
|
<a id="contents"></a>CORPUS STRUCTURE AND CONTENTS
|
|
-----------------------------
|
|
-----------------------------
|
|
|
|
|
|
@@ -213,24 +241,6 @@ Each "text" directory contains 2 subdirectories: "raw" (except in
|
|
website and save it in compressed plain text (.txt.gz).
|
|
website and save it in compressed plain text (.txt.gz).
|
|
|
|
|
|
|
|
|
|
-<a id="additional-materials"></a>ADDITIONAL Europarl-ASR MATERIALS
|
|
|
|
----------------------------------
|
|
|
|
-
|
|
|
|
-https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz<br />
|
|
|
|
-https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
|
|
|
|
-
|
|
|
|
-In addition to the speech and text data included in the main release and
|
|
|
|
-described in this document, we are making available for download the following
|
|
|
|
-materials to facilitate the reproducibility of our experiments:
|
|
|
|
-
|
|
|
|
-* The pretrained Europarl-ASR English-language n-gram language model, together
|
|
|
|
- with its vocabulary file.
|
|
|
|
-
|
|
|
|
-* The Europarl-ASR English-language verbatim transcription guidelines, which
|
|
|
|
- were applied to produce the manually revised verbatim transcriptions for the
|
|
|
|
- dev and test sets.
|
|
|
|
-
|
|
|
|
-
|
|
|
|
<a id="description"></a>EXTENDED DESCRIPTION
|
|
<a id="description"></a>EXTENDED DESCRIPTION
|
|
--------------------
|
|
--------------------
|
|
|
|
|