|
@@ -1,8 +1,347 @@
|
|
|
-# Europarl-ASR
|
|
|
-
|
|
|
-Europarl-ASR: A Large Speech+Text Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization.
|
|
|
-
|
|
|
-- 1300 hours EN transcribed speech data.
|
|
|
-- 18 hours EN speech data w/ revised verbatim and official non-verbatim transcriptions, split in 2 dev/test partitions for 2 realistic ASR tasks.
|
|
|
-- 3 full sets of timed transcriptions for the training data: official non-verbatim, automatically noise-filtered, and automatically verbatimized.
|
|
|
-- 70 million tokens EN text data.
|
|
|
+Europarl-ASR
|
|
|
+v1.0
|
|
|
+2 April 2021
|
|
|
+www.mllp.upv.es/europarl-asr/
|
|
|
+
|
|
|
+A large English-language speech and text corpus of parliamentary debates for
|
|
|
+streaming ASR benchmarking and speech data filtering/verbatimization.
|
|
|
+
|
|
|
+Keywords: automatic speech recognition; speech corpus; speech data filtering;
|
|
|
+speech data verbatimization.
|
|
|
+
|
|
|
+CONTACT:
|
|
|
+Gonçal V. Garcés Díaz-Munío (gogardia@vrain.upv.es),
|
|
|
+Joan Albert Silvestre-Cerdà (jsilvestre@vrain.upv.es).
|
|
|
+Universitat Politècnica de València.
|
|
|
+
|
|
|
+
|
|
|
+README CONTENTS
|
|
|
+---------------
|
|
|
+
|
|
|
+- Overview
|
|
|
+- Corpus structure and contents
|
|
|
+- Additional Europarl-ASR materials
|
|
|
+- Extended description
|
|
|
+- Acknowledgements
|
|
|
+- Legal disclaimers
|
|
|
+- Licence
|
|
|
+
|
|
|
+
|
|
|
+OVERVIEW
|
|
|
+--------
|
|
|
+
|
|
|
+Europarl-ASR (EN) includes:
|
|
|
+
|
|
|
+*Speech data:
|
|
|
+
|
|
|
+- 1.3K hours of English-language annotated speech data.
|
|
|
+- 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
+ and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
+ evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
+ knowledge of the speaker).
|
|
|
+- 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
+ (training partition): official non-verbatim transcriptions, automatically
|
|
|
+ noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
+
|
|
|
+*Text data:
|
|
|
+
|
|
|
+- 70M tokens of English-language text data.
|
|
|
+
|
|
|
+*Pretrained language models:
|
|
|
+
|
|
|
+- The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
+
|
|
|
+This data comprises most of the European Parliament's English-language debate
|
|
|
+recordings, transcriptions and translations available from the Parliament's
|
|
|
+website from 1996 to 2020. Additionally, to increase text data up to 170M
|
|
|
+tokens, Europarl-ASR also includes tools to add all English-language text from
|
|
|
+the DCEP Digital Corpus of the European Parliament.
|
|
|
+
|
|
|
+
|
|
|
+CORPUS STRUCTURE AND CONTENTS
|
|
|
+-----------------------------
|
|
|
+
|
|
|
+Total size: 20 GB
|
|
|
+
|
|
|
+The data is organized in 3 main directories: "train" (training data), "dev"
|
|
|
+(validation data) and "test" (evaluation data). Each directory contains the
|
|
|
+subdirectories "original_audio" and "text", the first one containing speech
|
|
|
+data with annotations (for acoustic modelling), the second one containing text
|
|
|
+data (for language modelling).
|
|
|
+
|
|
|
+Here we can see more completely the corpus structure, with additional
|
|
|
+subdirectories:
|
|
|
+
|
|
|
+ Europarl-ASR
|
|
|
+ └── en
|
|
|
+ ├── dev
|
|
|
+ │ ├── original_audio
|
|
|
+ │ │ ├── spk-dep
|
|
|
+ │ │ │ ├── lists
|
|
|
+ │ │ │ ├── metadata
|
|
|
+ │ │ │ ├── refs
|
|
|
+ │ │ │ └── speeches
|
|
|
+ │ │ └── spk-indep
|
|
|
+ │ │ ├── lists
|
|
|
+ │ │ ├── metadata
|
|
|
+ │ │ ├── refs
|
|
|
+ │ │ └── speeches
|
|
|
+ │ └── text
|
|
|
+ │ ├── prepro
|
|
|
+ │ └── raw
|
|
|
+ ├── test
|
|
|
+ │ ├── original_audio
|
|
|
+ │ │ ├── spk-dep
|
|
|
+ │ │ │ ├── lists
|
|
|
+ │ │ │ ├── metadata
|
|
|
+ │ │ │ ├── refs
|
|
|
+ │ │ │ └── speeches
|
|
|
+ │ │ └── spk-indep
|
|
|
+ │ │ ├── lists
|
|
|
+ │ │ ├── metadata
|
|
|
+ │ │ ├── refs
|
|
|
+ │ │ └── speeches
|
|
|
+ │ └── text
|
|
|
+ │ ├── prepro
|
|
|
+ │ └── raw
|
|
|
+ └── train
|
|
|
+ ├── original_audio
|
|
|
+ │ ├── lists
|
|
|
+ │ ├── metadata
|
|
|
+ │ └── speeches
|
|
|
+ └── text
|
|
|
+ ├── external
|
|
|
+ │ ├── prepro
|
|
|
+ │ └── scripts
|
|
|
+ └── internal
|
|
|
+ ├── prepro
|
|
|
+ └── raw
|
|
|
+
|
|
|
+*Speech data ("original_audio" directories):
|
|
|
+
|
|
|
+In the cases of "dev" and "test", they are subdivided in directories "spk-dep"
|
|
|
+and "spk-indep". Thus, for speech data, we have 2 train-dev-test partitions
|
|
|
+for 2 different ASR tasks, as follows:
|
|
|
+
|
|
|
+ 1) ASR with known speakers (MEP):
|
|
|
+ train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
|
|
|
+
|
|
|
+ 2) ASR with unknown speakers (Guest):
|
|
|
+ train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
|
|
|
+
|
|
|
+Each of these partition directories contains 3 to 4 subdirectories (depending
|
|
|
+on whether it is the train set or the dev/test sets): "lists", "metadata",
|
|
|
+"refs" (only in "dev" and "test") and "speeches".
|
|
|
+
|
|
|
+"lists" contains lists of all the speeches in "speeches", as well as lists of
|
|
|
+speeches per speaker.
|
|
|
+
|
|
|
+"metadata" contains metadata for each speech and for each speaker in the
|
|
|
+corresponding set (as csv and json files). For each speech we will find these
|
|
|
+metadata (as reflected in speeches.headers.csv):
|
|
|
+
|
|
|
+ term;session_date;speech_id;speaker_type;speaker_id;raw_dur;
|
|
|
+ aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
|
|
|
+
|
|
|
+And for each speaker (as reflected in speakers.headers.csv):
|
|
|
+
|
|
|
+ type;id;name;gender;url
|
|
|
+
|
|
|
+"speeches" contains a subdirectory for each speech in the corresponding set,
|
|
|
+according to this subdirectory structure:
|
|
|
+
|
|
|
+ speeches/<term>/<session_date>/<speech_id>/
|
|
|
+
|
|
|
+For each speech, we will find some of the following files (depending on
|
|
|
+whether it is in the train set or in the dev/test sets):
|
|
|
+
|
|
|
+ ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a
|
|
|
+ [In all sets] Audio of the speech.
|
|
|
+
|
|
|
+ ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}
|
|
|
+ [In all sets] Official non-verbatim transcription of the speech, as a txt
|
|
|
+ raw transcription file, as dfxp or srt force-aligned timed subtitle files,
|
|
|
+ and its json metadata.
|
|
|
+
|
|
|
+ ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}
|
|
|
+ [In train set] Automatically filtered transcription of the speech, as dfxp
|
|
|
+ or srt force-aligned timed subtitle files, and its json metadata.
|
|
|
+
|
|
|
+ ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}
|
|
|
+ [In train set] Automatically verbatimized transcription of the speech, as
|
|
|
+ a txt transcription file, as dfxp or srt force-aligned timed subtitle files,
|
|
|
+ and its json metadata.
|
|
|
+
|
|
|
+ ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}
|
|
|
+ [In dev/test sets] Manually revised verbatim transcription of the speech,
|
|
|
+ as a txt transcription file, as dfxp or srt force-aligned timed subtitle
|
|
|
+ files, and its json metadata.
|
|
|
+
|
|
|
+Finally, in "refs" (only in "dev" and "test") each file contains every speech
|
|
|
+in the corresponding dev or test set, that is, the full reference for that
|
|
|
+set. In each case, we will find 4 files, containing the official non-verbatim
|
|
|
+reference (*.orig.*) and the manually revised verbatim reference (*.rev.*), as
|
|
|
+transcriptions (*.ref) and as segment time marked files (*.stm). In all 4
|
|
|
+cases, the text is presented preprocessed for evaluation (tokenized,
|
|
|
+lowercased, punctuation removed...).
|
|
|
+
|
|
|
+*Text data ("text" directories):
|
|
|
+
|
|
|
+In the case of "train", they are subdivided in directories "external" and
|
|
|
+"internal". "internal" contains all the official non-verbatim transcriptions
|
|
|
+and translations in the train set, together with the selected non-overlapping
|
|
|
+Europarl v10 transcriptions and translations; "external" contains the files to
|
|
|
+make use of the external DCEP: Digital Corpus of the European Parliament.
|
|
|
+
|
|
|
+Each "text" directory contains 2 subdirectories: "raw" (except in
|
|
|
+"train/external"), "prepro" (in all sets), or "scripts" (only in
|
|
|
+"train/external").
|
|
|
+
|
|
|
+ "raw" contains the raw text data for the corresponding set (*.txt.gz), and
|
|
|
+ its metadata (*.csv). In the cases of "dev" and "test", both the official
|
|
|
+ non-verbatim transcriptions (*.orig.*) and the manually revised verbatim
|
|
|
+ transcriptions (*.rev.*) are included.
|
|
|
+
|
|
|
+ "prepro" contains the text data for the corresponding set, preprocessed for
|
|
|
+ training or evaluation (tokenized, lowercased, punctuation removed...). This
|
|
|
+ data is released to facilitate the reproducibility of our experiments.
|
|
|
+
|
|
|
+ Finally, "scripts" (only in "train/text/external") contains the script
|
|
|
+ get_DCEP.sh, which can be used to download the DCEP corpus from its original
|
|
|
+ website and save it in compressed plain text (.txt.gz).
|
|
|
+
|
|
|
+
|
|
|
+ADDITIONAL Europarl-ASR MATERIALS
|
|
|
+---------------------------------
|
|
|
+
|
|
|
+https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz
|
|
|
+https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
|
|
|
+
|
|
|
+In addition to the speech and text data included in the main release and
|
|
|
+described in this document, we are making available for download the following
|
|
|
+materials to facilitate the reproducibility of our experiments:
|
|
|
+
|
|
|
+- The pretrained Europarl-ASR English-language n-gram language model, together
|
|
|
+ with its vocabulary file.
|
|
|
+
|
|
|
+- The Europarl-ASR English-language verbatim transcription guidelines, which
|
|
|
+ were applied to produce the manually revised verbatim transcriptions for the
|
|
|
+ dev and test sets.
|
|
|
+
|
|
|
+
|
|
|
+EXTENDED DESCRIPTION
|
|
|
+--------------------
|
|
|
+
|
|
|
+Europarl-ASR (EN) is a large English-language speech and text corpus of
|
|
|
+parliamentary debates for (streaming) ASR benchmarking and speech data
|
|
|
+filtering/verbatimization, including 1300 hours of annotated English-language
|
|
|
+speeches from European Parliament sessions held in the period 1996-2020.
|
|
|
+
|
|
|
+It was compiled and released by the Machine Learning and Language Processing
|
|
|
+(MLLP) research group of VRAIN Institut Valencià d'Investigació en
|
|
|
+Intel·ligència Artificial, Universitat Politècnica de València
|
|
|
+( www.mllp.upv.es ).
|
|
|
+
|
|
|
+Europarl-ASR (EN) includes:
|
|
|
+
|
|
|
+*Speech data:
|
|
|
+
|
|
|
+- 1.3K hours of English-language annotated speech data (33K speeches, 1K
|
|
|
+ speakers).
|
|
|
+- 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
+ and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
+ evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
+ knowledge of the speaker).
|
|
|
+- 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
+ (training partition): official non-verbatim transcriptions, automatically
|
|
|
+ noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
+
|
|
|
+*Text data:
|
|
|
+
|
|
|
+- 70M tokens of English-language text data.
|
|
|
+
|
|
|
+*Language models:
|
|
|
+
|
|
|
+- The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
+
|
|
|
+This data comprises most of the European Parliament's English-language debate
|
|
|
+recordings, transcriptions and translations available from the Parliament's
|
|
|
+website from 1999 to 2020 (recordings being only available from 2008). This is
|
|
|
+complemented by including all English-language transcriptions and translations
|
|
|
+from the Europarl v10 text corpus for the period 1996-1999.
|
|
|
+
|
|
|
+Additionally, to increase text data for language modelling up to 170M tokens,
|
|
|
+Europarl-ASR also includes tools to add all English-language text from the
|
|
|
+DCEP Digital Corpus of the European Parliament.
|
|
|
+
|
|
|
+Detailed dates of the EP speech and text data gathered:
|
|
|
+
|
|
|
+- English speech: 2008-09-01 to 2020-05-27.
|
|
|
+- English transcriptions: 1999-07-20 to 2020-05-27.
|
|
|
+- Translations into English: 1999-07-20 to 2012-11-30.
|
|
|
+- Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
|
|
|
+- DCEP (does not include any EP reports of proceedings): 2001 to 2012.
|
|
|
+
|
|
|
+
|
|
|
+ACKNOWLEDGEMENTS
|
|
|
+---------------
|
|
|
+
|
|
|
+The authors would like to acknowledge:
|
|
|
+
|
|
|
+- The European Parliament, the European Commission and other EU organizations,
|
|
|
+ for making available a wealth of multilingual speech and text data, both in
|
|
|
+ their websites and as ready-made corpora such as the DCEP Digital Corpus of
|
|
|
+ the European Parliament
|
|
|
+ ( https://ec.europa.eu/jrc/en/language-technologies ).
|
|
|
+
|
|
|
+- Philipp Koehn for compiling the Europarl corpus
|
|
|
+ ( https://www.statmt.org/europarl/ ).
|
|
|
+
|
|
|
+This work has received funding from the EU's H2020 research and innovation
|
|
|
+programme under grant agreements 761758 (X5gon) and 952215 (TAILOR); the
|
|
|
+Government of Spain's research project Multisub (RTI2018-094879-B-I00,
|
|
|
+MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; the
|
|
|
+Generalitat Valenciana's research project Classroom Activity Recognition
|
|
|
+(PROMETEO/2019/111) and predoctoral research scholarship ACIF/2017/055; and
|
|
|
+the Universitat Politècnica de València's PAID-01-17 R&D support programme.
|
|
|
+
|
|
|
+
|
|
|
+LEGAL DISCLAIMERS
|
|
|
+-----------------
|
|
|
+
|
|
|
+- Speech and text data from the European Parliament website (audio, official
|
|
|
+ transcriptions and translations) were sourced from
|
|
|
+ https://www.europarl.europa.eu/plenary/en/debates-video.html
|
|
|
+
|
|
|
+- Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
+ exclusive property of the European Parliament. These data were sourced from
|
|
|
+ https://ec.europa.eu/jrc/en/language-technologies/dcep (date of the latest
|
|
|
+ update: 11 March 2015).
|
|
|
+
|
|
|
+
|
|
|
+LICENCE
|
|
|
+-------
|
|
|
+
|
|
|
+- Speech and text data from the European Parliament website (audio, official
|
|
|
+ transcriptions and translations) are the exclusive property of the European
|
|
|
+ Union represented by the European Parliament. These data are reused here
|
|
|
+ under the conditions stated in the European Parliament website's Legal
|
|
|
+ notice ( https://www.europarl.europa.eu/legal-notice ).
|
|
|
+
|
|
|
+- Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
+ exclusive property of the European Parliament. The European Parliament
|
|
|
+ retains ownership of the data. These data are reused here under the usage
|
|
|
+ conditions of the DCEP Digital Corpus of the European Parliament (
|
|
|
+ https://ec.europa.eu/jrc/en/language-technologies/dcep#Usage%20Conditions ).
|
|
|
+
|
|
|
+- Text data from the Europarl v10 corpus are reused here under the Europarl
|
|
|
+ corpus terms of use ( https://www.statmt.org/europarl/ ).
|
|
|
+
|
|
|
+- Europarl-ASR data and code not covered by the previously mentioned licences
|
|
|
+ © 2021 by Pau Baquero-Arnal, Jorge Civera, Gonçal V. Garcés Dı́az-Munı́o,
|
|
|
+ Adrià Giménez, Javier Iranzo-Sánchez, Javier Jorge, Alfons Juan, Alejandro
|
|
|
+ Pérez-González-de-Martos, Nahuel Roselló, Albert Sanchis and Joan Albert
|
|
|
+ Silvestre-Cerdà are licenced under CC BY 4.0. To view a copy of this
|
|
|
+ licence, visit http://creativecommons.org/licenses/by/4.0/
|
|
|
+
|
|
|
+See the file LICENSE for the full licence texts.
|