|
@@ -1,6 +1,6 @@
|
|
|
# Europarl-ASR
|
|
|
-v1.0<br />
|
|
|
-2 April 2021<br />
|
|
|
+Europarl-ASR v1.0
|
|
|
+2 April 2021
|
|
|
[www.mllp.upv.es/europarl-asr](https://www.mllp.upv.es/europarl-asr)
|
|
|
|
|
|
A large English-language speech and text corpus of parliamentary debates for
|
|
@@ -171,10 +171,10 @@ In the cases of "dev" and "test", they are subdivided in directories "spk-dep"
|
|
|
and "spk-indep". Thus, for speech data, we have 2 train-dev-test partitions
|
|
|
for 2 different ASR tasks, as follows:
|
|
|
|
|
|
-1. ASR with known speakers (MEP):<br />
|
|
|
+1. ASR with known speakers (MEP):
|
|
|
train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
|
|
|
|
|
|
-1. ASR with unknown speakers (Guest):<br />
|
|
|
+1. ASR with unknown speakers (Guest):
|
|
|
train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
|
|
|
|
|
|
Each of these partition directories contains 3 to 4 subdirectories (depending
|
|
@@ -188,7 +188,7 @@ speeches per speaker.
|
|
|
corresponding set (as csv and json files). For each speech we will find these
|
|
|
metadata (as reflected in speeches.headers.csv):
|
|
|
|
|
|
- term;session_date;speech_id;speaker_type;speaker_id;raw_dur;<br />
|
|
|
+ term;session_date;speech_id;speaker_type;speaker_id;raw_dur;
|
|
|
aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
|
|
|
|
|
|
And for each speaker (as reflected in speakers.headers.csv):
|
|
@@ -203,24 +203,24 @@ according to this subdirectory structure:
|
|
|
For each speech, we will find some of the following files (depending on
|
|
|
whether it is in the train set or in the dev/test sets):
|
|
|
|
|
|
- `ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a`<br />
|
|
|
+ `ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a`
|
|
|
[In all sets] Audio of the speech.
|
|
|
|
|
|
- `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}`<br />
|
|
|
+ `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}`
|
|
|
[In all sets] Official non-verbatim transcription of the speech, as a txt
|
|
|
raw transcription file, as dfxp or srt force-aligned timed subtitle files,
|
|
|
and its json metadata.
|
|
|
|
|
|
- `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}`<br />
|
|
|
+ `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}`
|
|
|
[In train set] Automatically filtered transcription of the speech, as dfxp
|
|
|
or srt force-aligned timed subtitle files, and its json metadata.
|
|
|
|
|
|
- `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}`<br />
|
|
|
+ `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}`
|
|
|
[In train set] Automatically verbatimized transcription of the speech, as
|
|
|
a txt transcription file, as dfxp or srt timed subtitle files,
|
|
|
and its json metadata.
|
|
|
|
|
|
- `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}`<br />
|
|
|
+ `ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}`
|
|
|
[In dev/test sets] Manually revised verbatim transcription of the speech,
|
|
|
as a txt transcription file, as dfxp or srt timed subtitle
|
|
|
files, and its json metadata.
|