Europarl-ST

Europarl-ST is a Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.

The full details of the corpus are available in the paper:
https://ieeexplore.ieee.org/document/9054626
(Preprint also available: https://arxiv.org/abs/1911.03167)

New release v1.1

This release adds 3 new languages (Romanian, Polish and Dutch). Jointly with the already available 6 languages (German, English, Spanish, French, Italian and Portuguese), the corpus now offers 72 speech translation directions.
We have released a new set, train-noisy, which contains the speeches that were discarded during our filtering process, as they may still be useful for some training regimes.

For more information about the activities of our research group, visit:
https://www.mllp.upv.es/

For any questions or comments regarding the corpus, don't hesitate to contact Javier Iranzo-Sánchez (jairsan@upv.es)

If you use the corpus in your research please cite the following reference:

  @INPROCEEDINGS{jairsan2020a,
  author={J. {Iranzo-Sánchez} and J. A. {Silvestre-Cerdà} and J. {Jorge} and N. {Roselló} and A. {Giménez} and A. {Sanchis} and J. {Civera} and A. {Juan}},
  booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, 
  year={2020},
  pages={8229-8233},}

Release v1.1

README

Release v1.0

A summary of the data for release v1.1 is as follows:

Train set(hours):

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	81	83	80	81	81	79	72	80
fr	32	-	21	20	21	22	20	18	22
de	30	18	-	17	18	18	17	17	18
it	37	21	21	-	21	21	21	19	20
es	22	14	14	14	-	14	13	12	13
pt	15	10	10	10	10	-	9	9	9
pl	28	18	18	17	18	18	-	16	18
ro	24	12	12	12	12	12	12	-	12
nl	7	5	5	4	5	4	4	4	-

Train-noisy set(hours):

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	89	90	84	88	89	87	89	88
fr	39	-	39	38	39	41	40	38	43
de	54	54	51	53	53	53	53	53	53
it	15	15	15	-	15	15	15	15	15
es	10	10	10	10	-	10	10	10	10
pt	5	5	5	5	5	-	5	5	5
pl	16	15	16	15	16	15	-	16	15
ro	4	4	3	3	4	4	3	-	4
nl	5	5	5	5	5	5	5	5	-

Dev/Test sets are all between 3 and 6 hours.

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	81	83	80	81	81	79	72	80
fr	32	-	21	20	21	22	20	18	22
de	30	18	-	17	18	18	17	17	18
it	37	21	21	-	21	21	21	19	20
es	22	14	14	14	-	14	13	12	13
pt	15	10	10	10	10	-	9	9	9
pl	28	18	18	17	18	18	-	16	18
ro	24	12	12	12	12	12	12	-	12
nl	7	5	5	4	5	4	4	4	-

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	89	90	84	88	89	87	89	88
fr	39	-	39	38	39	41	40	38	43
de	54	54	51	53	53	53	53	53	53
it	15	15	15	-	15	15	15	15	15
es	10	10	10	10	-	10	10	10	10
pt	5	5	5	5	5	-	5	5	5
pl	16	15	16	15	16	15	-	16	15
ro	4	4	3	3	4	4	3	-	4
nl	5	5	5	5	5	5	5	5	-

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	81	83	80	81	81	79	72	80
fr	32	-	21	20	21	22	20	18	22
de	30	18	-	17	18	18	17	17	18
it	37	21	21	-	21	21	21	19	20
es	22	14	14	14	-	14	13	12	13
pt	15	10	10	10	10	-	9	9	9
pl	28	18	18	17	18	18	-	16	18
ro	24	12	12	12	12	12	12	-	12
nl	7	5	5	4	5	4	4	4	-

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	89	90	84	88	89	87	89	88
fr	39	-	39	38	39	41	40	38	43
de	54	54	51	53	53	53	53	53	53
it	15	15	15	-	15	15	15	15	15
es	10	10	10	10	-	10	10	10	10
pt	5	5	5	5	5	-	5	5	5
pl	16	15	16	15	16	15	-	16	15
ro	4	4	3	3	4	4	3	-	4
nl	5	5	5	5	5	5	5	5	-

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	81	83	80	81	81	79	72	80
fr	32	-	21	20	21	22	20	18	22
de	30	18	-	17	18	18	17	17	18
it	37	21	21	-	21	21	21	19	20
es	22	14	14	14	-	14	13	12	13
pt	15	10	10	10	10	-	9	9	9
pl	28	18	18	17	18	18	-	16	18
ro	24	12	12	12	12	12	12	-	12
nl	7	5	5	4	5	4	4	4	-

src/tgt	en	fr	de	it	es	pt	pl	ro	nl
en	-	89	90	84	88	89	87	89	88
fr	39	-	39	38	39	41	40	38	43
de	54	54	51	53	53	53	53	53	53
it	15	15	15	-	15	15	15	15	15
es	10	10	10	10	-	10	10	10	10
pt	5	5	5	5	5	-	5	5	5
pl	16	15	16	15	16	15	-	16	15
ro	4	4	3	3	4	4	3	-	4
nl	5	5	5	5	5	5	5	5	-