Europarl-ST

Europarl-ST is a Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.

The full details of the corpus are available in the paper:
https://ieeexplore.ieee.org/document/9054626
(Preprint also available: https://arxiv.org/abs/1911.03167)

New release v1.1

For more information about the activities of our research group, visit:
https://www.mllp.upv.es/

For any questions or comments regarding the corpus, don't hesitate to contact Javier Iranzo-Sánchez (jairsan@upv.es)

If you use the corpus in your research please cite the following reference:

  @INPROCEEDINGS{jairsan2020a,
  author={J. {Iranzo-Sánchez} and J. A. {Silvestre-Cerdà} and J. {Jorge} and N. {Roselló} and A. {Giménez} and A. {Sanchis} and J. {Civera} and A. {Juan}},
  booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, 
  year={2020},
  pages={8229-8233},}
  

Release v1.1

README

Release v1.0

A summary of the data for release v1.1 is as follows:

Train set(hours):
src/tgt en fr de it es pt pl ro nl
en - 81 83 80 81 81 79 72 80
fr 32 - 21 20 21 22 20 18 22
de 30 18 - 17 18 18 17 17 18
it 37 21 21 - 21 21 21 19 20
es 22 14 14 14 - 14 13 12 13
pt 15 10 10 10 10 - 9 9 9
pl 28 18 18 17 18 18 - 16 18
ro 24 12 12 12 12 12 12 - 12
nl 7 5 5 4 5 4 4 4 -

Train-noisy set(hours):
src/tgt en fr de it es pt pl ro nl
en - 89 90 84 88 89 87 89 88
fr 39 - 39 38 39 41 40 38 43
de 54 54 51 53 53 53 53 53 53
it 15 15 15 - 15 15 15 15 15
es 10 10 10 10 - 10 10 10 10
pt 5 5 5 5 5 - 5 5 5
pl 16 15 16 15 16 15 - 16 15
ro 4 4 3 3 4 4 3 - 4
nl 5 5 5 5 5 5 5 5 -

Dev/Test sets are all between 3 and 6 hours.