Publications

Iranzo-Sánchez, Javier; Iranzo-Sánchez, Jorge; Giménez, Adrià; Civera, Jorge; Juan, Alfons

Segmentation-Free Streaming Machine Translation Journal Article

Transactions of the Association for Computational Linguistics, 12 , pp. 1104-1121, 2024, (also accepted for presentation at ACL 2024).

Abstract | Links | BibTeX | Tags: segmentation-free, streaming machine translation

Benstead, Kim; Brandl, Andreas; Brouwers, Ton; Civera, Jorge; Collen, Sarah; Csaba, Degi L; Munter, Johan De; Dewitte, Marieke; Diez de los Rios, Celia ; Dodlek, Nikolina; Eriksen, Jesper G; Forget, Patrice; Gasparatto, Chiara; Geissler, Jan; Hall, Corinne; Juan, Alfons; Kalz, Marco; Kelly, Richard; Klis, Giorgos; Kulaksiz, Taibe; Lecoq, Carine; Marangoni, Francesca; McInally, Wendy; Oliver, Kathy; Popovics, Maria; Poulios, Christos; Price, Richard; Rollo, Irena; Romeo, Silvia; Steinbacher, Jana; Sulosaari, Virpi; O’Higgins, Niall

An inter-specialty cancer training programme curriculum for Europe Journal Article

European Journal of Surgical Oncology, 49 (9), pp. 106989, 2023.

Abstract | Links | BibTeX | Tags: educational technologies, Neural Machine Translation

@article{Benstead2023,
title = {An inter-specialty cancer training programme curriculum for Europe},
author = {Kim Benstead AND Andreas Brandl AND Ton Brouwers AND Jorge Civera AND Sarah Collen AND Degi L. Csaba AND Johan De Munter AND Marieke Dewitte AND Diez de los Rios, Celia AND Nikolina Dodlek AND Jesper G. Eriksen AND Patrice Forget AND Chiara Gasparatto AND Jan Geissler AND Corinne Hall AND Alfons Juan AND Marco Kalz AND Richard Kelly AND Giorgos Klis AND Taibe Kulaksiz AND Carine Lecoq AND Francesca Marangoni AND Wendy McInally AND Kathy Oliver AND Maria Popovics AND Christos Poulios AND Richard Price AND Irena Rollo AND Silvia Romeo AND Jana Steinbacher AND Virpi Sulosaari AND Niall O’Higgins},
doi = {10.1016/j.ejso.2023.106989 },
year = {2023},
date = {2023-07-28},
journal = {European Journal of Surgical Oncology},
volume = {49},
number = {9},
pages = {106989},
abstract = {INTRODUCTION: Multidisciplinary and multi-professional collaboration is vital in providing better outcomes for patients The aim of the INTERACT-EUROPE Project (Wide Ranging Cooperation and Cutting Edge Innovation As A Response To Cancer Training Needs) was to develop an inter-specialty curriculum. A pilot project will enable a pioneer cohort to acquire a sample of the competencies needed. METHODS: A scoping review, qualitative and quantitative surveys were undertaken. The quantitative survey results are reported here. Respondents, including members of education boards, curriculum committees, trainee committees of European specialist societies and the ECO Patient Advisory Committee, were asked to score 127 proposed competencies on a 7-point Likert scale as to their value in achieving the aims of the curriculum. Results were discussed and competencies developed at two stakeholder meetings. A consultative document, shared with stakeholders and available online, requested views regarding the other components of the curriculum. RESULTS: Eleven competencies were revised, three omitted and three added. The competencies were organised according to the CanMEDS framework with 13 Entrustable Professional Activities, 23 competencies and 127 enabling competencies covering all roles in the framework. Recommendations regarding the infrastructure, organisational aspects, eligibility of trainees and training centres, programme contents, assessment and evaluation were developed using the replies to the consultative document. CONCLUSIONS: An Inter-specialty Cancer Training Programme Curriculum and a pilot programme with virtual and face-to-face components have been developed with the aim of improving the care of people affected by cancer.},
keywords = {educational technologies, Neural Machine Translation},
pubstate = {published},
tppubtype = {article}
}

Close

Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos

Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings

Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022.

Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech

Iranzo-Sánchez, Javier; Jorge, Javier; Pérez-González-de-Martos, Alejandro; Giménez, Adrià; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons

MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings

Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022.

Abstract | Links | BibTeX | Tags: Simultaneous Speech Translation, speech-to-speech translation

Iranzo-Sánchez, Javier ; Civera, Jorge ; Juan, Alfons

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History Inproceedings

Proc. 60th Annual Meeting of the Association for Computational Linguistics Vol. 1: Long Papers (ACL 2022), pp. 6972–6985, Dublin (Ireland), 2022.

Abstract | Links | BibTeX | Tags: simultaneous machine translation, streaming machine translation

Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article

Applied Sciences, 12 (2), pp. 804, 2022.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming

@article{applsci1505192,
title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension},
author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan},
doi = {10.3390/app12020804},
year = {2022},
date = {2022-01-01},
journal = {Applied Sciences},
volume = {12},
number = {2},
pages = {804},
abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.},
keywords = {Automatic Speech Recognition, Natural Language Processing, streaming},
pubstate = {published},
tppubtype = {article}
}

Close

Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Sanchis, Albert ; Alfons, Juan

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021.

Abstract | Links | BibTeX | Tags: acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming

Pérez, Alejandro; Garcés Díaz-Munío, Gonçal ; Giménez, Adrià; Silvestre-Cerdà, Joan Albert ; Sanchis, Albert; Civera, Jorge; Jiménez, Manuel; Turró, Carlos; Juan, Alfons

Towards cross-lingual voice cloning in higher education Journal Article

Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021.

Abstract | Links | BibTeX | Tags: cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech

@article{Pérez2021,
title = {Towards cross-lingual voice cloning in higher education},
author = {Alejandro Pérez and Garcés Díaz-Munío, Gonçal and Adrià Giménez and Silvestre-Cerdà, Joan Albert and Albert Sanchis and Jorge Civera and Manuel Jiménez and Carlos Turró and Alfons Juan},
url = {https://doi.org/10.1016/j.engappai.2021.104413},
year = {2021},
date = {2021-10-01},
journal = {Engineering Applications of Artificial Intelligence},
volume = {105},
pages = {104413},
abstract = {The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers.},
keywords = {cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech},
pubstate = {published},
tppubtype = {article}
}

Close

Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings

Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming

@inproceedings{Jorge2021,
title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge},
author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan},
doi = {10.21437/IberSPEECH.2021-25},
year = {2021},
date = {2021-03-24},
booktitle = {Proc. of IberSPEECH 2021},
pages = {118--122},
address = {Valladolid (Spain)},
abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge.

[EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative.

[CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.},
keywords = {Automatic Speech Recognition, Natural Language Processing, streaming},
pubstate = {published},
tppubtype = {inproceedings}
}

Close

1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge.

[EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative.

[CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.

Close

Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons

Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article

Neural Networks, 142 , pp. 303–315, 2021.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming

@article{Iranzo-Sánchez2021,
title = {Streaming cascade-based speech translation leveraged by a direct segmentation model},
author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan},
doi = {10.1016/j.neunet.2021.05.013},
year = {2021},
date = {2021-01-01},
journal = {Neural Networks},
volume = {142},
pages = {303--315},
abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.},
keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming},
pubstate = {published},
tppubtype = {article}
}

Close

Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings

Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization

@inproceedings{Garcés2021,
title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization},
author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan},
url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf
https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ
https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary
https://github.com/mllpresearch/Europarl-ASR},
doi = {10.21437/Interspeech.2021-1905},
year = {2021},
date = {2021-01-01},
booktitle = {Proc. Interspeech 2021},
journal = {Proc. Interspeech 2021},
pages = {3695--3699},
address = {Brno (Czech Republic)},
abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.

[CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades
en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.},
keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization},
pubstate = {published},
tppubtype = {inproceedings}
}

Close

[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.

[CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades
en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.

Close

Pérez-González-de-Martos, Alejandro; Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Jorge, Javier; Silvestre-Cerdà, Joan-Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons

Towards simultaneous machine interpretation Inproceedings

Proc. Interspeech 2021, pp. 2277–2281, Brno (Czech Republic), 2021.

Abstract | Links | BibTeX | Tags: cross-lingual voice cloning, incremental text-to-speech, simultaneous machine interpretation, speech-to-speech translation

Javier Iranzo-Sánchez Jorge Civera, Alfons Juan

Stream-level Latency Evaluation for Simultaneous Machine Translation Inproceedings

Findings of the ACL: EMNLP 2021, pp. 664–670, Punta Cana (Dominican Republic), 2021.

Abstract | Links | BibTeX | Tags: latency, simultaneous machine translation, stream-level evaluation, streaming

Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons

LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings

Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020.

Abstract | Links | BibTeX | Tags: acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming

Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons

Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings

Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation

Baquero-Arnal, Pau ; Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Iranzo-Sánchez, Javier ; Sanchis, Albert ; Civera, Jorge ; Juan, Alfons

Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings

Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020.

Abstract | Links | BibTeX | Tags: hybrid ASR, language models, streaming, Transformer

Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons

Direct Segmentation Models for Streaming Speech Translation Inproceedings

Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020.

Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming

Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Civera, Jorge; Sanchis, Albert; Juan, Alfons

Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models Inproceedings

Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019), pp. 3820–3824, Graz (Austria), 2019.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, LSTM language models, one-pass decoding, real-time

Baquero-Arnal, Pau ; Iranzo-Sánchez, Javier ; Civera, Jorge ; Juan, Alfons

The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task Inproceedings

Proc. of Fourth Conference on Machine Translation (WMT19), pp. 179-184, Florence (Italy), 2019.

Abstract | Links | BibTeX | Tags: Machine Translation, Neural Machine Translation, WMT19

Iranzo-Sánchez, Javier ; Garcés Díaz-Munío, Gonçal V; Civera, Jorge ; Juan, Alfons

The MLLP-UPV Supervised Machine Translation Systems for WMT19 News Translation Task Inproceedings

Proc. of Fourth Conference on Machine Translation (WMT19), pp. 218-224, Florence (Italy), 2019.

Abstract | Links | BibTeX | Tags: Machine Translation, Neural Machine Translation, WMT19 News Translation

Iranzo-Sánchez, Javier ; Baquero-Arnal, Pau ; Garcés Díaz-Munío, Gonçal V; Martínez-Villaronga, Adrià ; Civera, Jorge ; Juan, Alfons

The MLLP-UPV German-English Machine Translation System for WMT18 Inproceedings

Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers, pp. 422–428, Brussels (Belgium), 2018.

Abstract | Links | BibTeX | Tags: Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation

@inproceedings{Iranzo-Sánchez2018,
title = {The MLLP-UPV German-English Machine Translation System for WMT18},
author = {Iranzo-Sánchez, Javier and Baquero-Arnal, Pau and Garcés Díaz-Munío, Gonçal V. and Martínez-Villaronga, Adrià and Civera, Jorge and Juan, Alfons},
url = {http://dx.doi.org/10.18653/v1/W18-6414
https://www.mllp.upv.es/wp-content/uploads/2018/11/wmt18_mllp-upv_poster.pdf},
year = {2018},
date = {2018-01-01},
booktitle = {Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers},
pages = {422--428},
address = {Brussels (Belgium)},
abstract = {[EN] This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German>English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under "constrained" conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora.

[CA] "El sistema de traducció automàtica alemany>anglés de l'MLLP-UPV per a WMT18": En aquest article descrivim el sistema de traducció automàtica estadística creat pel grup d'investigació MLLP de la Universitat Politècnica de València per a la competició de traducció de notícies alemany>anglés de la Third Conference on Machine Translation (WMT18, associada a la conferència EMNLP 2018). Hem utilitzat una combinació de sistemes de traducció automàtica neuronal basats en l'arquitectura Transformer. Per a entrenar el nostre sistema en la categoria "fitada" (només amb els corpus lingüístics oficials de la competició), hem filtrat les dades paral·leles disponibles amb una tècnica que assigna puntuacions utilitzant models de llenguatge de caràcters, i hem afegit dades paral·leles basades en frases d'origen sintètiques generades a partir dels corpus monolingües disponibles.},
keywords = {Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation},
pubstate = {published},
tppubtype = {inproceedings}
}

Close

Del-Agua, Miguel Ángel ; Giménez, Adrià ; Sanchis, Alberto ; Civera, Jorge; Juan, Alfons

Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks Journal Article

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26 (7), pp. 1194–1202, 2018.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Confidence estimation, Confidence measures, Deep bidirectional recurrent neural networks, Long short-term memory, Speaker adaptation

@article{Del-Agua2018,
title = {Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks},
author = {Del-Agua, Miguel Ángel AND Giménez, Adrià AND Sanchis, Alberto AND Civera,Jorge AND Juan, Alfons},
url = {http://www.mllp.upv.es/wp-content/uploads/2018/04/Del-Agua2018_authors_version.pdf
https://doi.org/10.1109/TASLP.2018.2819900},
year = {2018},
date = {2018-01-01},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume = {26},
number = {7},
pages = {1194--1202},
abstract = {In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence measures considering a multi-task framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures which results in better automatic speech recognition performance.},
keywords = {Automatic Speech Recognition, Confidence estimation, Confidence measures, Deep bidirectional recurrent neural networks, Long short-term memory, Speaker adaptation},
pubstate = {published},
tppubtype = {article}
}

Close

Valor Miró, Juan Daniel ; Baquero-Arnal, Pau; Civera, Jorge; Turró, Carlos; Juan, Alfons

Multilingual videos for MOOCs and OER Journal Article

Journal of Educational Technology & Society, 21 (2), pp. 1–12, 2018.

Abstract | Links | BibTeX | Tags: Machine Translation, MOOCs, multilingual, Speech Recognition, video lecture repositories

@article{Miró2018,
title = {Multilingual videos for MOOCs and OER},
author = {Valor Miró, Juan Daniel and Pau Baquero-Arnal and Jorge Civera and Carlos Turró and Alfons Juan},
url = {https://www.mllp.upv.es/wp-content/uploads/2019/11/JETS2018MLLP.pdf
http://hdl.handle.net/10251/122577
https://www.jstor.org/stable/26388375
https://www.j-ets.net/collection/published-issues/21_2},
year = {2018},
date = {2018-01-01},
journal = {Journal of Educational Technology & Society},
volume = {21},
number = {2},
pages = {1--12},
abstract = {Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine translation (MT) techniques have reached a level of maturity which opens the possibility of producing multilingual video subtitles of publishable quality at low cost. This work summarizes authors' experience in exploring this possibility in two real-life case studies: a MOOC platform and a large video lecture repository. Apart from describing the systems, tools and integration components employed for such purpose, a comprehensive evaluation of the results achieved is provided in terms of quality and efficiency. More precisely, it is shown that draft multilingual subtitles produced by domain-adapted ASR/MT systems reach a level of accuracy that make them worth post-editing, instead of generating them ex novo, saving approximately 25%–75% of the time. Finally, the results reported on user multilingual data consumption reflect that multilingual subtitles have had a very positive impact in our case studies boosting student enrolment, in the case of the MOOC platform, by 70% relative.},
keywords = {Machine Translation, MOOCs, multilingual, Speech Recognition, video lecture repositories},
pubstate = {published},
tppubtype = {article}
}

Close

Piqueras, Santiago ; Pérez, Alejandro ; Turró, Carlos ; Jiménez, Manuel ; Sanchis, Albert ; Civera, Jorge ; Juan, Alfons

Hacia la traducción integral de vídeo charlas educativas Inproceedings

Proc. of III Congreso Nacional de Innovación Educativa y Docencia en Red (IN-RED 2017), pp. 117–124, València (Spain), 2017.

Abstract | Links | BibTeX | Tags: MOOCs, multilingual, translation

Silvestre-Cerdà, Joan Albert; Juan, Alfons; Civera, Jorge

Different Contributions to Cost-Effective Transcription and Translation of Video Lectures Inproceedings

Proc. of IX Jornadas en Tecnología del Habla and V Iberian SLTech Workshop (IberSpeech 2016), pp. 313-319, Lisbon (Portugal), 2016, ISBN: 978-3-319-49168-4 .

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Automatic transcription and translation, Machine Translation, Video Lectures

@inproceedings{Silvestre-Cerdà2016b,
title = {Different Contributions to Cost-Effective Transcription and Translation of Video Lectures},
author = {Joan Albert Silvestre-Cerdà and Alfons Juan and Jorge Civera},
url = {http://www.mllp.upv.es/wp-content/uploads/2016/11/poster.pdf
http://www.mllp.upv.es/wp-content/uploads/2016/11/paper.pdf
http://hdl.handle.net/10251/62194},
isbn = {978-3-319-49168-4 },
year = {2016},
date = {2016-11-24},
booktitle = {Proc. of IX Jornadas en Tecnología del Habla and V Iberian SLTech Workshop (IberSpeech 2016)},
pages = {313-319},
address = {Lisbon (Portugal)},
abstract = {In recent years, on-line multimedia repositories have experiencied
a strong growth that have made them consolidated as essential
knowledge assets, especially in the area of education, where large repositories
of video lectures have been built in order to complement or even
replace traditional teaching methods. However, most of these video lectures
are neither transcribed nor translated due to a lack of cost-effective
solutions to do so in a way that gives accurate enough results. Solutions
of this kind are clearly necessary in order to make these lectures accessible
to speakers of different languages and to people with hearing
disabilities, among many other benefits and applications.
For this reason, the main aim of this thesis is to develop a cost-effective
solution capable of transcribing and translating video lectures to a reasonable
degree of accuracy. More specifically, we address the integration
of state-of-the-art techniques in Automatic Speech Recognition and Machine
Translation into large video lecture repositories to generate highquality
multilingual video subtitles without human intervention and at
a reduced computational cost. Also, we explore the potential benefits of
the exploitation of the information that we know a priori about these
repositories, that is, lecture-specific knowledge such as speaker, topic
or slides, to create specialised, in-domain transcription and translation
systems by means of massive adaptation techniques.
The proposed solutions have been tested in real-life scenarios by carrying
out several objective and subjective evaluations, obtaining very
positive results. The main outcome derived from this multidisciplinary
thesis, The transLectures-UPV Platform, has been publicly released as an
open-source software, and, at the time of writing, it is serving automatic
transcriptions and translations for several thousands of video lectures in
many Spanish and European universities and institutions.},
keywords = {Automatic Speech Recognition, Automatic transcription and translation, Machine Translation, Video Lectures},
pubstate = {published},
tppubtype = {inproceedings}
}

Close

del-Agua, Miguel Ángel; Piqueras, Santiago; Giménez, Adrià; Sanchis, Alberto; Civera, Jorge; Juan, Alfons

ASR Confidence Estimation with Speaker-Adapted Recurrent Neural Networks Inproceedings

Proc. of the 17th Annual Conf. of the ISCA (Interspeech 2016), pp. 3464–3468, San Francisco (USA), 2016.

Abstract | Links | BibTeX | Tags: BLSTM, Confidence measures, Recurrent Neural Networks, Speaker adaptation, Speech Recognition

Valor Miró, Juan Daniel ; Turró, C; Civera, J; Juan, A

Generación eficiente de transcripciones y traducciones automáticas en poliMedia Inproceedings

Proc. of II Congreso Nacional de Innovación Educativa y Docencia en Red (IN-RED 2016), pp. 21–29, València (Spain), 2016.

Abstract | Links | BibTeX | Tags: Docencia en Red, e-learning, transcription, translation, video

del-Agua, Miguel Ángel; Martínez-Villaronga, Adrià; Giménez, Adrià; Sanchis, Alberto; Civera, Jorge; Juan, Alfons

The MLLP system for the 4th CHiME Challenge Inproceedings

Proc. of the 4th Intl. Workshop on Speech Processing in Everyday Environments (CHiME 2016), pp. 57–59, San Francisco (USA), 2016.

Abstract | Links | BibTeX | Tags:

del-Agua, Miguel Ángel; Martínez-Villaronga, Adrià; Piqueras, Santiago; Giménez, Adrià; Sanchis, Alberto; Civera, Jorge; Juan, Alfons

The MLLP ASR Systems for IWSLT 2015 Inproceedings

Proc. of 12th Intl. Workshop on Spoken Language Translation (IWSLT 2015), pp. 39–44, Da Nang (Vietnam), 2015.

Abstract | Links | BibTeX | Tags:

Valor Miró, Juan Daniel ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Turró, Carlos ; Juan, Alfons

Efficient Generation of High-Quality Multilingual Subtitles for Video Lecture Repositories Inproceedings

Proc. of 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), pp. 485–490, Toledo (Spain), 2015, ISBN: 978-3-319-24258-3.

Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Docencia en Red, Efficient video subtitling, Polimedia, Statistical machine translation, video lecture repositories

@inproceedings{valor2015efficient,
title = {Efficient Generation of High-Quality Multilingual Subtitles for Video Lecture Repositories},
author = {Valor Miró, Juan Daniel and Silvestre-Cerdà, Joan Albert and Civera, Jorge and Turró, Carlos and Juan, Alfons},
url = {http://link.springer.com/chapter/10.1007/978-3-319-24258-3_44
http://www.mllp.upv.es/wp-content/uploads/2016/03/paper.pdf
},
isbn = {978-3-319-24258-3},
year = {2015},
date = {2015-09-17},
booktitle = {Proc. of 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015)},
pages = {485--490},
address = {Toledo (Spain)},
abstract = {Video lectures are a valuable educational tool in higher education to support or replace face-to-face lectures in active learning strategies. In 2007 the Universitat Polit‘ecnica de Val‘encia (UPV) implemented its video lecture capture system, resulting in a high quality educational video repository, called poliMedia, with more than 10.000 mini lectures created by 1.373 lecturers. Also, in the framework of the European project transLectures, UPV has automatically generated transcriptions and translations in Spanish, Catalan and English for all videos included in the poliMedia video repository. transLectures’s objective responds to the widely-recognised need for subtitles to be provided with video lectures, as an essential service for non-native speakers and hearing impaired persons, and to allow advanced repository functionalities. Although high-quality automatic transcriptions and translations were generated in transLectures, they were not error-free. For this reason, lecturers need to manually review video subtitles to guarantee the absence of errors. The aim of this study is to evaluate the efficiency of the manual review process from automatic subtitles in comparison with the conventional generation of video subtitles from scratch. The reported results clearly indicate the convenience of providing automatic subtitles as a first step in the generation of video subtitles and the significant savings in time of up to almost 75% involved in reviewing subtitles.},
keywords = {Automatic Speech Recognition, Docencia en Red, Efficient video subtitling, Polimedia, Statistical machine translation, video lecture repositories},
pubstate = {published},
tppubtype = {inproceedings}
}

Close