2021 |
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
2020 |
Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation @inproceedings{Iranzo2020, title = {Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, author = {Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Javier Jorge and Nahuel Roselló and Adrià Giménez and Albert Sanchis and Jorge Civera and Alfons Juan}, url = {https://arxiv.org/abs/1911.03167 https://paperswithcode.com/paper/europarl-st-a-multilingual-corpus-for-speech https://www.mllp.upv.es/europarl-st/}, doi = {10.1109/ICASSP40776.2020.9054626}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {8229--8233}, address = {Barcelona (Spain)}, abstract = {Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.}, keywords = {Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. |
Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming @inproceedings{Iranzo-Sánchez2020, title = {Direct Segmentation Models for Streaming Speech Translation}, author = {Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Joan Albert Silvestre-Cerdà and Pau Baquero-Arnal and Jorge Civera Saiz and Alfons Juan}, url = {http://dx.doi.org/10.18653/v1/2020.emnlp-main.206}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, pages = {2599--2611}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.}, keywords = {Segmentation, Speech Translation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Models log-lineals Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2021 |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |
2020 |
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. |
Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. |