2023 |
Iranzo Sánchez, Javier Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Speech Translation, streaming speech translation @phdthesis{Sánchez2023, title = {Streaming Neural Speech Translation}, author = {Iranzo Sánchez, Javier}, doi = {10.4995/Thesis/10251/199170}, year = {2023}, date = {2023-09-29}, school = {Universitat Politècnica de València}, abstract = {Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality.}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Speech Translation, streaming speech translation}, pubstate = {published}, tppubtype = {phdthesis} } Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality. |
2021 |
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
2020 |
Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation @inproceedings{Iranzo2020, title = {Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, author = {Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Javier Jorge and Nahuel Roselló and Adrià Giménez and Albert Sanchis and Jorge Civera and Alfons Juan}, url = {https://arxiv.org/abs/1911.03167 https://paperswithcode.com/paper/europarl-st-a-multilingual-corpus-for-speech https://www.mllp.upv.es/europarl-st/}, doi = {10.1109/ICASSP40776.2020.9054626}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {8229--8233}, address = {Barcelona (Spain)}, abstract = {Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.}, keywords = {Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. |
Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming @inproceedings{Iranzo-Sánchez2020, title = {Direct Segmentation Models for Streaming Speech Translation}, author = {Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Joan Albert Silvestre-Cerdà and Pau Baquero-Arnal and Jorge Civera Saiz and Alfons Juan}, url = {http://dx.doi.org/10.18653/v1/2020.emnlp-main.206}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, pages = {2599--2611}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.}, keywords = {Segmentation, Speech Translation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Models log-lineals Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2023 |
Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
2021 |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |
2020 |
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. |
Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. |