2024 |
Garcés Díaz-Munío, Gonçal Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents @phdthesis{Garcés2024, title = {Automatic speech recognition and machine translation with deep neural networks for open educational resources, parliamentary contents and broadcast media}, author = {Garcés Díaz-Munío, Gonçal}, url = {https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12900 https://www.mllp.upv.es/phd-thesis-automatic-speech-recognition-and-machine-translation-with-deep-neural-networks-for-open-educational-resources-parliamentary-contents-and-broadcast-media-by-goncal-garces/}, year = {2024}, date = {2024-01-01}, school = {Universitat Politècnica de València}, abstract = {In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case.}, note = {advisers: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents}, pubstate = {published}, tppubtype = {phdthesis} } In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case. |
Iranzo-Sánchez, Javier; Iranzo-Sánchez, Jorge; Giménez, Adrià; Civera, Jorge; Juan, Alfons Segmentation-Free Streaming Machine Translation Journal Article Transactions of the Association for Computational Linguistics, 12 , pp. 1104-1121, 2024, (also accepted for presentation at ACL 2024). Abstract | Links | BibTeX | Tags: segmentation-free, streaming machine translation @article{Juan2024, title = {Segmentation-Free Streaming Machine Translation}, author = {Javier Iranzo-Sánchez AND Jorge Iranzo-Sánchez AND Adrià Giménez AND Jorge Civera AND Alfons Juan}, url = {https://paperswithcode.com/paper/segmentation-free-streaming-machine https://github.com/jairsan/Segmentation-Free_Streaming_Machine_Translation https://arxiv.org/abs/2309.14823 https://2024.aclweb.org/program/tacl_papers/ https://www.mllp.upv.es/wp-content/uploads/2024/09/tacl_segfree_poster.pdf}, doi = {10.1162/tacl_a_00691}, year = {2024}, date = {2024-01-01}, journal = {Transactions of the Association for Computational Linguistics}, volume = {12}, pages = {1104-1121}, abstract = {Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model.}, note = {also accepted for presentation at ACL 2024}, keywords = {segmentation-free, streaming machine translation}, pubstate = {published}, tppubtype = {article} } Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model.
|
2023 |
Iranzo Sánchez, Javier Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Speech Translation, streaming speech translation @phdthesis{Sánchez2023, title = {Streaming Neural Speech Translation}, author = {Iranzo Sánchez, Javier}, doi = {10.4995/Thesis/10251/199170}, year = {2023}, date = {2023-09-29}, school = {Universitat Politècnica de València}, abstract = {Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality.}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Speech Translation, streaming speech translation}, pubstate = {published}, tppubtype = {phdthesis} } Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality. |
Benstead, Kim; Brandl, Andreas; Brouwers, Ton; Civera, Jorge; Collen, Sarah; Csaba, Degi L; Munter, Johan De; Dewitte, Marieke; Diez de los Rios, Celia ; Dodlek, Nikolina; Eriksen, Jesper G; Forget, Patrice; Gasparatto, Chiara; Geissler, Jan; Hall, Corinne; Juan, Alfons; Kalz, Marco; Kelly, Richard; Klis, Giorgos; Kulaksiz, Taibe; Lecoq, Carine; Marangoni, Francesca; McInally, Wendy; Oliver, Kathy; Popovics, Maria; Poulios, Christos; Price, Richard; Rollo, Irena; Romeo, Silvia; Steinbacher, Jana; Sulosaari, Virpi; O’Higgins, Niall An inter-specialty cancer training programme curriculum for Europe Journal Article European Journal of Surgical Oncology, 49 (9), pp. 106989, 2023. Abstract | Links | BibTeX | Tags: educational technologies, Neural Machine Translation @article{Benstead2023, title = {An inter-specialty cancer training programme curriculum for Europe}, author = {Kim Benstead AND Andreas Brandl AND Ton Brouwers AND Jorge Civera AND Sarah Collen AND Degi L. Csaba AND Johan De Munter AND Marieke Dewitte AND Diez de los Rios, Celia AND Nikolina Dodlek AND Jesper G. Eriksen AND Patrice Forget AND Chiara Gasparatto AND Jan Geissler AND Corinne Hall AND Alfons Juan AND Marco Kalz AND Richard Kelly AND Giorgos Klis AND Taibe Kulaksiz AND Carine Lecoq AND Francesca Marangoni AND Wendy McInally AND Kathy Oliver AND Maria Popovics AND Christos Poulios AND Richard Price AND Irena Rollo AND Silvia Romeo AND Jana Steinbacher AND Virpi Sulosaari AND Niall O’Higgins}, doi = {10.1016/j.ejso.2023.106989 }, year = {2023}, date = {2023-07-28}, journal = {European Journal of Surgical Oncology}, volume = {49}, number = {9}, pages = {106989}, abstract = {INTRODUCTION: Multidisciplinary and multi-professional collaboration is vital in providing better outcomes for patients The aim of the INTERACT-EUROPE Project (Wide Ranging Cooperation and Cutting Edge Innovation As A Response To Cancer Training Needs) was to develop an inter-specialty curriculum. A pilot project will enable a pioneer cohort to acquire a sample of the competencies needed. METHODS: A scoping review, qualitative and quantitative surveys were undertaken. The quantitative survey results are reported here. Respondents, including members of education boards, curriculum committees, trainee committees of European specialist societies and the ECO Patient Advisory Committee, were asked to score 127 proposed competencies on a 7-point Likert scale as to their value in achieving the aims of the curriculum. Results were discussed and competencies developed at two stakeholder meetings. A consultative document, shared with stakeholders and available online, requested views regarding the other components of the curriculum. RESULTS: Eleven competencies were revised, three omitted and three added. The competencies were organised according to the CanMEDS framework with 13 Entrustable Professional Activities, 23 competencies and 127 enabling competencies covering all roles in the framework. Recommendations regarding the infrastructure, organisational aspects, eligibility of trainees and training centres, programme contents, assessment and evaluation were developed using the replies to the consultative document. CONCLUSIONS: An Inter-specialty Cancer Training Programme Curriculum and a pilot programme with virtual and face-to-face components have been developed with the aim of improving the care of people affected by cancer.}, keywords = {educational technologies, Neural Machine Translation}, pubstate = {published}, tppubtype = {article} } INTRODUCTION: Multidisciplinary and multi-professional collaboration is vital in providing better outcomes for patients The aim of the INTERACT-EUROPE Project (Wide Ranging Cooperation and Cutting Edge Innovation As A Response To Cancer Training Needs) was to develop an inter-specialty curriculum. A pilot project will enable a pioneer cohort to acquire a sample of the competencies needed. METHODS: A scoping review, qualitative and quantitative surveys were undertaken. The quantitative survey results are reported here. Respondents, including members of education boards, curriculum committees, trainee committees of European specialist societies and the ECO Patient Advisory Committee, were asked to score 127 proposed competencies on a 7-point Likert scale as to their value in achieving the aims of the curriculum. Results were discussed and competencies developed at two stakeholder meetings. A consultative document, shared with stakeholders and available online, requested views regarding the other components of the curriculum. RESULTS: Eleven competencies were revised, three omitted and three added. The competencies were organised according to the CanMEDS framework with 13 Entrustable Professional Activities, 23 competencies and 127 enabling competencies covering all roles in the framework. Recommendations regarding the infrastructure, organisational aspects, eligibility of trainees and training centres, programme contents, assessment and evaluation were developed using the replies to the consultative document. CONCLUSIONS: An Inter-specialty Cancer Training Programme Curriculum and a pilot programme with virtual and face-to-face components have been developed with the aim of improving the care of people affected by cancer. |
Baquero Arnal, Pau Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model @phdthesis{Arnal2023, title = {Transformer models for Machine Translation and Streaming Automatic Speech Recognition}, author = {Baquero Arnal, Pau}, url = {https://doi.org/10.4995/Thesis/10251/193680 https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12917}, year = {2023}, date = {2023-01-01}, school = {Universitat Politècnica de València}, abstract = {Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor.}, note = {Advisors: Alfons Juan Ciscar and Hermann Ney}, keywords = {Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model}, pubstate = {published}, tppubtype = {phdthesis} } Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor. |
2022 |
Jorge Cano, Javier Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Links | BibTeX | Tags: Automatic Speech Recognition, Deep Neural Networks, hybrid ASR, streaming @phdthesis{Cano2022, title = {Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models}, author = {Jorge Cano, Javier}, url = {https://doi.org/10.4995/Thesis/10251/191001}, year = {2022}, date = {2022-11-21}, school = {Universitat Politècnica de València}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Deep Neural Networks, hybrid ASR, streaming}, pubstate = {published}, tppubtype = {phdthesis} } |
Pérez González de Martos, Alejandro Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Alberto Sanchis Navarro). Links | BibTeX | Tags: automatic dubbing, cross-lingual voice cloning, educational resources, simultaneous machine interpretation, text-to-speech @phdthesis{aperez2022, title = {Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources}, author = {Pérez González de Martos, Alejandro}, url = {http://hdl.handle.net/10251/184019}, doi = {10.4995/Thesis/10251/184019}, year = {2022}, date = {2022-06-15}, school = {Universitat Politècnica de València}, note = {Advisors: Alfons Juan Ciscar and Alberto Sanchis Navarro}, keywords = {automatic dubbing, cross-lingual voice cloning, educational resources, simultaneous machine interpretation, text-to-speech}, pubstate = {published}, tppubtype = {phdthesis} } |
Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech @inproceedings{deMartos2022, title = {Doblaje automático de vídeo-charlas educativas en UPV[Media]}, author = {Pérez González de Martos, Alejandro AND Giménez Pastor, Adrià AND Jorge Cano, Javier AND Javier Iranzo-Sánchez AND Joan Albert Silvestre-Cerdà AND Garcés Díaz-Munío, Gonçal V. AND Pau Baquero-Arnal AND Sanchis Navarro, Alberto AND Civera Sáiz, Jorge AND Juan Ciscar, Alfons AND Turró Ribalta, Carlos}, doi = {10.4995/INRED2022.2022.15844}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022)}, pages = {557--570}, address = {València (Spain)}, abstract = {More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.}, keywords = {automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies. |
Iranzo-Sánchez, Javier; Jorge, Javier; Pérez-González-de-Martos, Alejandro; Giménez, Adrià; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. Abstract | Links | BibTeX | Tags: Simultaneous Speech Translation, speech-to-speech translation @inproceedings{Iranzo-Sánchez2022b, title = {MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks}, author = {Javier Iranzo-Sánchez and Javier Jorge and Alejandro Pérez-González-de-Martos and Adrià Giménez and Garcés Díaz-Munío, Gonçal V. and Pau Baquero-Arnal and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.18653/v1/2022.iwslt-1.22}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022)}, pages = {255--264}, address = {Dublin (Ireland)}, abstract = {This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference.}, keywords = {Simultaneous Speech Translation, speech-to-speech translation}, pubstate = {published}, tppubtype = {inproceedings} } This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference. |
Iranzo-Sánchez, Javier ; Civera, Jorge ; Juan, Alfons From Simultaneous to Streaming Machine Translation by Leveraging Streaming History Inproceedings Proc. 60th Annual Meeting of the Association for Computational Linguistics Vol. 1: Long Papers (ACL 2022), pp. 6972–6985, Dublin (Ireland), 2022. Abstract | Links | BibTeX | Tags: simultaneous machine translation, streaming machine translation @inproceedings{Iranzo-Sánchez2022, title = {From Simultaneous to Streaming Machine Translation by Leveraging Streaming History}, author = {Iranzo-Sánchez, Javier and Civera, Jorge and Juan, Alfons}, url = {https://arxiv.org/abs/2203.02459 https://github.com/jairsan/Speech_Translation_Segmenter}, doi = {10.18653/v1/2022.acl-long.480}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. 60th Annual Meeting of the Association for Computational Linguistics Vol. 1: Long Papers (ACL 2022)}, pages = {6972--6985}, address = {Dublin (Ireland)}, abstract = {Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentence-level MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems.}, keywords = {simultaneous machine translation, streaming machine translation}, pubstate = {published}, tppubtype = {inproceedings} } Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentence-level MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems. |
Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @article{applsci1505192, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension}, author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.3390/app12020804}, year = {2022}, date = {2022-01-01}, journal = {Applied Sciences}, volume = {12}, number = {2}, pages = {804}, abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {article} } This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. |
2021 |
Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Sanchis, Albert ; Alfons, Juan Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. Abstract | Links | BibTeX | Tags: acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming @article{Jorge2021b, title = {Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models}, author = {Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Civera, Jorge and Sanchis, Albert and Juan Alfons}, doi = {10.1109/TASLP.2021.3133216}, year = {2021}, date = {2021-11-23}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {30}, pages = {148--161}, abstract = {Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station.}, keywords = {acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming}, pubstate = {published}, tppubtype = {article} } Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station. |
Pérez, Alejandro; Garcés Díaz-Munío, Gonçal ; Giménez, Adrià; Silvestre-Cerdà, Joan Albert ; Sanchis, Albert; Civera, Jorge; Jiménez, Manuel; Turró, Carlos; Juan, Alfons Towards cross-lingual voice cloning in higher education Journal Article Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021. Abstract | Links | BibTeX | Tags: cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech @article{Pérez2021, title = {Towards cross-lingual voice cloning in higher education}, author = {Alejandro Pérez and Garcés Díaz-Munío, Gonçal and Adrià Giménez and Silvestre-Cerdà, Joan Albert and Albert Sanchis and Jorge Civera and Manuel Jiménez and Carlos Turró and Alfons Juan}, url = {https://doi.org/10.1016/j.engappai.2021.104413}, year = {2021}, date = {2021-10-01}, journal = {Engineering Applications of Artificial Intelligence}, volume = {105}, pages = {104413}, abstract = {The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers.}, keywords = {cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech}, pubstate = {published}, tppubtype = {article} } The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers. |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
Juan-Albarracín, Javier; Fuster-Garcia, Elies; Juan, Alfons; García-Gómez, Juan M Non-local spatially varying finite mixture models for image segmentation Journal Article Statistics and Computing, 31 (3), 2021. Abstract | Links | BibTeX | Tags: Non-local means, Spatially varying finite mixture models, Unsupervised learning @article{Juan-Albarracín2021, title = {Non-local spatially varying finite mixture models for image segmentation}, author = {Javier Juan-Albarracín and Elies Fuster-Garcia and Alfons Juan and Juan M. García-Gómez}, url = {http://hdl.handle.net/10251/183895}, doi = {10.1007/s11222-020-09988-w}, year = {2021}, date = {2021-01-01}, journal = {Statistics and Computing}, volume = {31}, number = {3}, abstract = {In this work, we propose a new Bayesian model for unsupervised image segmentation based on a combination of the spatially varying finite mixture models (SVFMMs) and the non-local means (NLM) framework. The probabilistic NLM weighting function is successfully integrated into a varying Gauss–Markov random field, yielding a prior density that adaptively imposes a local regularization to simultaneously preserve edges and enforce smooth constraints in homogeneous regions of the image. Two versions of our model are proposed: a pixel-based model and a patch-based model, depending on the design of the probabilistic NLM weighting function. Contrary to previous methods proposed in the literature, our approximation does not introduce new parameters to be estimated into the model, because the NLM weighting function is completely known once the neighborhood of a pixel is fixed. The proposed model can be estimated in closed-form solution via a maximum a posteriori (MAP) estimation in an expectation–maximization scheme. We have compared our model with previously proposed SVFMMs using two public datasets: the Berkeley Segmentation dataset and the BRATS 2013 dataset. The proposed model performs favorably to previous approaches in the literature, achieving better results in terms of Rand Index and Dice metrics in our experiments.}, keywords = {Non-local means, Spatially varying finite mixture models, Unsupervised learning}, pubstate = {published}, tppubtype = {article} } In this work, we propose a new Bayesian model for unsupervised image segmentation based on a combination of the spatially varying finite mixture models (SVFMMs) and the non-local means (NLM) framework. The probabilistic NLM weighting function is successfully integrated into a varying Gauss–Markov random field, yielding a prior density that adaptively imposes a local regularization to simultaneously preserve edges and enforce smooth constraints in homogeneous regions of the image. Two versions of our model are proposed: a pixel-based model and a patch-based model, depending on the design of the probabilistic NLM weighting function. Contrary to previous methods proposed in the literature, our approximation does not introduce new parameters to be estimated into the model, because the NLM weighting function is completely known once the neighborhood of a pixel is fixed. The proposed model can be estimated in closed-form solution via a maximum a posteriori (MAP) estimation in an expectation–maximization scheme. We have compared our model with previously proposed SVFMMs using two public datasets: the Berkeley Segmentation dataset and the BRATS 2013 dataset. The proposed model performs favorably to previous approaches in the literature, achieving better results in terms of Rand Index and Dice metrics in our experiments. |
Pérez-González-de-Martos, Alejandro; Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Jorge, Javier; Silvestre-Cerdà, Joan-Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons Towards simultaneous machine interpretation Inproceedings Proc. Interspeech 2021, pp. 2277–2281, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: cross-lingual voice cloning, incremental text-to-speech, simultaneous machine interpretation, speech-to-speech translation @inproceedings{Pérez-González-de-Martos2021, title = {Towards simultaneous machine interpretation}, author = {Alejandro Pérez-González-de-Martos and Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Javier Jorge and Joan-Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/Interspeech.2021-201}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {2277--2281}, address = {Brno (Czech Republic)}, abstract = {Automatic speech-to-speech translation (S2S) is one of the most challenging speech and language processing tasks, especially when considering its application to real-time settings. Recent advances in streaming Automatic Speech Recognition (ASR), simultaneous Machine Translation (MT) and incremental neural Text-To-Speech (TTS) make it possible to develop real-time cascade S2S systems with greatly improved accuracy. On the way to simultaneous machine interpretation, a state-of-the-art cascade streaming S2S system is described and empirically assessed in the simultaneous interpretation of European Parliament debates. We pay particular attention to the TTS component, particularly in terms of speech naturalness under a variety of response-time settings, as well as in terms of speaker similarity for its cross-lingual voice cloning capabilities.}, keywords = {cross-lingual voice cloning, incremental text-to-speech, simultaneous machine interpretation, speech-to-speech translation}, pubstate = {published}, tppubtype = {inproceedings} } Automatic speech-to-speech translation (S2S) is one of the most challenging speech and language processing tasks, especially when considering its application to real-time settings. Recent advances in streaming Automatic Speech Recognition (ASR), simultaneous Machine Translation (MT) and incremental neural Text-To-Speech (TTS) make it possible to develop real-time cascade S2S systems with greatly improved accuracy. On the way to simultaneous machine interpretation, a state-of-the-art cascade streaming S2S system is described and empirically assessed in the simultaneous interpretation of European Parliament debates. We pay particular attention to the TTS component, particularly in terms of speech naturalness under a variety of response-time settings, as well as in terms of speaker similarity for its cross-lingual voice cloning capabilities. |
Javier Iranzo-Sánchez Jorge Civera, Alfons Juan Stream-level Latency Evaluation for Simultaneous Machine Translation Inproceedings Findings of the ACL: EMNLP 2021, pp. 664–670, Punta Cana (Dominican Republic), 2021. Abstract | Links | BibTeX | Tags: latency, simultaneous machine translation, stream-level evaluation, streaming @inproceedings{Iranzo-Sánchez2021b, title = {Stream-level Latency Evaluation for Simultaneous Machine Translation}, author = {Javier Iranzo-Sánchez, Jorge Civera, Alfons Juan}, url = {https://arxiv.org/abs/2104.08817 https://github.com/jairsan/Stream-level_Latency_Evaluation_for_Simultaneous_Machine_Translation}, doi = {10.18653/v1/2021.findings-emnlp.58}, year = {2021}, date = {2021-01-01}, booktitle = {Findings of the ACL: EMNLP 2021}, pages = {664--670}, address = {Punta Cana (Dominican Republic)}, abstract = {Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.}, keywords = {latency, simultaneous machine translation, stream-level evaluation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task. |
Pérez-González-de-Martos, Alejandro; Sanchis, Albert; Juan, Alfons VRAIN-UPV MLLP's system for the Blizzard Challenge 2021 Inproceedings Proc. of Blizzard Challenge 2021, 2021. Abstract | Links | BibTeX | Tags: Blizzard Challenge, HiFi-GAN, text-to-speech @inproceedings{Pérez-González-de-Martos2021b, title = {VRAIN-UPV MLLP's system for the Blizzard Challenge 2021}, author = {Alejandro Pérez-González-de-Martos and Albert Sanchis and Alfons Juan}, url = {http://hdl.handle.net/10251/192554 https://arxiv.org/abs/2110.15792 http://www.festvox.org/blizzard/blizzard2021.html}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. of Blizzard Challenge 2021}, abstract = {This paper presents the VRAIN-UPV MLLP’s speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studio-quality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic model with explicit duration modeling and a HiFi-GAN neural vocoder. Our team is identified as J in the evaluation results. Our system obtained very good results in the subjective evaluation tests. Only one system among other 11 participants achieved better naturalness than ours. Concretely, it achieved a naturalness MOS of 3.61 compared to 4.21 for real samples.}, keywords = {Blizzard Challenge, HiFi-GAN, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } This paper presents the VRAIN-UPV MLLP’s speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studio-quality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic model with explicit duration modeling and a HiFi-GAN neural vocoder. Our team is identified as J in the evaluation results. Our system obtained very good results in the subjective evaluation tests. Only one system among other 11 participants achieved better naturalness than ours. Concretely, it achieved a naturalness MOS of 3.61 compared to 4.21 for real samples. |
2020 |
Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming @inproceedings{Iranzo-Sánchez2020, title = {Direct Segmentation Models for Streaming Speech Translation}, author = {Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Joan Albert Silvestre-Cerdà and Pau Baquero-Arnal and Jorge Civera Saiz and Alfons Juan}, url = {http://dx.doi.org/10.18653/v1/2020.emnlp-main.206}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, pages = {2599--2611}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.}, keywords = {Segmentation, Speech Translation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work. |
Baquero-Arnal, Pau ; Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Iranzo-Sánchez, Javier ; Sanchis, Albert ; Civera, Jorge ; Juan, Alfons Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. Abstract | Links | BibTeX | Tags: hybrid ASR, language models, streaming, Transformer @inproceedings{Baquero-Arnal2020, title = {Improved Hybrid Streaming ASR with Transformer Language Models}, author = {Baquero-Arnal, Pau and Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Iranzo-Sánchez, Javier and Sanchis, Albert and Civera, Jorge and Juan, Alfons}, url = {http://dx.doi.org/10.21437/Interspeech.2020-2770}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020)}, pages = {2127--2131}, address = {Shanghai (China)}, abstract = {Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup.}, keywords = {hybrid ASR, language models, streaming, Transformer}, pubstate = {published}, tppubtype = {inproceedings} } Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup. |
Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation @inproceedings{Iranzo2020, title = {Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, author = {Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Javier Jorge and Nahuel Roselló and Adrià Giménez and Albert Sanchis and Jorge Civera and Alfons Juan}, url = {https://arxiv.org/abs/1911.03167 https://paperswithcode.com/paper/europarl-st-a-multilingual-corpus-for-speech https://www.mllp.upv.es/europarl-st/}, doi = {10.1109/ICASSP40776.2020.9054626}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {8229--8233}, address = {Barcelona (Spain)}, abstract = {Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.}, keywords = {Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. |
Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming @inproceedings{Jorge2020, title = {LSTM-Based One-Pass Decoder for Low-Latency Streaming}, author = {Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2020/01/jorge2020_preprint.pdf https://doi.org/10.1109/ICASSP40776.2020.9054267}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {7814--7818}, address = {Barcelona (Spain)}, abstract = {Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses.}, keywords = {acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming}, pubstate = {published}, tppubtype = {inproceedings} } Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses. |
2019 |
Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Civera, Jorge; Sanchis, Albert; Juan, Alfons Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models Inproceedings Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019), pp. 3820–3824, Graz (Austria), 2019. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, LSTM language models, one-pass decoding, real-time @inproceedings{Jorge2019, title = {Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models}, author = {Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.isca-speech.org/archive/interspeech_2019/jorge19_interspeech.html}, year = {2019}, date = {2019-01-01}, booktitle = {Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019)}, pages = {3820--3824}, address = {Graz (Austria)}, abstract = {Recurrent Neural Networks, in particular Long-Short Term Memory (LSTM) networks, are widely used in Automatic Speech Recognition for language modelling during decoding, usually as a mechanism for rescoring hypothesis. This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables. These static tables were precomputed from a pruned n-gram model, reducing drastically the computational cost during decoding. Additionally, the LSTM language model evaluation was efficiently performed using Variance Regularization along with a strategy of lazy evaluation. The proposed one-pass decoder architecture was evaluated on the well-known LibriSpeech and TED-LIUMv3 datasets. Results showed that the proposed algorithm obtains very competitive WERs with ∼0.6 RTFs. Finally, our one-pass decoder is compared with a decoupled two-pass decoder.}, keywords = {Automatic Speech Recognition, LSTM language models, one-pass decoding, real-time}, pubstate = {published}, tppubtype = {inproceedings} } Recurrent Neural Networks, in particular Long-Short Term Memory (LSTM) networks, are widely used in Automatic Speech Recognition for language modelling during decoding, usually as a mechanism for rescoring hypothesis. This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables. These static tables were precomputed from a pruned n-gram model, reducing drastically the computational cost during decoding. Additionally, the LSTM language model evaluation was efficiently performed using Variance Regularization along with a strategy of lazy evaluation. The proposed one-pass decoder architecture was evaluated on the well-known LibriSpeech and TED-LIUMv3 datasets. Results showed that the proposed algorithm obtains very competitive WERs with ∼0.6 RTFs. Finally, our one-pass decoder is compared with a decoupled two-pass decoder. |
Iranzo-Sánchez, Javier ; Garcés Díaz-Munío, Gonçal V; Civera, Jorge ; Juan, Alfons The MLLP-UPV Supervised Machine Translation Systems for WMT19 News Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 218-224, Florence (Italy), 2019. Abstract | Links | BibTeX | Tags: Machine Translation, Neural Machine Translation, WMT19 News Translation @inproceedings{Iranzo-Sánchez2019, title = {The MLLP-UPV Supervised Machine Translation Systems for WMT19 News Translation Task}, author = {Iranzo-Sánchez, Javier and Garcés Díaz-Munío, Gonçal V. and Civera, Jorge and Juan, Alfons}, url = {https://www.mllp.upv.es/wp-content/uploads/2019/09/poster-1.pdf}, doi = {10.18653/v1/W19-5320}, year = {2019}, date = {2019-01-01}, booktitle = {Proc. of Fourth Conference on Machine Translation (WMT19)}, pages = {218-224}, address = {Florence (Italy)}, abstract = {[EN] This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 News Translation Shared Task. In this edition, we have submitted systems for the German ↔ English and German ↔ French language pairs, participating in both directions of each pair. Our submitted systems, based on the Transformer architecture, make ample use of data filtering, synthetic data and domain adaptation through fine-tuning. [CA] "Els sistemes de traducció automàtica supervisada de l'MLLP-UPV per a la tasca de traducció de notícies de WMT19": En aquest article descrivim la participació del grup de recerca MLLP de la Universitat Politècnica de València en la competició de traducció de notícies de WMT 2019. En aquesta edició, hem presentat sistemes per a les combinacions de traducció alemany ↔ anglés i alemany ↔ francés (en ambdós sentits). Els sistemes presentats, basats en l'arquitectura Transformer, fan un ús extens del filtratge de dades, les dades sintètiques i l'ajust fi amb adaptació al domini.}, keywords = {Machine Translation, Neural Machine Translation, WMT19 News Translation}, pubstate = {published}, tppubtype = {inproceedings} } [EN] This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 News Translation Shared Task. In this edition, we have submitted systems for the German ↔ English and German ↔ French language pairs, participating in both directions of each pair. Our submitted systems, based on the Transformer architecture, make ample use of data filtering, synthetic data and domain adaptation through fine-tuning. [CA] "Els sistemes de traducció automàtica supervisada de l'MLLP-UPV per a la tasca de traducció de notícies de WMT19": En aquest article descrivim la participació del grup de recerca MLLP de la Universitat Politècnica de València en la competició de traducció de notícies de WMT 2019. En aquesta edició, hem presentat sistemes per a les combinacions de traducció alemany ↔ anglés i alemany ↔ francés (en ambdós sentits). Els sistemes presentats, basats en l'arquitectura Transformer, fan un ús extens del filtratge de dades, les dades sintètiques i l'ajust fi amb adaptació al domini. |
Baquero-Arnal, Pau ; Iranzo-Sánchez, Javier ; Civera, Jorge ; Juan, Alfons The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 179-184, Florence (Italy), 2019. Abstract | Links | BibTeX | Tags: Machine Translation, Neural Machine Translation, WMT19 @inproceedings{Baquero-Arnal2019, title = {The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task}, author = {Baquero-Arnal, Pau and Iranzo-Sánchez, Javier and Civera, Jorge and Juan, Alfons}, url = {https://www.aclweb.org/anthology/W19-5423/ https://www.mllp.upv.es/wp-content/uploads/2019/09/poster-2.pdf}, year = {2019}, date = {2019-01-01}, booktitle = {Proc. of Fourth Conference on Machine Translation (WMT19)}, pages = {179-184}, address = {Florence (Italy)}, abstract = {This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 Similar Language Translation Shared Task. We have submitted systems for the Portuguese ↔ Spanish language pair, in both directions. They are based on the Transformer architecture, as well as on a novel architecture called 2D alternating RNN. Both systems have been domain adapted through fine-tuning which has been shown to be very effective.}, keywords = {Machine Translation, Neural Machine Translation, WMT19}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 Similar Language Translation Shared Task. We have submitted systems for the Portuguese ↔ Spanish language pair, in both directions. They are based on the Transformer architecture, as well as on a novel architecture called 2D alternating RNN. Both systems have been domain adapted through fine-tuning which has been shown to be very effective. |
del Agua Teba, Miguel Á Contributions to Efficient Automatic Transcription of Video Lectures PhD Thesis Universitat Politècnica de València, 2019, (Advisers: Alfons Juan Ciscar and Albert Sanchis Navarro). Links | BibTeX | Tags: Automatic Speech Recognition, Confidence measures, Video Lectures @phdthesis{delTeba2019, title = {Contributions to Efficient Automatic Transcription of Video Lectures}, author = {del Agua Teba, Miguel Á. }, url = {https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=10772}, year = {2019}, date = {2019-01-01}, school = {Universitat Politècnica de València}, note = {Advisers: Alfons Juan Ciscar and Albert Sanchis Navarro}, keywords = {Automatic Speech Recognition, Confidence measures, Video Lectures}, pubstate = {published}, tppubtype = {phdthesis} } |
2018 |
Matusov, Evgeny; Wilken, Patrick; Bahar, Parnia; Schamper, Julian; Golik, Pavel; Zeyer, Albert; Silvestre-Cerdà, Joan Albert; Martínez-Villaronga, Adrià; Pesch, Hendrick; Peter, Jan-Thorsten Neural Speech Translation at AppTek Inproceedings Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018), pp. 104–111, Hong Kong, 2018. Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation @inproceedings{Matusov18, title = {Neural Speech Translation at AppTek}, author = {Evgeny Matusov AND Patrick Wilken AND Parnia Bahar AND Julian Schamper AND Pavel Golik AND Albert Zeyer AND Joan Albert Silvestre-Cerdà AND Adrià Martínez-Villaronga AND Hendrick Pesch AND Jan-Thorsten Peter}, url = {https://www.mllp.upv.es/wp-content/uploads/2019/07/iwslt18.pdf https://workshop2018.iwslt.org/downloads/Proceedings_IWSLT_2018.pdf}, year = {2018}, date = {2018-07-01}, booktitle = {Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018)}, pages = {104--111}, address = {Hong Kong}, keywords = {Automatic Speech Recognition, Machine Translation}, pubstate = {published}, tppubtype = {inproceedings} } |
Iranzo-Sánchez, Javier ; Baquero-Arnal, Pau ; Garcés Díaz-Munío, Gonçal V; Martínez-Villaronga, Adrià ; Civera, Jorge ; Juan, Alfons The MLLP-UPV German-English Machine Translation System for WMT18 Inproceedings Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers, pp. 422–428, Brussels (Belgium), 2018. Abstract | Links | BibTeX | Tags: Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation @inproceedings{Iranzo-Sánchez2018, title = {The MLLP-UPV German-English Machine Translation System for WMT18}, author = {Iranzo-Sánchez, Javier and Baquero-Arnal, Pau and Garcés Díaz-Munío, Gonçal V. and Martínez-Villaronga, Adrià and Civera, Jorge and Juan, Alfons}, url = {http://dx.doi.org/10.18653/v1/W18-6414 https://www.mllp.upv.es/wp-content/uploads/2018/11/wmt18_mllp-upv_poster.pdf}, year = {2018}, date = {2018-01-01}, booktitle = {Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers}, pages = {422--428}, address = {Brussels (Belgium)}, abstract = {[EN] This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German>English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under "constrained" conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora. [CA] "El sistema de traducció automàtica alemany>anglés de l'MLLP-UPV per a WMT18": En aquest article descrivim el sistema de traducció automàtica estadística creat pel grup d'investigació MLLP de la Universitat Politècnica de València per a la competició de traducció de notícies alemany>anglés de la Third Conference on Machine Translation (WMT18, associada a la conferència EMNLP 2018). Hem utilitzat una combinació de sistemes de traducció automàtica neuronal basats en l'arquitectura Transformer. Per a entrenar el nostre sistema en la categoria "fitada" (només amb els corpus lingüístics oficials de la competició), hem filtrat les dades paral·leles disponibles amb una tècnica que assigna puntuacions utilitzant models de llenguatge de caràcters, i hem afegit dades paral·leles basades en frases d'origen sintètiques generades a partir dels corpus monolingües disponibles.}, keywords = {Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation}, pubstate = {published}, tppubtype = {inproceedings} } [EN] This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German>English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under "constrained" conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora. [CA] "El sistema de traducció automàtica alemany>anglés de l'MLLP-UPV per a WMT18": En aquest article descrivim el sistema de traducció automàtica estadística creat pel grup d'investigació MLLP de la Universitat Politècnica de València per a la competició de traducció de notícies alemany>anglés de la Third Conference on Machine Translation (WMT18, associada a la conferència EMNLP 2018). Hem utilitzat una combinació de sistemes de traducció automàtica neuronal basats en l'arquitectura Transformer. Per a entrenar el nostre sistema en la categoria "fitada" (només amb els corpus lingüístics oficials de la competició), hem filtrat les dades paral·leles disponibles amb una tècnica que assigna puntuacions utilitzant models de llenguatge de caràcters, i hem afegit dades paral·leles basades en frases d'origen sintètiques generades a partir dels corpus monolingües disponibles. |
Publications
2024 |
Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). |
Segmentation-Free Streaming Machine Translation Journal Article Transactions of the Association for Computational Linguistics, 12 , pp. 1104-1121, 2024, (also accepted for presentation at ACL 2024). |
2023 |
Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
An inter-specialty cancer training programme curriculum for Europe Journal Article European Journal of Surgical Oncology, 49 (9), pp. 106989, 2023. |
Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). |
2022 |
Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Alberto Sanchis Navarro). |
Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. |
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. |
From Simultaneous to Streaming Machine Translation by Leveraging Streaming History Inproceedings Proc. 60th Annual Meeting of the Association for Computational Linguistics Vol. 1: Long Papers (ACL 2022), pp. 6972–6985, Dublin (Ireland), 2022. |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. |
2021 |
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. |
Towards cross-lingual voice cloning in higher education Journal Article Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021. |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |
Non-local spatially varying finite mixture models for image segmentation Journal Article Statistics and Computing, 31 (3), 2021. |
Towards simultaneous machine interpretation Inproceedings Proc. Interspeech 2021, pp. 2277–2281, Brno (Czech Republic), 2021. |
Stream-level Latency Evaluation for Simultaneous Machine Translation Inproceedings Findings of the ACL: EMNLP 2021, pp. 664–670, Punta Cana (Dominican Republic), 2021. |
VRAIN-UPV MLLP's system for the Blizzard Challenge 2021 Inproceedings Proc. of Blizzard Challenge 2021, 2021. |
2020 |
Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. |
Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. |
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. |
LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. |
2019 |
Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models Inproceedings Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019), pp. 3820–3824, Graz (Austria), 2019. |
The MLLP-UPV Supervised Machine Translation Systems for WMT19 News Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 218-224, Florence (Italy), 2019. |
The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 179-184, Florence (Italy), 2019. |
Contributions to Efficient Automatic Transcription of Video Lectures PhD Thesis Universitat Politècnica de València, 2019, (Advisers: Alfons Juan Ciscar and Albert Sanchis Navarro). |
2018 |
Neural Speech Translation at AppTek Inproceedings Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018), pp. 104–111, Hong Kong, 2018. |
The MLLP-UPV German-English Machine Translation System for WMT18 Inproceedings Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers, pp. 422–428, Brussels (Belgium), 2018. |