2024 |
Garcés Díaz-Munío, Gonçal Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents @phdthesis{Garcés2024, title = {Automatic speech recognition and machine translation with deep neural networks for open educational resources, parliamentary contents and broadcast media}, author = {Garcés Díaz-Munío, Gonçal}, url = {http://hdl.handle.net/10251/212454 https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12900 https://github.com/gonsalet/ASR_and_MT_for_educational_parliamentary_and_broadcast_media}, doi = {10.4995/Thesis/10251/212454}, year = {2024}, date = {2024-11-25}, school = {Universitat Politècnica de València}, abstract = {In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case.}, note = {advisers: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents}, pubstate = {published}, tppubtype = {phdthesis} } In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case. |
2022 |
Jorge Cano, Javier Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Links | BibTeX | Tags: Automatic Speech Recognition, Deep Neural Networks, hybrid ASR, streaming @phdthesis{Cano2022, title = {Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models}, author = {Jorge Cano, Javier}, url = {https://doi.org/10.4995/Thesis/10251/191001}, year = {2022}, date = {2022-11-21}, school = {Universitat Politècnica de València}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Deep Neural Networks, hybrid ASR, streaming}, pubstate = {published}, tppubtype = {phdthesis} } |
2021 |
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Models log-lineals Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2024 |
Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). |
2022 |
Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
2021 |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |