2024 |
Garcés Díaz-Munío, Gonçal Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents @phdthesis{Garcés2024, title = {Automatic speech recognition and machine translation with deep neural networks for open educational resources, parliamentary contents and broadcast media}, author = {Garcés Díaz-Munío, Gonçal}, url = {http://hdl.handle.net/10251/212454 https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12900 https://github.com/gonsalet/ASR_and_MT_for_educational_parliamentary_and_broadcast_media}, doi = {10.4995/Thesis/10251/212454}, year = {2024}, date = {2024-11-25}, school = {Universitat Politècnica de València}, abstract = {In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case.}, note = {advisers: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Broadcast Media, Deep Neural Networks, Machine Translation, Open Educational Resources, Parliamentary Contents}, pubstate = {published}, tppubtype = {phdthesis} } In the last decade, automatic speech recognition (ASR) and machine translation (MT) have improved enormously through the use of constantly evolving deep neural network (DNN) models. If at the beginning of the 2010s the then pre-DNN ASR and MT systems were ready to tackle with success some real-life applications such as offline video lecture transcription and translation, now in the 2020s much more challenging applications are within grasp, such as live broadcast media subtitling. At the same time in this period, media accessibility for everyone, including deaf and hard-of-hearing people, is being given more and more importance. ASR and MT, in their current state, are powerful tools to increase the coverage of accessibility measures such as subtitles, transcriptions and translations, also as a way of providing multilingual access to all types of content. In this PhD thesis, we present research results on automatic speech recognition and machine translation based on deep neural networks in three very active domains: open educational resources, parliamentary contents and broadcast media. Regarding open educational resources (OER), we first present work on the evaluation and post-editing of ASR and MT with intelligent interaction approaches, as carried out in the framework of EU project transLectures: Transcription and Translation of Video Lectures. The results obtained confirm that the intelligent interaction approach can make post-editing automatic transcriptions and translations even more cost-effective. Then, in the context of subsequent EU project X5gon, we present research on developing DNN-based neural MT systems, and making the most of larger MT corpora through automatic data filtering. This work resulted in a first-rank classification in an international evaluation campaign on MT, and we show how these new NMT systems improved the quality of multilingual subtitles in real OER scenarios. In the also growing domain of language technologies for parliamentary contents, we describe research on speech data curation techniques for streaming ASR in the context of European Parliament debates. This research resulted in the release of Europarl-ASR, a new, large speech corpus for streaming ASR system training and evaluation, as well as for the benchmarking of speech data curation techniques. Finally, we present work in a domain on the edge of the state of the art for ASR and MT: the live subtitling of broadcast media, in the context of the 2020–2023 R&D collaboration agreement between the Valencian public broadcaster À Punt and the Universitat Politècnica de València for real-time computer assisted subtitling of media contents. This research has resulted in the deployment of high-quality, low-latency, real-time streaming ASR systems for a less-spoken language (Catalan) and a widely spoken language (Spanish) in a real broadcast use case. |
2022 |
Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @article{applsci1505192, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension}, author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.3390/app12020804}, year = {2022}, date = {2022-01-01}, journal = {Applied Sciences}, volume = {12}, number = {2}, pages = {804}, abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {article} } This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. |
Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech @inproceedings{deMartos2022, title = {Doblaje automático de vídeo-charlas educativas en UPV[Media]}, author = {Pérez González de Martos, Alejandro AND Giménez Pastor, Adrià AND Jorge Cano, Javier AND Javier Iranzo-Sánchez AND Joan Albert Silvestre-Cerdà AND Garcés Díaz-Munío, Gonçal V. AND Pau Baquero-Arnal AND Sanchis Navarro, Alberto AND Civera Sáiz, Jorge AND Juan Ciscar, Alfons AND Turró Ribalta, Carlos}, doi = {10.4995/INRED2022.2022.15844}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022)}, pages = {557--570}, address = {València (Spain)}, abstract = {More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.}, keywords = {automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies. |
2021 |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
2012 |
Silvestre-Cerdà, Joan Albert ; Del Agua, Miguel ; Garcés, Gonçal; Gascó, Guillem; Giménez-Pastor, Adrià; Martínez, Adrià; Pérez González de Martos, Alejandro ; Sánchez, Isaías; Serrano Martínez-Santos, Nicolás ; Spencer, Rachel; Valor Miró, Juan Daniel ; Andrés-Ferrer, Jesús; Civera, Jorge; Sanchís, Alberto; Juan, Alfons transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. Abstract | Links | BibTeX | Tags: Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures @inproceedings{Silvestre-Cerdà2012b, title = {transLectures}, author = {Silvestre-Cerdà, Joan Albert and Del Agua, Miguel and Gonçal Garcés and Guillem Gascó and Adrià Giménez-Pastor and Adrià Martínez and Pérez González de Martos, Alejandro and Isaías Sánchez and Serrano Martínez-Santos, Nicolás and Rachel Spencer and Valor Miró, Juan Daniel and Jesús Andrés-Ferrer and Jorge Civera and Alberto Sanchís and Alfons Juan}, url = {http://hdl.handle.net/10251/37290 http://lorien.die.upm.es/~lapiz/rtth/JORNADAS/VII/IberSPEECH2012_OnlineProceedings.pdf https://web.archive.org/web/20130609073144/http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdf http://www.mllp.upv.es/wp-content/uploads/2015/04/1209IberSpeech.pdf}, year = {2012}, date = {2012-11-22}, booktitle = {Proceedings (Online) of IberSPEECH 2012}, pages = {345--351}, address = {Madrid (Spain)}, abstract = {[EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia.}, keywords = {Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures}, pubstate = {published}, tppubtype = {inproceedings} } [EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia. |
Publications
2024 |
Universitat Politècnica de València, 2024, (advisers: Alfons Juan Ciscar and Jorge Civera Saiz). |
2022 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. |
Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. |
2021 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
2012 |
transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. |