2022 |
Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech @inproceedings{deMartos2022, title = {Doblaje automático de vídeo-charlas educativas en UPV[Media]}, author = {Pérez González de Martos, Alejandro AND Giménez Pastor, Adrià AND Jorge Cano, Javier AND Javier Iranzo-Sánchez AND Joan Albert Silvestre-Cerdà AND Garcés Díaz-Munío, Gonçal V. AND Pau Baquero-Arnal AND Sanchis Navarro, Alberto AND Civera Sáiz, Jorge AND Juan Ciscar, Alfons AND Turró Ribalta, Carlos}, doi = {10.4995/INRED2022.2022.15844}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022)}, pages = {557--570}, address = {València (Spain)}, abstract = {More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.}, keywords = {automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies. |
Iranzo-Sánchez, Javier; Jorge, Javier; Pérez-González-de-Martos, Alejandro; Giménez, Adrià; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. Abstract | Links | BibTeX | Tags: Simultaneous Speech Translation, speech-to-speech translation @inproceedings{Iranzo-Sánchez2022b, title = {MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks}, author = {Javier Iranzo-Sánchez and Javier Jorge and Alejandro Pérez-González-de-Martos and Adrià Giménez and Garcés Díaz-Munío, Gonçal V. and Pau Baquero-Arnal and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.18653/v1/2022.iwslt-1.22}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022)}, pages = {255--264}, address = {Dublin (Ireland)}, abstract = {This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference.}, keywords = {Simultaneous Speech Translation, speech-to-speech translation}, pubstate = {published}, tppubtype = {inproceedings} } This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference. |
Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @article{applsci1505192, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension}, author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.3390/app12020804}, year = {2022}, date = {2022-01-01}, journal = {Applied Sciences}, volume = {12}, number = {2}, pages = {804}, abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {article} } This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. |
2021 |
Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Sanchis, Albert ; Alfons, Juan Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. Abstract | Links | BibTeX | Tags: acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming @article{Jorge2021b, title = {Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models}, author = {Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Civera, Jorge and Sanchis, Albert and Juan Alfons}, doi = {10.1109/TASLP.2021.3133216}, year = {2021}, date = {2021-11-23}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {30}, pages = {148--161}, abstract = {Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station.}, keywords = {acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming}, pubstate = {published}, tppubtype = {article} } Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station. |
Pérez, Alejandro; Garcés Díaz-Munío, Gonçal ; Giménez, Adrià; Silvestre-Cerdà, Joan Albert ; Sanchis, Albert; Civera, Jorge; Jiménez, Manuel; Turró, Carlos; Juan, Alfons Towards cross-lingual voice cloning in higher education Journal Article Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021. Abstract | Links | BibTeX | Tags: cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech @article{Pérez2021, title = {Towards cross-lingual voice cloning in higher education}, author = {Alejandro Pérez and Garcés Díaz-Munío, Gonçal and Adrià Giménez and Silvestre-Cerdà, Joan Albert and Albert Sanchis and Jorge Civera and Manuel Jiménez and Carlos Turró and Alfons Juan}, url = {https://doi.org/10.1016/j.engappai.2021.104413}, year = {2021}, date = {2021-10-01}, journal = {Engineering Applications of Artificial Intelligence}, volume = {105}, pages = {104413}, abstract = {The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers.}, keywords = {cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech}, pubstate = {published}, tppubtype = {article} } The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers. |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Pérez-González-de-Martos, Alejandro; Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Jorge, Javier; Silvestre-Cerdà, Joan-Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons Towards simultaneous machine interpretation Inproceedings Proc. Interspeech 2021, pp. 2277–2281, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: cross-lingual voice cloning, incremental text-to-speech, simultaneous machine interpretation, speech-to-speech translation @inproceedings{Pérez-González-de-Martos2021, title = {Towards simultaneous machine interpretation}, author = {Alejandro Pérez-González-de-Martos and Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Javier Jorge and Joan-Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/Interspeech.2021-201}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {2277--2281}, address = {Brno (Czech Republic)}, abstract = {Automatic speech-to-speech translation (S2S) is one of the most challenging speech and language processing tasks, especially when considering its application to real-time settings. Recent advances in streaming Automatic Speech Recognition (ASR), simultaneous Machine Translation (MT) and incremental neural Text-To-Speech (TTS) make it possible to develop real-time cascade S2S systems with greatly improved accuracy. On the way to simultaneous machine interpretation, a state-of-the-art cascade streaming S2S system is described and empirically assessed in the simultaneous interpretation of European Parliament debates. We pay particular attention to the TTS component, particularly in terms of speech naturalness under a variety of response-time settings, as well as in terms of speaker similarity for its cross-lingual voice cloning capabilities.}, keywords = {cross-lingual voice cloning, incremental text-to-speech, simultaneous machine interpretation, speech-to-speech translation}, pubstate = {published}, tppubtype = {inproceedings} } Automatic speech-to-speech translation (S2S) is one of the most challenging speech and language processing tasks, especially when considering its application to real-time settings. Recent advances in streaming Automatic Speech Recognition (ASR), simultaneous Machine Translation (MT) and incremental neural Text-To-Speech (TTS) make it possible to develop real-time cascade S2S systems with greatly improved accuracy. On the way to simultaneous machine interpretation, a state-of-the-art cascade streaming S2S system is described and empirically assessed in the simultaneous interpretation of European Parliament debates. We pay particular attention to the TTS component, particularly in terms of speech naturalness under a variety of response-time settings, as well as in terms of speaker similarity for its cross-lingual voice cloning capabilities. |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
2020 |
Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming @inproceedings{Iranzo-Sánchez2020, title = {Direct Segmentation Models for Streaming Speech Translation}, author = {Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Joan Albert Silvestre-Cerdà and Pau Baquero-Arnal and Jorge Civera Saiz and Alfons Juan}, url = {http://dx.doi.org/10.18653/v1/2020.emnlp-main.206}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, pages = {2599--2611}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.}, keywords = {Segmentation, Speech Translation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work. |
Baquero-Arnal, Pau ; Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Iranzo-Sánchez, Javier ; Sanchis, Albert ; Civera, Jorge ; Juan, Alfons Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. Abstract | Links | BibTeX | Tags: hybrid ASR, language models, streaming, Transformer @inproceedings{Baquero-Arnal2020, title = {Improved Hybrid Streaming ASR with Transformer Language Models}, author = {Baquero-Arnal, Pau and Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Iranzo-Sánchez, Javier and Sanchis, Albert and Civera, Jorge and Juan, Alfons}, url = {http://dx.doi.org/10.21437/Interspeech.2020-2770}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020)}, pages = {2127--2131}, address = {Shanghai (China)}, abstract = {Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup.}, keywords = {hybrid ASR, language models, streaming, Transformer}, pubstate = {published}, tppubtype = {inproceedings} } Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup. |
Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation @inproceedings{Iranzo2020, title = {Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, author = {Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Javier Jorge and Nahuel Roselló and Adrià Giménez and Albert Sanchis and Jorge Civera and Alfons Juan}, url = {https://arxiv.org/abs/1911.03167 https://paperswithcode.com/paper/europarl-st-a-multilingual-corpus-for-speech https://www.mllp.upv.es/europarl-st/}, doi = {10.1109/ICASSP40776.2020.9054626}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {8229--8233}, address = {Barcelona (Spain)}, abstract = {Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.}, keywords = {Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. |
Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming @inproceedings{Jorge2020, title = {LSTM-Based One-Pass Decoder for Low-Latency Streaming}, author = {Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2020/01/jorge2020_preprint.pdf https://doi.org/10.1109/ICASSP40776.2020.9054267}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {7814--7818}, address = {Barcelona (Spain)}, abstract = {Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses.}, keywords = {acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming}, pubstate = {published}, tppubtype = {inproceedings} } Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses. |
2018 |
Matusov, Evgeny; Wilken, Patrick; Bahar, Parnia; Schamper, Julian; Golik, Pavel; Zeyer, Albert; Silvestre-Cerdà, Joan Albert; Martínez-Villaronga, Adrià; Pesch, Hendrick; Peter, Jan-Thorsten Neural Speech Translation at AppTek Inproceedings Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018), pp. 104–111, Hong Kong, 2018. Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation @inproceedings{Matusov18, title = {Neural Speech Translation at AppTek}, author = {Evgeny Matusov AND Patrick Wilken AND Parnia Bahar AND Julian Schamper AND Pavel Golik AND Albert Zeyer AND Joan Albert Silvestre-Cerdà AND Adrià Martínez-Villaronga AND Hendrick Pesch AND Jan-Thorsten Peter}, url = {https://www.mllp.upv.es/wp-content/uploads/2019/07/iwslt18.pdf https://workshop2018.iwslt.org/downloads/Proceedings_IWSLT_2018.pdf}, year = {2018}, date = {2018-07-01}, booktitle = {Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018)}, pages = {104--111}, address = {Hong Kong}, keywords = {Automatic Speech Recognition, Machine Translation}, pubstate = {published}, tppubtype = {inproceedings} } |
Jorge, Javier ; Martínez-Villaronga, Adrià ; Golik, Pavel ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Doetsch, Patrick ; Císcar, Vicent Andreu ; Ney, Hermann ; Juan, Alfons ; Sanchis, Albert MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge Inproceedings Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop, pp. 257–261, Barcelona (Spain), 2018. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Iberspeech-RTVE-Challenge2018, IberSpeech2018, Speech-to-Text @inproceedings{Jorge2018, title = {MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge}, author = {Jorge, Javier and Martínez-Villaronga, Adrià and Golik, Pavel and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Doetsch, Patrick and Císcar, Vicent Andreu and Ney, Hermann and Juan, Alfons and Sanchis, Albert}, doi = {10.21437/IberSPEECH.2018-54}, year = {2018}, date = {2018-01-01}, booktitle = {Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop}, pages = {257--261}, address = {Barcelona (Spain)}, abstract = {This paper describes the Automatic Speech Recognition systems built by the MLLP research group of Universitat Politècnica de València and the HLTPR research group of RWTH Aachen for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge. We participated in both the closed and the open training conditions. The best system built for the closed conditions was a hybrid BLSTM-HMM ASR system using one-pass decoding with a combination of an RNN LM and show-adapted n-gram LMs. It was trained on a set of reliable speech data extracted from the train and dev1 sets using the MLLP’s transLectures-UPV toolkit (TLK) and TensorFlow. This system achieved 20.0% WER on the dev2 set. For the open conditions, we used approx. 3800 hours of out-of-domain training data from multiple sources and trained a one-pass hybrid BLSTM-HMM ASR system using the open-source tools RASR and RETURNN developed at RWTH Aachen. This system scored 15.6% WER on the dev2 set. The highlights of these systems include robust speech data filtering for acoustic model training and show-specific language modelling.}, keywords = {Automatic Speech Recognition, Iberspeech-RTVE-Challenge2018, IberSpeech2018, Speech-to-Text}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes the Automatic Speech Recognition systems built by the MLLP research group of Universitat Politècnica de València and the HLTPR research group of RWTH Aachen for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge. We participated in both the closed and the open training conditions. The best system built for the closed conditions was a hybrid BLSTM-HMM ASR system using one-pass decoding with a combination of an RNN LM and show-adapted n-gram LMs. It was trained on a set of reliable speech data extracted from the train and dev1 sets using the MLLP’s transLectures-UPV toolkit (TLK) and TensorFlow. This system achieved 20.0% WER on the dev2 set. For the open conditions, we used approx. 3800 hours of out-of-domain training data from multiple sources and trained a one-pass hybrid BLSTM-HMM ASR system using the open-source tools RASR and RETURNN developed at RWTH Aachen. This system scored 15.6% WER on the dev2 set. The highlights of these systems include robust speech data filtering for acoustic model training and show-specific language modelling. |
2016 |
Silvestre-Cerdà, Joan Albert; Juan, Alfons; Civera, Jorge Different Contributions to Cost-Effective Transcription and Translation of Video Lectures Inproceedings Proc. of IX Jornadas en Tecnología del Habla and V Iberian SLTech Workshop (IberSpeech 2016), pp. 313-319, Lisbon (Portugal), 2016, ISBN: 978-3-319-49168-4 . Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Automatic transcription and translation, Machine Translation, Video Lectures @inproceedings{Silvestre-Cerdà2016b, title = {Different Contributions to Cost-Effective Transcription and Translation of Video Lectures}, author = {Joan Albert Silvestre-Cerdà and Alfons Juan and Jorge Civera}, url = {http://www.mllp.upv.es/wp-content/uploads/2016/11/poster.pdf http://www.mllp.upv.es/wp-content/uploads/2016/11/paper.pdf http://hdl.handle.net/10251/62194}, isbn = {978-3-319-49168-4 }, year = {2016}, date = {2016-11-24}, booktitle = {Proc. of IX Jornadas en Tecnología del Habla and V Iberian SLTech Workshop (IberSpeech 2016)}, pages = {313-319}, address = {Lisbon (Portugal)}, abstract = {In recent years, on-line multimedia repositories have experiencied a strong growth that have made them consolidated as essential knowledge assets, especially in the area of education, where large repositories of video lectures have been built in order to complement or even replace traditional teaching methods. However, most of these video lectures are neither transcribed nor translated due to a lack of cost-effective solutions to do so in a way that gives accurate enough results. Solutions of this kind are clearly necessary in order to make these lectures accessible to speakers of different languages and to people with hearing disabilities, among many other benefits and applications. For this reason, the main aim of this thesis is to develop a cost-effective solution capable of transcribing and translating video lectures to a reasonable degree of accuracy. More specifically, we address the integration of state-of-the-art techniques in Automatic Speech Recognition and Machine Translation into large video lecture repositories to generate highquality multilingual video subtitles without human intervention and at a reduced computational cost. Also, we explore the potential benefits of the exploitation of the information that we know a priori about these repositories, that is, lecture-specific knowledge such as speaker, topic or slides, to create specialised, in-domain transcription and translation systems by means of massive adaptation techniques. The proposed solutions have been tested in real-life scenarios by carrying out several objective and subjective evaluations, obtaining very positive results. The main outcome derived from this multidisciplinary thesis, The transLectures-UPV Platform, has been publicly released as an open-source software, and, at the time of writing, it is serving automatic transcriptions and translations for several thousands of video lectures in many Spanish and European universities and institutions.}, keywords = {Automatic Speech Recognition, Automatic transcription and translation, Machine Translation, Video Lectures}, pubstate = {published}, tppubtype = {inproceedings} } In recent years, on-line multimedia repositories have experiencied a strong growth that have made them consolidated as essential knowledge assets, especially in the area of education, where large repositories of video lectures have been built in order to complement or even replace traditional teaching methods. However, most of these video lectures are neither transcribed nor translated due to a lack of cost-effective solutions to do so in a way that gives accurate enough results. Solutions of this kind are clearly necessary in order to make these lectures accessible to speakers of different languages and to people with hearing disabilities, among many other benefits and applications. For this reason, the main aim of this thesis is to develop a cost-effective solution capable of transcribing and translating video lectures to a reasonable degree of accuracy. More specifically, we address the integration of state-of-the-art techniques in Automatic Speech Recognition and Machine Translation into large video lecture repositories to generate highquality multilingual video subtitles without human intervention and at a reduced computational cost. Also, we explore the potential benefits of the exploitation of the information that we know a priori about these repositories, that is, lecture-specific knowledge such as speaker, topic or slides, to create specialised, in-domain transcription and translation systems by means of massive adaptation techniques. The proposed solutions have been tested in real-life scenarios by carrying out several objective and subjective evaluations, obtaining very positive results. The main outcome derived from this multidisciplinary thesis, The transLectures-UPV Platform, has been publicly released as an open-source software, and, at the time of writing, it is serving automatic transcriptions and translations for several thousands of video lectures in many Spanish and European universities and institutions. |
Silvestre-Cerdà, Joan Albert Different Contributions to Cost-Effective Transcription and Translation of Video Lectures PhD Thesis Universitat Politècnica de València, 2016, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Education, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, video lecture repositories, Video Lectures @phdthesis{Silvestre-Cerdà2016, title = {Different Contributions to Cost-Effective Transcription and Translation of Video Lectures}, author = {Joan Albert Silvestre-Cerdà}, url = {http://hdl.handle.net/10251/62194 http://www.mllp.upv.es/wp-content/uploads/2016/01/slides.pdf http://www.mllp.upv.es/wp-content/uploads/2016/01/thesis.pdf http://www.mllp.upv.es/phd-thesis-different-contributions-to-cost-effective-transcription-and-translation-of-video-lectures-by-joan-albert-silvestre-cerda-abstract/}, year = {2016}, date = {2016-01-27}, school = {Universitat Politècnica de València}, abstract = {In recent years, online multimedia repositories have experienced a strong growth that has consolidated them as essential knowledge assets, especially in the area of education, where large repositories of video lectures have been built in order to complement or even replace traditional teaching methods. However, most of these video lectures are neither transcribed nor translated due to a lack of cost-effective solutions to do so in a way that provides accurate enough results. Solutions of this kind are clearly necessary in order to make these lectures accessible to speakers of different languages and to people with hearing disabilities. They would also facilitate lecture searchability and analysis functions, such as classification, recommendation or plagiarism detection, as well as the development of advanced educational functionalities like content summarisation to assist student note-taking. For this reason, the main aim of this thesis is to develop a cost-effective solution capable of transcribing and translating video lectures to a reasonable degree of accuracy. More specifically, we address the integration of state-of-the-art techniques in Automatic Speech Recognition and Machine Translation into large video lecture repositories to generate high-quality multilingual video subtitles without human intervention and at a reduced computational cost. Also, we explore the potential benefits of the exploitation of the information that we know a priori about these repositories, that is, lecture-specific knowledge such as speaker, topic or slides, to create specialised, in-domain transcription and translation systems by means of massive adaptation techniques. The proposed solutions have been tested in real-life scenarios by carrying out several objective and subjective evaluations, obtaining very positive results. The main technological outcome derived from this thesis, the transLectures-UPV Platform (TLP), has been publicly released as open-source software, and, at the time of writing, it is serving automatic transcriptions and translations for several thousands of video lectures in Spanish and European universities and institutions.}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Automatic Speech Recognition, Education, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, video lecture repositories, Video Lectures}, pubstate = {published}, tppubtype = {phdthesis} } In recent years, online multimedia repositories have experienced a strong growth that has consolidated them as essential knowledge assets, especially in the area of education, where large repositories of video lectures have been built in order to complement or even replace traditional teaching methods. However, most of these video lectures are neither transcribed nor translated due to a lack of cost-effective solutions to do so in a way that provides accurate enough results. Solutions of this kind are clearly necessary in order to make these lectures accessible to speakers of different languages and to people with hearing disabilities. They would also facilitate lecture searchability and analysis functions, such as classification, recommendation or plagiarism detection, as well as the development of advanced educational functionalities like content summarisation to assist student note-taking. For this reason, the main aim of this thesis is to develop a cost-effective solution capable of transcribing and translating video lectures to a reasonable degree of accuracy. More specifically, we address the integration of state-of-the-art techniques in Automatic Speech Recognition and Machine Translation into large video lecture repositories to generate high-quality multilingual video subtitles without human intervention and at a reduced computational cost. Also, we explore the potential benefits of the exploitation of the information that we know a priori about these repositories, that is, lecture-specific knowledge such as speaker, topic or slides, to create specialised, in-domain transcription and translation systems by means of massive adaptation techniques. The proposed solutions have been tested in real-life scenarios by carrying out several objective and subjective evaluations, obtaining very positive results. The main technological outcome derived from this thesis, the transLectures-UPV Platform (TLP), has been publicly released as open-source software, and, at the time of writing, it is serving automatic transcriptions and translations for several thousands of video lectures in Spanish and European universities and institutions. |
2015 |
Valor Miró, Juan Daniel ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Turró, Carlos ; Juan, Alfons Efficient Generation of High-Quality Multilingual Subtitles for Video Lecture Repositories Inproceedings Proc. of 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), pp. 485–490, Toledo (Spain), 2015, ISBN: 978-3-319-24258-3. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Docencia en Red, Efficient video subtitling, Polimedia, Statistical machine translation, video lecture repositories @inproceedings{valor2015efficient, title = {Efficient Generation of High-Quality Multilingual Subtitles for Video Lecture Repositories}, author = {Valor Miró, Juan Daniel and Silvestre-Cerdà, Joan Albert and Civera, Jorge and Turró, Carlos and Juan, Alfons}, url = {http://link.springer.com/chapter/10.1007/978-3-319-24258-3_44 http://www.mllp.upv.es/wp-content/uploads/2016/03/paper.pdf }, isbn = {978-3-319-24258-3}, year = {2015}, date = {2015-09-17}, booktitle = {Proc. of 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015)}, pages = {485--490}, address = {Toledo (Spain)}, abstract = {Video lectures are a valuable educational tool in higher education to support or replace face-to-face lectures in active learning strategies. In 2007 the Universitat Polit‘ecnica de Val‘encia (UPV) implemented its video lecture capture system, resulting in a high quality educational video repository, called poliMedia, with more than 10.000 mini lectures created by 1.373 lecturers. Also, in the framework of the European project transLectures, UPV has automatically generated transcriptions and translations in Spanish, Catalan and English for all videos included in the poliMedia video repository. transLectures’s objective responds to the widely-recognised need for subtitles to be provided with video lectures, as an essential service for non-native speakers and hearing impaired persons, and to allow advanced repository functionalities. Although high-quality automatic transcriptions and translations were generated in transLectures, they were not error-free. For this reason, lecturers need to manually review video subtitles to guarantee the absence of errors. The aim of this study is to evaluate the efficiency of the manual review process from automatic subtitles in comparison with the conventional generation of video subtitles from scratch. The reported results clearly indicate the convenience of providing automatic subtitles as a first step in the generation of video subtitles and the significant savings in time of up to almost 75% involved in reviewing subtitles.}, keywords = {Automatic Speech Recognition, Docencia en Red, Efficient video subtitling, Polimedia, Statistical machine translation, video lecture repositories}, pubstate = {published}, tppubtype = {inproceedings} } Video lectures are a valuable educational tool in higher education to support or replace face-to-face lectures in active learning strategies. In 2007 the Universitat Polit‘ecnica de Val‘encia (UPV) implemented its video lecture capture system, resulting in a high quality educational video repository, called poliMedia, with more than 10.000 mini lectures created by 1.373 lecturers. Also, in the framework of the European project transLectures, UPV has automatically generated transcriptions and translations in Spanish, Catalan and English for all videos included in the poliMedia video repository. transLectures’s objective responds to the widely-recognised need for subtitles to be provided with video lectures, as an essential service for non-native speakers and hearing impaired persons, and to allow advanced repository functionalities. Although high-quality automatic transcriptions and translations were generated in transLectures, they were not error-free. For this reason, lecturers need to manually review video subtitles to guarantee the absence of errors. The aim of this study is to evaluate the efficiency of the manual review process from automatic subtitles in comparison with the conventional generation of video subtitles from scratch. The reported results clearly indicate the convenience of providing automatic subtitles as a first step in the generation of video subtitles and the significant savings in time of up to almost 75% involved in reviewing subtitles. |
Pérez González de Martos, Alejandro ; Silvestre-Cerdà, Joan Albert ; Valor Miró, Juan Daniel ; Civera, Jorge ; Juan, Alfons MLLP Transcription and Translation Platform Miscellaneous 2015, (Short paper for demo presentation accepted at 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), Toledo (Spain), 2015.). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Docencia en Red, Document translation, Efficient video subtitling, Machine Translation, MLLP, Post-editing, Video Lectures @misc{mllpttp, title = {MLLP Transcription and Translation Platform}, author = {Pérez González de Martos, Alejandro and Silvestre-Cerdà, Joan Albert and Valor Miró, Juan Daniel and Civera, Jorge and Juan, Alfons}, url = {http://hdl.handle.net/10251/65747 http://www.mllp.upv.es/wp-content/uploads/2015/09/ttp_platform_demo_ectel2015.pdf http://ectel2015.httc.de/index.php?id=722}, year = {2015}, date = {2015-09-16}, booktitle = {Tenth European Conference On Technology Enhanced Learning (EC-TEL 2015)}, abstract = {This paper briefly presents the main features of MLLP’s Transcription and Translation Platform, which uses state-of-the-art automatic speech recognition and machine translation systems to generate multilingual subtitles of educational audiovisual and textual content. It has proven to reduce user effort up to 1/3 of the time needed to generate transcriptions and translations from scratch.}, note = {Short paper for demo presentation accepted at 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), Toledo (Spain), 2015.}, keywords = {Automatic Speech Recognition, Docencia en Red, Document translation, Efficient video subtitling, Machine Translation, MLLP, Post-editing, Video Lectures}, pubstate = {published}, tppubtype = {misc} } This paper briefly presents the main features of MLLP’s Transcription and Translation Platform, which uses state-of-the-art automatic speech recognition and machine translation systems to generate multilingual subtitles of educational audiovisual and textual content. It has proven to reduce user effort up to 1/3 of the time needed to generate transcriptions and translations from scratch. |
Valor Miró, Juan Daniel ; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Turró, Carlos; Juan, Alfons Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories Journal Article Speech Communication, 74 , pp. 65–75, 2015, ISSN: 0167-6393. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Computer-assisted transcription, Interface design strategies, Usability study, video lecture repositories @article{Valor201565, title = {Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories}, author = {Valor Miró, Juan Daniel and Joan Albert Silvestre-Cerdà and Jorge Civera and Carlos Turró and Alfons Juan}, url = {http://www.sciencedirect.com/science/article/pii/S0167639315001016 http://www.mllp.upv.es/wp-content/uploads/2016/03/paper1.pdf}, issn = {0167-6393}, year = {2015}, date = {2015-01-01}, journal = {Speech Communication}, volume = {74}, pages = {65--75}, abstract = {Abstract Video lectures are widely used in education to support and complement face-to-face lectures. However, the utility of these audiovisual assets could be further improved by adding subtitles that can be exploited to incorporate added-value functionalities such as searchability, accessibility, translatability, note-taking, and discovery of content-related videos, among others. Today, automatic subtitles are prone to error, and need to be reviewed and post-edited in order to ensure that what students see on-screen are of an acceptable quality. This work investigates different user interface design strategies for this post-editing task to discover the best way to incorporate automatic transcription technologies into large educational video repositories. Our three-phase study involved lecturers from the Universitat Politècnica de València (UPV) with videos available on the poliMedia video lecture repository, which is currently over 10,000 video objects. Simply by conventional post-editing automatic transcriptions users almost reduced to half the time that would require to generate the transcription from scratch. As expected, this study revealed that the time spent by lecturers reviewing automatic transcriptions correlated directly with the accuracy of said transcriptions. However, it is also shown that the average time required to perform each individual editing operation could be precisely derived and could be applied in the definition of a user model. In addition, the second phase of this study presents a transcription review strategy based on confidence measures (CM) and compares it to the conventional post-editing strategy. Finally, a third strategy resulting from the combination of that based on \\{CM\\} with massive adaptation techniques for automatic speech recognition (ASR), achieved to improve the transcription review efficiency in comparison with the two aforementioned strategies.}, keywords = {Automatic Speech Recognition, Computer-assisted transcription, Interface design strategies, Usability study, video lecture repositories}, pubstate = {published}, tppubtype = {article} } Abstract Video lectures are widely used in education to support and complement face-to-face lectures. However, the utility of these audiovisual assets could be further improved by adding subtitles that can be exploited to incorporate added-value functionalities such as searchability, accessibility, translatability, note-taking, and discovery of content-related videos, among others. Today, automatic subtitles are prone to error, and need to be reviewed and post-edited in order to ensure that what students see on-screen are of an acceptable quality. This work investigates different user interface design strategies for this post-editing task to discover the best way to incorporate automatic transcription technologies into large educational video repositories. Our three-phase study involved lecturers from the Universitat Politècnica de València (UPV) with videos available on the poliMedia video lecture repository, which is currently over 10,000 video objects. Simply by conventional post-editing automatic transcriptions users almost reduced to half the time that would require to generate the transcription from scratch. As expected, this study revealed that the time spent by lecturers reviewing automatic transcriptions correlated directly with the accuracy of said transcriptions. However, it is also shown that the average time required to perform each individual editing operation could be precisely derived and could be applied in the definition of a user model. In addition, the second phase of this study presents a transcription review strategy based on confidence measures (CM) and compares it to the conventional post-editing strategy. Finally, a third strategy resulting from the combination of that based on \{CM\} with massive adaptation techniques for automatic speech recognition (ASR), achieved to improve the transcription review efficiency in comparison with the two aforementioned strategies. |
2014 |
Martínez-Villaronga, A; del-Agua, M A; Silvestre-Cerdà, J A; Andrés-Ferrer, J; Juan, A Language model adaptation for lecture transcription by document retrieval Inproceedings Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014), Las Palmas de Gran Canaria (Spain), 2014. @inproceedings{MarAgu14, title = {Language model adaptation for lecture transcription by document retrieval}, author = {A. Martínez-Villaronga and M. A. del-Agua and J.A. Silvestre-Cerdà and J. Andrés-Ferrer and A. Juan}, url = {http://www.mllp.upv.es/wp-content/uploads/2015/04/ibsp14-cameraReady.pdf}, year = {2014}, date = {2014-01-01}, booktitle = {Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014)}, address = {Las Palmas de Gran Canaria (Spain)}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
Pérez-González-de-Martos, A; Silvestre-Cerdá, J A; Rihtar, M; Juan, A; Civera, J Using Automatic Speech Transcriptions in Lecture Recommendation Systems Inproceedings Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014), Las Palmas de Gran Canaria (Spain), 2014. @inproceedings{PerSil14, title = {Using Automatic Speech Transcriptions in Lecture Recommendation Systems}, author = {A. Pérez-González-de-Martos and J. A. Silvestre-Cerdá and M. Rihtar and A. Juan and J. Civera}, url = {http://www.mllp.upv.es/wp-content/uploads/2015/04/lavie_is2014_camready1.pdf}, year = {2014}, date = {2014-01-01}, booktitle = {Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014)}, address = {Las Palmas de Gran Canaria (Spain)}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
2013 |
Silvestre-Cerdà, Joan Albert; Pérez, Alejandro; Jiménez, Manuel; Turró, Carlos; Juan, Alfons; Civera, Jorge A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories Inproceedings Proc. of the IEEE Intl. Conf. on Systems, Man, and Cybernetics SMC 2013 , pp. 3994-3999, Manchester (UK), 2013. Abstract | Links | BibTeX | Tags: Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures @inproceedings{Silvestre-Cerdà2013, title = {A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories}, author = {Joan Albert Silvestre-Cerdà and Alejandro Pérez and Manuel Jiménez and Carlos Turró and Alfons Juan and Jorge Civera}, url = {http://dx.doi.org/10.1109/SMC.2013.682}, year = {2013}, date = {2013-01-01}, booktitle = {Proc. of the IEEE Intl. Conf. on Systems, Man, and Cybernetics SMC 2013 }, pages = {3994-3999}, address = {Manchester (UK)}, abstract = {Online video lecture repositories are rapidly growing and becoming established as fundamental knowledge assets. However, most lectures are neither transcribed nor translated because of the lack of cost-effective solutions that can give accurate enough results. In this paper, we describe a system architecture that supports the cost-effective transcription and translation of large video lecture repositories. This architecture has been adopted in the EU project transLectures and is now being tested on a repository of more than 9000 video lectures at the Universitat Politecnica de Valencia. Following a brief description of this repository and of the transLectures project, we describe the proposed system architecture in detail. We also report empirical results on the quality of the transcriptions and translations currently being maintained and steadily improved.}, keywords = {Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures}, pubstate = {published}, tppubtype = {inproceedings} } Online video lecture repositories are rapidly growing and becoming established as fundamental knowledge assets. However, most lectures are neither transcribed nor translated because of the lack of cost-effective solutions that can give accurate enough results. In this paper, we describe a system architecture that supports the cost-effective transcription and translation of large video lecture repositories. This architecture has been adopted in the EU project transLectures and is now being tested on a repository of more than 9000 video lectures at the Universitat Politecnica de Valencia. Following a brief description of this repository and of the transLectures project, we describe the proposed system architecture in detail. We also report empirical results on the quality of the transcriptions and translations currently being maintained and steadily improved. |
2012 |
Silvestre-Cerdà, Joan Albert ; Del Agua, Miguel ; Garcés, Gonçal; Gascó, Guillem; Giménez-Pastor, Adrià; Martínez, Adrià; Pérez González de Martos, Alejandro ; Sánchez, Isaías; Serrano Martínez-Santos, Nicolás ; Spencer, Rachel; Valor Miró, Juan Daniel ; Andrés-Ferrer, Jesús; Civera, Jorge; Sanchís, Alberto; Juan, Alfons transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. Abstract | Links | BibTeX | Tags: Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures @inproceedings{Silvestre-Cerdà2012b, title = {transLectures}, author = {Silvestre-Cerdà, Joan Albert and Del Agua, Miguel and Gonçal Garcés and Guillem Gascó and Adrià Giménez-Pastor and Adrià Martínez and Pérez González de Martos, Alejandro and Isaías Sánchez and Serrano Martínez-Santos, Nicolás and Rachel Spencer and Valor Miró, Juan Daniel and Jesús Andrés-Ferrer and Jorge Civera and Alberto Sanchís and Alfons Juan}, url = {http://hdl.handle.net/10251/37290 http://lorien.die.upm.es/~lapiz/rtth/JORNADAS/VII/IberSPEECH2012_OnlineProceedings.pdf https://web.archive.org/web/20130609073144/http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdf http://www.mllp.upv.es/wp-content/uploads/2015/04/1209IberSpeech.pdf}, year = {2012}, date = {2012-11-22}, booktitle = {Proceedings (Online) of IberSPEECH 2012}, pages = {345--351}, address = {Madrid (Spain)}, abstract = {[EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia.}, keywords = {Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures}, pubstate = {published}, tppubtype = {inproceedings} } [EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia. |
Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Andrés-Ferrer, Jesús; Civera, Jorge; Juan, Alfons Albayzin Evaluation: The PRHLT-UPV Audio Segmentation System Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 596-600, Madrid (Spain), 2012. Abstract | Links | BibTeX | Tags: @inproceedings{Silvestre-Cerdà2012c, title = {Albayzin Evaluation: The PRHLT-UPV Audio Segmentation System}, author = {Joan Albert Silvestre-Cerdà and Adrià Giménez and Jesús Andrés-Ferrer and Jorge Civera and Alfons Juan}, url = {http://hdl.handle.net/10251/53699 http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdf}, year = {2012}, date = {2012-11-22}, booktitle = {Proceedings (Online) of IberSPEECH 2012}, pages = {596-600}, address = {Madrid (Spain)}, abstract = {This paper describes the audio segmentation system developed by the PRHLT research group at the UPV for the Albayzin Audio Segmentation Evaluation 2012. The PRHLT-UPV audio segmentation system is based on a conventional GMM-HMM speech recognition approach in which the vocabulary set is defined by the power set of segment classes. MFCC features were extracted to represent the acoustic signal and the AK toolkit was used for both, training acoustic models and performing audio segmentation. Experimental results reveals that our system provides an excellent performance on speech detection, so it could be successfully employed to provide speech segments to a diarization or speech recognition system.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes the audio segmentation system developed by the PRHLT research group at the UPV for the Albayzin Audio Segmentation Evaluation 2012. The PRHLT-UPV audio segmentation system is based on a conventional GMM-HMM speech recognition approach in which the vocabulary set is defined by the power set of segment classes. MFCC features were extracted to represent the acoustic signal and the AK toolkit was used for both, training acoustic models and performing audio segmentation. Experimental results reveals that our system provides an excellent performance on speech detection, so it could be successfully employed to provide speech segments to a diarization or speech recognition system. |
Silvestre-Cerdà, Joan Albert; Andrés-Ferrer, Jesús; Civera, Jorge Explicit length modelling for statistical machine translation Journal Article Pattern Recognition, 45 (9), pp. 3183 - 3192, 2012, ISSN: 0031-3203. Abstract | Links | BibTeX | Tags: Length modelling, Log-linear models, Phrase-based models, Statistical machine translation @article{Silvestre-Cerdà2012a, title = {Explicit length modelling for statistical machine translation}, author = {Joan Albert Silvestre-Cerdà and Jesús Andrés-Ferrer and Jorge Civera}, url = {http://hdl.handle.net/10251/34996}, issn = {0031-3203}, year = {2012}, date = {2012-01-01}, journal = {Pattern Recognition}, volume = {45}, number = {9}, pages = {3183 - 3192}, abstract = {Explicit length modelling has been previously explored in statistical pattern recognition with successful results. In this paper, two length models along with two parameter estimation methods and two alternative parametrisation for statistical machine translation (SMT) are presented. More precisely, we incorporate explicit bilingual length modelling in a state-of-the-art log-linear SMT system as an additional feature function in order to prove the contribution of length information. Finally, a systematic evaluation on reference SMT tasks considering different language pairs prove the benefits of explicit length modelling.}, keywords = {Length modelling, Log-linear models, Phrase-based models, Statistical machine translation}, pubstate = {published}, tppubtype = {article} } Explicit length modelling has been previously explored in statistical pattern recognition with successful results. In this paper, two length models along with two parameter estimation methods and two alternative parametrisation for statistical machine translation (SMT) are presented. More precisely, we incorporate explicit bilingual length modelling in a state-of-the-art log-linear SMT system as an additional feature function in order to prove the contribution of length information. Finally, a systematic evaluation on reference SMT tasks considering different language pairs prove the benefits of explicit length modelling. |
2011 |
Silvestre-Cerdà, Joan Albert Modelat explícit de la longitud en Traducció Automàtica Estadística Masters Thesis Màster U. en Intel·ligència Artificial, Reconeixement de Formes i Imatge Digital, Universitat Politècnica de València, 2011. Abstract | Links | BibTeX | Tags: Modelat de la longitud, Models basats en seqüències de paraules, Models log-lineals, Traducció Automàtica Estadística @mastersthesis{Silvestre-Cerdà2011b, title = {Modelat explícit de la longitud en Traducció Automàtica Estadística}, author = {Joan Albert Silvestre-Cerdà}, url = {http://www.mllp.upv.es/wp-content/uploads/2014/04/memoria2.pdf}, year = {2011}, date = {2011-09-08}, school = {Màster U. en Intel·ligència Artificial, Reconeixement de Formes i Imatge Digital, Universitat Politècnica de València}, abstract = {El modelat explícit de la longitud és un problema que ha estat explorat prèviament en diferents tasques de reconeixement de formes, oferint bons resultats. En aquest treball, es presenten dos models de longitud juntament amb dos mètodes d\'estimació i dos parametritzacions alternatives per a traducció automàtica estadística (TAE). Concretament, hem incorporat els models de longitud com a característiques addicionals al model logarítmic-lineal d\'un sistema de TAE estat de l\'art basat en seqüències de paraules, amb l\'objectiu d\'estudiar la contribució de la informació de la longitud de les seqüències de paraules en el procés de traducció. Es mostren els resultats dels experiments que hem dut a terme en una tasca de referència de la TAE, els quals posen en relleu els beneficis del modelat explícit de la longitud.}, keywords = {Modelat de la longitud, Models basats en seqüències de paraules, Models log-lineals, Traducció Automàtica Estadística}, pubstate = {published}, tppubtype = {mastersthesis} } El modelat explícit de la longitud és un problema que ha estat explorat prèviament en diferents tasques de reconeixement de formes, oferint bons resultats. En aquest treball, es presenten dos models de longitud juntament amb dos mètodes d'estimació i dos parametritzacions alternatives per a traducció automàtica estadística (TAE). Concretament, hem incorporat els models de longitud com a característiques addicionals al model logarítmic-lineal d'un sistema de TAE estat de l'art basat en seqüències de paraules, amb l'objectiu d'estudiar la contribució de la informació de la longitud de les seqüències de paraules en el procés de traducció. Es mostren els resultats dels experiments que hem dut a terme en una tasca de referència de la TAE, els quals posen en relleu els beneficis del modelat explícit de la longitud. |
Silvestre-Cerdà, Joan Albert; García-Martínez, Mercedes; Barrón-Cedeño, Alberto; Civera, Jorge; Rosso, Paolo Extracción de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase Inproceedings Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011), pp. 14-21, CEUR-WS, 2011, ISSN: 1613-0073. Abstract | Links | BibTeX | Tags: Comparable Corpora, Parallel Sentences Extraction, Statistical machine translation @inproceedings{Silvestre-Cerdà2011b, title = {Extracción de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase}, author = {Joan Albert Silvestre-Cerdà and Mercedes García-Martínez and Alberto Barrón-Cedeño and Jorge Civera and Paolo Rosso}, url = {http://hdl.handle.net/10251/27930}, issn = {1613-0073}, year = {2011}, date = {2011-01-01}, booktitle = {Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011)}, volume = {824}, pages = {14-21}, publisher = {CEUR-WS}, abstract = {This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging.}, keywords = {Comparable Corpora, Parallel Sentences Extraction, Statistical machine translation}, pubstate = {published}, tppubtype = {inproceedings} } This paper presents a proposal for extracting parallel corpora from Wikipedia on the basis of statistical machine translation techniques. We have used word-level alignment models from IBM in order to obtain phrase-level bilingual alignments between documents pairs. We have manually annotated a set of test English-Spanish comparable documents in order to evaluate the model. The obtained results are encouraging. |
Silvestre-Cerdà, Joan Albert; Andrés-Ferrer, Jesús ; Civera, Jorge Explicit Length Modelling for Statistical Machine Translation Incollection Vitrià, Jordi ; Sanches, JoãoMiguel ; Hernández, Mario (Ed.): Pattern Recognition and Image Analysis (IbPRIA 2011), 6669 , pp. 273-280, Springer Berlin Heidelberg, 2011, ISBN: 978-3-642-21256-7. Abstract | Links | BibTeX | Tags: Length modelling, Log-linear models, Phrase-based models, Statistical machine translation @incollection{Silvestre-Cerdà2011, title = {Explicit Length Modelling for Statistical Machine Translation}, author = { Joan Albert Silvestre-Cerdà and Jesús Andrés-Ferrer and Jorge Civera}, editor = {Vitrià, Jordi and Sanches, JoãoMiguel and Hernández, Mario}, url = {http://hdl.handle.net/10251/35749 http://dx.doi.org/10.1007/978-3-642-21257-4_34}, isbn = {978-3-642-21256-7}, year = {2011}, date = {2011-01-01}, booktitle = {Pattern Recognition and Image Analysis (IbPRIA 2011)}, volume = {6669}, pages = {273-280}, publisher = {Springer Berlin Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {Explicit length modelling has been previously explored in statistical pattern recognition with successful results. In this paper, two length models along with two parameter estimation methods for statistical machine translation (SMT) are presented. More precisely, we incorporate explicit length modelling in a state-of-the-art log-linear SMT system as an additional feature function in order to prove the contribution of length information. Finally, promising experimental results are reported on a reference SMT task.}, keywords = {Length modelling, Log-linear models, Phrase-based models, Statistical machine translation}, pubstate = {published}, tppubtype = {incollection} } Explicit length modelling has been previously explored in statistical pattern recognition with successful results. In this paper, two length models along with two parameter estimation methods for statistical machine translation (SMT) are presented. More precisely, we incorporate explicit length modelling in a state-of-the-art log-linear SMT system as an additional feature function in order to prove the contribution of length information. Finally, promising experimental results are reported on a reference SMT task. |
2010 |
Silvestre-Cerdà, Joan Albert Aportacions a la millora d'un sistema interactiu d'ajuda a la traducció basat en mètodes estadístics Miscellaneous Final Year Project (Computer Science and Engineering at Universitat Politècnica de València), 2010. Links | BibTeX | Tags: Modelat de la longitud, Models basats en seqüències de paraules, Models log-lineals, Traducció Automàtica Estadística @misc{Silvestre-Cerdà2010, title = {Aportacions a la millora d\'un sistema interactiu d\'ajuda a la traducció basat en mètodes estadístics}, author = {Joan Albert Silvestre-Cerdà}, url = {http://hdl.handle.net/10251/9109 http://www.mllp.upv.es/wp-content/uploads/2014/04/memoria3.pdf}, year = {2010}, date = {2010-11-01}, school = {School of Computer Science, Universitat Politècnica de València}, howpublished = {Final Year Project (Computer Science and Engineering at Universitat Politècnica de València)}, keywords = {Modelat de la longitud, Models basats en seqüències de paraules, Models log-lineals, Traducció Automàtica Estadística}, pubstate = {published}, tppubtype = {misc} } |
Publications
2022 |
Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. |
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Conf. on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. |
2021 |
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. |
Towards cross-lingual voice cloning in higher education Journal Article Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021. |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Towards simultaneous machine interpretation Inproceedings Proc. Interspeech 2021, pp. 2277–2281, Brno (Czech Republic), 2021. |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |
2020 |
Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. |
Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. |
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. |
LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. |
2018 |
Neural Speech Translation at AppTek Inproceedings Proc. of 15th Intl. Workshop on Spoken Language Translation (IWSLT 2018), pp. 104–111, Hong Kong, 2018. |
MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge Inproceedings Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop, pp. 257–261, Barcelona (Spain), 2018. |
2016 |
Different Contributions to Cost-Effective Transcription and Translation of Video Lectures Inproceedings Proc. of IX Jornadas en Tecnología del Habla and V Iberian SLTech Workshop (IberSpeech 2016), pp. 313-319, Lisbon (Portugal), 2016, ISBN: 978-3-319-49168-4 . |
Different Contributions to Cost-Effective Transcription and Translation of Video Lectures PhD Thesis Universitat Politècnica de València, 2016, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
2015 |
Efficient Generation of High-Quality Multilingual Subtitles for Video Lecture Repositories Inproceedings Proc. of 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), pp. 485–490, Toledo (Spain), 2015, ISBN: 978-3-319-24258-3. |
MLLP Transcription and Translation Platform Miscellaneous 2015, (Short paper for demo presentation accepted at 10th European Conf. on Technology Enhanced Learning (EC-TEL 2015), Toledo (Spain), 2015.). |
Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories Journal Article Speech Communication, 74 , pp. 65–75, 2015, ISSN: 0167-6393. |
2014 |
Language model adaptation for lecture transcription by document retrieval Inproceedings Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014), Las Palmas de Gran Canaria (Spain), 2014. |
Using Automatic Speech Transcriptions in Lecture Recommendation Systems Inproceedings Proc. of VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop (IberSpeech 2014), Las Palmas de Gran Canaria (Spain), 2014. |
2013 |
A System Architecture to Support Cost-Effective Transcription and Translation of Large Video Lecture Repositories Inproceedings Proc. of the IEEE Intl. Conf. on Systems, Man, and Cybernetics SMC 2013 , pp. 3994-3999, Manchester (UK), 2013. |
2012 |
transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. |
Albayzin Evaluation: The PRHLT-UPV Audio Segmentation System Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 596-600, Madrid (Spain), 2012. |
Explicit length modelling for statistical machine translation Journal Article Pattern Recognition, 45 (9), pp. 3183 - 3192, 2012, ISSN: 0031-3203. |
2011 |
Modelat explícit de la longitud en Traducció Automàtica Estadística Masters Thesis Màster U. en Intel·ligència Artificial, Reconeixement de Formes i Imatge Digital, Universitat Politècnica de València, 2011. |
Extracción de corpus paralelos de la Wikipedia basada en la obtención de alineamientos bilingües a nivel de frase Inproceedings Proceedings of the Workshop on Iberian Cross-Language Natural Language Processing Tasks (ICL 2011), pp. 14-21, CEUR-WS, 2011, ISSN: 1613-0073. |
Explicit Length Modelling for Statistical Machine Translation Incollection Vitrià, Jordi ; Sanches, JoãoMiguel ; Hernández, Mario (Ed.): Pattern Recognition and Image Analysis (IbPRIA 2011), 6669 , pp. 273-280, Springer Berlin Heidelberg, 2011, ISBN: 978-3-642-21256-7. |
2010 |
Aportacions a la millora d'un sistema interactiu d'ajuda a la traducció basat en mètodes estadístics Miscellaneous Final Year Project (Computer Science and Engineering at Universitat Politècnica de València), 2010. |