2021 |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Javier Iranzo-Sánchez Jorge Civera, Alfons Juan Stream-level Latency Evaluation for Simultaneous Machine Translation Inproceedings Findings of the ACL: EMNLP 2021, pp. 664–670, Punta Cana (Dominican Republic), 2021. Abstract | Links | BibTeX | Tags: latency, simultaneous machine translation, stream-level evaluation, streaming @inproceedings{Iranzo-Sánchez2021b, title = {Stream-level Latency Evaluation for Simultaneous Machine Translation}, author = {Javier Iranzo-Sánchez, Jorge Civera, Alfons Juan}, url = {https://arxiv.org/abs/2104.08817 https://github.com/jairsan/Stream-level_Latency_Evaluation_for_Simultaneous_Machine_Translation}, doi = {10.18653/v1/2021.findings-emnlp.58}, year = {2021}, date = {2021-01-01}, booktitle = {Findings of the ACL: EMNLP 2021}, pages = {664--670}, address = {Punta Cana (Dominican Republic)}, abstract = {Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.}, keywords = {latency, simultaneous machine translation, stream-level evaluation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Models log-lineals Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2021 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Stream-level Latency Evaluation for Simultaneous Machine Translation Inproceedings Findings of the ACL: EMNLP 2021, pp. 664–670, Punta Cana (Dominican Republic), 2021. |