2023 |
Iranzo Sánchez, Javier Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). Abstract | Links | BibTeX | Tags: Speech Translation, streaming speech translation @phdthesis{Sánchez2023, title = {Streaming Neural Speech Translation}, author = {Iranzo Sánchez, Javier}, doi = {10.4995/Thesis/10251/199170}, year = {2023}, date = {2023-09-29}, school = {Universitat Politècnica de València}, abstract = {Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality.}, note = {Advisors: Alfons Juan Ciscar and Jorge Civera Saiz}, keywords = {Speech Translation, streaming speech translation}, pubstate = {published}, tppubtype = {phdthesis} } Thanks to significant advances in Deep Learning, Speech Translation (ST) has become a mature field that enables the use of ST technology in production-ready solutions. Due to the ever-increasing hours of audio-visual content produced each year, as well as higher awareness of the importance of media accessibility, ST is poised to become a key element for the production of entertainment and educational media. Although significant advances have been made in ST, most research has focused on the offline scenario, where the entire input audio is available. In contrast, online ST remains an under-researched topic. A special case of online ST, streaming ST, translates an unbounded input stream in a real-time fashion under strict latency constraints. This is a much more realistic problem that needs to be solved in order to apply ST to a variety of real-life tasks. The focus of this thesis is on researching and developing key techniques necessary for a successful streaming ST solution. First, in order to enable ST system development and evaluation, a new multilingual ST dataset is collected, which significantly expands the amount of hours available for ST. Then, a streaming-ready segmenter component is developed to segment the intermediate transcriptions of our proposed cascade solution, which consists in an Automatic Speech Recognition (ASR) system that transcribes the audio, followed by a Machine Translation (MT) system that translates the intermediate transcriptions into the desired language. Research has shown that segmentation quality plays a significant role in downstream MT performance, so the development of an effective streaming segmenter is a critical step in the streaming ST process. This segmenter is then integrated and the components of the cascade are jointly optimized to achieve an appropriate quality-latency trade-off. Streaming ST has much more strict latency constraints than standard online ST, as the desired latency level must be maintained during the whole translation process. Therefore, it is crucial to be able to accurately measure this latency, but the standard online ST metrics are not well suited for this task. As a consequence, new evaluation methods are proposed for streaming ST evaluation, which ensure realistic, yet interpretable results. Lastly, a novel method is presented for improving translation quality through the use of contextual information. Whereas standard online ST systems translate audios in isolation, there is a wealth of contextual information available for improving streaming ST systems. Our approach introduces the concept of streaming history by storing the most recent information of the translation process, which is then used by the model in order to improve translation quality. |
Benstead, Kim; Brandl, Andreas; Brouwers, Ton; Civera, Jorge; Collen, Sarah; Csaba, Degi L; Munter, Johan De; Dewitte, Marieke; Diez de los Rios, Celia ; Dodlek, Nikolina; Eriksen, Jesper G; Forget, Patrice; Gasparatto, Chiara; Geissler, Jan; Hall, Corinne; Juan, Alfons; Kalz, Marco; Kelly, Richard; Klis, Giorgos; Kulaksiz, Taibe; Lecoq, Carine; Marangoni, Francesca; McInally, Wendy; Oliver, Kathy; Popovics, Maria; Poulios, Christos; Price, Richard; Rollo, Irena; Romeo, Silvia; Steinbacher, Jana; Sulosaari, Virpi; O’Higgins, Niall An inter-specialty cancer training programme curriculum for Europe Journal Article European Journal of Surgical Oncology, 49 (9), pp. 106989, 2023. Abstract | Links | BibTeX | Tags: educational technologies, Neural Machine Translation @article{Benstead2023, title = {An inter-specialty cancer training programme curriculum for Europe}, author = {Kim Benstead AND Andreas Brandl AND Ton Brouwers AND Jorge Civera AND Sarah Collen AND Degi L. Csaba AND Johan De Munter AND Marieke Dewitte AND Diez de los Rios, Celia AND Nikolina Dodlek AND Jesper G. Eriksen AND Patrice Forget AND Chiara Gasparatto AND Jan Geissler AND Corinne Hall AND Alfons Juan AND Marco Kalz AND Richard Kelly AND Giorgos Klis AND Taibe Kulaksiz AND Carine Lecoq AND Francesca Marangoni AND Wendy McInally AND Kathy Oliver AND Maria Popovics AND Christos Poulios AND Richard Price AND Irena Rollo AND Silvia Romeo AND Jana Steinbacher AND Virpi Sulosaari AND Niall O’Higgins}, doi = {10.1016/j.ejso.2023.106989 }, year = {2023}, date = {2023-07-28}, journal = {European Journal of Surgical Oncology}, volume = {49}, number = {9}, pages = {106989}, abstract = {INTRODUCTION: Multidisciplinary and multi-professional collaboration is vital in providing better outcomes for patients The aim of the INTERACT-EUROPE Project (Wide Ranging Cooperation and Cutting Edge Innovation As A Response To Cancer Training Needs) was to develop an inter-specialty curriculum. A pilot project will enable a pioneer cohort to acquire a sample of the competencies needed. METHODS: A scoping review, qualitative and quantitative surveys were undertaken. The quantitative survey results are reported here. Respondents, including members of education boards, curriculum committees, trainee committees of European specialist societies and the ECO Patient Advisory Committee, were asked to score 127 proposed competencies on a 7-point Likert scale as to their value in achieving the aims of the curriculum. Results were discussed and competencies developed at two stakeholder meetings. A consultative document, shared with stakeholders and available online, requested views regarding the other components of the curriculum. RESULTS: Eleven competencies were revised, three omitted and three added. The competencies were organised according to the CanMEDS framework with 13 Entrustable Professional Activities, 23 competencies and 127 enabling competencies covering all roles in the framework. Recommendations regarding the infrastructure, organisational aspects, eligibility of trainees and training centres, programme contents, assessment and evaluation were developed using the replies to the consultative document. CONCLUSIONS: An Inter-specialty Cancer Training Programme Curriculum and a pilot programme with virtual and face-to-face components have been developed with the aim of improving the care of people affected by cancer.}, keywords = {educational technologies, Neural Machine Translation}, pubstate = {published}, tppubtype = {article} } INTRODUCTION: Multidisciplinary and multi-professional collaboration is vital in providing better outcomes for patients The aim of the INTERACT-EUROPE Project (Wide Ranging Cooperation and Cutting Edge Innovation As A Response To Cancer Training Needs) was to develop an inter-specialty curriculum. A pilot project will enable a pioneer cohort to acquire a sample of the competencies needed. METHODS: A scoping review, qualitative and quantitative surveys were undertaken. The quantitative survey results are reported here. Respondents, including members of education boards, curriculum committees, trainee committees of European specialist societies and the ECO Patient Advisory Committee, were asked to score 127 proposed competencies on a 7-point Likert scale as to their value in achieving the aims of the curriculum. Results were discussed and competencies developed at two stakeholder meetings. A consultative document, shared with stakeholders and available online, requested views regarding the other components of the curriculum. RESULTS: Eleven competencies were revised, three omitted and three added. The competencies were organised according to the CanMEDS framework with 13 Entrustable Professional Activities, 23 competencies and 127 enabling competencies covering all roles in the framework. Recommendations regarding the infrastructure, organisational aspects, eligibility of trainees and training centres, programme contents, assessment and evaluation were developed using the replies to the consultative document. CONCLUSIONS: An Inter-specialty Cancer Training Programme Curriculum and a pilot programme with virtual and face-to-face components have been developed with the aim of improving the care of people affected by cancer. |
Baquero Arnal, Pau Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model @phdthesis{Arnal2023, title = {Transformer models for Machine Translation and Streaming Automatic Speech Recognition}, author = {Baquero Arnal, Pau}, url = {https://doi.org/10.4995/Thesis/10251/193680 https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12917}, year = {2023}, date = {2023-01-01}, school = {Universitat Politècnica de València}, abstract = {Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor.}, note = {Advisors: Alfons Juan Ciscar and Hermann Ney}, keywords = {Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model}, pubstate = {published}, tppubtype = {phdthesis} } Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Models log-lineals Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2023 |
Streaming Neural Speech Translation PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Jorge Civera Saiz). |
An inter-specialty cancer training programme curriculum for Europe Journal Article European Journal of Surgical Oncology, 49 (9), pp. 106989, 2023. |
Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). |