3.0 KB

Direct Segmentation Models for Streaming Speech Translation

You can find the repository containing the code of the EMNLP 2020 paper "Direct Segmentation Models for Streaming Speech Translation" at:

Please refer to the publication:

    title = "Direct Segmentation Models for Streaming Speech Translation",
    author = "Iranzo-S{\'a}nchez, Javier  and
      Gim{\'e}nez Pastor, Adri{\`a}  and
      Silvestre-Cerd{\`a}, Joan Albert  and
      Baquero-Arnal, Pau  and
      Civera Saiz, Jorge  and
      Juan, Alfons",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "10.18653/v1/2020.emnlp-main.206",
    pages = "2599--2611",
    abstract = "The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and throughly experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.",


In the cascade approach to Speech Translation, an ASR system first transcribes the audio, and then the transcriptions are translated by a downstream MT system.

Standard MT systems are usually trained with sequences of around 100-150 tokens. Thus, if the audio is short, we can directly translate the transcription. However, when we have a long audio stream, the resulting transcription is many times longer than the maximum length seen by the MT model during training. This is why it is necessary to have a segmenter model that takes as input the stream of transcribed words, and outputs a stream of (hopefully) semantically self-contained segments, which are then translated independently by the MT model. The model presented here has been prepared to carry out segmentation in a streaming fashion.

Get the code

You can find the repository containing the code of the paper Direct Segmentation Models for Streaming Speech Translation at: