CN117252213A

CN117252213A - End-to-end speech translation method using synthesized speech as supervision information

Info

Publication number: CN117252213A
Application number: CN202310824069.5A
Authority: CN
Inventors: 熊德意; 薛征山
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-12-19
Anticipated expiration: 2043-07-06
Also published as: CN117252213B

Abstract

The invention discloses an end-to-end voice translation method using synthetic voice as supervision information, which comprises the steps of firstly preprocessing triple original voice translation data to be translated to obtain quadruple voice translation data containing the synthetic voice; then constructing a voice translation model, and training the voice translation model by using voice translation data of the four-element group as a sample, wherein a pair Ji Shi adapter module is designed for taking semantic representation of the synthesized voice as supervision information so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice; at the same time, at the shared decoder side, the logits distribution of the synthesized speech is distilled onto the logits distribution of the original speech. And finally, translating the input voice to be translated by using the trained voice translation model, and outputting a target translation text. The invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.

Description

End-to-end speech translation method using synthesized speech as supervision information

Technical Field

The invention relates to the technical field of speech translation, in particular to an end-to-end speech translation method using synthetic speech as supervision information.

Background

Speech translation techniques include two types: (1) Speech-to-Speech translation (S2S), which automatically translates audio signals in one language into audio signals in another language. (2) Speech-to-Text (S2T), which automatically translates audio signals in one language into Text in another language. The two technologies are widely applied to simultaneous transmission systems. Such as transmission of flight information, hundred degrees transmission of flight information, etc.

This patent belongs to the S2T technology, i.e. converting source language speech into text in the target language. Depending on the implementation, it may be classified into tandem speech translation and end-to-end speech translation. Tandem speech translation, which concatenates speech recognition (Automatic Speech Recognition, ASR) and machine translation (Machine Translation, MT), is a process of recognizing a source speech as a source text and then translating the source text to a target text. The advantages of this solution are: (1) The voice recognition and the machine translation can be optimized independently, so that the difficulty of a voice translation task is reduced. (2) ASR and MT have rich data, so that both speech recognition and machine translation alone have good results. However, the disadvantages of this approach are also apparent: (1) error propagation; if errors exist in the source text obtained by the voice recognition model, the errors are likely to be amplified in the translation process, so that the final translation result has larger deviation. (2) high latency; the speech recognition model and the text translation model can only be calculated serially, the time delay is high, the translation efficiency is relatively low, and especially in a real-time speech translation scene, the requirement on the translation efficiency is very high. In addition, as the cascade system further decomposes the tasks in the actual scene, an intermediate processing module is added, and the overall performance of the system can be improved, but the time delay is further improved, and the translation efficiency is reduced. (3) loss of voice information; in the process of recognizing speech as text, information such as mood, emotion, tone and the like contained in speech is lost, and such information is not usually expressed in the form of text. In the same sentence, the meanings expressed in different speech are likely to be different. This information is also helpful for translation. End-to-end speech translation, which is a direct modeling of speech to target text conversion relative to tandem speech translation, is a current focus of research in the industry. The advantages are that: (1) error propagation problems are avoided; (2) the time delay is significantly reduced; and (3) the model deployment is light. The disadvantages are: (1) high modeling complexity; conversion involving modalities; (2) Training data is scarce, which is the bottleneck that affects the greatest end-to-end speech translation.

The End-to-End Speech translation technology (End-to-End S2T) is mainly based on the current technical scheme of multi-task learning (MultiTask Learning), knowledge distillation (Knowledge Distillation, KD), speech-text hybrid learning (Speech-Text manifold mixup learning), contrast learning (Contrastive Learning) and the like. According to the scheme, the end-to-end voice translation effect can be effectively improved. The multi-task learning is to add an additional training target to the model to guide the learning of the model, and provide a supervised learning signal for the speech translation model to assist the training by introducing the knowledge of other models, so that the problem of insufficient speech translation training data is solved. Knowledge distillation is to distill text translation knowledge with relatively good model effect onto a voice translation model with relatively complex model and relatively poor effect, so that the effect of voice translation is improved. The voice text mixed learning is essentially a data enhancement technology, and the effect of voice translation is improved by constructing more data. In the comparison learning method, the distance between the inter-translation sentence pairs is shortened in the same training batch, and the distance between the non-inter-translation sentence pairs is shortened at the same time, so that the voice translation effect is improved. None of the above techniques takes into account the robustness of speech translation, but stands at the point of modal pull-up, improving the translation result of the model.

Robustness is an important direction of speech translation. The robustness problem is mainly represented by the fact that for the same text, the translation effect of the model is not completely consistent even if the same sentence is spoken by different persons through voice and even if the same sentence is spoken by different rounds of the same person. The problem solved by the robustness is that for the voices corresponding to the same text, whether different people or the same person, the translation results should be kept consistent or the translation results have higher quality under various scene conditions. Different persons or the same person, in different situations, say the same, may have different speaking speech durations, pauses, etc., collectively referred to herein as different timbres, which may affect the robustness of the speech translation.

Disclosure of Invention

The invention aims at solving the technical problem of end-to-end speech translation robustness, and provides an end-to-end speech translation method using synthesized speech as supervision information. The method uses standard synthesized voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, so that the voice translation effect is improved.

The technical scheme adopted for realizing the purpose of the invention is as follows:

an end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:

step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: the original voice, the synthesized voice corresponding to the transcribed text, the transcribed text and the translated text;

step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;

the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;

a speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;

a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;

a shared encoder module for obtaining semantic representations of the original speech, the synthesized speech, the transcribed text;

the shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;

for the Ji Shi adapter module, the function is to take the semantic representation of the synthesized voice as supervision information, so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice;

definition of samples，/>Representing original speech, synthesized speech, transcribed text and target translated text, respectively, < >>Representing model parameters, in the sampleD. Model parameters->Under the condition, the following loss function is established:

1. loss of original speech translation to target translation text：

；

2. Loss of original speech recognition as transcribed text：

；

3. Loss of machine translation of transcribed text into target translated text：

；

4. Loss of translation of synthesized speech to target translation text：

；

For the four loss functions described above, the loss function, among others,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing a transcribed text and a target translated text pair; />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2);

5. loss of Ji Shi adapter：

；

Wherein,representation->Via the output to the Ji Shi adapter, +.>Representation->Through the output of the shared encoder, MSE represents the mean square error loss;

6. loss of synthetic speech to original speech knowledge distillation：

；

Wherein,is a shared decoder module +.>The output logits of the steps, the logits represent probability distribution of all words in the vocabulary at the position i, and each step can generate one logits; is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>Is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>A token refers to a specific word in the logits;the representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Represent target translation length, KD represents knowledgeDistilling;

the 6 loss functions are combined, and the loss function of the final whole voice translation model is that：

；

And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.

In the above technical solution, step 1 includes the following steps:

step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;

step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;

step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;

step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;

step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;

step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.

In the above technical solution, in step 1.1, a synthesized speech corresponding to a text corresponding to an original speech is generated by an espnet2-TTS system.

In the above technical solution, in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 and the synthesized speech are filtered out, and the corresponding transcribed text and translated text are filtered out simultaneously.

In the above technical solution, in step 2, the speech encoder module adds two layers of CNNs after the wav2vec2.0 open source speech pre-training model.

In the above technical solution, in step 2, the shared encoder module adopts a classical Transformer Encoder structure.

In the above technical solution, in step 2, the shared decoder module adopts a classical Transformer Decoder structure.

In the above technical solution, in step 2, the pair Ji Shi adapter modules adopts a classical Transformer Encoder structure.

Compared with the prior art, the invention has the beneficial effects that:

the invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.

Drawings

Fig. 1 is a flow chart of an end-to-end speech translation method of the present invention using synthesized speech as supervisory information.

FIG. 2 is a schematic diagram of a speech translation model according to the present invention.

Other relevant drawings may be made by those of ordinary skill in the art from the above figures without undue burden.

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, an end-to-end speech translation method using synthesized speech as supervisory information includes the steps of:

step 1: preprocessing the triplet original voice translation data to be translated to obtain the voice translation data of the quadruple. The method comprises the following specific steps:

step 1.1: the original speech translation data is a triplet structure, and the three tuples are: the method comprises the steps of firstly, synthesizing corresponding voices from the original voices, transcribed texts corresponding to the original voices and translated texts corresponding to the original voices by a text-to-voice synthesis system, namely, generating synthesized voice data corresponding to the transcribed texts corresponding to the original voices. In this embodiment, an open source espnet2-TTS system is adopted, and a synthetic speech corresponding to a transcribed text corresponding to the original speech is generated through the espnet2-TTS system.

Step 1.2: since the sample rate of the synthesized speech generated by the espnet2-TTS system is 22k, which is different from the sample rate (16 k) of the original speech, the synthesized speech sample rate obtained in step 1.1 needs to be converted into 16k by a ffmpeg tool.

Step 1.3: and (2) calculating the duration of the synthesized voice and the duration of the original voice obtained in the step (1.2), and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice.

Step 1.4: the original speech with compression <0.4 and compression >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.

Step 1.5: for the original speech processed in step 1.4 and the synthesized speech, the duration of the synthesized speech is converted to be identical to the speech duration of the original speech by the ffmpeg tool (ideally, the duration of the synthesized speech is completely identical, but the conversion process of the ffmpeg tool has a loss, so that the final duration is not completely identical, and has a small gap, which is about 1% or less, and if the duration of the original speech is about 1 second, the duration of the synthesized speech is about 1.00x or 0.99x seconds).

Because the ffmpeg tool can lengthen or shorten a voice, its control is controlled by the parameter atempo, and the value of this parameter is 0.5 to 100; so if the compression ratio=0.49, because it is smaller than 0.5, a plurality of atempo parameters are required, the present invention adopts a method that the compression ratio=0.49 takes the value of the square of the open (0.7) as the value of the parameter atempo, and stacks 2 atempos (atempo=0.7).

Step 1.6: adding the synthesized voice processed in the step 1.5 into original voice translation data to obtain voice translation data of four tuples, wherein the four tuples are respectively: original speech, synthesized speech, transcribed text, and translated text.

Step 2: and (3) constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step (1) as a sample.

Referring to fig. 2, the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module.

A speech encoder module: the function is to encode the original speech and the synthesized speech in the four-tuple speech translation data into real vectors, and the real vectors represent the speech characteristics of the speech in the model and participate in the training of the subsequent model. The specific implementation of the speech coder module is to add two layers of CNN (Convolutional Neural Networks, convolutional neural network) after the Wav2vec open source speech pre-training model.

A text encoder module: the function is to encode the transcribed text in the four-tuple speech translation data into a word vector.

A shared encoder module: the classical Transformer Encoder structure is adopted, and the function is to obtain semantic representations of original voice, synthesized voice and transcribed text, wherein the parameter layer (the number of layers of an Encoder) =6 and multi-heads (the number of multi-head attentions) =8.

A shared decoder module: a classical Transformer Decoder structure was used, where the parameter layer (number of layers of Dncoder) =6, multi-heads (number of multi-head attention) =8. The function is to obtain the reasoning result by adopting an autoregressive method.

For Ji Shi adapter module: the same structure as the shared encoder adopts a classical Transformer Encoder structure. The method has the function of taking the semantic representation of the synthesized voice as supervision information, and making the semantic representation of the original voice close to the semantic representation of the synthesized voice, thereby eliminating the negative influence caused by voice changes of different speakers or the same speaker (namely, the semantic representations of the different speakers or the same speaker tend to be consistent).

Definition of samples，/>Representing original speech, synthesized speech, transcribed text and target translated text, respectively, < >>Representing model parameters, in samples->Model parameters->Under the condition, the following loss function is established:

1. loss of original speech translation to target translation text：

；

2. Loss of original speech recognition as transcribed text：

；

4. Loss of translation of synthesized speech to target translation text：

；

For the above-mentioned loss function, the loss function, among others,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing a transcribed text and a target translated text pair; />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); ST represents speech translation (Speech Translation), ASR represents speech recognition (Automatic Speech Recognition), MT represents machine translation (Machine Translation);

5. loss of Ji Shi adapter：

；

Wherein,representation->Via the output to the Ji Shi adapter, +.>Representation->The output through the shared encoder MSE (mean squared error) represents the mean square error loss;

6. loss of synthetic speech to original speech knowledge distillation：

；

Wherein,is a shared decoder module +.>Output logits of the step (logits represent probability distribution of all words in the vocabulary at position i),>is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>(during reasoning, each step generates a word, and token refers to a specific word in the words); />The representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Represents the target translation length, KD represents knowledge distillation (Knowledge Distillation).

；

The foregoing has described exemplary embodiments of the invention, it being understood that any simple variations, modifications, or other equivalent arrangements which would not unduly obscure the invention may be made by those skilled in the art without departing from the spirit of the invention.

Claims

1. An end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:

step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: original speech, synthesized speech, transcribed text, and translated text;

1, loss of original speech translation into target translation text：

；

2, original speech recognition as loss of transcribed text：

；

3 loss of machine translation of transcribed text into target translated text：

；

4, loss of translation of synthesized speech into target translation text：

；

For the 4 loss functions described above, among them,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing pairs of transcribed text and target translated text；/>Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2);

5 loss of Ji Shi adapter：

；

6 loss of synthetic speech to raw speech knowledge distillation：

；

Wherein,is a shared decoder module +.>The output logits of the steps, the logits represent probability distribution of all words in the vocabulary at the position i, and each step can generate one logits; />Is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>A token refers to a specific word in the logits; />The representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Indicating the target translation length, KD indicates knowledge distillation;

；

2. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: step 1 comprises the following steps:

3. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.1, a synthesized speech corresponding to a text corresponding to the original speech is generated by means of an espnet2-TTS system.

4. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.

5. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the speech coder module adds two layers of CNN after the wav2vec2.0 open source speech pre-training model.

6. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared encoder module adopts a classical Transformer Encoder structure.

7. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared decoder module adopts a classical Transformer Decoder structure.

8. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the pair Ji Shi adapter modules adopts a classical Transformer Encoder structure.