CN117252213B

CN117252213B - End-to-end speech translation method using synthesized speech as supervision information

Info

Publication number: CN117252213B
Application number: CN202310824069.5A
Authority: CN
Inventors: 熊德意; 薛征山
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2024-05-31
Anticipated expiration: 2043-07-06
Also published as: CN117252213A

Abstract

The invention discloses an end-to-end voice translation method using synthetic voice as supervision information, which comprises the steps of firstly preprocessing triple original voice translation data to be translated to obtain quadruple voice translation data containing the synthetic voice; then constructing a voice translation model, and training the voice translation model by using voice translation data of the four-element group as a sample, wherein a pair Ji Shi adapter module is designed for taking semantic representation of the synthesized voice as supervision information so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice; at the same time, at the shared decoder side, the logits distribution of synthesized speech is distilled onto the logits distribution of original speech. And finally, translating the input voice to be translated by using the trained voice translation model, and outputting a target translation text. The invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.

Description

End-to-end speech translation method using synthesized speech as supervision information

Technical Field

The invention relates to the technical field of speech translation, in particular to an end-to-end speech translation method using synthetic speech as supervision information.

Background

Speech translation techniques include two types: (1) Speech-to-Speech translation (S2S), which automatically translates audio signals in one language into audio signals in another language. (2) Speech-to-Text (S2T), which automatically translates audio signals in one language into Text in another language. The two technologies are widely applied to simultaneous transmission systems. Such as transmission of flight information, hundred degrees transmission of flight information, etc.

This patent belongs to the S2T technology, i.e. converting source language speech into text in the target language. Depending on the implementation, it may be classified into tandem speech translation and end-to-end speech translation. Tandem speech translation, which concatenates speech recognition (Automatic Speech Recognition, ASR) and machine translation (Machine Translation, MT), is a process of first recognizing a source speech as a source text and then translating the source text into a target text. The advantages of this solution are: (1) The voice recognition and the machine translation can be optimized independently, so that the difficulty of a voice translation task is reduced. (2) ASR and MT have rich data, so that both speech recognition and machine translation alone have good results. However, the disadvantages of this approach are also apparent: (1) error propagation; if errors exist in the source text obtained by the voice recognition model, the errors are likely to be amplified in the translation process, so that the final translation result has larger deviation. (2) high latency; the speech recognition model and the text translation model can only be calculated serially, the time delay is high, the translation efficiency is relatively low, and especially in a real-time speech translation scene, the requirement on the translation efficiency is very high. In addition, as the cascade system further decomposes the tasks in the actual scene, an intermediate processing module is added, and the overall performance of the system can be improved, but the time delay is further improved, and the translation efficiency is reduced. (3) loss of voice information; in the process of recognizing speech as text, information such as mood, emotion, tone and the like contained in speech is lost, and such information is not usually expressed in the form of text. In the same sentence, the meanings expressed in different speech are likely to be different. This information is also helpful for translation. End-to-end speech translation, which is a direct modeling of speech to target text conversion relative to tandem speech translation, is a current focus of research in the industry. The advantages are that: (1) error propagation problems are avoided; (2) the time delay is significantly reduced; and (3) the model deployment is light. The disadvantages are: (1) high modeling complexity; conversion involving modalities; (2) Training data is scarce, which is the bottleneck that affects the greatest end-to-end speech translation.

The End-to-End Speech translation technology (End-to-End S2T) is mainly based on the current technical schemes of multi-task learning (MultiTask Learning), knowledge distillation (Knowledge Distillation, KD), speech-text hybrid learning (Speech-Text manifold mixup learning), contrast learning (Contrastive Learning) and the like. According to the scheme, the end-to-end voice translation effect can be effectively improved. The multi-task learning is to add an additional training target to the model to guide the learning of the model, and provide a supervised learning signal for the speech translation model to assist the training by introducing the knowledge of other models, so that the problem of insufficient speech translation training data is solved. Knowledge distillation is to distill text translation knowledge with relatively good model effect onto a voice translation model with relatively complex model and relatively poor effect, so that the effect of voice translation is improved. The voice text mixed learning is essentially a data enhancement technology, and the effect of voice translation is improved by constructing more data. In the comparison learning method, the distance between the inter-translation sentence pairs is shortened in the same training batch, and the distance between the non-inter-translation sentence pairs is shortened at the same time, so that the voice translation effect is improved. None of the above techniques takes into account the robustness of speech translation, but stands at the point of modal pull-up, improving the translation result of the model.

Robustness is an important direction of speech translation. The robustness problem is mainly represented by the fact that for the same text, the translation effect of the model is not completely consistent even if the same sentence is spoken by different persons through voice and even if the same sentence is spoken by different rounds of the same person. The problem solved by the robustness is that for the voices corresponding to the same text, whether different people or the same person, the translation results should be kept consistent or the translation results have higher quality under various scene conditions. Different persons or the same person, in different situations, say the same, may have different speaking speech durations, pauses, etc., collectively referred to herein as different timbres, which may affect the robustness of the speech translation.

Disclosure of Invention

The invention aims at solving the technical problem of end-to-end speech translation robustness, and provides an end-to-end speech translation method using synthesized speech as supervision information. The method uses standard synthesized voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, so that the voice translation effect is improved.

The technical scheme adopted for realizing the purpose of the invention is as follows:

an end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:

Step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: the original voice, the synthesized voice corresponding to the transcribed text, the transcribed text and the translated text;

Step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;

The speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;

A speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;

a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;

a shared encoder module for obtaining semantic representations of the original speech, the synthesized speech, the transcribed text;

The shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;

For Ji Shi adapter module, the function is to use the semantic representation of the synthesized voice as supervision information to make the semantic representation of the original voice approach to the semantic representation of the synthesized voice;

definition of samples ，/>Representing the original speech, the synthesized speech, the transcribed text and the target translated text, respectively,/>Representing model parameters, in sample D, model parameters/>Under the condition, the following loss function is established:

1. Loss of original speech translation to target translation text ：

；

2. Loss of original speech recognition as transcribed text：

；

3. Loss of machine translation of transcribed text into target translated text：

；

4. Loss of translation of synthesized speech to target translation text：

；

For the four loss functions described above, the loss function, among others,Representing pairs of original speech and target translated text,/>Representing a pair of synthesized speech and target translated text,/>Representing the original speech and transcript text pairs,/>Representing a transcribed text and a target translated text pair; /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Identify as/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2);

5. Loss of Ji Shi adapters ：

；

Wherein,Representation/>Through the output to Ji Shi adapters,/>Representation/>Through the output of the shared encoder, MSE represents the mean square error loss;

6. Loss of synthetic speech to original speech knowledge distillation ：

；

Wherein,Is the shared decoder module/>The outputs logits, logits of the steps represent the probability distribution of all words in the vocabulary at position i, each step yielding one logits; is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>Is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>A token refers to a particular word in logits; Expressed in input as/> The parameter is/>First/>The time output token is/>Probability of (2); /(I)Representing vocabulary size,/>Indicating the target translation length, KD indicates knowledge distillation;

The 6 loss functions are combined, and the loss function of the final whole voice translation model is that ：

；

And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.

In the above technical solution, step 1 includes the following steps:

Step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;

Step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;

Step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;

step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;

Step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;

Step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.

In the above technical solution, in step 1.1, a synthesized speech corresponding to a text corresponding to an original speech is generated by means of a espnet-TTS system.

In the above technical solution, in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 and the synthesized speech are filtered out, and the corresponding transcribed text and translated text are filtered out simultaneously.

In the above technical solution, in step 2, the speech encoder module adds two layers of CNNs after the wav2vec2.0 open source speech pre-training model.

In the above technical solution, in step 2, the shared encoder module adopts a classical Transformer Encoder structure.

In the above technical solution, in step 2, the shared decoder module adopts a classical Transformer Decoder structure.

In the above technical solution, in step 2, the pair Ji Shi of adapter modules adopts a classical Transformer Encoder structure.

Compared with the prior art, the invention has the beneficial effects that:

the invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.

Drawings

Fig. 1 is a flow chart of an end-to-end speech translation method of the present invention using synthesized speech as supervisory information.

FIG. 2 is a schematic diagram of a speech translation model according to the present invention.

Other relevant drawings may be made by those of ordinary skill in the art from the above figures without undue burden.

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, an end-to-end speech translation method using synthesized speech as supervisory information includes the steps of:

Step 1: preprocessing the triplet original voice translation data to be translated to obtain the voice translation data of the quadruple. The method comprises the following specific steps:

Step 1.1: the original speech translation data is a triplet structure, and the three tuples are: the method comprises the steps of firstly, synthesizing corresponding voices from the original voices, transcribed texts corresponding to the original voices and translated texts corresponding to the original voices by a text-to-voice synthesis system, namely, generating synthesized voice data corresponding to the transcribed texts corresponding to the original voices. In this embodiment, an open source espnet-TTS system is used, and a synthetic speech corresponding to the transcribed text corresponding to the original speech is generated by means of the espnet-TTS system.

Step 1.2: since the sample rate of the synthesized speech generated by espnet2-TTS system is 22k, which is different from the sample rate (16 k) of the original speech, the synthesized speech sample rate obtained in step 1.1 needs to be converted into 16k by a ffmpeg tool.

Step 1.3: and (2) calculating the duration of the synthesized voice and the duration of the original voice obtained in the step (1.2), and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice.

Step 1.4: the original speech with compression <0.4 and compression >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.

Step 1.5: for the original speech processed in step 1.4 and the synthesized speech, the duration of the synthesized speech is converted to be identical to the speech duration of the original speech by the ffmpeg tool (ideally, the duration of the synthesized speech is completely identical, but the conversion process of the ffmpeg tool has a loss, so that the final duration is not completely identical, and has a small gap, which is about 1% or less, and if the duration of the original speech is about 1 second, the duration of the synthesized speech is about 1.00x or 0.99x seconds).

Because the ffmpeg tool can lengthen or shorten a voice, its control is controlled by parameter atempo, the value of this parameter is 0.5 to 100; therefore, if the compression ratio=0.49, because it is smaller than 0.5, a plurality of atempo parameters are required, and the present invention adopts a method that the compression ratio=0.49 is a value of the square of the open (0.7) as the value of the parameter atempo, and 2 pieces of atempo (atempo =0.7) are stacked.

Step 1.6: adding the synthesized voice processed in the step 1.5 into original voice translation data to obtain voice translation data of four tuples, wherein the four tuples are respectively: original speech, synthesized speech, transcribed text, and translated text.

Step 2: and (3) constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step (1) as a sample.

Referring to fig. 2, the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module.

A speech encoder module: the function is to encode the original speech and the synthesized speech in the four-tuple speech translation data into real vectors, and the real vectors represent the speech characteristics of the speech in the model and participate in the training of the subsequent model. The specific implementation of the speech coder module is to add two layers of CNN (Convolutional Neural Networks, convolutional neural network) after the Wav2vec open source speech pre-training model.

A text encoder module: the function is to encode the transcribed text in the four-tuple speech translation data into a word vector.

A shared encoder module: the classical Transformer Encoder structure is adopted, and the function is to obtain semantic representation of original voice, synthesized voice and transcribed text, wherein the parameters layer (number of layers of Encoder) =6 and multi-heads (number of multi-head attention) =8.

A shared decoder module: a classical Transformer Decoder structure was used, where the parameter layer (number of layers of Dncoder) =6, multi-heads (number of multi-headed attention) =8. The function is to obtain the reasoning result by adopting an autoregressive method.

For Ji Shi adapter modules: the same structure as the shared encoder adopts a classical Transformer Encoder structure. The method has the function of taking the semantic representation of the synthesized voice as supervision information, and making the semantic representation of the original voice close to the semantic representation of the synthesized voice, thereby eliminating the negative influence caused by voice changes of different speakers or the same speaker (namely, the semantic representations of the different speakers or the same speaker tend to be consistent).

Definition of samples，/>Representing the original speech, the synthesized speech, the transcribed text and the target translated text, respectively,/>Representing model parameters, at sample/>Model parameters/>Under the condition, the following loss function is established:

1. Loss of original speech translation to target translation text ：

；

2. Loss of original speech recognition as transcribed text：

；

4. Loss of translation of synthesized speech to target translation text：

；

For the above-mentioned loss function, the loss function, among others,Representing pairs of original speech and target translated text,/>Representing a pair of synthesized speech and target translated text,/>Representing the original speech and transcript text pairs,/>Representing a transcribed text and a target translated text pair; /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Identify as/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); ST represents speech translation (Speech Translation), ASR represents speech recognition (Automatic Speech Recognition), MT represents machine translation (Machine Translation);

5. Loss of Ji Shi adapters ：

；

Wherein,Representation/>Through the output to Ji Shi adapters,/>Representation/>The output through the shared encoder MSE (mean squared error) represents the mean square error loss;

6. Loss of synthetic speech to original speech knowledge distillation ：

；

Wherein,Is the shared decoder module/>Output logits of the step (logits represents the probability distribution of all words in the vocabulary at position i),/>Is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>(During reasoning, each step will produce one logits, token refers to a specific word in logits); /(I)Expressed in input as/>The parameter is/>First/>The time output token is/>Probability of (2); /(I)Representing vocabulary size,/>Represents the target translation length and KD represents knowledge distillation (Knowledge Distillation).

The 6 loss functions are combined, and the loss function of the final whole voice translation model is that：

；

The foregoing has described exemplary embodiments of the invention, it being understood that any simple variations, modifications, or other equivalent arrangements which would not unduly obscure the invention may be made by those skilled in the art without departing from the spirit of the invention.

Claims

1. An end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:

Step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: original speech, synthesized speech, transcribed text, and translated text;

a shared encoder module for obtaining a semantic representation of the original speech, the synthesized speech and the transcribed text;

Defining a sample D= (s, s ', x, y), s, s', x, y respectively representing an original voice, a synthesized voice, a transcribed text and a target translated text, and θ representing model parameters, and under the conditions of the sample D and the model parameters θ, establishing the following loss functions:

1. Loss of original speech translation to target translation text L _ST (D; θ):

L_ST(D;θ)＝-∑_(s,y)∈DlogP(y|s；θ)

2. original speech is recognized as a loss of transcribed text L _ASR (D; θ):

L_ASR(D;θ)＝-∑_(s,x)∈DlogP(x|s；θ)

3. loss of machine translation of transcribed text to target translated text L _MT (D; θ):

L_MT(D;θ)＝-∑_(x,y)∈DlogP(y|x；θ)

4. Loss of translation of synthesized speech to target translation text L _ST′ (D; θ):

L_ST′(D;θ)＝-∑_(s′,y)∈DlogP(y|s′；θ)

for the 4 penalty functions described above, where (s, y) represents the original speech and target translation pair, (s', y) represents the synthesized speech and target translation pair, (s, x) represents the original speech and transcription pair, and (x, y) represents the transcription and target translation pair; p (y|s; θ) represents the probability of s translating to y under the model parameters θ; p (x|s; θ) represents the probability that s is identified as x under the model parameter θ; p (y|x; θ) represents the probability of x translating to y under the model parameters θ;

5. loss to Ji Shi adapter L _align (D; θ):

L_align(D;θ)＝∑_(s,s′)∈DMSE(s_*,s′_*)

Where s represents the original speech, s ' represents the synthesized speech, s _* represents the output of s through the pair Ji Shi adapter, s ' _* represents the output of s ' through the shared encoder, and MSE represents the mean square error loss;

6. loss of synthesized speech to original speech knowledge, L _KD (D; θ):

Wherein logits represents the probability distribution of all words in the vocabulary at position i, each step yielding one logits; τ is a temperature coefficient, P (y _t＝k|y_<t, s, θ) represents a probability that the input is s, the model parameter is θ, and the output token at time t is k, the token being a specific word in logits; p (y _t＝k|y_<t, s ', θ) represents the probability of the output token being k at input s', model parameter θ and time t; v represents the vocabulary size, N represents the target translation length, KD represents knowledge distillation;

In combination with the above 6 loss functions, the final loss function of the whole speech translation model is L (D; θ):

L(D;θ)＝L_ST(D;θ)+L_ASR(D;θ)+L_MT(D;θ)+L_ST′(D;θ)+L_align(D;θ)+L_KD(D;θ)

2. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: step 1 comprises the following steps:

3. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.1, a synthesized speech corresponding to text corresponding to the original speech is generated by means of a espnet-TTS system.

4. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.

5. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the speech coder module adds two layers of CNN after the wav2vec2.0 open source speech pre-training model.

6. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared encoder module adopts a classical Transformer Encoder structure.

7. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared decoder module adopts a classical Transformer Decoder structure.

8. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the pair Ji Shi of adapter modules adopts a classical Transformer Encoder structure.