CN117252213B - End-to-end speech translation method using synthesized speech as supervision information - Google Patents

End-to-end speech translation method using synthesized speech as supervision information Download PDF

Info

Publication number
CN117252213B
CN117252213B CN202310824069.5A CN202310824069A CN117252213B CN 117252213 B CN117252213 B CN 117252213B CN 202310824069 A CN202310824069 A CN 202310824069A CN 117252213 B CN117252213 B CN 117252213B
Authority
CN
China
Prior art keywords
speech
voice
translation
original
synthesized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310824069.5A
Other languages
Chinese (zh)
Other versions
CN117252213A (en
Inventor
熊德意
薛征山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310824069.5A priority Critical patent/CN117252213B/en
Publication of CN117252213A publication Critical patent/CN117252213A/en
Application granted granted Critical
Publication of CN117252213B publication Critical patent/CN117252213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an end-to-end voice translation method using synthetic voice as supervision information, which comprises the steps of firstly preprocessing triple original voice translation data to be translated to obtain quadruple voice translation data containing the synthetic voice; then constructing a voice translation model, and training the voice translation model by using voice translation data of the four-element group as a sample, wherein a pair Ji Shi adapter module is designed for taking semantic representation of the synthesized voice as supervision information so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice; at the same time, at the shared decoder side, the logits distribution of synthesized speech is distilled onto the logits distribution of original speech. And finally, translating the input voice to be translated by using the trained voice translation model, and outputting a target translation text. The invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.

Description

End-to-end speech translation method using synthesized speech as supervision information
Technical Field
The invention relates to the technical field of speech translation, in particular to an end-to-end speech translation method using synthetic speech as supervision information.
Background
Speech translation techniques include two types: (1) Speech-to-Speech translation (S2S), which automatically translates audio signals in one language into audio signals in another language. (2) Speech-to-Text (S2T), which automatically translates audio signals in one language into Text in another language. The two technologies are widely applied to simultaneous transmission systems. Such as transmission of flight information, hundred degrees transmission of flight information, etc.
This patent belongs to the S2T technology, i.e. converting source language speech into text in the target language. Depending on the implementation, it may be classified into tandem speech translation and end-to-end speech translation. Tandem speech translation, which concatenates speech recognition (Automatic Speech Recognition, ASR) and machine translation (Machine Translation, MT), is a process of first recognizing a source speech as a source text and then translating the source text into a target text. The advantages of this solution are: (1) The voice recognition and the machine translation can be optimized independently, so that the difficulty of a voice translation task is reduced. (2) ASR and MT have rich data, so that both speech recognition and machine translation alone have good results. However, the disadvantages of this approach are also apparent: (1) error propagation; if errors exist in the source text obtained by the voice recognition model, the errors are likely to be amplified in the translation process, so that the final translation result has larger deviation. (2) high latency; the speech recognition model and the text translation model can only be calculated serially, the time delay is high, the translation efficiency is relatively low, and especially in a real-time speech translation scene, the requirement on the translation efficiency is very high. In addition, as the cascade system further decomposes the tasks in the actual scene, an intermediate processing module is added, and the overall performance of the system can be improved, but the time delay is further improved, and the translation efficiency is reduced. (3) loss of voice information; in the process of recognizing speech as text, information such as mood, emotion, tone and the like contained in speech is lost, and such information is not usually expressed in the form of text. In the same sentence, the meanings expressed in different speech are likely to be different. This information is also helpful for translation. End-to-end speech translation, which is a direct modeling of speech to target text conversion relative to tandem speech translation, is a current focus of research in the industry. The advantages are that: (1) error propagation problems are avoided; (2) the time delay is significantly reduced; and (3) the model deployment is light. The disadvantages are: (1) high modeling complexity; conversion involving modalities; (2) Training data is scarce, which is the bottleneck that affects the greatest end-to-end speech translation.
The End-to-End Speech translation technology (End-to-End S2T) is mainly based on the current technical schemes of multi-task learning (MultiTask Learning), knowledge distillation (Knowledge Distillation, KD), speech-text hybrid learning (Speech-Text manifold mixup learning), contrast learning (Contrastive Learning) and the like. According to the scheme, the end-to-end voice translation effect can be effectively improved. The multi-task learning is to add an additional training target to the model to guide the learning of the model, and provide a supervised learning signal for the speech translation model to assist the training by introducing the knowledge of other models, so that the problem of insufficient speech translation training data is solved. Knowledge distillation is to distill text translation knowledge with relatively good model effect onto a voice translation model with relatively complex model and relatively poor effect, so that the effect of voice translation is improved. The voice text mixed learning is essentially a data enhancement technology, and the effect of voice translation is improved by constructing more data. In the comparison learning method, the distance between the inter-translation sentence pairs is shortened in the same training batch, and the distance between the non-inter-translation sentence pairs is shortened at the same time, so that the voice translation effect is improved. None of the above techniques takes into account the robustness of speech translation, but stands at the point of modal pull-up, improving the translation result of the model.
Robustness is an important direction of speech translation. The robustness problem is mainly represented by the fact that for the same text, the translation effect of the model is not completely consistent even if the same sentence is spoken by different persons through voice and even if the same sentence is spoken by different rounds of the same person. The problem solved by the robustness is that for the voices corresponding to the same text, whether different people or the same person, the translation results should be kept consistent or the translation results have higher quality under various scene conditions. Different persons or the same person, in different situations, say the same, may have different speaking speech durations, pauses, etc., collectively referred to herein as different timbres, which may affect the robustness of the speech translation.
Disclosure of Invention
The invention aims at solving the technical problem of end-to-end speech translation robustness, and provides an end-to-end speech translation method using synthesized speech as supervision information. The method uses standard synthesized voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, so that the voice translation effect is improved.
The technical scheme adopted for realizing the purpose of the invention is as follows:
an end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:
Step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: the original voice, the synthesized voice corresponding to the transcribed text, the transcribed text and the translated text;
Step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;
The speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;
A speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;
a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;
a shared encoder module for obtaining semantic representations of the original speech, the synthesized speech, the transcribed text;
The shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;
For Ji Shi adapter module, the function is to use the semantic representation of the synthesized voice as supervision information to make the semantic representation of the original voice approach to the semantic representation of the synthesized voice;
definition of samples ,/>Representing the original speech, the synthesized speech, the transcribed text and the target translated text, respectively,/>Representing model parameters, in sample D, model parameters/>Under the condition, the following loss function is established:
1. Loss of original speech translation to target translation text
2. Loss of original speech recognition as transcribed text
3. Loss of machine translation of transcribed text into target translated text
4. Loss of translation of synthesized speech to target translation text
For the four loss functions described above, the loss function, among others,Representing pairs of original speech and target translated text,/>Representing a pair of synthesized speech and target translated text,/>Representing the original speech and transcript text pairs,/>Representing a transcribed text and a target translated text pair; /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Identify as/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2);
5. Loss of Ji Shi adapters
Wherein,Representation/>Through the output to Ji Shi adapters,/>Representation/>Through the output of the shared encoder, MSE represents the mean square error loss;
6. Loss of synthetic speech to original speech knowledge distillation
Wherein,Is the shared decoder module/>The outputs logits, logits of the steps represent the probability distribution of all words in the vocabulary at position i, each step yielding one logits; is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>Is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>A token refers to a particular word in logits; Expressed in input as/> The parameter is/>First/>The time output token is/>Probability of (2); /(I)Representing vocabulary size,/>Indicating the target translation length, KD indicates knowledge distillation;
The 6 loss functions are combined, and the loss function of the final whole voice translation model is that
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
In the above technical solution, step 1 includes the following steps:
Step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;
Step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;
Step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;
step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;
Step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;
Step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.
In the above technical solution, in step 1.1, a synthesized speech corresponding to a text corresponding to an original speech is generated by means of a espnet-TTS system.
In the above technical solution, in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 and the synthesized speech are filtered out, and the corresponding transcribed text and translated text are filtered out simultaneously.
In the above technical solution, in step 2, the speech encoder module adds two layers of CNNs after the wav2vec2.0 open source speech pre-training model.
In the above technical solution, in step 2, the shared encoder module adopts a classical Transformer Encoder structure.
In the above technical solution, in step 2, the shared decoder module adopts a classical Transformer Decoder structure.
In the above technical solution, in step 2, the pair Ji Shi of adapter modules adopts a classical Transformer Encoder structure.
Compared with the prior art, the invention has the beneficial effects that:
the invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.
Drawings
Fig. 1 is a flow chart of an end-to-end speech translation method of the present invention using synthesized speech as supervisory information.
FIG. 2 is a schematic diagram of a speech translation model according to the present invention.
Other relevant drawings may be made by those of ordinary skill in the art from the above figures without undue burden.
Detailed Description
The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, an end-to-end speech translation method using synthesized speech as supervisory information includes the steps of:
Step 1: preprocessing the triplet original voice translation data to be translated to obtain the voice translation data of the quadruple. The method comprises the following specific steps:
Step 1.1: the original speech translation data is a triplet structure, and the three tuples are: the method comprises the steps of firstly, synthesizing corresponding voices from the original voices, transcribed texts corresponding to the original voices and translated texts corresponding to the original voices by a text-to-voice synthesis system, namely, generating synthesized voice data corresponding to the transcribed texts corresponding to the original voices. In this embodiment, an open source espnet-TTS system is used, and a synthetic speech corresponding to the transcribed text corresponding to the original speech is generated by means of the espnet-TTS system.
Step 1.2: since the sample rate of the synthesized speech generated by espnet2-TTS system is 22k, which is different from the sample rate (16 k) of the original speech, the synthesized speech sample rate obtained in step 1.1 needs to be converted into 16k by a ffmpeg tool.
Step 1.3: and (2) calculating the duration of the synthesized voice and the duration of the original voice obtained in the step (1.2), and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice.
Step 1.4: the original speech with compression <0.4 and compression >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.
Step 1.5: for the original speech processed in step 1.4 and the synthesized speech, the duration of the synthesized speech is converted to be identical to the speech duration of the original speech by the ffmpeg tool (ideally, the duration of the synthesized speech is completely identical, but the conversion process of the ffmpeg tool has a loss, so that the final duration is not completely identical, and has a small gap, which is about 1% or less, and if the duration of the original speech is about 1 second, the duration of the synthesized speech is about 1.00x or 0.99x seconds).
Because the ffmpeg tool can lengthen or shorten a voice, its control is controlled by parameter atempo, the value of this parameter is 0.5 to 100; therefore, if the compression ratio=0.49, because it is smaller than 0.5, a plurality of atempo parameters are required, and the present invention adopts a method that the compression ratio=0.49 is a value of the square of the open (0.7) as the value of the parameter atempo, and 2 pieces of atempo (atempo =0.7) are stacked.
Step 1.6: adding the synthesized voice processed in the step 1.5 into original voice translation data to obtain voice translation data of four tuples, wherein the four tuples are respectively: original speech, synthesized speech, transcribed text, and translated text.
Step 2: and (3) constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step (1) as a sample.
Referring to fig. 2, the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module.
A speech encoder module: the function is to encode the original speech and the synthesized speech in the four-tuple speech translation data into real vectors, and the real vectors represent the speech characteristics of the speech in the model and participate in the training of the subsequent model. The specific implementation of the speech coder module is to add two layers of CNN (Convolutional Neural Networks, convolutional neural network) after the Wav2vec open source speech pre-training model.
A text encoder module: the function is to encode the transcribed text in the four-tuple speech translation data into a word vector.
A shared encoder module: the classical Transformer Encoder structure is adopted, and the function is to obtain semantic representation of original voice, synthesized voice and transcribed text, wherein the parameters layer (number of layers of Encoder) =6 and multi-heads (number of multi-head attention) =8.
A shared decoder module: a classical Transformer Decoder structure was used, where the parameter layer (number of layers of Dncoder) =6, multi-heads (number of multi-headed attention) =8. The function is to obtain the reasoning result by adopting an autoregressive method.
For Ji Shi adapter modules: the same structure as the shared encoder adopts a classical Transformer Encoder structure. The method has the function of taking the semantic representation of the synthesized voice as supervision information, and making the semantic representation of the original voice close to the semantic representation of the synthesized voice, thereby eliminating the negative influence caused by voice changes of different speakers or the same speaker (namely, the semantic representations of the different speakers or the same speaker tend to be consistent).
Definition of samples,/>Representing the original speech, the synthesized speech, the transcribed text and the target translated text, respectively,/>Representing model parameters, at sample/>Model parameters/>Under the condition, the following loss function is established:
1. Loss of original speech translation to target translation text
2. Loss of original speech recognition as transcribed text
3. Loss of machine translation of transcribed text into target translated text
4. Loss of translation of synthesized speech to target translation text
For the above-mentioned loss function, the loss function, among others,Representing pairs of original speech and target translated text,/>Representing a pair of synthesized speech and target translated text,/>Representing the original speech and transcript text pairs,/>Representing a transcribed text and a target translated text pair; /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Identify as/>Probability of (2); /(I)Expressed in model parameters/>Under the condition of/>Translation into/>Probability of (2); ST represents speech translation (Speech Translation), ASR represents speech recognition (Automatic Speech Recognition), MT represents machine translation (Machine Translation);
5. Loss of Ji Shi adapters
Wherein,Representation/>Through the output to Ji Shi adapters,/>Representation/>The output through the shared encoder MSE (mean squared error) represents the mean square error loss;
6. Loss of synthetic speech to original speech knowledge distillation
Wherein,Is the shared decoder module/>Output logits of the step (logits represents the probability distribution of all words in the vocabulary at position i),/>Is the temperature coefficient,/>Expressed in input as/>The parameter is/>First/>The time output token is/>(During reasoning, each step will produce one logits, token refers to a specific word in logits); /(I)Expressed in input as/>The parameter is/>First/>The time output token is/>Probability of (2); /(I)Representing vocabulary size,/>Represents the target translation length and KD represents knowledge distillation (Knowledge Distillation).
The 6 loss functions are combined, and the loss function of the final whole voice translation model is that
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
The foregoing has described exemplary embodiments of the invention, it being understood that any simple variations, modifications, or other equivalent arrangements which would not unduly obscure the invention may be made by those skilled in the art without departing from the spirit of the invention.

Claims (8)

1. An end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:
Step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: original speech, synthesized speech, transcribed text, and translated text;
Step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;
The speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;
A speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;
a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;
a shared encoder module for obtaining a semantic representation of the original speech, the synthesized speech and the transcribed text;
The shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;
For Ji Shi adapter module, the function is to use the semantic representation of the synthesized voice as supervision information to make the semantic representation of the original voice approach to the semantic representation of the synthesized voice;
Defining a sample D= (s, s ', x, y), s, s', x, y respectively representing an original voice, a synthesized voice, a transcribed text and a target translated text, and θ representing model parameters, and under the conditions of the sample D and the model parameters θ, establishing the following loss functions:
1. Loss of original speech translation to target translation text L ST (D; θ):
LST(D;θ)=-∑(s,y)∈DlogP(y|s;θ)
2. original speech is recognized as a loss of transcribed text L ASR (D; θ):
LASR(D;θ)=-∑(s,x)∈DlogP(x|s;θ)
3. loss of machine translation of transcribed text to target translated text L MT (D; θ):
LMT(D;θ)=-∑(x,y)∈DlogP(y|x;θ)
4. Loss of translation of synthesized speech to target translation text L ST′ (D; θ):
LST′(D;θ)=-∑(s′,y)∈DlogP(y|s′;θ)
for the 4 penalty functions described above, where (s, y) represents the original speech and target translation pair, (s', y) represents the synthesized speech and target translation pair, (s, x) represents the original speech and transcription pair, and (x, y) represents the transcription and target translation pair; p (y|s; θ) represents the probability of s translating to y under the model parameters θ; p (x|s; θ) represents the probability that s is identified as x under the model parameter θ; p (y|x; θ) represents the probability of x translating to y under the model parameters θ;
5. loss to Ji Shi adapter L align (D; θ):
Lalign(D;θ)=∑(s,s′)∈DMSE(s*,s′*)
Where s represents the original speech, s ' represents the synthesized speech, s * represents the output of s through the pair Ji Shi adapter, s ' * represents the output of s ' through the shared encoder, and MSE represents the mean square error loss;
6. loss of synthesized speech to original speech knowledge, L KD (D; θ):
Wherein logits represents the probability distribution of all words in the vocabulary at position i, each step yielding one logits; τ is a temperature coefficient, P (y t=k|y<t, s, θ) represents a probability that the input is s, the model parameter is θ, and the output token at time t is k, the token being a specific word in logits; p (y t=k|y<t, s ', θ) represents the probability of the output token being k at input s', model parameter θ and time t; v represents the vocabulary size, N represents the target translation length, KD represents knowledge distillation;
In combination with the above 6 loss functions, the final loss function of the whole speech translation model is L (D; θ):
L(D;θ)=LST(D;θ)+LASR(D;θ)+LMT(D;θ)+LST′(D;θ)+Lalign(D;θ)+LKD(D;θ)
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
2. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: step 1 comprises the following steps:
Step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;
Step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;
Step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;
step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;
Step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;
Step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.
3. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.1, a synthesized speech corresponding to text corresponding to the original speech is generated by means of a espnet-TTS system.
4. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.
5. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the speech coder module adds two layers of CNN after the wav2vec2.0 open source speech pre-training model.
6. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared encoder module adopts a classical Transformer Encoder structure.
7. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared decoder module adopts a classical Transformer Decoder structure.
8. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the pair Ji Shi of adapter modules adopts a classical Transformer Encoder structure.
CN202310824069.5A 2023-07-06 2023-07-06 End-to-end speech translation method using synthesized speech as supervision information Active CN117252213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310824069.5A CN117252213B (en) 2023-07-06 2023-07-06 End-to-end speech translation method using synthesized speech as supervision information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310824069.5A CN117252213B (en) 2023-07-06 2023-07-06 End-to-end speech translation method using synthesized speech as supervision information

Publications (2)

Publication Number Publication Date
CN117252213A CN117252213A (en) 2023-12-19
CN117252213B true CN117252213B (en) 2024-05-31

Family

ID=89125402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310824069.5A Active CN117252213B (en) 2023-07-06 2023-07-06 End-to-end speech translation method using synthesized speech as supervision information

Country Status (1)

Country Link
CN (1) CN117252213B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020038343A (en) * 2018-08-30 2020-03-12 国立研究開発法人情報通信研究機構 Method and device for training language identification model, and computer program for it
CN111326157A (en) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN112204653A (en) * 2019-03-29 2021-01-08 谷歌有限责任公司 Direct speech-to-speech translation through machine learning
CN112951213A (en) * 2021-02-09 2021-06-11 中国科学院自动化研究所 End-to-end online voice detection and recognition method, system and equipment
CN113505611A (en) * 2021-07-09 2021-10-15 中国人民解放军战略支援部队信息工程大学 Training method and system for obtaining better speech translation model in generation of confrontation
CN114842834A (en) * 2022-03-31 2022-08-02 中国科学院自动化研究所 Voice text joint pre-training method and system
CN115828943A (en) * 2022-12-28 2023-03-21 沈阳雅译网络技术有限公司 Speech translation model modeling method and device based on speech synthesis data
CN115985298A (en) * 2022-12-20 2023-04-18 沈阳雅译网络技术有限公司 End-to-end speech translation method based on automatic alignment, mixing and self-training of speech texts
CN116227503A (en) * 2023-01-06 2023-06-06 沈阳雅译网络技术有限公司 CTC-based non-autoregressive end-to-end speech translation method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020038343A (en) * 2018-08-30 2020-03-12 国立研究開発法人情報通信研究機構 Method and device for training language identification model, and computer program for it
CN112204653A (en) * 2019-03-29 2021-01-08 谷歌有限责任公司 Direct speech-to-speech translation through machine learning
CN111326157A (en) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 Text generation method and device, electronic equipment and computer readable medium
CN112951213A (en) * 2021-02-09 2021-06-11 中国科学院自动化研究所 End-to-end online voice detection and recognition method, system and equipment
CN113505611A (en) * 2021-07-09 2021-10-15 中国人民解放军战略支援部队信息工程大学 Training method and system for obtaining better speech translation model in generation of confrontation
CN114842834A (en) * 2022-03-31 2022-08-02 中国科学院自动化研究所 Voice text joint pre-training method and system
CN115985298A (en) * 2022-12-20 2023-04-18 沈阳雅译网络技术有限公司 End-to-end speech translation method based on automatic alignment, mixing and self-training of speech texts
CN115828943A (en) * 2022-12-28 2023-03-21 沈阳雅译网络技术有限公司 Speech translation model modeling method and device based on speech synthesis data
CN116227503A (en) * 2023-01-06 2023-06-06 沈阳雅译网络技术有限公司 CTC-based non-autoregressive end-to-end speech translation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
End-to-End Speech Translation with Knowledge Distillation;Yuchen Liu et.;《arXiv》;20190417;全文 *
基于Transformer Transducer的端到端实时语音翻译的研究;邹剑云;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20220615(第6期);全文 *

Also Published As

Publication number Publication date
CN117252213A (en) 2023-12-19

Similar Documents

Publication Publication Date Title
CN107545903B (en) Voice conversion method based on deep learning
Zhang et al. Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet
CN112767958B (en) Zero-order learning-based cross-language tone conversion system and method
WO2022048405A1 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
Kameoka et al. ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
Zhao et al. Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams.
Zhou et al. Converting anyone's emotion: Towards speaker-independent emotional voice conversion
CN114023316A (en) TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
Liu et al. Voice conversion with transformer network
Luong et al. Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
Kameoka et al. Fasts2s-vc: Streaming non-autoregressive sequence-to-sequence voice conversion
An et al. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
Wang et al. Speech augmentation using wavenet in speech recognition
Wu et al. Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party.
Dai et al. Cloning one’s voice using very limited data in the wild
Moritani et al. Stargan-based emotional voice conversion for japanese phrases
Fu et al. Cycletransgan-evc: A cyclegan-based emotional voice conversion model with transformer
CN117252213B (en) End-to-end speech translation method using synthesized speech as supervision information
US11715457B1 (en) Real time correction of accent in speech audio signals
US20230317059A1 (en) Alignment Prediction to Inject Text into Automatic Speech Recognition Training
Tan et al. Denoised senone i-vectors for robust speaker verification
Akuzawa et al. Conditional deep hierarchical variational autoencoder for voice conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant