CN117252213A - End-to-end speech translation method using synthesized speech as supervision information - Google Patents
End-to-end speech translation method using synthesized speech as supervision information Download PDFInfo
- Publication number
- CN117252213A CN117252213A CN202310824069.5A CN202310824069A CN117252213A CN 117252213 A CN117252213 A CN 117252213A CN 202310824069 A CN202310824069 A CN 202310824069A CN 117252213 A CN117252213 A CN 117252213A
- Authority
- CN
- China
- Prior art keywords
- speech
- voice
- translation
- original
- synthesized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013519 translation Methods 0.000 title claims abstract description 148
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 21
- 230000006835 compression Effects 0.000 claims description 16
- 238000007906 compression Methods 0.000 claims description 16
- 238000013140 knowledge distillation Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 10
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 206010002953 Aphonia Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/27—Regression, e.g. linear or logistic regression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/043—Time compression or expansion by changing speed
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an end-to-end voice translation method using synthetic voice as supervision information, which comprises the steps of firstly preprocessing triple original voice translation data to be translated to obtain quadruple voice translation data containing the synthetic voice; then constructing a voice translation model, and training the voice translation model by using voice translation data of the four-element group as a sample, wherein a pair Ji Shi adapter module is designed for taking semantic representation of the synthesized voice as supervision information so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice; at the same time, at the shared decoder side, the logits distribution of the synthesized speech is distilled onto the logits distribution of the original speech. And finally, translating the input voice to be translated by using the trained voice translation model, and outputting a target translation text. The invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.
Description
Technical Field
The invention relates to the technical field of speech translation, in particular to an end-to-end speech translation method using synthetic speech as supervision information.
Background
Speech translation techniques include two types: (1) Speech-to-Speech translation (S2S), which automatically translates audio signals in one language into audio signals in another language. (2) Speech-to-Text (S2T), which automatically translates audio signals in one language into Text in another language. The two technologies are widely applied to simultaneous transmission systems. Such as transmission of flight information, hundred degrees transmission of flight information, etc.
This patent belongs to the S2T technology, i.e. converting source language speech into text in the target language. Depending on the implementation, it may be classified into tandem speech translation and end-to-end speech translation. Tandem speech translation, which concatenates speech recognition (Automatic Speech Recognition, ASR) and machine translation (Machine Translation, MT), is a process of recognizing a source speech as a source text and then translating the source text to a target text. The advantages of this solution are: (1) The voice recognition and the machine translation can be optimized independently, so that the difficulty of a voice translation task is reduced. (2) ASR and MT have rich data, so that both speech recognition and machine translation alone have good results. However, the disadvantages of this approach are also apparent: (1) error propagation; if errors exist in the source text obtained by the voice recognition model, the errors are likely to be amplified in the translation process, so that the final translation result has larger deviation. (2) high latency; the speech recognition model and the text translation model can only be calculated serially, the time delay is high, the translation efficiency is relatively low, and especially in a real-time speech translation scene, the requirement on the translation efficiency is very high. In addition, as the cascade system further decomposes the tasks in the actual scene, an intermediate processing module is added, and the overall performance of the system can be improved, but the time delay is further improved, and the translation efficiency is reduced. (3) loss of voice information; in the process of recognizing speech as text, information such as mood, emotion, tone and the like contained in speech is lost, and such information is not usually expressed in the form of text. In the same sentence, the meanings expressed in different speech are likely to be different. This information is also helpful for translation. End-to-end speech translation, which is a direct modeling of speech to target text conversion relative to tandem speech translation, is a current focus of research in the industry. The advantages are that: (1) error propagation problems are avoided; (2) the time delay is significantly reduced; and (3) the model deployment is light. The disadvantages are: (1) high modeling complexity; conversion involving modalities; (2) Training data is scarce, which is the bottleneck that affects the greatest end-to-end speech translation.
The End-to-End Speech translation technology (End-to-End S2T) is mainly based on the current technical scheme of multi-task learning (MultiTask Learning), knowledge distillation (Knowledge Distillation, KD), speech-text hybrid learning (Speech-Text manifold mixup learning), contrast learning (Contrastive Learning) and the like. According to the scheme, the end-to-end voice translation effect can be effectively improved. The multi-task learning is to add an additional training target to the model to guide the learning of the model, and provide a supervised learning signal for the speech translation model to assist the training by introducing the knowledge of other models, so that the problem of insufficient speech translation training data is solved. Knowledge distillation is to distill text translation knowledge with relatively good model effect onto a voice translation model with relatively complex model and relatively poor effect, so that the effect of voice translation is improved. The voice text mixed learning is essentially a data enhancement technology, and the effect of voice translation is improved by constructing more data. In the comparison learning method, the distance between the inter-translation sentence pairs is shortened in the same training batch, and the distance between the non-inter-translation sentence pairs is shortened at the same time, so that the voice translation effect is improved. None of the above techniques takes into account the robustness of speech translation, but stands at the point of modal pull-up, improving the translation result of the model.
Robustness is an important direction of speech translation. The robustness problem is mainly represented by the fact that for the same text, the translation effect of the model is not completely consistent even if the same sentence is spoken by different persons through voice and even if the same sentence is spoken by different rounds of the same person. The problem solved by the robustness is that for the voices corresponding to the same text, whether different people or the same person, the translation results should be kept consistent or the translation results have higher quality under various scene conditions. Different persons or the same person, in different situations, say the same, may have different speaking speech durations, pauses, etc., collectively referred to herein as different timbres, which may affect the robustness of the speech translation.
Disclosure of Invention
The invention aims at solving the technical problem of end-to-end speech translation robustness, and provides an end-to-end speech translation method using synthesized speech as supervision information. The method uses standard synthesized voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, so that the voice translation effect is improved.
The technical scheme adopted for realizing the purpose of the invention is as follows:
an end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:
step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: the original voice, the synthesized voice corresponding to the transcribed text, the transcribed text and the translated text;
step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;
the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;
a speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;
a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;
a shared encoder module for obtaining semantic representations of the original speech, the synthesized speech, the transcribed text;
the shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;
for the Ji Shi adapter module, the function is to take the semantic representation of the synthesized voice as supervision information, so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice;
definition of samples,/>Representing original speech, synthesized speech, transcribed text and target translated text, respectively, < >>Representing model parameters, in the sampleD. Model parameters->Under the condition, the following loss function is established:
1. loss of original speech translation to target translation text:
;
2. Loss of original speech recognition as transcribed text:
;
3. Loss of machine translation of transcribed text into target translated text:
;
4. Loss of translation of synthesized speech to target translation text:
;
For the four loss functions described above, the loss function, among others,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing a transcribed text and a target translated text pair; />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2);
5. loss of Ji Shi adapter:
;
Wherein,representation->Via the output to the Ji Shi adapter, +.>Representation->Through the output of the shared encoder, MSE represents the mean square error loss;
6. loss of synthetic speech to original speech knowledge distillation:
;
;
Wherein,is a shared decoder module +.>The output logits of the steps, the logits represent probability distribution of all words in the vocabulary at the position i, and each step can generate one logits; is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>Is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>A token refers to a specific word in the logits;the representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Represent target translation length, KD represents knowledgeDistilling;
the 6 loss functions are combined, and the loss function of the final whole voice translation model is that:
;
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
In the above technical solution, step 1 includes the following steps:
step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;
step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;
step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;
step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;
step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;
step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.
In the above technical solution, in step 1.1, a synthesized speech corresponding to a text corresponding to an original speech is generated by an espnet2-TTS system.
In the above technical solution, in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 and the synthesized speech are filtered out, and the corresponding transcribed text and translated text are filtered out simultaneously.
In the above technical solution, in step 2, the speech encoder module adds two layers of CNNs after the wav2vec2.0 open source speech pre-training model.
In the above technical solution, in step 2, the shared encoder module adopts a classical Transformer Encoder structure.
In the above technical solution, in step 2, the shared decoder module adopts a classical Transformer Decoder structure.
In the above technical solution, in step 2, the pair Ji Shi adapter modules adopts a classical Transformer Encoder structure.
Compared with the prior art, the invention has the beneficial effects that:
the invention uses standard synthetic voice as supervision information, and is integrated into a voice translation model training framework, and in the training process, the supervision directs the end-to-end original voice translation training, thereby improving the translation effect.
Drawings
Fig. 1 is a flow chart of an end-to-end speech translation method of the present invention using synthesized speech as supervisory information.
FIG. 2 is a schematic diagram of a speech translation model according to the present invention.
Other relevant drawings may be made by those of ordinary skill in the art from the above figures without undue burden.
Detailed Description
The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, an end-to-end speech translation method using synthesized speech as supervisory information includes the steps of:
step 1: preprocessing the triplet original voice translation data to be translated to obtain the voice translation data of the quadruple. The method comprises the following specific steps:
step 1.1: the original speech translation data is a triplet structure, and the three tuples are: the method comprises the steps of firstly, synthesizing corresponding voices from the original voices, transcribed texts corresponding to the original voices and translated texts corresponding to the original voices by a text-to-voice synthesis system, namely, generating synthesized voice data corresponding to the transcribed texts corresponding to the original voices. In this embodiment, an open source espnet2-TTS system is adopted, and a synthetic speech corresponding to a transcribed text corresponding to the original speech is generated through the espnet2-TTS system.
Step 1.2: since the sample rate of the synthesized speech generated by the espnet2-TTS system is 22k, which is different from the sample rate (16 k) of the original speech, the synthesized speech sample rate obtained in step 1.1 needs to be converted into 16k by a ffmpeg tool.
Step 1.3: and (2) calculating the duration of the synthesized voice and the duration of the original voice obtained in the step (1.2), and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice.
Step 1.4: the original speech with compression <0.4 and compression >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.
Step 1.5: for the original speech processed in step 1.4 and the synthesized speech, the duration of the synthesized speech is converted to be identical to the speech duration of the original speech by the ffmpeg tool (ideally, the duration of the synthesized speech is completely identical, but the conversion process of the ffmpeg tool has a loss, so that the final duration is not completely identical, and has a small gap, which is about 1% or less, and if the duration of the original speech is about 1 second, the duration of the synthesized speech is about 1.00x or 0.99x seconds).
Because the ffmpeg tool can lengthen or shorten a voice, its control is controlled by the parameter atempo, and the value of this parameter is 0.5 to 100; so if the compression ratio=0.49, because it is smaller than 0.5, a plurality of atempo parameters are required, the present invention adopts a method that the compression ratio=0.49 takes the value of the square of the open (0.7) as the value of the parameter atempo, and stacks 2 atempos (atempo=0.7).
Step 1.6: adding the synthesized voice processed in the step 1.5 into original voice translation data to obtain voice translation data of four tuples, wherein the four tuples are respectively: original speech, synthesized speech, transcribed text, and translated text.
Step 2: and (3) constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step (1) as a sample.
Referring to fig. 2, the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module.
A speech encoder module: the function is to encode the original speech and the synthesized speech in the four-tuple speech translation data into real vectors, and the real vectors represent the speech characteristics of the speech in the model and participate in the training of the subsequent model. The specific implementation of the speech coder module is to add two layers of CNN (Convolutional Neural Networks, convolutional neural network) after the Wav2vec open source speech pre-training model.
A text encoder module: the function is to encode the transcribed text in the four-tuple speech translation data into a word vector.
A shared encoder module: the classical Transformer Encoder structure is adopted, and the function is to obtain semantic representations of original voice, synthesized voice and transcribed text, wherein the parameter layer (the number of layers of an Encoder) =6 and multi-heads (the number of multi-head attentions) =8.
A shared decoder module: a classical Transformer Decoder structure was used, where the parameter layer (number of layers of Dncoder) =6, multi-heads (number of multi-head attention) =8. The function is to obtain the reasoning result by adopting an autoregressive method.
For Ji Shi adapter module: the same structure as the shared encoder adopts a classical Transformer Encoder structure. The method has the function of taking the semantic representation of the synthesized voice as supervision information, and making the semantic representation of the original voice close to the semantic representation of the synthesized voice, thereby eliminating the negative influence caused by voice changes of different speakers or the same speaker (namely, the semantic representations of the different speakers or the same speaker tend to be consistent).
Definition of samples,/>Representing original speech, synthesized speech, transcribed text and target translated text, respectively, < >>Representing model parameters, in samples->Model parameters->Under the condition, the following loss function is established:
1. loss of original speech translation to target translation text:
;
2. Loss of original speech recognition as transcribed text:
;
3. Loss of machine translation of transcribed text into target translated text:
;
4. Loss of translation of synthesized speech to target translation text:
;
For the above-mentioned loss function, the loss function, among others,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing a transcribed text and a target translated text pair; />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); ST represents speech translation (Speech Translation), ASR represents speech recognition (Automatic Speech Recognition), MT represents machine translation (Machine Translation);
5. loss of Ji Shi adapter:
;
Wherein,representation->Via the output to the Ji Shi adapter, +.>Representation->The output through the shared encoder MSE (mean squared error) represents the mean square error loss;
6. loss of synthetic speech to original speech knowledge distillation:
;
;
Wherein,is a shared decoder module +.>Output logits of the step (logits represent probability distribution of all words in the vocabulary at position i),>is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>(during reasoning, each step generates a word, and token refers to a specific word in the words); />The representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Represents the target translation length, KD represents knowledge distillation (Knowledge Distillation).
The 6 loss functions are combined, and the loss function of the final whole voice translation model is that:
;
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
The foregoing has described exemplary embodiments of the invention, it being understood that any simple variations, modifications, or other equivalent arrangements which would not unduly obscure the invention may be made by those skilled in the art without departing from the spirit of the invention.
Claims (8)
1. An end-to-end speech translation method using synthesized speech as supervisory information, comprising the steps of:
step 1: preprocessing the triplet original voice translation data to be translated to obtain quadruple voice translation data; wherein the triples are: original voice, transcribed text corresponding to the original voice and translated text corresponding to the original voice, and four-tuple is: original speech, synthesized speech, transcribed text, and translated text;
step 2: constructing a voice translation model, and training the voice translation model by using the voice translation data of the four-element group obtained in the step 1 as a sample;
the speech translation model includes 5 modules: a speech encoder module, a text encoder module, a shared encoder module, a pair Ji Shi adapter module, and a shared decoder module;
a speech encoder module for encoding the original speech and the synthesized speech in the four-tuple speech translation data into real vectors representing speech features of the speech in the model;
a text encoder module for encoding transcribed text in the four-tuple speech translation data into a word vector;
a shared encoder module for obtaining semantic representations of the original speech, the synthesized speech, the transcribed text;
the shared decoder module is used for obtaining an reasoning result by adopting an autoregressive method;
for the Ji Shi adapter module, the function is to take the semantic representation of the synthesized voice as supervision information, so that the semantic representation of the original voice is close to the semantic representation of the synthesized voice;
definition of samples,/>Representing original speech, synthesized speech, transcribed text and target translated text, respectively, < >>Representing model parameters, in samples->Model parameters->Under the condition, the following loss function is established:
1, loss of original speech translation into target translation text:
;
2, original speech recognition as loss of transcribed text:
;
3 loss of machine translation of transcribed text into target translated text:
;
4, loss of translation of synthesized speech into target translation text:
;
For the 4 loss functions described above, among them,representing the original speech and target translation pair, < ->Representing a pair of synthesized speech and target translation, < ->Representing the original speech and transcript text pair, +.>Representing pairs of transcribed text and target translated text;/>Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2); />Expressed in model parameters +.>Under the condition of->Identified as->Probability of (2); />Expressed in model parameters +.>Under the condition of->Translation into->Probability of (2);
5 loss of Ji Shi adapter:
;
Wherein,representation->Via the output to the Ji Shi adapter, +.>Representation->Through the output of the shared encoder, MSE represents the mean square error loss;
6 loss of synthetic speech to raw speech knowledge distillation:
;
;
Wherein,is a shared decoder module +.>The output logits of the steps, the logits represent probability distribution of all words in the vocabulary at the position i, and each step can generate one logits; />Is the temperature coefficient>The representation is +.>Parameter is->First->The time output token is +.>A token refers to a specific word in the logits; />The representation is +.>Parameter is->First->The time output token is +.>Probability of (2); />Representing vocabulary size, ++>Indicating the target translation length, KD indicates knowledge distillation;
the 6 loss functions are combined, and the loss function of the final whole voice translation model is that:
;
And step 3, translating the input voice to be translated by using the trained voice translation model obtained in the step 2, and outputting a target translation text.
2. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: step 1 comprises the following steps:
step 1.1: generating synthetic voice data corresponding to a transcribed text corresponding to the original voice;
step 1.2: the sampling rate of the synthesized voice obtained in the step 1.1 is adjusted to be the same as the sampling rate of the original voice;
step 1.3: calculating the duration of the synthesized voice and the duration of the original voice obtained in the step 1.2, and calculating the compression rate between the synthesized voice and the original voice, wherein the compression rate=the duration of the synthesized voice/the duration of the original voice;
step 1.4: filtering out the original voice and the synthesized voice with the compression rate not meeting the requirement, and synchronously filtering out the corresponding transcribed text and the corresponding translated text;
step 1.5: converting the duration of the synthesized voice to be consistent with the voice duration of the original voice for the original voice and the synthesized voice processed in the step 1.4;
step 1.6: and (3) adding the synthesized voice processed in the step (1.5) into the original voice translation data to obtain voice translation data of the quadruple.
3. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.1, a synthesized speech corresponding to a text corresponding to the original speech is generated by means of an espnet2-TTS system.
4. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 2, wherein: in step 1.4, the original speech with compression ratio <0.4 and compression ratio >3 is filtered out with the synthesized speech, and the corresponding transcribed text and translated text are filtered out simultaneously.
5. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the speech coder module adds two layers of CNN after the wav2vec2.0 open source speech pre-training model.
6. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared encoder module adopts a classical Transformer Encoder structure.
7. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the shared decoder module adopts a classical Transformer Decoder structure.
8. The end-to-end speech translation method using synthesized speech as supervisory information according to claim 1, wherein: in step 2, the pair Ji Shi adapter modules adopts a classical Transformer Encoder structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310824069.5A CN117252213B (en) | 2023-07-06 | 2023-07-06 | End-to-end speech translation method using synthesized speech as supervision information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310824069.5A CN117252213B (en) | 2023-07-06 | 2023-07-06 | End-to-end speech translation method using synthesized speech as supervision information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117252213A true CN117252213A (en) | 2023-12-19 |
CN117252213B CN117252213B (en) | 2024-05-31 |
Family
ID=89125402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310824069.5A Active CN117252213B (en) | 2023-07-06 | 2023-07-06 | End-to-end speech translation method using synthesized speech as supervision information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117252213B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020038343A (en) * | 2018-08-30 | 2020-03-12 | 国立研究開発法人情報通信研究機構 | Method and device for training language identification model, and computer program for it |
CN111326157A (en) * | 2020-01-20 | 2020-06-23 | 北京字节跳动网络技术有限公司 | Text generation method and device, electronic equipment and computer readable medium |
CN112204653A (en) * | 2019-03-29 | 2021-01-08 | 谷歌有限责任公司 | Direct speech-to-speech translation through machine learning |
CN112951213A (en) * | 2021-02-09 | 2021-06-11 | 中国科学院自动化研究所 | End-to-end online voice detection and recognition method, system and equipment |
CN113505611A (en) * | 2021-07-09 | 2021-10-15 | 中国人民解放军战略支援部队信息工程大学 | Training method and system for obtaining better speech translation model in generation of confrontation |
CN114842834A (en) * | 2022-03-31 | 2022-08-02 | 中国科学院自动化研究所 | Voice text joint pre-training method and system |
CN115828943A (en) * | 2022-12-28 | 2023-03-21 | 沈阳雅译网络技术有限公司 | Speech translation model modeling method and device based on speech synthesis data |
CN115985298A (en) * | 2022-12-20 | 2023-04-18 | 沈阳雅译网络技术有限公司 | End-to-end speech translation method based on automatic alignment, mixing and self-training of speech texts |
CN116227503A (en) * | 2023-01-06 | 2023-06-06 | 沈阳雅译网络技术有限公司 | CTC-based non-autoregressive end-to-end speech translation method |
-
2023
- 2023-07-06 CN CN202310824069.5A patent/CN117252213B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020038343A (en) * | 2018-08-30 | 2020-03-12 | 国立研究開発法人情報通信研究機構 | Method and device for training language identification model, and computer program for it |
CN112204653A (en) * | 2019-03-29 | 2021-01-08 | 谷歌有限责任公司 | Direct speech-to-speech translation through machine learning |
CN111326157A (en) * | 2020-01-20 | 2020-06-23 | 北京字节跳动网络技术有限公司 | Text generation method and device, electronic equipment and computer readable medium |
CN112951213A (en) * | 2021-02-09 | 2021-06-11 | 中国科学院自动化研究所 | End-to-end online voice detection and recognition method, system and equipment |
CN113505611A (en) * | 2021-07-09 | 2021-10-15 | 中国人民解放军战略支援部队信息工程大学 | Training method and system for obtaining better speech translation model in generation of confrontation |
CN114842834A (en) * | 2022-03-31 | 2022-08-02 | 中国科学院自动化研究所 | Voice text joint pre-training method and system |
CN115985298A (en) * | 2022-12-20 | 2023-04-18 | 沈阳雅译网络技术有限公司 | End-to-end speech translation method based on automatic alignment, mixing and self-training of speech texts |
CN115828943A (en) * | 2022-12-28 | 2023-03-21 | 沈阳雅译网络技术有限公司 | Speech translation model modeling method and device based on speech synthesis data |
CN116227503A (en) * | 2023-01-06 | 2023-06-06 | 沈阳雅译网络技术有限公司 | CTC-based non-autoregressive end-to-end speech translation method |
Non-Patent Citations (2)
Title |
---|
YUCHEN LIU ET.: "End-to-End Speech Translation with Knowledge Distillation", 《ARXIV》, 17 April 2019 (2019-04-17) * |
邹剑云: "基于Transformer Transducer的端到端实时语音翻译的研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 6, 15 June 2022 (2022-06-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN117252213B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107545903B (en) | Voice conversion method based on deep learning | |
WO2022048405A1 (en) | Text-based virtual object animation generation method, apparatus, storage medium, and terminal | |
CN112767958B (en) | Zero-order learning-based cross-language tone conversion system and method | |
Zhang et al. | Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet | |
Zhao et al. | Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. | |
Zhou et al. | Converting anyone's emotion: Towards speaker-independent emotional voice conversion | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN114023316A (en) | TCN-Transformer-CTC-based end-to-end Chinese voice recognition method | |
Liu et al. | Voice conversion with transformer network | |
Luong et al. | Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech | |
CN112420050B (en) | Voice recognition method and device and electronic equipment | |
CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
CN111460143A (en) | Emotion recognition model of multi-person conversation system | |
Ueno et al. | Data augmentation for asr using tts via a discrete representation | |
An et al. | Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features | |
Wang et al. | Speech augmentation using wavenet in speech recognition | |
Dai et al. | Cloning one’s voice using very limited data in the wild | |
Moritani et al. | Stargan-based emotional voice conversion for japanese phrases | |
US11715457B1 (en) | Real time correction of accent in speech audio signals | |
CN117252213B (en) | End-to-end speech translation method using synthesized speech as supervision information | |
US20230317059A1 (en) | Alignment Prediction to Inject Text into Automatic Speech Recognition Training | |
Akuzawa et al. | Conditional deep hierarchical variational autoencoder for voice conversion | |
CN115376533A (en) | Voice conversion method for personalized voice generation | |
US20240005907A1 (en) | Real time correction of accent in speech audio signals | |
CN113852851B (en) | Rapid lip movement-voice alignment method based on parallel flow model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |