CN113505611A - Training method and system for obtaining better speech translation model in generation of confrontation - Google Patents

Training method and system for obtaining better speech translation model in generation of confrontation Download PDF

Info

Publication number
CN113505611A
CN113505611A CN202110780410.2A CN202110780410A CN113505611A CN 113505611 A CN113505611 A CN 113505611A CN 202110780410 A CN202110780410 A CN 202110780410A CN 113505611 A CN113505611 A CN 113505611A
Authority
CN
China
Prior art keywords
model
training
coding layer
loss
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110780410.2A
Other languages
Chinese (zh)
Other versions
CN113505611B (en
Inventor
屈丹
张昊
杨绪魁
闫红刚
张文林
郝朝龙
魏雪娟
李�真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Zhengzhou Xinda Institute of Advanced Technology
Original Assignee
Information Engineering University of PLA Strategic Support Force
Zhengzhou Xinda Institute of Advanced Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force , Zhengzhou Xinda Institute of Advanced Technology filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202110780410.2A priority Critical patent/CN113505611B/en
Publication of CN113505611A publication Critical patent/CN113505611A/en
Application granted granted Critical
Publication of CN113505611B publication Critical patent/CN113505611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The present invention provides a training method and system that achieves better speech translation models in generating countermeasures. The method comprises collecting training data, and training an MT model by using transcription-translation data pairs in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; by adopting a method of 'maximum and minimum' of the countermeasure, the encoding layer output distribution of the ST model is fitted with the encoding layer output distribution of the MT model, so that the ST model is helped to capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model. The invention can improve the recognition performance of the voice translation model, thereby improving the voice translation efficiency and quality.

Description

Training method and system for obtaining better speech translation model in generation of confrontation
Technical Field
The invention relates to the technical field of voice translation, in particular to a training method and a system for obtaining a better voice translation model in generation countermeasure, and a voice translation method and equipment.
Background
Speech translation refers to inputting speech in one language and outputting text in another language. The traditional Speech Translation system adopts a cascade mode, i.e. Speech is transcribed by an ASR (Automatic Speech Recognition) system, and then the transcribed result is input into an MT (Machine Translation) system for Translation. The system can utilize more data to train the ASR and MT systems respectively, and an ST (Speech Translation) system with higher Translation quality is obtained, so the cascade ST system has been widely applied for many years.
The end-to-end system skips the intermediate transcription stage and directly translates the input speech, and has been paid more and more attention by researchers in recent years and gradually becomes a mainstream paradigm because it can avoid the problem of error propagation between the two systems, has low inference delay and has lower requirements for memory and computing resources. But for end-to-end ST systems of the encoding-decoding architecture, one recognized problem is the overburdening of the encoding layer. It needs not only to extract the acoustic information of the original audio, but also to map into the semantic space, i.e. not only to listen, but also to understand. The insufficient data volume and the missing of the internal supervision signal lead the coding layer of the ST not to be trained effectively, which hinders the further improvement of the performance of the whole model.
The currently common solution is to pre-train the coding layer of the ST model with the ASR model, but in the ASR model the alignment between speech and transcription is monotonic, capturing local information of the surface, and pre-training in this way can help the ST model to capture acoustic information rather than semantic information. In fact, a very common observation is that the gap in modalities results in a MT model with the same data pair having superior performance than the ST model, and many researchers consider using the MT model to enhance the performance of the ST model. The most direct method is knowledge distillation, but the method only provides guidance at the last layer of the model and hardly influences the network of the previous part.
Disclosure of Invention
Aiming at the problem that the coding layer of ST cannot be effectively trained due to insufficient data volume and the absence of internal supervision signals, the invention provides a method and a system for training a speech translation model, which can obtain better training in generation countermeasure.
In a first aspect, the present invention provides a training method for obtaining better speech translation models in generating countermeasures, comprising:
step 1: collecting training data, and training an MT model by using transcription-translation data pairs in the training data;
step 2: compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC;
and step 3: by adopting a method of 'maximum and minimum' of the countermeasure, the encoding layer output distribution of the ST model is fitted with the encoding layer output distribution of the MT model, so that the ST model is helped to capture more semantic information;
and 4, step 4: and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
Further, in step 2, the shrunk ST model encodes the layer state
Figure BDA0003156565950000021
The description is as follows:
Figure BDA0003156565950000022
wherein Idx represents the corresponding token indexed in the vocabulary, h represents the original ST model coding layer state,
Figure BDA0003156565950000023
representing a linear transformation in a CTC module, T representing a transposition, hiRepresenting the ith row vector in h.
Further, step 3 comprises:
training a countermeasure for classifying speech and text encodings; the input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical ofD(l | h); wherein p isD
Figure BDA0003156565950000024
0 means h is from speech, 1 means h is from text; wherein, aiming at the MT model, the coding layer output h is the coding layer output h of the MT modelMT(ii) a Aiming at the ST model, the output h of the coding layer is the output h of the semantic coding layer of the ST models
The process of training the antagonists includes: the antagonists are first trained to correctly predict the modality by minimizing the following cross-entropy loss:
Figure BDA0003156565950000025
wherein liRepresents hiCorresponding modality,/i∈{ST,MT},pDExpressed as the probability, θ, of selecting the correct mode for the coding layer output hDRepresenting parameters of the reactor;
the semantic coding layer of the ST model is then trained by minimizing the following cross-entropy loss, causing it to output a spoof competitor:
Figure BDA0003156565950000031
further, in step 4, the objective function adopted by the joint training is adopted
Figure BDA0003156565950000032
As shown in the following formula:
Figure BDA0003156565950000033
wherein the content of the first and second substances,
Figure BDA0003156565950000034
the loss of the CTC is indicated,
Figure BDA0003156565950000035
representing the coding layer loss of the ST model,
Figure BDA0003156565950000036
represents the loss of the end-to-end ST model, θaParameters, θ, representing the acoustic coding layer of the ST modelsParameters, θ, representing the semantic coding layer of the ST modeldParameter, θ, representing the decoded layer of the ST modelDRepresenting the parameters of the countermeasure, alpha, beta, gamma are hyper-parameters.
Further, define
Figure BDA0003156565950000037
Comprises the following steps:
Figure BDA0003156565950000038
wherein the content of the first and second substances,
Figure BDA0003156565950000039
Figure BDA00031565659500000310
Figure BDA00031565659500000311
wherein h represents a length TxOutputting the coding layer; x is the number of*Represents a target transcription sequence; x represents a transcription sequence corresponding to speech;
Figure BDA00031565659500000312
represents an operation of processing the alignment into a transcription; pr (x | h) represents the posterior probability of transcription y; pr (a | h) represents the probability that a aligns with a CTC; pr(k, t | h) represents the probability of generating the kth tag at t time steps;
Figure BDA00031565659500000313
denotes qtThe (k) th element of (a),
Figure BDA00031565659500000314
indicating a vocabulary and a '-' indicating a blank label.
Further, define
Figure BDA00031565659500000315
Comprises the following steps:
Figure BDA00031565659500000316
where θ represents a parameter of the ST model, pθRepresenting the probability of the decoding layer output after softmax,
Figure BDA0003156565950000041
denotes the translated sequence, TyLength of translated sequence, hsAnd representing the semantic coding layer output of the ST model.
In a second aspect, the present invention also provides a training system for obtaining better speech translation models in generating countermeasures, comprising:
the data collection module is used for collecting training data;
the model training module is used for training the MT model by using the transcription-translation data pair in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
In a third aspect, the present invention further provides a speech translation method, including:
acquiring target voice to be translated;
and translating the target voice by using the voice translation model obtained by training by the method.
In a third aspect, the present invention further provides a speech translation apparatus, including:
the voice acquisition unit is used for acquiring target voice to be translated;
and the voice translation unit is used for translating the target voice by utilizing the voice translation model obtained by the training of the method.
The invention has the beneficial effects that:
the invention adopts a countermeasure mode to lead the coding layer of the ST model to be close to the coding layer output of the MT model, reduces the representation difference between the two modes and promotes the performance of the ST model. Specifically, the coding layer output of the trained MT model is considered to be a "true" representation, and the coding layer output of the ST model is considered to be a "false" representation. The encoder of the ST model needs to generate a more realistic representation, confusing the warfare to make it impossible to tell whether the input is from speech or text, and the warfare needs to judge whether the input is "true" or "false". In the process of generating the countermeasure, the output distribution of the ST model coding layer gradually gets close to the output distribution of the MT model coding layer, and more semantic information can be learned. In addition, in the invention, a contraction mechanism is adopted to solve the problem of inconsistent lengths of voice and text, namely, the redundant coding layer state is filtered by utilizing the posterior probability of the peak value label output by the CTC module.
Drawings
FIG. 1 is a schematic flow chart of a training method for obtaining better speech translation models in generating countermeasures according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a training framework for obtaining better speech translation models in generating countermeasures according to an embodiment of the present invention;
FIG. 3 is a diagram of En-D according to an embodiment of the present inventioneOn the data set, the training process of the model is carried out when a contraction mechanism exists or not;
FIG. 4 shows En-D for the baseline and model of the invention at different data volumes, according to an embodiment of the inventioneThe above expression.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in FIG. 1, an embodiment of the present invention provides a training method for obtaining better speech translation model in generating confrontation, which includes the following steps:
step 1: collecting training data, and training an MT model by using transcription-translation data pairs in the training data;
step 2: compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC;
and step 3: adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information;
specifically, after step 2, the contracted ST model coding layer state contains acoustic information, but still lacks semantic information needed for translation. In order to enable the model to obtain better semantic representation, the output of the coding layer of the MT model is used as an internal supervision signal to guide the learning of the coding layer of the ST model, so that the output distribution of the ST coding layer is close to the output distribution of the MT coding layer, and the distance of the two modal representations is reduced.
And 4, step 4: and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
The embodiment of the invention adopts a countermeasure mode to lead the coding layer of the ST model to be close to the coding layer of the MT model, reduces the representation difference between the two modes and promotes the performance of the ST model. Specifically, the coding layer output of the trained MT model is considered to be a "true" representation, and the coding layer output of the ST model is considered to be a "false" representation. The encoder of the ST model needs to generate a more realistic representation, confusing the warfare to make it impossible to tell whether the input is from speech or text, and the warfare needs to judge whether the input is "true" or "false". In the process of generating the countermeasure, the output distribution of the ST model coding layer gradually gets close to the output distribution of the MT model coding layer, and more semantic information can be learned. In addition, the problem of length inconsistency exists between the text and the voice, even after down-sampling, the length (frame) of the voice is usually more than 20 times of the length (Token) of the text, and the invention finds that if the length inconsistency problem is not processed, the model cannot be trained, so in the embodiment of the invention, a contraction mechanism is adopted to solve the length inconsistency problem, namely, the posterior probability of the peak label output by the CTC module is utilized to filter out redundant coding layer states.
The embodiment of the invention utilizes the following two key technologies: (1) compressing the length of the voice by utilizing the peak value phenomenon of the CTC; (2) the countermeasure is introduced into a model structure, so that the speech translation model learns the machine translation model, the training of the speech translation model is assisted, the performance of the model is improved, and the requirement on data volume is reduced; thus a better speech translation model is obtained.
Example 2
On the basis of the above embodiments, a training method for obtaining a better speech translation model in generating countermeasures according to the embodiments of the present invention is described in more detail with reference to the training framework shown in fig. 2.
The training framework of the invention is a general structure irrelevant to the network, namely a convolution network, a cyclic neural network and a transformer structure. In the embodiment of the present invention, a Transformer structure is adopted as a main structure, as shown in fig. 2.
The training framework of the invention mainly comprises five parts: (1) an acoustic coding layer to code the acoustic features into coding layer states corresponding to the source text. (2) And the CTC module is used for predicting the transcription of the voice and helping the acoustic coding layer to capture acoustic information. (3) And a contraction mechanism, namely filtering redundant coding layer states by utilizing the posterior probability predicted by the CTC, and solving the problem of inconsistent lengths represented by the voice and the text. (4) And the antagonists are used for enabling the representation of the voice to be close to the representation of the text and reducing the difference between the two modal representations. (5) A decoder for predicting text in a target language using the phonetic representation.
In fig. 2, it is noted that: the acoustic coding layer, the puncturing mechanism, the semantic coding layer are labeled in fig. 2. For simplicity of representation, the residual concatenation in the transform structure, the translated text input of the decoded part after layer normalization, are not shown in fig. 2.
The speech translation task refers to getting text output in the target language given the input speech, and then minimizing the cross entropy between the predicted text and the factual text. The phonetic translation corpus is a triplet, denoted DST{ (s, x, y) }. Wherein
Figure BDA0003156565950000072
Representing a sequence of acoustic features (e.g. MFCC), x ═ x1,…,xTx]Is the corresponding transcribed sequence, y ═ y1,…,yTy]Is the corresponding translated sequence. T iss,Tx,TyRepresenting the acoustic characteristics of the speech, the length of the transcription and translation, respectively.
(1) Acoustic coding layer: the method is mainly used for processing acoustic features and encoding the acoustic features into vectors corresponding to speech transcription. Before inputting the acoustic features s into the network, the embodiment of the present invention preprocesses them using a two-layer CNN, preliminarily reduces the length (number of frames) of the acoustic features, and transforms the feature dimensions to the model dimensions. Inputting the transformed features into NaEach sub-block of the stacked Transformer consists of multi-headed self-attention, a forward neural network, residual concatenation, and layer normalization. The above process can be expressed as (for simplicity of expression, omitted residual join and layer normalization):
Figure BDA0003156565950000073
Figure BDA0003156565950000074
(2) a CTC module: in an embodiment of the invention, acoustic information is captured using a CTC loss helper model to predict transcription of speech. The CTC module contains only one linear projection layer. Given length TxThe encoding layer output h of (1) is normalized by using Softmax to the vector q after projection, and the probability of generating the kth label at t time steps can be obtained:
Figure BDA0003156565950000075
wherein the content of the first and second substances,
Figure BDA0003156565950000081
denotes qtThe kth element of (1).
Figure BDA0003156565950000082
Indicating a vocabulary and a '-' indicating a blank label. A CTC alignment a is a length TxContains an index list of spaces and labels. The probability of a, Pr (a | h), is the product of the emission probabilities at each time step:
Figure BDA0003156565950000083
alignment a of CTCs is a many-to-one mapping corresponding to transcription, due to the appearance of tags that allow for blanks and successive repeats. For example, alignments (a, -, b, c, -, -) and (-, -, a, -, b, c) all correspond to transcripts (a, b, c). By using
Figure BDA0003156565950000084
Indicating operations for processing alignment into transcription, i.e.
Figure BDA0003156565950000085
The duplicate labels in the alignment are removed first, followed by the blank labels. The posterior probability of transcription y is the sum of all probabilities corresponding to the alignment of x:
Figure BDA0003156565950000086
the intuition behind CTCs is that since it is not known where in the transcription a tag will appear without alignment of speech and text, all of their possible locations are summed. Given a target transcription x*The objective function of CTC is:
Figure BDA0003156565950000087
(3) the shrinkage mechanism is as follows: CTC has a peak phenomenon that the remaining predictive tags, except for blank marks, form a peak, as shown in part (b) of fig. 2. This phenomenon can reduce the number of recognition units and speed up decoding. The unit corresponding to the spike is considered the desired coding layer state, containing a priori information of the original speech. The remaining blank and repeat labels do not contain useful information, but rather increase in length, impeding model training. The coding layer state h' corresponding to the CTC spike can thus be extracted, as shown in fig. 2, i.e. the label corresponding to the index passing through the softmax maximum is first taken as the prediction label, and then the coding layer state corresponding to the non-empty label is extracted. For repeated tags, only the coding layer state corresponding to the first tag is extracted, and the others are omitted. In the embodiment of the present invention, the state of the ST model coding layer after contraction is described as follows:
Figure BDA0003156565950000088
wherein Idx represents the corresponding token indexed in the vocabulary, h represents the original ST model coding layer state,
Figure BDA0003156565950000089
representing the contracted ST model coding layer state,
Figure BDA00031565659500000810
representing a linear transformation in a CTC module, T representing a transposition, hiRepresenting the ith row vector in h.
(4) Semantic coding layer and decoder: the contracted ST model coding layer state contains acoustic information, but still lacks semantic information required in translation, and in order to enable the model to obtain better semantic representation, the embodiment of the invention adopts NsThe packed transform sub-blocks encode the punctured coding layer state:
Figure BDA0003156565950000091
using basic NdThe Transformer subblock is used as a decoder, and firstly adopts a self-attention layer to encode the embedding of the target text, and then utilizes an interactive attention layer to integrate the encoded output and the output of a semantic encoding layer, wherein the process is described as follows:
hy=Embedding(y)
hs=FFN(Cross_Attention(Self_Attention(hy),hs))
finally, the cross-entropy loss of the end-to-end ST model is defined as:
Figure BDA0003156565950000092
where θ represents a parameter of the ST model, pθRepresents the decoded layer output hdProbability after sofimax.
(5) Antagonistic training: in order to enable the model to obtain better semantic representation, the embodiment of the invention utilizes the output of the coding layer of the MT model as an internal supervision signal to guide the learning of the ST model coding layer, so that the output distribution of the ST coding layer is close to the output of the MT coding layer, and the distance of two modal representations is reduced.
To achieve a close distribution of the two modes, embodiments of the present invention train a competitor to encode speech and textAnd (6) classifying. The input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical ofD(l | h); wherein p isD
Figure BDA0003156565950000093
0 means h is from speech, 1 means h is from text; wherein, aiming at the MT model, the coding layer output h is the coding layer output h of the MT modelMT(ii) a Aiming at the ST model, the output h of the coding layer is the output h of the semantic coding layer of the ST models
The process of training the antagonists includes: the antagonists are first trained to correctly predict the modality by minimizing the following cross-entropy loss:
Figure BDA0003156565950000101
wherein liRepresents hiCorresponding modality,/i∈{ST,MT},pDExpressed as the probability, θ, of selecting the correct mode for the coding layer output hDRepresenting parameters of the reactor;
the semantic coding layer of the ST model is then trained by minimizing the following cross-entropy loss, causing it to output a spoof competitor:
Figure BDA0003156565950000102
it is noted that in the embodiments of the present invention, the purpose of the reactor is to guide the ST model with the output of the MT model, rather than to make the model learn the output of multiple modalities. So the MT model is fixed and its parameters are not updated as "real output" throughout the training process. In addition, the role of the antagonists is to approximate the distribution of the semantic space of the two modalities, so there is no need for a one-to-one correspondence of the full two modality inputs, which can help model training with additional text with low resource ST data.
By incorporating antagonists into the speech translation model, embodiments of the present invention can ensure that coding layer representations from different modalities become indistinguishable, i.e., arrive at the representation of the ST model guided by the representation of the MT model as described earlier, bringing the representations of the two close together.
Finally, in the model of the invention, the objective function adopted by the joint training
Figure BDA0003156565950000103
As shown in the following formula:
Figure BDA0003156565950000104
wherein the content of the first and second substances,
Figure BDA0003156565950000105
the loss of the CTC is indicated,
Figure BDA0003156565950000106
representing the coding layer loss of the ST model,
Figure BDA0003156565950000107
represents the loss of the end-to-end ST model, θaParameters, θ, representing the acoustic coding layer of the ST modelsParameters, θ, representing the semantic coding layer of the ST modeldParameter, θ, representing the decoded layer of the ST modelDRepresenting the parameters of the countermeasure, alpha, beta, gamma are hyper-parameters.
Example 3
Correspondingly, the embodiment of the invention also provides a training system for obtaining a better speech translation model in the generation of confrontation, which comprises:
the data collection module is used for collecting training data;
the model training module is used for training the MT model by using the transcription-translation data pair in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
It should be noted that the training system provided in the embodiments of the present invention is for implementing the training method in the embodiments of the methods, and specific reference may be made to the embodiments of the methods for their functions, which are not described herein again.
Example 4
The embodiment of the invention also provides a voice translation method, which comprises the following steps:
acquiring target voice to be translated;
and translating the target voice by using the voice translation model obtained by training the method in the method embodiments.
Example 5
Correspondingly, an embodiment of the present invention further provides a speech translation device, including:
the voice acquisition unit is used for acquiring target voice to be translated;
and the voice translation unit is used for translating the target voice by using the voice translation model obtained by the method training in the method embodiments.
In order to verify the effectiveness of the training method of the speech translation model provided by the invention, the invention also provides the following experiments.
(1) Data set
The invention performs experiments on two public data sets, including an Augmented LibriSpeechingLish-French corpus and a MuST _ C English-GermanTED corpus.
Augmented LibriSpeechlike-French (En-Fr): this corpus is obtained by aligning the e-book in french with the english pronunciation in libri spech and further increases the amount of data by providing french translations through google translations. The entire corpus contains 236 hours of speech. The experiment was trained on a 100 hour (47271 utterances) clean data set and tested using the model selected on the development set (2 hours, 1071 utterances) (4 hours, 2048 utterances).
MuST _ C English-GermanTED (En-De for short): the MuST-C corpus is based on the English TED speech, including English speech, corresponding transcription and translation in 8 different directions. Experiments were performed in the direction of the English-German language. The corpus contains 234K translation pairs for a total of 408 hours. The method is divided into a training set (400 hours, 229703 pronunciations), a verification set (3 hours, 1423 pronunciations) and a test set (5 hours, 2641 pronunciations).
Additional data: for further analytical comparison, additional ASR was required in part of the experiments, using Librispeech 960 hours of data.
(2) Pretreatment and evaluation
For speech, this experiment used Kaldi to extract 80-dimensional MFCC features with a step size of 10msThe window length is 25ms and the mean and variance are normalized. The data volume was increased with speed perturbations of 0.9, 1.0, 1.1. Text data in the source language, lowercase, tokenize, and remove all punctuation, is consistent with the ASR data. Text data for the target language, lowercase, tokenize, and normalized punctuation. The processing of the text is implemented using MOSE. And applying BPE to the data combined with the source text and the target text to obtain a shared sub-word unit, wherein the size of the vocabulary is set to be 8K.
For comparison with other existing works, a multi-BLEU. pl script was used to calculate case insensitive BLEU values as model evaluation indices.
(3) Model set-up
The models involved in the experiment all adopt a Transformer structure. For the MT and ASR models, the number of encoding and decoding layers is 6, the number of attention heads is 4, the embedding dimension is 256, and the filter size d of the forward neural networkff2048, attention dropout is 0 and the remaining dropouts are 0.1. Regarding the ST model, the number of the acoustic coding layer, the semantic coding layer and the decoding layer is 6, 4 attention heads are included, the embedding dimension is 256, and the attention is paiddropout is 0, and the remainder is 0.1. When voice is preprocessed, 2-dimensional CNN is adopted for down-sampling, each CNN layer has 256 output channels, the convolution kernel is 3, and stride is 2.
For the reactor, a three-layer forward neural network is adopted, the dimension of a coding layer is 1024, Leaky-ReLU is adopted as an activation function, and dropout is 0.1. In order to ensure the stability of the training of the antagonists, the parameters of each layer of the antagonists are processed by spectra norm.
(4) Training arrangement
The model was trained on 4 NVIDIAV100 graphics cards using Adam optimizer. With respect to the ASR, MT and ST models, the learning rate decay pattern is consistent with the original paper, with a scaling factor k of 5.0, warmup of 25000 for ASR and ST and 8000 for MT. For the antagonists, the learning rate is 1e-4, β1=0.9,β2=0.999,ε=1e-8。
And storing the best-performing 5 models on the verification set, and averaging to obtain the final test model. And an early termination strategy is adopted by taking ACC as a criterion, and the heart endurance cycle is 3. The beam size at decoding is set to 10. Experiments were performed on the basis of Espnet, using the pytorch-lightning organization code.
TABLE 1 results of experiments on an Augmented Librisipeech English-French
Figure BDA0003156565950000131
In table 1, the corresponding model is different from the general initialization method, which employs ASR and MT combined training, and then initializes the entire ST model using the MT model.
TABLE 2 results of the experiments on MuST _ C English-GermanteD
Figure BDA0003156565950000132
Figure BDA0003156565950000141
The transform + adverse used ASR and MT joint training, after which the entire ST model was initialized using the MT model. Meta-Learning initializes ST models using the method of MAML.
(5) Analysis of results
Results on an Augmented Librisppeech English-French: experiments were performed in two experimental settings. In E2E base, the model is trained using only ST corpus. In E2E expanded, the acoustic coding layer is pre-trained with libripsecech. The trained speech translation model of the present invention is compared to the existing end-to-end model, as shown in table 1.
In base and expanded experiments, the model of the present invention represents more than all end-to-end models in the past. Compared with Ashkan Alinejad (Transformer + averse), the model of the invention improves the BLEU by 0.93 by training the Ashkan Alinejad with a countermeasure. This is because their approach is to give the MT's coding layer multi-modal representation capability by jointly training the ASR and MT, and initialize the ST model using the entire MT model, ensuring continuity of coding layer and decoding layer initialization. Although the method gives better initialization value of the ST model, the problem of internal supervision signal loss is still not solved, and the method directly provides supervision signals to help model training, so the method has better effect. The present invention also achieves better results than knowledge distillation, since knowledge distillation guides the ST model with the MT model only in the last layer, and although providing a supervisory signal, it is difficult to propagate into previous networks. Compared with (TCEN, curiculum), the model of the invention is simple and flexible, and avoids complex training process. The LUT and STAST also adopt different strategies to provide guidance for the coding layer, and relieve the burden of the coding layer so as to obtain more semantic information. The model of the invention performs better than LUT and STAST.
Results on MuST _ C English-GermanTED: in Table 2, the performance of various models on the MuST _ CENGlish-GermanTED data set is compared. Compared with the Augmented library English-French, the ST model and the MT model (which can be regarded as the upper bound of the ST model) have larger performance difference. This is because the speech in this corpus is live recorded, the speech contains more noise, and the transcription and translation are not well aligned with the original audio. As can be seen from the comparison of the experimental results, the model of the present invention achieves better results than all previous models without any initialization, exhibiting the same trend as on the Augmented library English-French dataset.
(6) Ablation test
TABLE 3 ablation testing on two data sets
Figure BDA0003156565950000151
Impact of parasitic losses: in order to evaluate the contribution of each part in the model of the invention, ablation tests were performed on both data sets for additional tasks. The results in table 3 show that each part in the model of the invention has a positive effect. If the countermeasure loss is removed, the performance of the model will be degraded, indicating the ability of the countermeasure to reduce the difference in the representation of the two modes, so that it learns more semantic information and produces better performance. If the semantic coding layer is removed, the model performance will decrease dramatically, which means that the semantic coding layer is necessary and can provide the model with more ability to learn semantic information. If the model behaves very poorly without the puncturing mechanism, since the coding layer contains too much meaningless and redundant coding layer state, resulting in too much difference in the length of the text and speech representations, the warfare agent cannot learn in this situation.
Influence of the hyper-parameters: experiments were performed on the MUST-C En-De data by varying the weights of the different loss terms in the loss function. To reduce the burden of parameter adjustment, α is set to 1- β.
In general, the most appropriate player weights vary from CTC weight to CTC weight, but after convergence, the model behavior does not change much.
TABLE 4 ablation testing on En-De test set
Figure BDA0003156565950000152
Impact of contraction mechanism on training: FIG. 3 illustrates the training process of the model with or without the contraction mechanism. It can be seen that, in the absence of a contraction mechanism, the length difference between the two modes is too large, the loss of the generator is rapidly increased, the purpose of closing the phonetic representation to the text representation by confusing the reactor cannot be achieved, and the model cannot be converged.
FIG. 3 shows the training process of the model with or without the contraction mechanism on the En-De data set, namely, the descriptor loss
Figure BDA0003156565950000161
generator loss is namely
Figure BDA0003156565950000162
Effect of shrinkage mechanism on testing: the state of the coding layer can be effectively compressed by using a contraction mechanism, but the compression undoubtedly loses part of information during testing, and the contraction mechanism is removed during testing. The results in Table 5 show that removing the shrink mechanism can bring about an increase in the model of around 0.2-0.5BLEU values.
TABLE 5 Performance of the model of the invention on the En-De test set with and without the shrink mechanism
Figure BDA0003156565950000163
And (3) a probe task: the learned representation is further analyzed using the Fluent spechmomands dataset to perform the probe task. Among them, speaker ver is a task designed to recognize a speaker, and benefits from more acoustic information. IntentIde focuses on intent recognition and requires more linguistic information. The data set contains 30043 pronunciations, 97 speakers, and 31 intentions. Pronunciations are randomly divided into training, testing and verifying sets, and therefore the speaker and the intention are contained in each set. During training, the state of the coding layer of the trained model is extracted, the coding layer is fixed, then a full connection layer is used, fine adjustment is carried out on two probe tasks respectively, and the accuracy of the test set is displayed. By comparison, it can be seen that in the ST model, acoustic information is mainly learned at the bottom layer, and semantic information is captured with emphasis at the high layer.
TABLE 6 Classification accuracy on speaker recognition and intent recognition tasks
Figure BDA0003156565950000164
AE denotes the acoustic coding layer output and SE denotes the semantic coding layer output.
The impact of the extra text: the model behavior in low resource situations was studied. For different data resource scenarios, 50, 10, 200, 300 hours of speech training data (speech-transcription-translation triplets) were randomly selected from the En-De dataset. Two models, including the basic end-to-end baseline and the inventive model, were then trained on data of different scales. Wherein the model of the invention can utilize additional transcription data.
FIG. 4 is a graph of the performance of the baseline and the model of the invention on En-De for different data volumes. As the amount of data increases, the performance of both models continues to increase. The model of the present invention lasted higher than the baseline model. With only 50 hours of data, the model of the invention achieved performance comparable to the 300 hour baseline model. This is because the model of the present invention provides a supervisory signal to the coding layer of ST, and by using the countermeasure to force the representation of speech to the representation of text, helps the ST model to capture more semantic information, reduces the difficulty of learning the model, and reduces the requirement of the model for data size.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A training method for obtaining a better speech translation model in generating an confrontation, comprising:
step 1: collecting training data, and training an MT model by using transcription-translation data pairs in the training data;
step 2: compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC;
and step 3: by adopting a method of 'maximum and minimum' of the countermeasure, the encoding layer output distribution of the ST model is fitted with the encoding layer output distribution of the MT model, so that the ST model is helped to capture more semantic information;
and 4, step 4: and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
2. The training method for obtaining better speech translation model in generating antagonism according to claim 1, wherein in step 2, the contracted ST model encodes the layer state
Figure FDA0003156565940000011
The description is as follows:
Figure FDA0003156565940000012
wherein Idx represents the corresponding token indexed in the vocabulary, h represents the original ST model coding layer state,
Figure FDA0003156565940000013
representing a linear transformation in a CTC module, T representing a transposition, hiRepresenting the ith row vector in h.
3. The training method for obtaining better speech translation models in generating opponents according to claim 1, wherein step 3 comprises:
training a countermeasure for classifying speech and text encodings; the input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical ofD(l | h); wherein p isD:
Figure FDA0003156565940000014
0 means h is from speech, 1 means h is from text; wherein, aiming at the MT model, the coding layer output h is the coding layer output h of the MT modelMT(ii) a Aiming at the ST model, the output h of the coding layer is the output h of the semantic coding layer of the ST models
The process of training the antagonists includes: the antagonists are first trained to correctly predict the modality by minimizing the following cross-entropy loss:
Figure FDA0003156565940000015
wherein liRepresents hiCorresponding modality,/i∈{ST,MT},pDExpressed as the probability, θ, of selecting the correct mode for the coding layer output hDRepresenting parameters of the reactor;
the semantic coding layer of the ST model is then trained by minimizing the following cross-entropy loss, causing it to output a spoof competitor:
Figure FDA0003156565940000021
4. the training method for obtaining better speech translation model in generating opponents according to claim 3, wherein in step 4, the objective function used in the joint training is
Figure FDA0003156565940000022
As shown in the following formula:
Figure FDA0003156565940000023
wherein the content of the first and second substances,
Figure FDA0003156565940000024
the loss of the CTC is indicated,
Figure FDA0003156565940000025
representing the coding layer loss of the ST model,
Figure FDA0003156565940000026
represents the loss of the end-to-end ST model, θaParameters, θ, representing the acoustic coding layer of the ST modelsParameters, θ, representing the semantic coding layer of the ST modeldParameter, θ, representing the decoded layer of the ST modelDRepresenting the parameters of the countermeasure, alpha, beta, gamma are hyper-parameters.
5. Training method for obtaining better speech translation models in generating antagonism according to claim 4, characterised in that it defines
Figure FDA0003156565940000027
Comprises the following steps:
Figure FDA0003156565940000028
wherein the content of the first and second substances,
Figure FDA0003156565940000029
Figure FDA00031565659400000210
Figure FDA00031565659400000211
wherein h represents a length TxOutputting the coding layer; x is the number of*Represents a target transcription sequence; x represents a transcription sequence corresponding to speech;
Figure FDA00031565659400000212
represents an operation of processing the alignment into a transcription; pr (x | h) represents the posterior probability of transcription y; pr (a | h) represents the probability that a aligns with a CTC; pr(k, t | h) represents the probability of generating the kth tag at t time steps;
Figure FDA00031565659400000213
denotes qtThe (k) th element of (a),
Figure FDA00031565659400000214
Figure FDA00031565659400000215
indicating a vocabulary and a '-' indicating a blank label.
6. Training method for obtaining better speech translation models in generating antagonism according to claim 4, characterised in that it defines
Figure FDA00031565659400000216
Comprises the following steps:
Figure FDA0003156565940000031
where θ represents a parameter of the ST model, pθDenotes the probability of the decoded layer output after softmax, y ═ y1,…,yTy]Denotes the translated sequence, TyLength of translated sequence, hsAnd representing the semantic coding layer output of the ST model.
7. A training system for obtaining better speech translation models in generating antagonists, comprising:
the data collection module is used for collecting training data;
the model training module is used for training the MT model by using the transcription-translation data pair in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.
8. A method of speech translation, comprising:
acquiring target voice to be translated;
translating the target speech by using a speech translation model trained by the method of any one of claims 1 to 6.
9. A speech translation apparatus, comprising:
the voice acquisition unit is used for acquiring target voice to be translated;
a speech translation unit, configured to translate the target speech by using the speech translation model trained by the method of any one of claims 1 to 6.
CN202110780410.2A 2021-07-09 2021-07-09 Training method and system for obtaining better speech translation model in generation of confrontation Active CN113505611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110780410.2A CN113505611B (en) 2021-07-09 2021-07-09 Training method and system for obtaining better speech translation model in generation of confrontation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110780410.2A CN113505611B (en) 2021-07-09 2021-07-09 Training method and system for obtaining better speech translation model in generation of confrontation

Publications (2)

Publication Number Publication Date
CN113505611A true CN113505611A (en) 2021-10-15
CN113505611B CN113505611B (en) 2022-04-15

Family

ID=78012608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110780410.2A Active CN113505611B (en) 2021-07-09 2021-07-09 Training method and system for obtaining better speech translation model in generation of confrontation

Country Status (1)

Country Link
CN (1) CN113505611B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299937A (en) * 2021-12-31 2022-04-08 镁佳(北京)科技有限公司 DNN model training method and voice recognition method and device
CN117252213A (en) * 2023-07-06 2023-12-19 天津大学 End-to-end speech translation method using synthesized speech as supervision information

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1812998A (en) * 2003-04-30 2006-08-02 阿雷斯贸易股份有限公司 Secreted protein family
US20120143591A1 (en) * 2010-12-01 2012-06-07 Microsoft Corporation Integrative and discriminative technique for spoken utterance translation
US20140365202A1 (en) * 2013-06-11 2014-12-11 Facebook, Inc. Translation and integration of presentation materials in cross-lingual lecture support
US20160147740A1 (en) * 2014-11-24 2016-05-26 Microsoft Technology Licensing, Llc Adapting machine translation data using damaging channel model
GB201804073D0 (en) * 2018-03-14 2018-04-25 Papercup Tech Limited A speech processing system and a method of processing a speech signal
CN108304390A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Training method, interpretation method, device based on translation model and storage medium
CN108630199A (en) * 2018-06-30 2018-10-09 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of acoustic model
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN110110337A (en) * 2019-05-08 2019-08-09 网易有道信息技术(北京)有限公司 Translation model training method, medium, device and calculating equipment
CN110598224A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Translation model training method, text processing device and storage medium
CN111343650A (en) * 2020-02-14 2020-06-26 山东大学 Urban scale wireless service flow prediction method based on cross-domain data and loss resistance
CN111783477A (en) * 2020-05-13 2020-10-16 厦门快商通科技股份有限公司 Voice translation method and system
CN111858931A (en) * 2020-07-08 2020-10-30 华中师范大学 Text generation method based on deep learning
CN112183120A (en) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 Speech translation method, device, equipment and storage medium
CN112686058A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 BERT embedded speech translation model training method and system, and speech translation method and equipment
CN112784881A (en) * 2021-01-06 2021-05-11 北京西南交大盛阳科技股份有限公司 Network abnormal flow detection method, model and system
CN112800782A (en) * 2021-01-29 2021-05-14 中国科学院自动化研究所 Text semantic feature fused voice translation method, system and equipment

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1812998A (en) * 2003-04-30 2006-08-02 阿雷斯贸易股份有限公司 Secreted protein family
US20120143591A1 (en) * 2010-12-01 2012-06-07 Microsoft Corporation Integrative and discriminative technique for spoken utterance translation
US20140365202A1 (en) * 2013-06-11 2014-12-11 Facebook, Inc. Translation and integration of presentation materials in cross-lingual lecture support
US20160147740A1 (en) * 2014-11-24 2016-05-26 Microsoft Technology Licensing, Llc Adapting machine translation data using damaging channel model
CN108304390A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Training method, interpretation method, device based on translation model and storage medium
GB201804073D0 (en) * 2018-03-14 2018-04-25 Papercup Tech Limited A speech processing system and a method of processing a speech signal
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN108630199A (en) * 2018-06-30 2018-10-09 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of acoustic model
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN110110337A (en) * 2019-05-08 2019-08-09 网易有道信息技术(北京)有限公司 Translation model training method, medium, device and calculating equipment
CN110598224A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Translation model training method, text processing device and storage medium
CN111343650A (en) * 2020-02-14 2020-06-26 山东大学 Urban scale wireless service flow prediction method based on cross-domain data and loss resistance
CN111783477A (en) * 2020-05-13 2020-10-16 厦门快商通科技股份有限公司 Voice translation method and system
CN111858931A (en) * 2020-07-08 2020-10-30 华中师范大学 Text generation method based on deep learning
CN112183120A (en) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 Speech translation method, device, equipment and storage medium
CN112686058A (en) * 2020-12-24 2021-04-20 中国人民解放军战略支援部队信息工程大学 BERT embedded speech translation model training method and system, and speech translation method and equipment
CN112784881A (en) * 2021-01-06 2021-05-11 北京西南交大盛阳科技股份有限公司 Network abnormal flow detection method, model and system
CN112800782A (en) * 2021-01-29 2021-05-14 中国科学院自动化研究所 Text semantic feature fused voice translation method, system and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUN KYUNG LEE等: "Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech", 《APPLIED SCIENCES》 *
ZHENGKUN TIAN等: "Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition", 《HTTPS://ARXIV.ORG/PDF/2005.07903.PDF》 *
何文龙等: "基于对抗训练的端到端语音翻译研究", 《信号处理》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299937A (en) * 2021-12-31 2022-04-08 镁佳(北京)科技有限公司 DNN model training method and voice recognition method and device
CN114299937B (en) * 2021-12-31 2022-07-01 镁佳(北京)科技有限公司 DNN model training method and voice recognition method and device
CN117252213A (en) * 2023-07-06 2023-12-19 天津大学 End-to-end speech translation method using synthesized speech as supervision information

Also Published As

Publication number Publication date
CN113505611B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
Shor et al. Personalizing ASR for dysarthric and accented speech with limited data
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
Zhao et al. Hearing lips: Improving lip reading by distilling speech recognizers
Chen et al. End-to-end neural network based automated speech scoring
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN112017644A (en) Sound transformation system, method and application
CN113470662A (en) Generating and using text-to-speech data for keyword spotting systems and speaker adaptation in speech recognition systems
Kheddar et al. Deep transfer learning for automatic speech recognition: Towards better generalization
CN113505611B (en) Training method and system for obtaining better speech translation model in generation of confrontation
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
Denisov et al. End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning
Boito et al. Empirical evaluation of sequence-to-sequence models for word discovery in low-resource settings
Chen et al. SpeechFormer++: A hierarchical efficient framework for paralinguistic speech processing
Ubale et al. Exploring end-to-end attention-based neural networks for native language identification
CN114627162A (en) Multimodal dense video description method based on video context information fusion
Sefara et al. HMM-based speech synthesis system incorporated with language identification for low-resourced languages
EP4235485A1 (en) Method for converting text data into acoustic feature, electronic device, and storage medium
Basak et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems.
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
Amari et al. Arabic speech recognition based on a CNN-BLSTM combination
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
Shao et al. Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant