CN113505611A

CN113505611A - Training method and system for obtaining better speech translation model in generation of confrontation

Info

Publication number: CN113505611A
Application number: CN202110780410.2A
Authority: CN
Inventors: 屈丹; 张昊; 杨绪魁; 闫红刚; 张文林; 郝朝龙; 魏雪娟; 李�真
Original assignee: Information Engineering University of PLA Strategic Support Force; Zhengzhou Xinda Institute of Advanced Technology
Current assignee: Information Engineering University of PLA Strategic Support Force; Zhengzhou Xinda Institute of Advanced Technology
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-10-15
Anticipated expiration: 2041-07-09
Also published as: CN113505611B

Abstract

The present invention provides a training method and system that achieves better speech translation models in generating countermeasures. The method comprises collecting training data, and training an MT model by using transcription-translation data pairs in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; by adopting a method of 'maximum and minimum' of the countermeasure, the encoding layer output distribution of the ST model is fitted with the encoding layer output distribution of the MT model, so that the ST model is helped to capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model. The invention can improve the recognition performance of the voice translation model, thereby improving the voice translation efficiency and quality.

Description

Training method and system for obtaining better speech translation model in generation of confrontation

Technical Field

The invention relates to the technical field of voice translation, in particular to a training method and a system for obtaining a better voice translation model in generation countermeasure, and a voice translation method and equipment.

Background

Speech translation refers to inputting speech in one language and outputting text in another language. The traditional Speech Translation system adopts a cascade mode, i.e. Speech is transcribed by an ASR (Automatic Speech Recognition) system, and then the transcribed result is input into an MT (Machine Translation) system for Translation. The system can utilize more data to train the ASR and MT systems respectively, and an ST (Speech Translation) system with higher Translation quality is obtained, so the cascade ST system has been widely applied for many years.

The end-to-end system skips the intermediate transcription stage and directly translates the input speech, and has been paid more and more attention by researchers in recent years and gradually becomes a mainstream paradigm because it can avoid the problem of error propagation between the two systems, has low inference delay and has lower requirements for memory and computing resources. But for end-to-end ST systems of the encoding-decoding architecture, one recognized problem is the overburdening of the encoding layer. It needs not only to extract the acoustic information of the original audio, but also to map into the semantic space, i.e. not only to listen, but also to understand. The insufficient data volume and the missing of the internal supervision signal lead the coding layer of the ST not to be trained effectively, which hinders the further improvement of the performance of the whole model.

The currently common solution is to pre-train the coding layer of the ST model with the ASR model, but in the ASR model the alignment between speech and transcription is monotonic, capturing local information of the surface, and pre-training in this way can help the ST model to capture acoustic information rather than semantic information. In fact, a very common observation is that the gap in modalities results in a MT model with the same data pair having superior performance than the ST model, and many researchers consider using the MT model to enhance the performance of the ST model. The most direct method is knowledge distillation, but the method only provides guidance at the last layer of the model and hardly influences the network of the previous part.

Disclosure of Invention

Aiming at the problem that the coding layer of ST cannot be effectively trained due to insufficient data volume and the absence of internal supervision signals, the invention provides a method and a system for training a speech translation model, which can obtain better training in generation countermeasure.

In a first aspect, the present invention provides a training method for obtaining better speech translation models in generating countermeasures, comprising:

step 1: collecting training data, and training an MT model by using transcription-translation data pairs in the training data;

step 2: compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC;

and step 3: by adopting a method of 'maximum and minimum' of the countermeasure, the encoding layer output distribution of the ST model is fitted with the encoding layer output distribution of the MT model, so that the ST model is helped to capture more semantic information;

and 4, step 4: and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.

Further, in step 2, the shrunk ST model encodes the layer state

The description is as follows:

wherein Idx represents the corresponding token indexed in the vocabulary, h represents the original ST model coding layer state,

representing a linear transformation in a CTC module, T representing a transposition, h_iRepresenting the ith row vector in h.

Further, step 3 comprises:

training a countermeasure for classifying speech and text encodings; the input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical of_D(l | h); wherein p is_D：

0 means h is from speech, 1 means h is from text; wherein, aiming at the MT model, the coding layer output h is the coding layer output h of the MT model_MT(ii) a Aiming at the ST model, the output h of the coding layer is the output h of the semantic coding layer of the ST model_s；

The process of training the antagonists includes: the antagonists are first trained to correctly predict the modality by minimizing the following cross-entropy loss:

wherein l_iRepresents h_iCorresponding modality,/_i∈{ST，MT}，p_DExpressed as the probability, θ, of selecting the correct mode for the coding layer output h_DRepresenting parameters of the reactor;

the semantic coding layer of the ST model is then trained by minimizing the following cross-entropy loss, causing it to output a spoof competitor:

further, in step 4, the objective function adopted by the joint training is adopted

As shown in the following formula:

wherein the content of the first and second substances,

the loss of the CTC is indicated,

representing the coding layer loss of the ST model,

represents the loss of the end-to-end ST model, θ_aParameters, θ, representing the acoustic coding layer of the ST model_sParameters, θ, representing the semantic coding layer of the ST model_dParameter, θ, representing the decoded layer of the ST model_DRepresenting the parameters of the countermeasure, alpha, beta, gamma are hyper-parameters.

Further, define

Comprises the following steps:

wherein the content of the first and second substances,

wherein h represents a length T_xOutputting the coding layer; x is the number of^*Represents a target transcription sequence; x represents a transcription sequence corresponding to speech;

represents an operation of processing the alignment into a transcription; pr (x | h) represents the posterior probability of transcription y; pr (a | h) represents the probability that a aligns with a CTC; p_r(k, t | h) represents the probability of generating the kth tag at t time steps;

denotes q_tThe (k) th element of (a),

indicating a vocabulary and a '-' indicating a blank label.

Further, define

Comprises the following steps:

where θ represents a parameter of the ST model, p_θRepresenting the probability of the decoding layer output after softmax,

denotes the translated sequence, T_yLength of translated sequence, h_sAnd representing the semantic coding layer output of the ST model.

In a second aspect, the present invention also provides a training system for obtaining better speech translation models in generating countermeasures, comprising:

the data collection module is used for collecting training data;

the model training module is used for training the MT model by using the transcription-translation data pair in the training data; compressing the input length of the ST model using a puncturing mechanism such that the speech and text encoding layer output lengths are approximately the same, comprising: firstly, predicting the transcription of the voice by adopting a CTC loss help ST model, and capturing acoustic information of the voice; then removing redundant information in the state of the ST model coding layer by utilizing the peak phenomenon existing in the CTC; adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information; and performing joint training on the whole voice translation model by taking the CTC loss as an additional loss and combining the loss of the end-to-end ST model.

In a third aspect, the present invention further provides a speech translation method, including:

acquiring target voice to be translated;

and translating the target voice by using the voice translation model obtained by training by the method.

In a third aspect, the present invention further provides a speech translation apparatus, including:

the voice acquisition unit is used for acquiring target voice to be translated;

and the voice translation unit is used for translating the target voice by utilizing the voice translation model obtained by the training of the method.

The invention has the beneficial effects that:

the invention adopts a countermeasure mode to lead the coding layer of the ST model to be close to the coding layer output of the MT model, reduces the representation difference between the two modes and promotes the performance of the ST model. Specifically, the coding layer output of the trained MT model is considered to be a "true" representation, and the coding layer output of the ST model is considered to be a "false" representation. The encoder of the ST model needs to generate a more realistic representation, confusing the warfare to make it impossible to tell whether the input is from speech or text, and the warfare needs to judge whether the input is "true" or "false". In the process of generating the countermeasure, the output distribution of the ST model coding layer gradually gets close to the output distribution of the MT model coding layer, and more semantic information can be learned. In addition, in the invention, a contraction mechanism is adopted to solve the problem of inconsistent lengths of voice and text, namely, the redundant coding layer state is filtered by utilizing the posterior probability of the peak value label output by the CTC module.

Drawings

FIG. 1 is a schematic flow chart of a training method for obtaining better speech translation models in generating countermeasures according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training framework for obtaining better speech translation models in generating countermeasures according to an embodiment of the present invention;

FIG. 3 is a diagram of En-D according to an embodiment of the present invention_eOn the data set, the training process of the model is carried out when a contraction mechanism exists or not;

FIG. 4 shows En-D for the baseline and model of the invention at different data volumes, according to an embodiment of the invention_eThe above expression.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in FIG. 1, an embodiment of the present invention provides a training method for obtaining better speech translation model in generating confrontation, which includes the following steps:

and step 3: adopting a countermeasure device to enable the output distribution of the coding layer of the ST model to be close to the output distribution of the coding layer of the MT model through a maximum and minimum game, and helping the ST model capture more semantic information;

specifically, after step 2, the contracted ST model coding layer state contains acoustic information, but still lacks semantic information needed for translation. In order to enable the model to obtain better semantic representation, the output of the coding layer of the MT model is used as an internal supervision signal to guide the learning of the coding layer of the ST model, so that the output distribution of the ST coding layer is close to the output distribution of the MT coding layer, and the distance of the two modal representations is reduced.

The embodiment of the invention adopts a countermeasure mode to lead the coding layer of the ST model to be close to the coding layer of the MT model, reduces the representation difference between the two modes and promotes the performance of the ST model. Specifically, the coding layer output of the trained MT model is considered to be a "true" representation, and the coding layer output of the ST model is considered to be a "false" representation. The encoder of the ST model needs to generate a more realistic representation, confusing the warfare to make it impossible to tell whether the input is from speech or text, and the warfare needs to judge whether the input is "true" or "false". In the process of generating the countermeasure, the output distribution of the ST model coding layer gradually gets close to the output distribution of the MT model coding layer, and more semantic information can be learned. In addition, the problem of length inconsistency exists between the text and the voice, even after down-sampling, the length (frame) of the voice is usually more than 20 times of the length (Token) of the text, and the invention finds that if the length inconsistency problem is not processed, the model cannot be trained, so in the embodiment of the invention, a contraction mechanism is adopted to solve the length inconsistency problem, namely, the posterior probability of the peak label output by the CTC module is utilized to filter out redundant coding layer states.

The embodiment of the invention utilizes the following two key technologies: (1) compressing the length of the voice by utilizing the peak value phenomenon of the CTC; (2) the countermeasure is introduced into a model structure, so that the speech translation model learns the machine translation model, the training of the speech translation model is assisted, the performance of the model is improved, and the requirement on data volume is reduced; thus a better speech translation model is obtained.

Example 2

On the basis of the above embodiments, a training method for obtaining a better speech translation model in generating countermeasures according to the embodiments of the present invention is described in more detail with reference to the training framework shown in fig. 2.

The training framework of the invention is a general structure irrelevant to the network, namely a convolution network, a cyclic neural network and a transformer structure. In the embodiment of the present invention, a Transformer structure is adopted as a main structure, as shown in fig. 2.

The training framework of the invention mainly comprises five parts: (1) an acoustic coding layer to code the acoustic features into coding layer states corresponding to the source text. (2) And the CTC module is used for predicting the transcription of the voice and helping the acoustic coding layer to capture acoustic information. (3) And a contraction mechanism, namely filtering redundant coding layer states by utilizing the posterior probability predicted by the CTC, and solving the problem of inconsistent lengths represented by the voice and the text. (4) And the antagonists are used for enabling the representation of the voice to be close to the representation of the text and reducing the difference between the two modal representations. (5) A decoder for predicting text in a target language using the phonetic representation.

In fig. 2, it is noted that: the acoustic coding layer, the puncturing mechanism, the semantic coding layer are labeled in fig. 2. For simplicity of representation, the residual concatenation in the transform structure, the translated text input of the decoded part after layer normalization, are not shown in fig. 2.

The speech translation task refers to getting text output in the target language given the input speech, and then minimizing the cross entropy between the predicted text and the factual text. The phonetic translation corpus is a triplet, denoted D_ST{ (s, x, y) }. Wherein

Representing a sequence of acoustic features (e.g. MFCC), x ═ x₁，…，x_Tx]Is the corresponding transcribed sequence, y ═ y₁，…，y_Ty]Is the corresponding translated sequence. T is_s，T_x，T_yRepresenting the acoustic characteristics of the speech, the length of the transcription and translation, respectively.

(1) Acoustic coding layer: the method is mainly used for processing acoustic features and encoding the acoustic features into vectors corresponding to speech transcription. Before inputting the acoustic features s into the network, the embodiment of the present invention preprocesses them using a two-layer CNN, preliminarily reduces the length (number of frames) of the acoustic features, and transforms the feature dimensions to the model dimensions. Inputting the transformed features into N_aEach sub-block of the stacked Transformer consists of multi-headed self-attention, a forward neural network, residual concatenation, and layer normalization. The above process can be expressed as (for simplicity of expression, omitted residual join and layer normalization):

(2) a CTC module: in an embodiment of the invention, acoustic information is captured using a CTC loss helper model to predict transcription of speech. The CTC module contains only one linear projection layer. Given length T_xThe encoding layer output h of (1) is normalized by using Softmax to the vector q after projection, and the probability of generating the kth label at t time steps can be obtained:

wherein the content of the first and second substances,

denotes q_tThe kth element of (1).

Indicating a vocabulary and a '-' indicating a blank label. A CTC alignment a is a length T_xContains an index list of spaces and labels. The probability of a, Pr (a | h), is the product of the emission probabilities at each time step:

alignment a of CTCs is a many-to-one mapping corresponding to transcription, due to the appearance of tags that allow for blanks and successive repeats. For example, alignments (a, -, b, c, -, -) and (-, -, a, -, b, c) all correspond to transcripts (a, b, c). By using

Indicating operations for processing alignment into transcription, i.e.

The duplicate labels in the alignment are removed first, followed by the blank labels. The posterior probability of transcription y is the sum of all probabilities corresponding to the alignment of x:

the intuition behind CTCs is that since it is not known where in the transcription a tag will appear without alignment of speech and text, all of their possible locations are summed. Given a target transcription x^*The objective function of CTC is:

(3) the shrinkage mechanism is as follows: CTC has a peak phenomenon that the remaining predictive tags, except for blank marks, form a peak, as shown in part (b) of fig. 2. This phenomenon can reduce the number of recognition units and speed up decoding. The unit corresponding to the spike is considered the desired coding layer state, containing a priori information of the original speech. The remaining blank and repeat labels do not contain useful information, but rather increase in length, impeding model training. The coding layer state h' corresponding to the CTC spike can thus be extracted, as shown in fig. 2, i.e. the label corresponding to the index passing through the softmax maximum is first taken as the prediction label, and then the coding layer state corresponding to the non-empty label is extracted. For repeated tags, only the coding layer state corresponding to the first tag is extracted, and the others are omitted. In the embodiment of the present invention, the state of the ST model coding layer after contraction is described as follows:

representing the contracted ST model coding layer state,

(4) Semantic coding layer and decoder: the contracted ST model coding layer state contains acoustic information, but still lacks semantic information required in translation, and in order to enable the model to obtain better semantic representation, the embodiment of the invention adopts N_sThe packed transform sub-blocks encode the punctured coding layer state:

using basic N_dThe Transformer subblock is used as a decoder, and firstly adopts a self-attention layer to encode the embedding of the target text, and then utilizes an interactive attention layer to integrate the encoded output and the output of a semantic encoding layer, wherein the process is described as follows:

h_y＝Embedding(y)

h_s＝FFN(Cross_Attention(Self_Attention(h_y)，h_s))

finally, the cross-entropy loss of the end-to-end ST model is defined as:

where θ represents a parameter of the ST model, p_θRepresents the decoded layer output h_dProbability after sofimax.

(5) Antagonistic training: in order to enable the model to obtain better semantic representation, the embodiment of the invention utilizes the output of the coding layer of the MT model as an internal supervision signal to guide the learning of the ST model coding layer, so that the output distribution of the ST coding layer is close to the output of the MT coding layer, and the distance of two modal representations is reduced.

To achieve a close distribution of the two modes, embodiments of the present invention train a competitor to encode speech and textAnd (6) classifying. The input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical of_D(l | h); wherein p is_D：

it is noted that in the embodiments of the present invention, the purpose of the reactor is to guide the ST model with the output of the MT model, rather than to make the model learn the output of multiple modalities. So the MT model is fixed and its parameters are not updated as "real output" throughout the training process. In addition, the role of the antagonists is to approximate the distribution of the semantic space of the two modalities, so there is no need for a one-to-one correspondence of the full two modality inputs, which can help model training with additional text with low resource ST data.

By incorporating antagonists into the speech translation model, embodiments of the present invention can ensure that coding layer representations from different modalities become indistinguishable, i.e., arrive at the representation of the ST model guided by the representation of the MT model as described earlier, bringing the representations of the two close together.

Finally, in the model of the invention, the objective function adopted by the joint training

As shown in the following formula:

wherein the content of the first and second substances,

the loss of the CTC is indicated,

representing the coding layer loss of the ST model,

Example 3

Correspondingly, the embodiment of the invention also provides a training system for obtaining a better speech translation model in the generation of confrontation, which comprises:

the data collection module is used for collecting training data;

It should be noted that the training system provided in the embodiments of the present invention is for implementing the training method in the embodiments of the methods, and specific reference may be made to the embodiments of the methods for their functions, which are not described herein again.

Example 4

The embodiment of the invention also provides a voice translation method, which comprises the following steps:

acquiring target voice to be translated;

and translating the target voice by using the voice translation model obtained by training the method in the method embodiments.

Example 5

Correspondingly, an embodiment of the present invention further provides a speech translation device, including:

the voice acquisition unit is used for acquiring target voice to be translated;

and the voice translation unit is used for translating the target voice by using the voice translation model obtained by the method training in the method embodiments.

In order to verify the effectiveness of the training method of the speech translation model provided by the invention, the invention also provides the following experiments.

(1) Data set

The invention performs experiments on two public data sets, including an Augmented LibriSpeechingLish-French corpus and a MuST _ C English-GermanTED corpus.

Augmented LibriSpeechlike-French (En-Fr): this corpus is obtained by aligning the e-book in french with the english pronunciation in libri spech and further increases the amount of data by providing french translations through google translations. The entire corpus contains 236 hours of speech. The experiment was trained on a 100 hour (47271 utterances) clean data set and tested using the model selected on the development set (2 hours, 1071 utterances) (4 hours, 2048 utterances).

MuST _ C English-GermanTED (En-De for short): the MuST-C corpus is based on the English TED speech, including English speech, corresponding transcription and translation in 8 different directions. Experiments were performed in the direction of the English-German language. The corpus contains 234K translation pairs for a total of 408 hours. The method is divided into a training set (400 hours, 229703 pronunciations), a verification set (3 hours, 1423 pronunciations) and a test set (5 hours, 2641 pronunciations).

Additional data: for further analytical comparison, additional ASR was required in part of the experiments, using Librispeech 960 hours of data.

(2) Pretreatment and evaluation

For speech, this experiment used Kaldi to extract 80-dimensional MFCC features with a step size of 10m_sThe window length is 25ms and the mean and variance are normalized. The data volume was increased with speed perturbations of 0.9, 1.0, 1.1. Text data in the source language, lowercase, tokenize, and remove all punctuation, is consistent with the ASR data. Text data for the target language, lowercase, tokenize, and normalized punctuation. The processing of the text is implemented using MOSE. And applying BPE to the data combined with the source text and the target text to obtain a shared sub-word unit, wherein the size of the vocabulary is set to be 8K.

For comparison with other existing works, a multi-BLEU. pl script was used to calculate case insensitive BLEU values as model evaluation indices.

(3) Model set-up

The models involved in the experiment all adopt a Transformer structure. For the MT and ASR models, the number of encoding and decoding layers is 6, the number of attention heads is 4, the embedding dimension is 256, and the filter size d of the forward neural network_ff2048, attention dropout is 0 and the remaining dropouts are 0.1. Regarding the ST model, the number of the acoustic coding layer, the semantic coding layer and the decoding layer is 6, 4 attention heads are included, the embedding dimension is 256, and the attention is paiddropout is 0, and the remainder is 0.1. When voice is preprocessed, 2-dimensional CNN is adopted for down-sampling, each CNN layer has 256 output channels, the convolution kernel is 3, and stride is 2.

For the reactor, a three-layer forward neural network is adopted, the dimension of a coding layer is 1024, Leaky-ReLU is adopted as an activation function, and dropout is 0.1. In order to ensure the stability of the training of the antagonists, the parameters of each layer of the antagonists are processed by spectra norm.

(4) Training arrangement

The model was trained on 4 NVIDIAV100 graphics cards using Adam optimizer. With respect to the ASR, MT and ST models, the learning rate decay pattern is consistent with the original paper, with a scaling factor k of 5.0, warmup of 25000 for ASR and ST and 8000 for MT. For the antagonists, the learning rate is 1e-4, β₁＝0.9，β₂＝0.999，ε＝1e-8。

And storing the best-performing 5 models on the verification set, and averaging to obtain the final test model. And an early termination strategy is adopted by taking ACC as a criterion, and the heart endurance cycle is 3. The beam size at decoding is set to 10. Experiments were performed on the basis of Espnet, using the pytorch-lightning organization code.

TABLE 1 results of experiments on an Augmented Librisipeech English-French

In table 1, the corresponding model is different from the general initialization method, which employs ASR and MT combined training, and then initializes the entire ST model using the MT model.

TABLE 2 results of the experiments on MuST _ C English-GermanteD

The transform + adverse used ASR and MT joint training, after which the entire ST model was initialized using the MT model. Meta-Learning initializes ST models using the method of MAML.

(5) Analysis of results

Results on an Augmented Librisppeech English-French: experiments were performed in two experimental settings. In E2E base, the model is trained using only ST corpus. In E2E expanded, the acoustic coding layer is pre-trained with libripsecech. The trained speech translation model of the present invention is compared to the existing end-to-end model, as shown in table 1.

In base and expanded experiments, the model of the present invention represents more than all end-to-end models in the past. Compared with Ashkan Alinejad (Transformer + averse), the model of the invention improves the BLEU by 0.93 by training the Ashkan Alinejad with a countermeasure. This is because their approach is to give the MT's coding layer multi-modal representation capability by jointly training the ASR and MT, and initialize the ST model using the entire MT model, ensuring continuity of coding layer and decoding layer initialization. Although the method gives better initialization value of the ST model, the problem of internal supervision signal loss is still not solved, and the method directly provides supervision signals to help model training, so the method has better effect. The present invention also achieves better results than knowledge distillation, since knowledge distillation guides the ST model with the MT model only in the last layer, and although providing a supervisory signal, it is difficult to propagate into previous networks. Compared with (TCEN, curiculum), the model of the invention is simple and flexible, and avoids complex training process. The LUT and STAST also adopt different strategies to provide guidance for the coding layer, and relieve the burden of the coding layer so as to obtain more semantic information. The model of the invention performs better than LUT and STAST.

Results on MuST _ C English-GermanTED: in Table 2, the performance of various models on the MuST _ CENGlish-GermanTED data set is compared. Compared with the Augmented library English-French, the ST model and the MT model (which can be regarded as the upper bound of the ST model) have larger performance difference. This is because the speech in this corpus is live recorded, the speech contains more noise, and the transcription and translation are not well aligned with the original audio. As can be seen from the comparison of the experimental results, the model of the present invention achieves better results than all previous models without any initialization, exhibiting the same trend as on the Augmented library English-French dataset.

(6) Ablation test

TABLE 3 ablation testing on two data sets

Impact of parasitic losses: in order to evaluate the contribution of each part in the model of the invention, ablation tests were performed on both data sets for additional tasks. The results in table 3 show that each part in the model of the invention has a positive effect. If the countermeasure loss is removed, the performance of the model will be degraded, indicating the ability of the countermeasure to reduce the difference in the representation of the two modes, so that it learns more semantic information and produces better performance. If the semantic coding layer is removed, the model performance will decrease dramatically, which means that the semantic coding layer is necessary and can provide the model with more ability to learn semantic information. If the model behaves very poorly without the puncturing mechanism, since the coding layer contains too much meaningless and redundant coding layer state, resulting in too much difference in the length of the text and speech representations, the warfare agent cannot learn in this situation.

Influence of the hyper-parameters: experiments were performed on the MUST-C En-De data by varying the weights of the different loss terms in the loss function. To reduce the burden of parameter adjustment, α is set to 1- β.

In general, the most appropriate player weights vary from CTC weight to CTC weight, but after convergence, the model behavior does not change much.

TABLE 4 ablation testing on En-De test set

Impact of contraction mechanism on training: FIG. 3 illustrates the training process of the model with or without the contraction mechanism. It can be seen that, in the absence of a contraction mechanism, the length difference between the two modes is too large, the loss of the generator is rapidly increased, the purpose of closing the phonetic representation to the text representation by confusing the reactor cannot be achieved, and the model cannot be converged.

FIG. 3 shows the training process of the model with or without the contraction mechanism on the En-De data set, namely, the descriptor loss

generator loss is namely

Effect of shrinkage mechanism on testing: the state of the coding layer can be effectively compressed by using a contraction mechanism, but the compression undoubtedly loses part of information during testing, and the contraction mechanism is removed during testing. The results in Table 5 show that removing the shrink mechanism can bring about an increase in the model of around 0.2-0.5BLEU values.

TABLE 5 Performance of the model of the invention on the En-De test set with and without the shrink mechanism

And (3) a probe task: the learned representation is further analyzed using the Fluent spechmomands dataset to perform the probe task. Among them, speaker ver is a task designed to recognize a speaker, and benefits from more acoustic information. IntentIde focuses on intent recognition and requires more linguistic information. The data set contains 30043 pronunciations, 97 speakers, and 31 intentions. Pronunciations are randomly divided into training, testing and verifying sets, and therefore the speaker and the intention are contained in each set. During training, the state of the coding layer of the trained model is extracted, the coding layer is fixed, then a full connection layer is used, fine adjustment is carried out on two probe tasks respectively, and the accuracy of the test set is displayed. By comparison, it can be seen that in the ST model, acoustic information is mainly learned at the bottom layer, and semantic information is captured with emphasis at the high layer.

TABLE 6 Classification accuracy on speaker recognition and intent recognition tasks

AE denotes the acoustic coding layer output and SE denotes the semantic coding layer output.

The impact of the extra text: the model behavior in low resource situations was studied. For different data resource scenarios, 50, 10, 200, 300 hours of speech training data (speech-transcription-translation triplets) were randomly selected from the En-De dataset. Two models, including the basic end-to-end baseline and the inventive model, were then trained on data of different scales. Wherein the model of the invention can utilize additional transcription data.

FIG. 4 is a graph of the performance of the baseline and the model of the invention on En-De for different data volumes. As the amount of data increases, the performance of both models continues to increase. The model of the present invention lasted higher than the baseline model. With only 50 hours of data, the model of the invention achieved performance comparable to the 300 hour baseline model. This is because the model of the present invention provides a supervisory signal to the coding layer of ST, and by using the countermeasure to force the representation of speech to the representation of text, helps the ST model to capture more semantic information, reduces the difficulty of learning the model, and reduces the requirement of the model for data size.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A training method for obtaining a better speech translation model in generating an confrontation, comprising:

2. The training method for obtaining better speech translation model in generating antagonism according to claim 1, wherein in step 2, the contracted ST model encodes the layer state

The description is as follows:

3. The training method for obtaining better speech translation models in generating opponents according to claim 1, wherein step 3 comprises:

training a countermeasure for classifying speech and text encodings; the input of the reactor is the coding layer output h, and the output of the reactor is a binary classification prediction about the mode corresponding to h: p is a radical of_D(l | h); wherein p is_D:

4. the training method for obtaining better speech translation model in generating opponents according to claim 3, wherein in step 4, the objective function used in the joint training is

As shown in the following formula:

wherein the content of the first and second substances,

the loss of the CTC is indicated,

representing the coding layer loss of the ST model,

5. Training method for obtaining better speech translation models in generating antagonism according to claim 4, characterised in that it defines

Comprises the following steps:

wherein the content of the first and second substances,

denotes q_tThe (k) th element of (a),

indicating a vocabulary and a '-' indicating a blank label.

6. Training method for obtaining better speech translation models in generating antagonism according to claim 4, characterised in that it defines

Comprises the following steps:

where θ represents a parameter of the ST model, p_θDenotes the probability of the decoded layer output after softmax, y ═ y₁,…,y_Ty]Denotes the translated sequence, T_yLength of translated sequence, h_sAnd representing the semantic coding layer output of the ST model.

7. A training system for obtaining better speech translation models in generating antagonists, comprising:

the data collection module is used for collecting training data;

8. A method of speech translation, comprising:

acquiring target voice to be translated;

translating the target speech by using a speech translation model trained by the method of any one of claims 1 to 6.

9. A speech translation apparatus, comprising:

the voice acquisition unit is used for acquiring target voice to be translated;

a speech translation unit, configured to translate the target speech by using the speech translation model trained by the method of any one of claims 1 to 6.