CN112926344B

CN112926344B - Word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium

Info

Publication number: CN112926344B
Application number: CN202110271844.XA
Authority: CN
Inventors: 杨雅婷; 陈玺; 董瑞; 马博; 王磊; 周喜
Original assignee: Xinjiang Technical Institute of Physics and Chemistry of CAS
Current assignee: Xinjiang Technical Institute of Physics and Chemistry of CAS
Priority date: 2021-03-13
Filing date: 2021-03-13
Publication date: 2023-11-17
Anticipated expiration: 2041-03-13
Also published as: CN112926344A

Abstract

The invention discloses a word vector substitution data enhancement-based machine translation model training method, a word vector substitution data enhancement-based machine translation model training device, electronic equipment and a storage medium, wherein the specific implementation scheme is as follows: acquiring a training sample data set; and preprocessing the sample data set. Respectively training a forward language model and a reverse language model based on a Transformer structure aiming at the existing source language or target language corpus; obtaining probability distribution of words at any position in sentences on the whole word list through a forward language model and a reverse language model; determining a final word vector according to the probability distribution and the word vector of the whole word list, and replacing the word at the position by the final word vector; training a neural machine translation model by using the replaced bilingual parallel corpus to obtain a translation result; meanwhile, the monolingual data can be merged into the method to obtain a better translation effect. Experimental results show that the method can remarkably improve the translation quality of the machine translation model.

Description

Word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a word vector substitution data enhancement-based machine translation model training method, a word vector substitution data enhancement-based machine translation model training device, electronic equipment and a storage medium.

Background

In recent years, with the development of artificial intelligence, particularly the increasing maturity of deep learning technology, the artificial intelligence is widely applied in various industries, and the production efficiency is greatly improved. In the field of machine translation in the field of natural language processing, neural machine translation based on a neural network also achieves a better effect. Machine translation is a way to implement interconversions between pairs of languages using computers. With the development of deep learning, the development of neural machine translation (Neural Machine Translation, NMT) based on deep learning has made great progress, and the network structure has been developed from a recurrent neural network to a convolutional neural network to a network based entirely on self-attention mechanisms. Among these different network architectures, highly parallelized transformers based on self-attention mechanisms have achieved very good results. Neural machine translation gradually replaces statistical machine translation and becomes a currently mainstream translation system.

The current neural machine translation model obtains better translation effect when facing to English method, english middle and other resource rich language pairs with large-scale parallel corpus, but has unsatisfactory effect in some low-resource language machine translation tasks. As a data-driven translation method, neural machine translation is severely dependent on the quality, scale of parallel data. In some machine translation tasks of resource-scarce languages, the performance of their corresponding neuro-machine translation systems is not ideal due to the lack of large-scale high-quality parallel corpus resources and the lack of efficient analysis tools. How to build and improve translation performance under the condition of low resources becomes a main problem in the wiener machine translation task.

In order to make the defined dataset appear more content, data enhancement can be achieved by processing a part of the words in the parallel sentence pair, depending on the basic idea of image data enhancement. Zhang X et al perform data enhancement by using existing synonym forests to find and replace selected content in the text to be replaced. Fadaee et al propose for the first time to replace high frequency words in a text sequence with low frequency words using a language model, while replacing the corresponding translation. Although this approach can effectively enhance the effect of machine translation, it focuses only on some of the words (low frequency words) in the vocabulary. However, the substitution between high frequency words also improves the effect of machine translation, and there are many words that can be used for substitution, and the above method cannot generate all possible substitutions.

As word vectors are proposed, they describe word-to-word relationships by converting words into vectors that are successively dense. Words of similar semantics will get a similar vector representation, so words with similar semantics can be captured by word vectors. However, the probabilities of the words with the same semantics in the real data set are not the same, so the invention provides a word vector substitution data enhancement-based machine translation model training method, a word vector substitution data enhancement-based machine translation model training device, word vector substitution data enhancement-based machine translation model training equipment and a storage medium.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a word vector substitution data enhancement-based machine translation model training method, a word vector substitution data enhancement-based machine translation model training device, electronic equipment and a storage medium, wherein a training sample data set is acquired by the method; preprocessing the sample data set; respectively training a forward language model and a reverse language model based on a Transformer structure aiming at the existing source language or target language corpus; obtaining probability distribution of words at any position in sentences on the whole word list through a forward language model and a reverse language model; determining a final word vector according to the probability distribution and the word vector of the whole word list, and replacing the word at the position by the final word vector; training a neural machine translation model by using the replaced bilingual parallel corpus to obtain a translation result; meanwhile, the monolingual data can be merged into the method to obtain a better translation effect. Experimental results show that the method can remarkably improve the translation quality of the machine translation model. To improve the translation effect of the machine translation model.

The invention discloses a word vector substitution data enhancement-based machine translation model training method, which comprises the following steps:

a. for the existing parallel corpus, training a forward language model and a reverse language model based on a Transformer structure by using a source language respectively, wherein in the models, given a word vector matrix E of all words, words w generated by the forward language model are given _t The word vectors of (a) are expressed as:

wherein f _j (w _j ) For the probability distribution of each word in the vocabulary, PF (w) is the word w _t A one-dimensional vector representation among the forward language models; word w generated by reverse language model _t The word vector is calculated from the following formula:

wherein b _j (w _j ) For the probability distribution of each word in the vocabulary, PB (w) is the word w _t A one-dimensional vector representation among the reverse language models;

b. obtaining probability distribution of words at any position in sentences on the whole word list through a forward language model and a reverse language model, namely replacing all possibilities of the position by replacing single-hot codes of the words with the probability distribution;

c. determining a final word vector according to the probability distribution and the word vector of the whole word list, wherein the final word vector of the word at any position is expressed as follows:

e _w ＝avg(PF(w)E+PB(w)E)

Replacing the word at the position with the final word vector;

d. training a neural machine translation model by using the replaced bilingual parallel corpus;

e. and (3) carrying out reverse translation on the monolingual corpus to obtain pseudo parallel corpus, adding the pseudo parallel corpus into training data, and repeating the steps a-d to obtain a final translation result.

The parallel corpus needed in the step a is used as a training sample data set, and the described parallel corpus process is preprocessed:

filtering noise symbols in the corpus; cutting the corpus by using a cutting tool; if necessary, carrying out case reduction on the language materials and converting the full angle into the half angle; filtering parallel language pairs with too large or too small length ratio; preprocessing the corpus by byte pair coding technology; converting words in the corpus into one-hot coding representation; dividing the corpus into different training batches; and intercepting the longer corpus and filling the shorter corpus with a value of 0 so that the corpora in the same batch are adjusted to be represented with the same length.

In the step e, the monolingual corpus is a more excellent language model trained by adding more field-related monolingual data sets on the basis of the training corpus; adding reverse translated data; and adding reverse translation data with labels, and adding the labels into the translation data of the source language to distinguish different data of the source end.

The invention also provides a machine translation model training device based on word vector replacement data enhancement, which comprises: the system comprises a sample data set corpus preprocessing module, a forward language model module, a reverse language model module, a word embedding module, a probability distribution determining module, a final word vector determining module, a model training module and a monolingual corpus merging module, wherein:

the corpus preprocessing module of the sample data set: the method is used for preprocessing a bilingual parallel corpus data set or a monolingual corpus;

forward language model module: a language model for obtaining a forward direction, wherein the language model is a language model for predicting the following from left to right according to the above;

reverse language model module: for obtaining a reverse language model, the language model being from right to left, predicting the above according to the following;

word embedding module: for input into a machine translation model as an input to the model;

probability distribution determination module: the probability distribution of words at any position in sentences on the whole word list is obtained through a forward language model and a reverse language model;

a final word vector determination module: the word processing method comprises the steps of determining a final word vector according to probability distribution and word vectors of the whole word list, and replacing words at the position with the final word vector;

Model training module: the method is used for iteratively training a neural machine translation model by utilizing the replaced bilingual parallel corpus and combining the bilingual data;

the bilingual corpus integration module: the method is used for performing reverse translation by using the monolingual data set, and adding the obtained pseudo parallel corpus into the training data set corpus.

The model training module comprises:

an encoder module to encode the source language into semantic features of a particular dimension;

a decoder module for decoding the semantic features into a target language.

The corpus preprocessing module of the sample data set comprises:

noise symbol filtering unit: the method is used for filtering noise symbols in the corpus;

corpus segmentation unit: the method is used for segmenting the corpus;

conversion unit: the method is used for carrying out case reduction and full-angle-half-angle conversion on the corpus;

length filtering unit: for filtering parallel language pairs with too large or too small a length scale;

an encoding unit: the method is used for preprocessing and encoding the corpus through bytes;

a numerical value conversion unit: the method is used for converting words in the corpus into one-hot coding representation;

corpus dividing unit: the corpus is divided into different training batches;

Length adjustment unit: the method is used for intercepting longer corpora and filling shorter corpora with 0 values so that the corpora in the same batch are adjusted to be represented with the same length.

The invention also provides an electronic device, which comprises: at least one multicore processor; at least one GPU computing card, and a memory communicatively coupled to the at least one multi-core processor, wherein the memory stores instructions executable by the at least one multi-core processor, the instructions being executable by the at least one multi-core processor or the at least one GPU computing card to enable the at least one multi-core processor or the at least one GPU computing card to implement the steps of the method of any of the machine translation model training methods.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any one of the machine translation model training methods.

The word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium have the beneficial effects that:

The invention provides a word vector substitution data enhancement-based machine translation model training method and device, electronic equipment and a storage medium. Firstly, obtaining probability distribution of any position in sentences on the whole word list through a forward language model and a reverse language model; and then determining a final word vector according to the probability distribution and the word vector of the whole vocabulary, and replacing the word at the position by the final word vector. Finally, the effect of the method is further improved by utilizing the monolingual data, and experiments on a plurality of language pairs prove that the translation effect of the machine translation model can be effectively improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

FIG. 1 is a workflow diagram of the present invention;

FIG. 2 is a block diagram of corpus preprocessing of a sample dataset of the device of the present invention;

FIG. 3 is a schematic diagram of the apparatus of the present invention;

FIG. 4 is a block diagram of the device of the present invention for monolingual corpus fusion;

FIG. 5 is a frame diagram of the training method of the present invention;

Fig. 6 is a structural diagram of the electronic device of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments will be described in detail below, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

the parallel corpus process pretreatment:

filtering noise symbols in the corpus;

cutting the corpus by using a cutting tool;

if necessary, carrying out case reduction on the language materials and converting the full angle into the half angle;

filtering parallel language pairs with too large or too small length ratio;

preprocessing the corpus by byte pair coding technology;

converting words in the corpus into one-hot coding representation;

dividing the corpus into different training batches;

and intercepting the longer corpus and filling the shorter corpus with a value of 0 so that the corpora in the same batch are adjusted to be represented with the same length.

e _w ＝avg(PF(w)E+PB(w)E)

replacing the word at the position with the final word vector;

e. performing reverse translation on the monolingual corpus to obtain pseudo parallel corpus, adding the pseudo parallel corpus into training data, and repeating the steps a-d to obtain a final translation result; the monolingual corpus is a language model which is trained more excellently by adding more field-related monolingual data sets on the basis of the training corpus; adding reverse translated data; adding reverse translation data with a label, and adding the label into the translation data of the source language to distinguish different data of the source end;

a word vector replacement data enhancement based machine translation model training apparatus, the apparatus comprising: the system comprises a sample data set corpus preprocessing module, a forward language model module, a reverse language model module, a word embedding module, a probability distribution determining module, a final word vector determining module, a model training module and a monolingual corpus merging module, wherein:

The model training module comprises:

A decoder module for decoding the semantic features into a target language.

The corpus preprocessing module of the sample data set comprises:

corpus segmentation unit: the method is used for segmenting the corpus;

corpus dividing unit: the corpus is divided into different training batches;

length adjustment unit: the method comprises the steps of intercepting longer corpora, and filling shorter corpora with 0 values so that the corpora in the same batch are adjusted to be represented with the same length;

an electronic device, comprising: at least one multicore processor; at least one GPU computing card, and a memory communicatively coupled to the at least one multi-core processor, wherein the memory stores instructions executable by the at least one multi-core processor, the instructions being executable by the at least one multi-core processor or the at least one GPU computing card to enable the at least one multi-core processor or the at least one GPU computing card to implement the steps of the method of any of claims 1 to 4;

A computer readable storage medium having stored thereon a computer program, wherein the computer instructions, when executed by a processor, implement the steps of the method of any of claims 1 to 4;

as shown in fig. 1, to improve the translation effect of the existing machine translation model, the method specifically includes the following steps:

s1, acquiring a training sample data set, wherein the training sample data set comprises a plurality of bilingual parallel linguistic data aligned one by one and a plurality of bilingual source language linguistic data or a plurality of bilingual target language linguistic data, and the linguistic data also carries corresponding language types;

s2, preprocessing the sample data set. The method specifically comprises the following steps: filtering noise symbols in the corpus; cutting the corpus by using a cutting tool; if necessary, performing case reduction on the corpus, and converting full angle into half angle; filtering parallel language pairs with too large or too small length ratio; preprocessing the corpus by byte-pair encoding (BPE); converting words in the corpus into one-hot encoding representation (one-hot encoding); dividing the corpus into different training batches; intercepting longer corpus and filling shorter corpus with 0 value to adjust the corpus of the same batch to be represented with the same length;

S3, respectively training a forward language model and a reverse language model based on a Transformer structure aiming at the existing source language or target language corpus;

in neural machine translation systems, each word is assigned a unique ID value and is represented by a one-hot encoding (one-hot encoding); for example, the i-th word in the vocabulary may be represented as a word vector (0, …,1, …, 0) in the |v| dimension; where |V| is the size of the vocabulary, the ith dimension of the vector is 1, during training, each word w _c The corresponding one-hot code is multiplied by a vector matrix E with the size of |V|m to obtain a corresponding word vector representation, wherein m is the size of a vector dimension; and at any position in the text, there is a probability distribution according to the context thereof, to represent the probability that a word appears at that position; substitution of the position by substitution of the one-hot encoding of the word with a probability distribution represents all likelihood of the position. In order to fully utilize the context information to generate probability distribution of the corresponding position, the invention utilizes a forward language model and a reverse language model to calculate;

given source or target language sentence s= (x) ₁ ，x ₂ …，x _n ) Corresponding forward language model LM _forward Reverse language model LM _backward The method comprises the steps of carrying out a first treatment on the surface of the According to the context information x ₁ ，x ₂ …，x _＜t-1 ，x _t+1 ，x _t+2 …，x _n The word probability distribution for position t can be calculated by the following formula:

f _j (w _j )＝LM _forwar d(w _j |w _＜t )

b _j (w _j )＝LM _backward (w _j |w _＞t )

wherein LM _forward (w _j |w _＜t ) Representing the occurrence of the jth word in the vocabulary in the word sequence x ₁ ，x ₂ …，x _＜t-1 Probability of the back; likewise, LM _backward (w _j |w _＞t ) Representing the occurrence of the jth word in the vocabulary at x _t+1 ，x _t+2 …，x _n Probability of front;

s4, obtaining probability distribution of words at any position in sentences on the whole word list through a forward language model and a reverse language model;

taking the forward language model as an example, when each word probability distribution f in the resulting vocabulary _j (w _j ) After that, can form w _t One-dimensional vector representation:

PF(w _t )＝(f ₁ (w ₁ )，f ₂ (w ₂ )，...，f _|v| (w _|v| ))

wherein f _j (w) is not less than 0 and

s5, determining a final word vector according to the probability distribution and word vectors of the whole word list, and replacing the words at the positions by using the final word vector;

in the machine translation model, given the word vector matrix E of all words, the words w generated by the forward language model _t The word vector of (a) can be rewritten as:

wherein f _j (w _j ) For the probability distribution of each word in the vocabulary, PF (w) is the word w _t A one-dimensional vector representation among the forward language models; word w generated by reverse language model _t The word vector can be calculated from the following formula:

and finally, carrying out average processing on the obtained forward and reverse word vectors, wherein the final word vector can be calculated by the following formula, and the final word vector of the word at any position is expressed as follows:

e _w ＝avg(PF(w)E+PB(w)E)

s6, training a neural machine translation model by utilizing the replaced bilingual parallel corpus and combining with the monolingual data to obtain a translation result;

s7, as a large number of monolingual corpus exist for part of languages, the method plays a very important role in the absence of parallel data; therefore, on the basis of the method, the invention effectively fuses the monolingual data to further improve the performance of machine translation; the invention comprises three ways of merging the whisper data:

s71, training a language model with better performance: the method mentioned in the previous section only uses the original training data set when training the language model; therefore, more field-related monolingual data sets are added on the basis of original corpus to train a language model with more excellent performance;

s72, adding reverse translation data: the reverse translation method proves quite effective on a plurality of language pairs, but the model improvement is limited by adding a limited-scale monolingual data set, so that the effect of machine translation is further improved by the method mentioned in the section above on the basis of the reverse translation method;

S73, adding reverse translation data with labels: when the reverse translation data is added, part of the data of the source end is translated translation data. This portion of the data may contain some noise and may have adverse effects on the language model. Therefore, a Tag (Tag) is added to the translation data of the source end to distinguish different data of the source end, as shown in table 1:

table 1 reverse translation data Tag (Tag) example:

in order to verify the effectiveness of the method, the invention performs experiments on an IWSLT2014 Ind translation task (EN-DE) and a CWMT2017 (UY-CH) UltraIris machine translation task; the Ind translation task contains 160k parallel sentence pairs, which can be used to simulate a low-resource translation task. Selecting 5% sentences from the sentences randomly as a verification set, wherein the scale of the verification set is about 7200 sentences; combining all test sets of IWSLT14.TED.dev2010, IWSLT14.TED.dev2012 and 2010-2012 to test the performance of the model, wherein the size of the test set is 6750 sentences; the method comprises the steps that a CWMT evaluation data set in 2017 is selected for a Vimachine translation task, and a test set is selected from 2015 and 2017 evaluation tasks for better performance of a verification model; each test set has 4 different reference translations, and only one reference translation is selected for scoring during the test; all datasets are processed using preprocessing tools provided by moses frameworks; for a monolingual dataset, selecting a portion of english and german corpus from the TED corpus; for Uygur language and Chinese single language corpus, crawling is carried out on websites of Uygur language and Chinese news to obtain corresponding single language data of a single language data selection part, wherein specific statistical information is shown in a table 2:

Table 2 monolingual corpus statistics

For obvious comparison between experiments, the invention compares various data enhancement modes (Otto E, etc.) with the data enhancement method in the invention on the German-English translation task, and the specific contents are as follows:

base: a translation model trained based on the initial data, without using any data enhancement strategy;

swap: randomly exchanging adjacent words within a window size K;

dropout: randomly discarding words in the sentence;

blank: randomly replacing words in the sentence with placeholders (placeholder token);

LM _replace : data enhancement is performed by replacing words according to the work of Fadaee et al.

In the english translation task (DE-EN), the transducer algorithm is employed as a training framework and its default parameter configuration is used: the parameters of the encoder and the decoder are 6 layers; the dimension of the hidden layer is set to 512; the parameters of the feed-forward layer are 1024; the number of attention heads is 4; the Dropout rate was set to 0.3; the label smoothness was set to 0.1; the model does not stop training until fitting; in the decoding process, setting the size of the column search as 5 and the length penalty as 1; for language models, the transfomer algorithm is also used; the adopted model parameters are the parameter setting of a transducer_based; in order to be able to combine a language model with neural machine translation, the language model uses the same vocabulary as the machine translation; in the machine translation training process, parameters of the language model are not changed at all; all the data enhancement models adopt the same experimental setting and are set to have 0.15 probability to replace words in the training process;

TABLE 3 German-English translation results

The experimental results in table 3 show that: compared with a baseline system, the method provided by the invention can effectively improve the performance of machine translation under the condition of not utilizing any whisper data. In the German-English (DE-EN) translation task, the translation performance is improved by 1.20 BLEU. After comparing the different data enhancement methods, it was found that the exchange-based approach would reduce the performance of the model. Besides, all data enhancement modes are beneficial to improving the performance of machine translation; among all the data enhancement methods, the method proposed in this chapter obtains the optimal performance; the result shows that the data enhancement method based on word vector replacement can provide diversified sentence information for the model;

the invention verifies the effectiveness of the method on Uygur language-Chinese translation task: for Uygur language-Chinese translation experiments, a basic transform algorithm is also selected as a model structure; model parameters are as follows: the number of layers of the decoder and the encoder is 6; the dimension of the feed-forward layer is set to 2048 and the dimension of the hidden layer is 512; the number of the attribute heads is 8; an optimization algorithm using Adam algorithm as a model, the learning rate was set to 0.001, and the dropout rate was set to 0.3;4000 steps are performed once for norm-up. In the decoding step, decoding is carried out by averaging the last 5 checkpoints, the size of the Beam search is set to be 5, and the length penalty is set to be 1; the language model also uses a transducer algorithm and the vocabulary is identical to neural machine translation. The performance of Uygur language-Chinese translation task in different test sets is shown in Table 4:

TABLE 4 Uygur language-Chinese translation results

The analysis of the results in table 4 gave: in Uygur language-Chinese language machine translation task, the method provided by the invention has obvious improvement compared with a baseline system. The 0.71 and 0.44 BLEU values are improved on different test sets, which directly proves that the data enhancement method based on the word vector replacement technology has obvious effect on Uygur language-Chinese low-resource translation tasks;

for experiments with monolingual data, 5 sets of comparative experiments were performed according to the method described above, with the following specific contents: the OurModel_big_LM is used for training a language model by using the original training set and all the single language data; the data set for training NMT did not change at all; back-transfer: directly adding reverse translation data into the original data without using any data enhancement method; training a reverse translation model to decode the monolingual data of the target end and directly adding the monolingual data into the original data; the language model is trained based on the new training corpus; ourModel_Back-transfer_tag: when adding reverse translation data, adding a < BT > tag to the reverse translation data of the source end; training a new language model on the basis; after adding the monolingual data, the experimental results of the german-english, uighur-chinese monolingual are shown in table 5, table 6:

TABLE 5 German-English translation results after addition of the monolingual corpus

TABLE 6 Uygur language-Chinese translation results after adding a monolingual corpus

Tables 5 and 6 show: all methods of adding monolingual data have a positive impact compared to the baseline system; the method comprises the steps of adding reverse translation data without labels and obtaining an optimal translation result by using a data enhancement method based on word vector substitution; compared with the experiment of the number 3, the experiment of the number 2 finds that the language model with better use performance has little influence on the machine translation model; this is because the neural machine translation is identical to the vocabulary of the language model, and the probability distribution generated by the more optimal language model does not change much when the probability distribution is generated; compared with a baseline system, the number 3 experiment can improve the performance of machine translation after adding reverse data; experiment 5 and experiment 6, compared with experiment 4, illustrate that combining the word vector replacement data enhancement method with the reverse translation method can further improve the performance of machine translation, because the word vector replacement-based data enhancement method can optimize reverse translation data and enable the reverse translation data to obtain more diversified expressions; compared with experiment 6, experiment 5 has slightly reduced performance after Tag label is added, and is particularly obvious in the Withan translation; the Uygur language is complex in form, and reverse translation data can generate more noise in the process of generating diversity by adopting tag labels;

The embodiment also provides a word vector replacement data enhancement-based machine translation model training device, as shown in fig. 2, 3 and 4, so as to improve the translation effect of the existing machine translation model; the device specifically comprises: the system comprises a sample data set corpus preprocessing module 11, a forward language model module 12, a reverse language model module 13, a word embedding module 14, a probability distribution determining module 15, a final word vector determining module 16, a model training module 17 and a monolingual corpus merging module 18;

each module is described in detail below:

the sample data set corpus preprocessing module 11 is configured to preprocess a bilingual parallel corpus data set or a bilingual data set, and specifically includes: a noise symbol filtering unit 111, configured to filter noise symbols in the corpus; a corpus segmentation unit 112, configured to segment a corpus; a conversion unit 113, configured to perform case reduction and full-angle-half-angle conversion on the corpus; a length filtering unit 114 for filtering parallel language pairs with too large or too small a length ratio; an encoding unit 115, configured to perform preprocessing encoding on the corpus by using byte-pair encoding (BPE); a numerical conversion unit 116, configured to convert words in the corpus into one-hot encoding (one-hot encoding); a corpus dividing unit 117, configured to divide the corpus into different training batches; a length adjustment unit 118, configured to intercept a longer corpus and fill a shorter corpus with a value of 0, so that the corpora in the same batch are adjusted to be represented with the same length;

The forward language model module 12 is configured to obtain a forward language model, where the language model predicts the following from left to right according to the context;

the inverse language model module 13 is configured to obtain an inverse language model, where the language model predicts the above from right to right according to the following;

word embedding module 14 is for input into the machine translation model as input to the model;

the probability distribution determining module 15 is used for obtaining the probability distribution of words at any position in the sentence on the whole word list through a forward language model and a reverse language model;

the final word vector determining module 16 is configured to determine a final word vector according to the probability distribution and the word vector of the entire vocabulary, and replace the word at the position with the final word vector;

the model training module 17 is used for iteratively training a neural machine translation model by utilizing the replaced bilingual parallel corpus and combining the bilingual data; the training module comprises an encoder module for encoding a source language into semantic features of a specific dimension; a decoder module to decode the semantic features into a target language;

the monolingual corpus utilization module 18 is configured to utilize the monolingual data set to further improve the translation effect of the present invention, and includes adding more field-related monolingual data sets to train a more excellent language model based on the training corpus; adding reverse translated data; and adding reverse translation data with labels, and adding the labels into the translation data of the source language to distinguish different data of the source end.

In the neural machine translation device, each word w _t The corresponding one-hot encoding is multiplied by a vector matrix E of size |v|×m, where m is the size of the vector dimension, to obtain a corresponding word vector representation. And at any location in the text there is a probability distribution depending on its context to indicate the likelihood that a word will appear at that location. The replacement of all likelihood of the location is represented by a one-hot encoding of the word with a probability distribution. In order to fully utilize the context information to generate probability distribution of the corresponding position, a forward language model and a reverse language model are utilized for calculation;

f _j (w _j )＝LM _forward (w _j |w _＜t )

b _j (w _j )＝LM _backward (w _j |w _＞t )

wherein LM _forward (w _j |w _＜t ) Representing the occurrence of the jth word in the vocabulary in the word sequence x ₁ ，x ₂ …，x _＜t-1 Probability of the back; likewise, LM _backward (w _j |w _＞t ) Substitution ofAppear in x representing the jth word in the vocabulary _t+1 ，x _t+2 …，x _n Probability of front;

obtaining probability distribution of words at any position in sentences on the whole word list through a forward language model and a reverse language model;

PF(w _t )＝(f ₁ (w ₁ )，f ₂ (w ₂ )，...，f _|v| (w _|v| ))

wherein f _j (w) is not less than 0 and

determining a final word vector according to the probability distribution and the word vector of the whole word list, and replacing the word at the position by the final word vector;

finally, carrying out average processing on the obtained forward and reverse word vectors, wherein the final word vector can be calculated by the following formula; the final word vector for the word at any position is expressed as:

e _w ＝avg(PF(w)E+PB(w)E)

training a neural machine translation model by combining the replaced bilingual parallel corpus with the bilingual corpus to obtain a translation result;

because a part of languages have a great amount of monolingual data, the method plays a very important role in the absence of parallel data; therefore, based on the above method, the monolingual data is merged to further improve the performance of machine translation, and the monolingual corpus merging module 18 includes the following three units:

The extended language model module 181: the methods mentioned in the previous section only utilize the original training data set when training the language model. Therefore, more field-related monolingual data sets are added on the basis of original corpus to train a language model with more excellent performance;

reverse translation data enrichment module 182: this approach of reverse translation has proven to be very effective across many linguistic pairs. However, adding the monolingual dataset with a limited scale is limited in model improvement, so that whether the effect of machine translation can be further improved by the mode mentioned in the section above is based on a reverse translation method;

tagged reverse translation data enrichment module 183: when the reverse translation data is added, part of the data of the source end is translated translation data. The part of data contains partial noise, which may have bad effect on the language model, so that a Tag (Tag) is added in the translation data of the source end to distinguish different data of the source end;

in summary, the device improves the traditional data enhancement mode based on the word replacement technology, and provides a device based on the word vector replacement technology of probability distribution. The probability distribution of the input words is generated from the context by the language model, and then new word vectors are generated from the probability distribution over all words to replace the original word vectors. Experiments are carried out on two language pairs of German-English and Uygur-Chinese, and the effectiveness of the method is verified. At the same time, monolingual data is combined with this data enhancement. Experimental results show that the performance of the model can be remarkably improved by adding the monolingual data through a reverse translation technology;

The present embodiment provides an electronic device, which refers to a wide variety of modern electronic digital computers, including, for example: personal computers, portable computers, various server devices. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, can implement the word vector replacement data enhancement-based machine translation model training method provided in embodiment 1;

as shown in fig. 6, the electronic device includes:

one or more multi-core processors 401, one or more GPU computing cards 402, memory 403, the memory 403 including volatile memory, such as Random Access Memory (RAM) 404 and/or cache memory 406, and may further include Read Only Memory (ROM) 405;

for the electronic device to interact, it should further comprise: an input device 408, an output device 409, and the various devices are interconnected by a bus 410;

the memory stores instructions executable by the at least one multi-core processor or the at least one GPU computing card, so that the word vector replacement data-based enhanced machine translation model training method provided by the application is executed. The memory 403 of the present application stores computer instructions for causing a computer to perform the word vector replacement data enhancement-based machine translation model training method provided by the present application;

An input device 408 providing and accepting control signals input by a user into the electronic device, including a keyboard for generating numeric or character information and a mouse for controlling the device to generate other key signals, the output device 409 providing feedback information from the user electronic device, including a display of the print execution results or processes;

the electronic device provided in this embodiment may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the internet) via a network adapter 407, the network adapter 407 communicating with other modules of the electronic device via a bus 410; it should be appreciated that although not shown, other hardware and/or software modules may be used in connection with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like;

it should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, in accordance with embodiments of the present application, the features and functions of two or more units/modules described above may be embodied in one unit/module, whereas the features and functions of one unit/module described herein may be further divided into a plurality of units/modules;

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the machine translation model training method of the embodiment based on word vector replacement data enhancement;

more specifically, among others, readable storage media may be employed including, but not limited to: portable disks, hard disks, random access memories, read-only memories, erasable programmable read-only memories, optical storage devices, magnetic storage devices, or any suitable combination of the foregoing;

in a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the machine translation model training method of example 1 based on word vector replacement data enhancement, when said program product is run on the terminal device;

wherein the program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partially on the user device, as a stand-alone software package, partially on the user device, partially on a remote device or entirely on the remote device.

In summary, the invention improves the traditional data enhancement mode based on the word replacement technology, and provides a word vector replacement method based on probability distribution; the probability distribution of the input words is generated from the context by the language model, and then new word vectors are generated from the probability distribution over all words to replace the original word vectors. Experiments are carried out on two language pairs of German-English and Uygur-Chinese, and the effectiveness of the method is verified. At the same time, monolingual data is combined with this data enhancement. Experimental results show that the performance of the model can be remarkably improved by adding the monolingual data through a reverse translation technology.

It should be noted that, for the sake of simplicity of description, the foregoing embodiments are all described as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention; further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.

This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a word vector replacement data-based enhanced machine translation model, the method comprising the steps of:

a. for the existing parallel corpus, training a forward language based on a Transformer structure by using a source languageModel and reverse language model, in which word w generated by the forward language model is given a word vector matrix E of all words _t The word vectors of (a) are expressed as:

given source or target language sentence s= (x) ₁ ,x ₂ …,x _n ) Corresponding forward language model LM _foward Reverse language model LM _backward According to the context information x ₁ ,x ₂ …,x _<t-1 ，x _t+1 ,x _t+2 …,x _n The word probability distribution for position t can be calculated by the following formula:

f _j (w _j )＝LM _forward (w _j |w _＜t )

b _j (w _j )＝LM _backward (w _j |w _＞t )

wherein LM _forward (w _j |w _＜t ) Representing the occurrence of the jth word in the vocabulary in the word sequence x ₁ ,x ₂ …,x _＜t-1 Probability of the back; likewise, LM _backward (w _j |w _＞t ) Representing the occurrence of the jth word in the vocabulary at x _t+1 ,x _t+2 …,x _n Probability of front;

forward language model, when each word probability distribution f in the resulting vocabulary _j (w _j ) After that, can form w _t One-dimensional vector representation:

PF(w _t )＝(f ₁ (w ₁ )，f ₂ (w ₂ )，...，f _|v| (w _|v| ))

wherein f _j (w) is not less than 0 and

e _w ＝avg(PF(w)E+PB(w)E)

Replacing the word at the position with the final word vector;

2. The word vector replacement data enhancement-based machine translation model training method according to claim 1, wherein the parallel corpus required in step a is used as a training sample data set, and the described parallel corpus process is preprocessed:

filtering noise symbols in the corpus; cutting the corpus by using a cutting tool; preprocessing the corpus by byte pair coding technology; converting words in the corpus into one-hot coding representation; dividing the corpus into different training batches; so that the corpora of the same batch are scaled to representations of the same length.

3. The word vector replacement data enhancement-based machine translation model training method according to claim 1, wherein the monolingual corpus in step e is trained by adding more field-related monolingual datasets on the basis of the training corpus; a language model; adding reverse translated data; and adding reverse translation data with labels, and adding the labels into the translation data of the source language to distinguish different data of the source end.

4. A word vector replacement data enhancement-based machine translation model training device, comprising: the system comprises a sample data set corpus preprocessing module, a forward language model module, a reverse language model module, a word embedding module, a probability distribution determining module, a final word vector determining module, a model training module and a monolingual corpus merging module, wherein:

probability distribution determination module: the probability distribution of words at any position in sentences on the whole word list is obtained through a forward language model and a reverse language model; given source or target language sentence s= (x) ₁ ,x ₂ …,x _n ) Corresponding forward language model LM _forward Reverse language model LM _backward Based on context informationRest x ₁ ,x ₂ …,x _<t-1 ，x _t+1 ,x _t+2 …,x _n The word probability distribution for position t can be calculated by the following formula:

f _j (w _j )＝LM _forward (w _j |w _＜t )

b _j (w _j )＝LM _backward (w _j |w _＞t )

PF(w _t )＝(f ₁ (w ₁ )，f ₂ (w ₂ )，...，f _|v| (w _|v| ))

wherein f _j (w) is not less than 0 and

a final word vector determination module: the word processing method comprises the steps of determining a final word vector according to probability distribution and word vectors of the whole word list, and replacing words at the position with the final word vector; in the machine translation model, given the word vector matrix E of all words, the words w generated by the forward language model _t The word vector of (a) can be rewritten as:

e _w ＝avg(PF(w)E+PB(w)E)；

5. The word vector replacement data enhancement based machine translation model training device according to claim 4, wherein the model training module comprises:

a decoder module for decoding the semantic features into a target language.

6. The word vector replacement data enhancement-based machine translation model training device according to claim 4, wherein the sample dataset corpus preprocessing module comprises:

noise symbol filtering unit: the method is used for filtering noise symbols in the corpus; corpus segmentation unit: the method is used for segmenting the corpus; conversion unit: the method is used for carrying out case reduction and full-angle-half-angle conversion on the corpus; length filtering unit: for filtering parallel language pairs; an encoding unit: the method is used for preprocessing and encoding the corpus through bytes; a numerical value conversion unit: the method is used for converting words in the corpus into one-hot coding representation; corpus dividing unit: the corpus is divided into different training batches; length adjustment unit: the corpora for the same batch are scaled to the same length representation.

7. An electronic device, comprising: at least one multicore processor; -at least one GPU computing card, -a memory communicatively connected to the at least one multi-core processor, characterized in that the memory stores instructions executable by the at least one multi-core processor, the instructions being executable by the at least one multi-core processor or the at least one GPU computing card to enable the at least one multi-core processor or the at least one GPU computing card to implement the steps of the method of any one of claims 1 to 3.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 3.