CN113673259A

CN113673259A - Low-resource neural machine translation method and system based on data enhancement

Info

Publication number: CN113673259A
Application number: CN202110857215.5A
Authority: CN
Inventors: 刘洋; 米尔阿迪力江·麦麦提; 栾焕博; 孙茂松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-19

Abstract

The invention provides a low-resource neural machine translation method and system based on data enhancement, wherein the method comprises the following steps: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair. The embodiment of the invention realizes the application of sparse data to the neural machine translation model of low-resource neural machine translation, and can efficiently and accurately solve the problem of resource shortage in low-resource neural machine translation.

Description

Low-resource neural machine translation method and system based on data enhancement

Technical Field

The invention relates to the technical field of machine translation, in particular to a low-resource neural machine translation method and system based on data enhancement.

Background

Translation between low resource languages and chinese is currently an urgent and important task. To achieve automatic machine translation, currently common techniques include statistical-based and neural network-based methods, the former being statistical machine translation and the latter being neural machine translation. In order to obtain a reliable translation model, it is necessary to collect large-scale high-quality parallel corpora, which often exist only between a few languages and are limited to some specific fields, such as government documents, news, etc., while the corpora in other fields are relatively deficient. In addition to domain-specific corpora, some languages are inherently resource-poor and it is very difficult to find or obtain available parallel corpora from the internet. At present, the neural machine translation surpasses the traditional statistical machine translation in the translation quality, but the main defect is that the training of a translation model is highly dependent on large-scale parallel corpora.

A large amount of linguistic data on the Internet enables parallel linguistic data acquisition covering multiple languages and fields to be possible. However, in the corpora obtained from the internet, there are few corpora belonging to a specific field, for example, news corpora are easy to obtain, but corpora belonging to a specific field such as government, movie, trade, education, sports, literature, medical treatment, etc. are difficult to obtain. If the training set, the development set (for tuning the trained model) and the test set belong to the same domain, the translation result (on the corpus in the domain) is very good, otherwise, the translation result (on the corpus in the foreign domain) is very poor. Although the research on the neural machine translation oriented to the high-resource language achieves excellent results, in the machine translation task of the low-resource language, the parallel linguistic data are difficult to obtain, and the parallel linguistic data are not to mention the parallel linguistic data in a specific field. The problem caused by the method is Data Sparsity (Data Sparsity), and if the translation model is not trained sufficiently, even the most popular and effective neural machine translation method facing high resource language pairs at present is difficult to be used for low resource machine translation. Therefore, low-resource machine translation is one of the problems that needs to be solved as quickly as possible.

Disclosure of Invention

The embodiment of the invention provides a low-resource neural machine translation method and system based on data enhancement, which are used for solving the problem that a neural machine translation model cannot be applied to low-resource neural machine translation due to data sparsity at present.

In a first aspect, an embodiment of the present invention provides a data enhancement method based on data enhancement for low-resource neural machine translation, including:

determining real data to be translated;

inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;

the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.

Further, the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora on the low-resource language pair, and includes:

acquiring original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair, and carrying out negative sampling on the original real data to obtain negative sample data;

training a discriminator sub-model based on the original real data and the negative sample data to obtain an evaluation model;

constructing the original real data into pseudo data based on data enhancement of an editing distance, and screening the pseudo data based on the evaluation model to obtain screened data;

and combining the screening data and the original real data to construct enhanced data, and training a low-resource neural machine translation model by using the enhanced data and an attention-based encoder/decoder translation frame to obtain the neural machine translation model.

Further, the raw truth data is from a published data set or manually prepared data;

the low-resource language pair is a language pair with the parallel corpus size smaller than a preset value and comes from an open data set;

the monolingual corpus is a monolingual corpus of a source language or a target language in a low-resource source language and comes from manually prepared data;

the obtaining of negative sample data by negative sampling of the original real data includes: and generating negative sample data by randomly discarding or randomly adding the original real data.

Further, after obtaining the original real data including the parallel corpus and the monolingual corpus of the low-resource language pair, the method further includes:

and carrying out cleaning data preprocessing on the original real data comprising the source language or the target language and secondary preprocessing comprising cleaning data, eliminating blank lines, eliminating illegal characters and non-English characters at a target end.

Further, the data enhancement based on the edit distance constructs the original real data into pseudo data, including:

performing edit distance sampling on the original real data based on an edit distance submodel;

selecting the position of the replacement word based on the position sub-model and the sampled editing distance;

and replacing the new word at the position of the replacement word based on the replacement sub-model to obtain pseudo data.

Further, the edit distance submodel is represented as follows:

wherein tau represents a temperature over-parameter, c (d, I) represents the number of sentences with the editing distance d (d is equal to {0,1,2,3, …, I }) and the length I;

the location submodel is represented as follows:

the alternative submodel is represented as follows:

P(w_j|x,d,p)＝P(w_i|w_i-1,p_i)；

wherein, w_jFor sampling new words, p, of step j_jIs the sampling location.

Further, the discriminator submodel is used for distinguishing the distribution of the original real data from the distribution of the pseudo data, and comprises a discriminator loss function which is expressed as follows:

wherein the content of the first and second substances,

in order to be a function of the loss,

as a discriminator, P_r(x) Is the distribution of the original real data.

In a second aspect, an embodiment of the present invention provides a data-based enhanced low-resource neural machine translation system, including:

the data determining unit is used for determining real data to be translated;

the machine translation unit is used for inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the steps of the data-based enhanced low-resource neural machine translation method according to any one of the above-mentioned first aspects.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the data enhancement-based low-resource neural machine translation method according to any one of the first aspect.

According to the low-resource neural machine translation method and system based on data enhancement, real data are input into a neural machine translation model, and a neural machine translation result output by the neural machine translation model is obtained; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair. The invention realizes the application of sparse data to the neural machine translation model of low-resource neural machine translation, and can efficiently and accurately solve the problem of resource shortage in low-resource neural machine translation.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a data enhancement-based low-resource neural machine translation method provided by the invention;

FIG. 2 is a schematic diagram of a training process of a neural machine translation model provided by the present invention;

FIG. 3 is a diagram of a data enhancement architecture for constrained sampling based on edit distance provided by the present invention;

FIG. 4 is a sample diagram of a process for generating negative examples provided by the present invention;

FIG. 5 is a schematic diagram of a process for constructing pseudo data according to the present invention;

FIG. 6 is a diagram illustrating a word-level data enhancement method provided by the present invention;

FIG. 7 is a schematic structural diagram of a data enhancement-based low-resource neural machine translation system provided by the present invention;

FIG. 8 is a schematic diagram of the structure of a machine translation unit provided by the present invention;

FIG. 9 is a schematic structural diagram of a data screening unit provided in the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical idea of the invention is as follows: the translation quality of neural machine translation is closely related to the quantity and quality of the corpus. To train a high quality neural-machine translation model, commercial-level translation systems often finely process (filter) tens of millions or even billions of bi-directional parallel corpora and train the model. In languages with abundant resources such as english, french, german, chinese, etc., such a large corpus can be obtained completely. However, for resource-poor languages, it is very difficult to obtain massively parallel corpora. Relevant research work has proved that under the premise that the parallel corpus size is limited, the translation quality of the neural machine translation is inferior to that of the statistical machine translation. Therefore, how to realize high-quality neural machine translation under the premise that the parallel corpus is limited becomes a hot research point of the machine translation world. In addition, the language itself has historical, cultural and regional characteristics. These properties are evident in many resource-poor languages, which also present some challenges to the study of low-resource neural machine translation.

It is an object of the invention to achieve high quality data enhancement. Although various scholars generate bilingual corpus using monolingual corpus by data enhancement, it is difficult to avoid syntactic and semantic errors in the generated pseudo data, either sentence-level enhancement (pseudo data is generated by translation on monolingual corpus) or word-level enhancement (words in low-resource source language and words in high-resource source language are replaced with each other, i.e., a dictionary is constructed by word vector method and then words in high-resource source language are replaced with words in low-resource language). In other words, it is possible to construct high-quality pseudo data from existing monolingual data or bilingual data, and to expand the scale of bilingual data by making full use of existing data, thereby further improving the quality of low-resource neural machine translation. For example, the existing low-resource language pair source end (uzbeki, Uz) or target end (chinese, Zh) in uzbeki → chinese (Uz → Zh) uses the method proposed in the present invention to realize data enhancement, so as to achieve better translation effect. Although the data sets of the low-resource language are much less, the situation that the translation quality is reduced due to data sparsity needs to be avoided as much as possible. If a method for training a better translation model between low-resource languages or between high-resource and low-resource languages can be provided on the basis of an efficient and easy-to-use data enhancement technology, the problem will not be puzzled again. Therefore, the invention provides a data-enhanced low-resource neural machine translation method based on edit distance constraint sampling. The edit distance based constrained sampling method is more efficient than other random sampling methods used in the original text. In addition, the invention also designs a discriminator submodel to select higher-quality data after generation, and the submodel is used for screening out pseudo data with little syntax and semantic errors to a certain extent. In summary, a simple and effective data enhancement method is urgently needed in low-resource machine translation, so as to solve the problem that the quality of pseudo data is difficult to guarantee in the prior art, improve the performance of low-resource neural machine translation, improve the translation efficiency, and obtain a large amount of more accurate data.

The following describes a data enhancement-based low-resource neural machine translation method and system provided by the present invention with reference to fig. 1 to 10.

The embodiment of the invention provides a low-resource neural machine translation method based on data enhancement. Fig. 1 is a schematic flowchart of a data-enhancement-based low-resource neural machine translation method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:

step 110, determining real data to be translated;

step 120, inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;

According to the method provided by the embodiment of the invention, the real data is input into the neural machine translation model to obtain the neural machine translation result output by the neural machine translation model, so that the problem of resource shortage in low-resource neural machine translation can be efficiently and accurately solved.

Based on any of the above embodiments, as shown in fig. 2, the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora on a low-resource language pair, and includes:

step 210, acquiring original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair, and performing negative sampling on the original real data to obtain negative sample data;

specifically, Parallel corpora (Parallel Corpus) and MonoLingual corpora (MonoLingual Corpus) on a Low-Resource (Low-Resource) language pair are prepared as original Real Data (Real Data).

Several elements of parallel and monolingual corpora on low-resource language pairs are prepared:

A. the factors to be considered in selecting a low-resource language pair are the parallel corpus size, the number of non-repeating words in the monolingual corpus, and the recognition level of the data set.

B. Generally, a low-resource language pair as referred to herein refers to a language pair with few parallel corpora, for example, a language pair with a parallel corpus size of less than 100 ten thousand, less than 50 ten thousand, less than 20 ten thousand, or even less than 1 ten thousand.

C. The monolingual corpus used in the invention is a monolingual corpus of a Source (Source) language or a Target (Target) language in a low-resource Source language, but not a monolingual corpus of a language in other high-resource language pairs.

D. Meanwhile, the quality of the monolingual corpus also needs to be carefully considered, because in the present invention, the monolingual corpus is always considered as the original real data. If the quality of the prepared monolingual corpus is problematic, poor quality samples may occur when constructing the dummy data by the method proposed in the present invention.

E. Monolingual corpora must be prepared manually, not through machine-generated data. If the monolingual corpus is not manually prepared data, the monolingual corpus cannot be said to be real data. The monolingual corpus required in the present invention is not only raw data but also real data.

220, training a discriminator submodel based on the original real data and the negative sample data to obtain an evaluation model;

specifically, when a negative sample is generated on the original real data by using the negative sampling method, the more noise the generated sample contains, the better. If more noise is contained in the negative samples, the trained Discriminator Sub-Model (Discriminator Sub-Model) performs better. The positive and negative samples are needed in training the discriminator submodel, respectively using D_r(X) and D_n(X) represents. The Negative Sampling method (Negative Sampling) employed in the present invention is a method of random discard (Randomly infection). In practice, as long as D can be guaranteed_nThe noise in (X) is sufficient from D_r(X) production of D_nIn the case of (X), other methods may be used. If D is_nInsufficient noise in (X) may affect the performance of the discriminator submodel。

That is, on the original real data, a new Negative sample data, i.e., data with noise (noise) different from the original real data, is constructed by using a Negative Sampling (Negative Sampling) method.

A Discriminator Sub-Model (Discriminator Sub-Model) is trained as an evaluation Model (Evaluator Model) by using the original real data and a negative sample obtained by negative sampling. Data enhancement methods (either word-level or sentence-level methods) have difficulty in guaranteeing the quality of the dummy data, and even have no guarantee of semantic and syntactic integrity of the dummy data at all. Therefore, the invention designs a discriminator submodel, which can be also called a screening model or a Filter (Filter) model. This model is an independent part of the invention, i.e. the discriminator submodel is not trained together with the data enhancement module based on edit distance constrained sampling, but independently. The pseudo data generated by the data enhancement module is used as the input of the discriminator submodel, which is beneficial to constructing the pseudo data with higher quality.

Step 230, constructing the original real data into pseudo data based on data enhancement of an editing distance, and screening the pseudo data based on the evaluation model to obtain screened data;

specifically, in the positive sample P_r(X) on the basis of the constraint sampling data enhancement method based on the editing distance, the original real data D are enhanced_r(X) constructing a data distribution of

By means of real data D_rAnd (X) generating Pseudo Data (Pseudo Data) based on a Data enhancement method of editing distance constraint sampling, and screening Data through a discriminator sub-model provided by the invention, so that syntax and semantic errors can be reduced to a certain extent. Therefore, the data enhancement method provided by the invention is superior to the prior method in the whole architecture.

Parallel linguistic Data on an original low-resource language pair are enhanced (augmented) from original real Data in a Distance (Edit Distance) Constrained Sampling (Constrained Sampling) mode, namely Pseudo Data (Pseudo Data) are constructed through a Data enhancement method provided by the invention, so that the performance of a low-resource neural machine translation model is improved.

In order to further guarantee the fluency and the fidelity of the pseudo data, a discriminator submodel is adopted for screening, so that the data with the best quality is reserved. In other words, sentences containing syntax or semantic errors are removed through the discriminator submodel, and the pseudo data generated by the core algorithm provided by the invention is mainly screened, so that the quality of the pseudo data is further improved.

Fig. 3 shows the structure of the most core data enhancement module and discriminator submodel proposed in the present invention. Let x be x₁,x₂,x₃…,x_i,…,x_IIs a source sentence containing I words, and y ═ y₁,y₂,y₃…,y_j,…,y_JIs a target sentence containing J words,

representing the original training data containing M sentences. Formalizing the data enhancement task as shown in FIG. 3: given the distribution P of the real data_r(x) The enhancement task is at P_r(x) Training the enhanced model on the basis of the data and generating pseudo data. Model generated pseudo-data distribution proposed by the invention

Is close to P_r(x)。

The negative sampling method shown in FIG. 4 is based on the original real data P_r(x) Some negative samples P are sampled_n(x) For training the discriminator submodel; then on the original real data P_r(x) Go up through the bookThe procedure of step D explained in the description of the invention (the bright spots and the most core spots of the invention), i.e. the method of editing distance constrained sampling, generates pseudo data

Then constructing a negative sample P by a negative sampling method_n(x) And original real data P_r(x) Upper training discriminator submodel

Then the dummy data is transmitted

Using discriminator submodels

And further screening out data with the best quality, namely data with few syntactic and semantic errors, even data without syntactic errors at all. And finally, combining the screened high-quality data and the most original real data to generate large-scale data, thereby realizing data enhancement and effectively solving the problem that the performance cannot be improved due to insufficient data resources on a low-resource neural machine translation task.

And 240, combining the screening data and the original real data to construct enhanced data, and training a low-resource neural machine translation model by using the enhanced data and an attention-based encoder/decoder translation frame to obtain the neural machine translation model.

Specifically, a large amount of screened pseudo data and the most original small amount of real data are combined to construct larger-scale data, so that data enhancement is realized. The characteristic of Data hunger of neural machine translation is met through the process, the problem that the performance of a low-resource neural machine translation model cannot be improved all the time due to the fact that enough Data cannot be obtained can be solved to a certain extent, and the performance of the neural network machine translation model between languages with scarce resources or between the languages with abundant resources is improved. After data enhancement is achieved by the method proposed in the present invention, training of a low-resource neural machine translation model is started. The Neural machine translation model used by the invention adopts an encoder-decoder framework based on an attention mechanism, and an RNN (Current Neural Network) based on LSTM (Long-Short Term Memory) is used at both the encoder end and the decoder end.

The screened data and the most primitive real data are combined to construct high-quality enhancement data on which a low-resource neural machine translation model is trained using an attention-based encoder-decoder neural machine translation framework. Training the training corpus in the training set by a neural machine translation model of an encoder-decoder framework based on an attention mechanism to obtain neural machine translation model parameters of a low-resource language; specifically, the model parameters include source language end and target language end word vectors, and model weight matrix parameters.

The training process of the translation model is as follows:

firstly, obtaining a word vector of each word of an input sentence, and realizing the following steps through the preprocessing step of an RNN language model:

f1, RNN language model is composed of look-up layer, hidden layer and output layer. Each word contained in the input sentence is converted into a corresponding word vector representation through the look-up layer:

x_t＝look-up(s) (1)

wherein x is_tIs the word vector representation of s, s is the input for each time period t, and look-up represents the look-up layer.

F2, for the parallel sentence pair obtained in step a, let its source end be x ═ x₁,…,x_i,…,x_IThe target end is y ═ y₁,…,y_j,…,y_J. Neural machine translation often factors sentence-level translation probabilities into word-level probabilities:

where θ is a series of model parameters, y_<jIs a partial translation. If the training set is

The training goal is to maximize the log-likelihood on the training set:

the decision rule for translation is to utilize the learned model parameters for the source sentence x that has not been encountered (i.e., has not been trained)

Obtaining the target sentence with the maximum translation probability

In particular, in maximizing

By maximizing the probability of translation of the word

The word level translation probability of (c):

after the steps F1, F2, the following steps are also required:

f3, the input obtained through step F2, requires further processing, using bi-directional LSTM at the encoder side to obtain a representation of the entire source sentence. Since the GRU is also an element of the RNN network, and the description has been given in step F1, the RNN language model is composed of a look-up layer, a hidden layer, and an output layer. The word vector representation of each word is found by RNN in step F1, and the result is then used as input to the encoder, i.e. information prepared for the encoder stage hidden layer. When the hidden layer calculates the current hidden state, the output of the look-up layer is used as input, namely, words are mapped to a context vector according to the word vector of each word and a plurality of previous hidden state information:

h_t＝f(x_t,h_t-1)， (7)

where f is an abstract function for giving an input x_tAnd historical status h_t-1On the premise of (1), calculating the current new hidden state. Initial state h₀Often set to 0, the f function is typically of the form h_t＝σ(W_xhx_t+W_xhh_t-1) Where σ is a non-linear function (e.g., sigmoid or tanh, etc.).

Thus, the forward (forward) state of a bi-directional BiRNN is calculated according to the following equation:

wherein the content of the first and second substances,

is a matrix of word vectors that is,

and

is a weight matrix, n × m is a word vector dimension and a hidden state dimension, respectively, and σ is a sigmoid function.

Reverse state

The calculation method of (3) is similar to the forward state. Sharing a word vector matrix between forward and reverse states

But not the weight matrix. Combining the forward and reverse states to obtain

Wherein the content of the first and second substances,

the decoder of the translation model uses a unidirectional RNN, as opposed to the encoder using a bidirectional RNN.

The decoder also has a corresponding hidden state, but unlike the hidden state of the encoder, the detailed calculation process is as follows:

wherein the content of the first and second substances,

z_i＝σ(W_zEy_i+U_zs_i-1+C_zc_i), (15)

r_i＝σ(W_rEy_i+U_rs_i-1+C_rc_i), (16)

e is a word vector matrix for each word contained in the target language sentence,

and

the method is characterized in that the method is a weight matrix, m and n are a word vector dimension and a hidden state dimension respectively, and sigma is a sigmoid function. Initial hidden state s₀Calculated by the following way:

wherein the content of the first and second substances,

the context word vector is recalculated at each time step:

wherein the content of the first and second substances,

h_jis the hidden state corresponding to the jth word in the source sentence,

and

are all weight matrices.

According to any of the above embodiments, the raw real data is from a public data set or manually prepared data;

specifically, the constructed data sets are from five public data sets NIST, Tanzil, WMT14, IWSLT14, and IWSLT 15. The NIST data set contains chinese → english (Zh → En); the Tanzil dataset contains asebai → english (Az → En), indian → english (Hi → En), uzbekkaiki → english (Uz → En), uygur → english (Ug → En) and turkey → english (Tr → En); WMT14 contains english → german (En → De); IWSLT14 contains german → english (De → En); IWSLT15 contains vietnamese → english (Vi → En). Selecting a corresponding training set, a corresponding development set and a corresponding test set according to each language pair; it is emphasized that the parallel corpus referred to herein has no specific labeling information such as language direction (e.g., <2ch > indicates the direction of language from the source language to Chinese).

Specifically, the negative sampling method employed in generating the negative examples is to randomly discard a word so that the original real sentence generates a grammatical error (semantic error or syntactic error). Further, because there are many methods of negative sampling, in order to make the negative sample contain enough noise to train a better-performing discriminator, a method of randomly discarding words is selected from many negative sampling methods. In fact, a certain position can be randomly selected to insert a new word to break the integrity of the whole sentence, or different words can be randomly selected from the original sentence and the positions can be exchanged. However, other methods do not produce enough noise in most cases, and therefore the present invention chooses a method to randomly discard words.

Some negative samples are generated by the method of negative sampling. The source is not assumed to be data enhanced, so only the source is temporarily Negative sampled (Negative Sampling) to generate some Negative samples. In fact, the enhancement of the target end is similar, that is, the target end single speech is sampled negatively when the target end needs to be enhanced. Setting the original real data as

The generated negative sample data is

Generating new negative examples by random discarding or random adding, e.g. the original real sentence S_r＝w₁,w₂,w₃…,w_i,…,w_IAfter random deletion (assuming that w is deleted)₂) The negative sample thereafter is

The length of the compound is changed from I to I-1; by random addition (assuming at w)_iAdding word w to the back_l) The negative sample thereafter is

Its length changes from I to I + 1.

Fig. 4 shows a process of generating negative examples. The discriminator submodel mentioned in the present invention is a core part of the present invention and is also one of the bright spots. Since training the discriminator submodel requires the preparation of negative examples, the generation of negative examples is also an important part of the present invention. The negative examples used are based on the original real data D_r(x) The above negative sample data D with noise is constructed by the negative sampling method shown in fig. 4_n(x)。

Negative sampling also has broad application in many tasks in the fields of machine learning and natural language processing. Negative sampling refers to randomly generating some negative examples related to positive examples in the training data. Negative sampling plays a different role in different machine learning tasks. For example, in contrast learning, negative sampling is used to achieve the training goal of contrast learning, increasing the distance between the representations of positive and negative examples. In word2vec, negative sampling is used for reducing the number of parameters of the model updated each time, and training efficiency can be effectively improved. In the task of machine translation, the use of negative sampling also follows its use in machine learning, i.e. the original real sentence S is translated in the machine_r＝w₁,w₂,w₃…,w_i,…,w_IGeneration of new sequences by employing a 'negative sampling' strategy

But newly generated sentences

And the original sentence S_rThere are certain differences. In most of the cases of the above-mentioned cases,

and S_rThere is a gap at the syntactic or semantic level, which is the goal that negative sampling is expected to achieve.

As shown in fig. 4, a common method of negative sampling is from the original real sentence S_rIn which some sentence components (core components) are randomly deleted or some components (irrelevant components) are randomly added so as to be generated

As much as possible containing some syntactic or semantic errors. Whether using random deletion or random addition from S_rGenerating

Sometimes, some samples are generated that are free of syntactic and semantic errors, more like positive samples than negative samples. For example from S_rIn using random deletionGenerated in the manner of

(the "is deleted) there is a syntax error, but in

(deleting "caremully") there are no syntax errors. Likewise, from

Generated by random addition

(increased "math") there is a syntax error in

(now "is added") there are no syntax errors. In the process of negative sampling, it is desirable to generate similarities as much as possible

Such sentences with grammatical errors, not the similarity

Such sentences without syntax errors are therefore used in a more extreme manner in negative sampling, ensuring that the sampled samples are negative samples as far as possible, while avoiding the occurrence of positive samples.

Aiming at the problem of data scarcity faced by neural machine translation under the low-resource scene, the invention provides a method for restricting a sampling strategy. In this method, negative examples are prepared in order to train a discriminator submodel. In the real data D by means of negative sampling_r(x) Generating negative examples D by randomly deleting some words_n(x)。

Based on any of the above embodiments, after obtaining the original real data including the parallel corpus and the monolingual corpus of the low-resource language pair, the method further includes:

Specifically, the step of preprocessing the data includes processing the source language text and the target language text in the data set, for example, cleaning the data using a preprocessing tool provided by NiuTrans, to eliminate illegal characters (the target end of the parallel sentence pair used in the experiment of the present invention is english). In addition, a series of preprocessing tools have been developed using Python language to perform some operations, including secondary preprocessing (cleaning data again, eliminating blank lines, eliminating illegal characters and non-english characters on the target side, etc.). In the preprocessing stage, the data of languages other than the Chinese language are segmented by using tokenizer. perl provided by a statistical machine translation open source system MOSES, and the data of the Chinese language are segmented by using a THULAC toolkit (Natural language processing of university A and Chinese segmentation tools proposed by B key laboratory). Then, for all corpora, the BPE method is used for sub-word segmentation.

Based on any of the above embodiments, as shown in fig. 5, the data enhancement based on the edit distance constructs the original real data into pseudo data, including:

step 510, performing edit distance sampling on the original real data based on an edit distance sub-model;

step 520, selecting the position of the replacement word based on the position sub-model and the sampled editing distance;

step 530, replacing the new word at the position of the replacement word based on the replacement sub-model, and obtaining the pseudo data.

In particular, data enhancement based on edit distance constrained sampling is also a word-level data enhancement method, i.e., a process of word replacement. This step can be viewed as a restricted constrained sampling process, which can be divided into three sub-steps: the first step is edit distance sampling, the second step is calculate position based on edit distance, and the third step is calculate the replacement word based on the previous two steps.

FIG. 6 shows one of the most common methods in the field of data enhancement, often used in other tasks, whether neural machine translation or natural language processing.

The data enhancement method refers to a method of making the scale of training data large. Data enhancement has also been widely used not only in machine translation, but also in natural language processing tasks such as dialog generation, question answering, machine writing, and natural language reasoning.

In machine translation, the commonly used data enhancement method is mainly performed from two perspectives of "word level" and "sentence level".

Word-level data enhancement involves randomly replacing words, randomly inserting words, randomly deleting words, randomly exchanging the positions of different words, etc. to achieve data enhancement, such as randomly selecting a word from the original sentence and then replacing it with a word in the dictionary (in FIG. 6 replacing "on" with "is" and "story" with "material"), or randomly selecting a word from the original sentence and exchanging positions with other words in the sentence (in FIG. 6, "now" is originally the second word, but in the first sentence generated

Where the position is exchanged for the first word) to generate a new sentence. One typical work for word-level data enhancement methods is the random substitution method proposed by fadae et al.

Sentence-level data enhancement is mainly performed by means of Back Translation (BT), Forward Translation (FT), and some modified versions of the Back Translation, such as Tagged Back Translation (Tagged BT). Among various sentence-level data enhancement methods, a common method is translation, and the core idea is to make full use of the existing bilingual data and train a reverse neural machine translation model, construct a pseudo source-end sentence through the target-end monolingual data, thereby forming pseudo parallel data, and combine the pseudo parallel data with the original bilingual corpus to perform data enhancement.

However, whether it is a BT method at sentence level or a substitution, exchange, or the like at word level, some sentences containing syntax errors are often generated, which are actually undesired sentences. Therefore, in order to alleviate the syntax problem after data enhancement to a certain extent, a method for generating high-quality data based on edit distance constraint sampling is provided.

Based on any of the above embodiments, the edit distance submodel is represented as follows:

the location submodel is represented as follows:

the alternative submodel is represented as follows:

P(w_j|x,d,p)＝P(w_i|w_i-1,p_i)； (23)

wherein, w_jFor sampling new words, p, of step j_jIs the sampling location.

In particular, the present invention develops a constrained sampling method for low-resource neural machine translation. The enhancement steps of the present invention are generally the same as the idea of the pure word replacement method, but replace words with a different sampling method than random sampling. Let P_r(x) In order to be the actual data distribution,

representing a forged sentence generated from the original sentence x. Given x, using

Representing dummy data

Distribution of (2). The distribution of the enhancement data can be expressed as:

generating from x by replacing words based on edit distance

For example, assume x is "I like the book",

to "I like a movie", the edit distance is 2, the positions of the replaced words are 3 and 4, and the replaced words are "a" and "movie". More formally, d is used to denote x and

edit distance between P ═ { P ═ P₁,…,p_dDenotes the position of the replaced word, P ═ w₁,…,w_dDenotes a list of replacement words. In the above example, d is 2, p₁＝3,p₂＝4,w₁＝"a",w₂Or "movie". According to distribution P of original real data_r(x) The enhancement profile is defined as:

more precisely, the constrained sampling method herein consists of 3 parts as follows:

and D1, sampling the edit distance D according to the real data sample. According to the idea of the edit distance, an edit distance sub-model is defined as follows:

where τ represents a temperature hyperparameter that limits the search space around the original sentence. It can be concluded that a larger τ will get more samples with longer edit distances. c (d, I) represents the number of sentences with a sentence editing distance d (d ∈ {0,1,2,3, …, I }) with a length I, which can be obtained as follows:

wherein

Representing a vocabulary.

D2, when replacing the word, the position of the word is selected according to the sampled editing distance D. The location submodel is defined as follows:

according to the sampling method above, the position set P ═ { P is obtained₁,p₂,p₃,…,p_d}. This approach substantially guarantees an edit distance d between the new sentence and the initial sentence.

D3, substitution model will be at each sampling position p_jThe new word is replaced. This process proceeds with d steps in common, and in each step (assuming the jth step), one can follow the distribution P (wX |)_j-1,p＝p_j) Middle sampling new word w_jThen at X_j-1P of (a)_jPlace replacement old word to generate X_j. Finally, the replacement submodel is defined as:

P(w_j|x,d,p)＝P(w_i|w_i-1,p_i)， (29)

here using a constrained sampling scheme pair w_jSampling to maximize sequence X_jThe language model score of (1).

Based on any of the above embodiments, the discriminator submodel is configured to distinguish a distribution of original real data from a distribution of dummy data, and includes a discriminator loss function, which is expressed as follows:

wherein the content of the first and second substances,

in order to be a function of the loss,

as a discriminator, P_r(x) Is the distribution of the original real data.

Specifically, in order to improve the quality of the dummy data, the original real data D_r(X) and negative samples D obtained using negative sampling_n(X) on-train arbiter submodel

Also called data filters. The discriminator acts as a filter for the pseudo data obtained after the limited sampling enhancement. The discriminator submodel is similar to the GAN, and is mainly used for distinguishing the distribution P of real data_r(x) And distribution of dummy data

Similar to the least squares generating countermeasure network, the discriminator loss function is set as follows:

loss function

Make the arbiter

The reward for real data is higher than for dummy data. Thus, the discriminator

Dummy data of higher quality closer to the distribution of the real data can be selected. The data enhancement method for the source end is described above, but the method can be easily extended to the target end.

The invention provides a data enhancement-based low-resource neural machine translation system, and the following description and the above-described data enhancement-based low-resource neural machine translation method can be referred to correspondingly.

Fig. 7 is a schematic structural diagram of a data enhancement-based low-resource neural machine translation system according to an embodiment of the present invention, as shown in fig. 7, the system includes a data determination unit 710 and a machine translation unit 720;

the data determining unit 710 is configured to determine real data to be translated;

the machine translation unit 720 is configured to input the real data to be translated into a neural machine translation model, so as to obtain a neural machine translation result output by the neural machine translation model;

According to the system provided by the embodiment of the invention, the real data is input into the neural machine translation model to obtain the neural machine translation result output by the neural machine translation model, so that the problem of resource shortage in low-resource neural machine translation can be efficiently and accurately solved.

Based on any of the above embodiments, as shown in fig. 8, the machine translation unit includes a data acquisition unit 810, a model training unit 820, a data screening unit 830, and a data enhancement unit 840;

the data obtaining unit 810 is configured to obtain original real data including a parallel corpus and a monolingual corpus of a low-resource language pair, and perform negative sampling on the original real data to obtain negative sample data;

the model training unit 820 is configured to train a discriminator sub-model based on the original real data and the negative sample data to obtain an evaluation model;

the data screening unit 830 is configured to build the original real data into pseudo data based on data enhancement of an editing distance, and screen the pseudo data based on the evaluation model to obtain screened data;

the data enhancement unit 840 is configured to combine the filtered data and the original real data to construct enhanced data, and train a low-resource neural machine translation model using the enhanced data and an attention-based encoder/decoder translation framework to obtain the neural machine translation model.

Based on any embodiment of the foregoing, after the data obtaining unit is configured to obtain original real data including parallel corpora and monolingual corpora of the low-resource language pair, the data obtaining unit further includes:

Based on any of the above embodiments, as shown in fig. 9, the data filtering unit includes an edit distance module 910, a location selection module 920, and a location replacement module 930;

the edit distance module 910 is configured to perform edit distance sampling on the original real data based on an edit distance submodel;

the position selection module 920 is configured to select a position of a replacement word based on the position sub-model and the sampled edit distance;

the position replacement module 930 is configured to replace a new word at the position of the replacement word based on the replacement sub-model, so as to obtain pseudo data.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a data-based enhanced low-resource neural machine translation method comprising: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the data enhancement-based low-resource neural machine translation method provided by the above methods, where the method includes: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.

In yet another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the data enhancement-based low-resource neural machine translation method provided in the foregoing, the method including: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data enhancement-based low-resource neural machine translation method is characterized by comprising the following steps:

determining real data to be translated;

2. The method for low-resource neural machine translation based on data enhancement of claim 1, wherein the neural machine translation model is obtained by training a low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora of a low-resource language pair, and comprises:

3. The data-based augmented low-resource neural machine translation method of claim 2, wherein the raw real data is from a published data set or manually prepared data;

4. The method for low-resource neural machine translation based on data enhancement of claim 2, wherein after obtaining the original real data including the parallel corpora and the monolingual corpora of the low-resource language pair, further comprising:

5. The data enhancement-based low-resource neural machine translation method according to claim 2, wherein the data enhancement based on edit distance constructs the original real data into pseudo data, comprising:

6. The data-based enhanced low-resource neural machine translation method of claim 5, wherein the edit distance sub-model is represented as follows:

the location submodel is represented as follows:

the alternative submodel is represented as follows:

P(w_j|x,d,p)＝P(w_i|w_i-1,p_i)；

wherein, w_jFor sampling new words, p, of step j_jIs the sampling location.

7. The data-enhancement-based low-resource neural machine translation method of claim 2, wherein the discriminator submodel is used for distinguishing the distribution of original real data from the distribution of pseudo data, and comprises a discriminator loss function represented as follows:

wherein the content of the first and second substances,

in order to be a function of the loss,

as a discriminator, P_r(x) Is the distribution of the original real data.

8. A data-based enhanced low-resource neural machine translation system, comprising:

the data determining unit is used for determining real data to be translated;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the data-based enhanced low-resource neural machine translation method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the data-enhancement-based low-resource neural machine translation method of any one of claims 1 to 7.