CN113673259A - Low-resource neural machine translation method and system based on data enhancement - Google Patents

Low-resource neural machine translation method and system based on data enhancement Download PDF

Info

Publication number
CN113673259A
CN113673259A CN202110857215.5A CN202110857215A CN113673259A CN 113673259 A CN113673259 A CN 113673259A CN 202110857215 A CN202110857215 A CN 202110857215A CN 113673259 A CN113673259 A CN 113673259A
Authority
CN
China
Prior art keywords
data
machine translation
resource
neural machine
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110857215.5A
Other languages
Chinese (zh)
Inventor
刘洋
米尔阿迪力江·麦麦提
栾焕博
孙茂松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110857215.5A priority Critical patent/CN113673259A/en
Publication of CN113673259A publication Critical patent/CN113673259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention provides a low-resource neural machine translation method and system based on data enhancement, wherein the method comprises the following steps: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair. The embodiment of the invention realizes the application of sparse data to the neural machine translation model of low-resource neural machine translation, and can efficiently and accurately solve the problem of resource shortage in low-resource neural machine translation.

Description

Low-resource neural machine translation method and system based on data enhancement
Technical Field
The invention relates to the technical field of machine translation, in particular to a low-resource neural machine translation method and system based on data enhancement.
Background
Translation between low resource languages and chinese is currently an urgent and important task. To achieve automatic machine translation, currently common techniques include statistical-based and neural network-based methods, the former being statistical machine translation and the latter being neural machine translation. In order to obtain a reliable translation model, it is necessary to collect large-scale high-quality parallel corpora, which often exist only between a few languages and are limited to some specific fields, such as government documents, news, etc., while the corpora in other fields are relatively deficient. In addition to domain-specific corpora, some languages are inherently resource-poor and it is very difficult to find or obtain available parallel corpora from the internet. At present, the neural machine translation surpasses the traditional statistical machine translation in the translation quality, but the main defect is that the training of a translation model is highly dependent on large-scale parallel corpora.
A large amount of linguistic data on the Internet enables parallel linguistic data acquisition covering multiple languages and fields to be possible. However, in the corpora obtained from the internet, there are few corpora belonging to a specific field, for example, news corpora are easy to obtain, but corpora belonging to a specific field such as government, movie, trade, education, sports, literature, medical treatment, etc. are difficult to obtain. If the training set, the development set (for tuning the trained model) and the test set belong to the same domain, the translation result (on the corpus in the domain) is very good, otherwise, the translation result (on the corpus in the foreign domain) is very poor. Although the research on the neural machine translation oriented to the high-resource language achieves excellent results, in the machine translation task of the low-resource language, the parallel linguistic data are difficult to obtain, and the parallel linguistic data are not to mention the parallel linguistic data in a specific field. The problem caused by the method is Data Sparsity (Data Sparsity), and if the translation model is not trained sufficiently, even the most popular and effective neural machine translation method facing high resource language pairs at present is difficult to be used for low resource machine translation. Therefore, low-resource machine translation is one of the problems that needs to be solved as quickly as possible.
Disclosure of Invention
The embodiment of the invention provides a low-resource neural machine translation method and system based on data enhancement, which are used for solving the problem that a neural machine translation model cannot be applied to low-resource neural machine translation due to data sparsity at present.
In a first aspect, an embodiment of the present invention provides a data enhancement method based on data enhancement for low-resource neural machine translation, including:
determining real data to be translated;
inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
Further, the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora on the low-resource language pair, and includes:
acquiring original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair, and carrying out negative sampling on the original real data to obtain negative sample data;
training a discriminator sub-model based on the original real data and the negative sample data to obtain an evaluation model;
constructing the original real data into pseudo data based on data enhancement of an editing distance, and screening the pseudo data based on the evaluation model to obtain screened data;
and combining the screening data and the original real data to construct enhanced data, and training a low-resource neural machine translation model by using the enhanced data and an attention-based encoder/decoder translation frame to obtain the neural machine translation model.
Further, the raw truth data is from a published data set or manually prepared data;
the low-resource language pair is a language pair with the parallel corpus size smaller than a preset value and comes from an open data set;
the monolingual corpus is a monolingual corpus of a source language or a target language in a low-resource source language and comes from manually prepared data;
the obtaining of negative sample data by negative sampling of the original real data includes: and generating negative sample data by randomly discarding or randomly adding the original real data.
Further, after obtaining the original real data including the parallel corpus and the monolingual corpus of the low-resource language pair, the method further includes:
and carrying out cleaning data preprocessing on the original real data comprising the source language or the target language and secondary preprocessing comprising cleaning data, eliminating blank lines, eliminating illegal characters and non-English characters at a target end.
Further, the data enhancement based on the edit distance constructs the original real data into pseudo data, including:
performing edit distance sampling on the original real data based on an edit distance submodel;
selecting the position of the replacement word based on the position sub-model and the sampled editing distance;
and replacing the new word at the position of the replacement word based on the replacement sub-model to obtain pseudo data.
Further, the edit distance submodel is represented as follows:
Figure BDA0003184546180000031
wherein tau represents a temperature over-parameter, c (d, I) represents the number of sentences with the editing distance d (d is equal to {0,1,2,3, …, I }) and the length I;
the location submodel is represented as follows:
Figure BDA0003184546180000032
the alternative submodel is represented as follows:
P(wj|x,d,p)=P(wi|wi-1,pi);
wherein, wjFor sampling new words, p, of step jjIs the sampling location.
Further, the discriminator submodel is used for distinguishing the distribution of the original real data from the distribution of the pseudo data, and comprises a discriminator loss function which is expressed as follows:
Figure BDA0003184546180000041
wherein the content of the first and second substances,
Figure BDA0003184546180000042
in order to be a function of the loss,
Figure BDA0003184546180000043
as a discriminator, Pr(x) Is the distribution of the original real data.
In a second aspect, an embodiment of the present invention provides a data-based enhanced low-resource neural machine translation system, including:
the data determining unit is used for determining real data to be translated;
the machine translation unit is used for inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the steps of the data-based enhanced low-resource neural machine translation method according to any one of the above-mentioned first aspects.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the data enhancement-based low-resource neural machine translation method according to any one of the first aspect.
According to the low-resource neural machine translation method and system based on data enhancement, real data are input into a neural machine translation model, and a neural machine translation result output by the neural machine translation model is obtained; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair. The invention realizes the application of sparse data to the neural machine translation model of low-resource neural machine translation, and can efficiently and accurately solve the problem of resource shortage in low-resource neural machine translation.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a data enhancement-based low-resource neural machine translation method provided by the invention;
FIG. 2 is a schematic diagram of a training process of a neural machine translation model provided by the present invention;
FIG. 3 is a diagram of a data enhancement architecture for constrained sampling based on edit distance provided by the present invention;
FIG. 4 is a sample diagram of a process for generating negative examples provided by the present invention;
FIG. 5 is a schematic diagram of a process for constructing pseudo data according to the present invention;
FIG. 6 is a diagram illustrating a word-level data enhancement method provided by the present invention;
FIG. 7 is a schematic structural diagram of a data enhancement-based low-resource neural machine translation system provided by the present invention;
FIG. 8 is a schematic diagram of the structure of a machine translation unit provided by the present invention;
FIG. 9 is a schematic structural diagram of a data screening unit provided in the present invention;
fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The technical idea of the invention is as follows: the translation quality of neural machine translation is closely related to the quantity and quality of the corpus. To train a high quality neural-machine translation model, commercial-level translation systems often finely process (filter) tens of millions or even billions of bi-directional parallel corpora and train the model. In languages with abundant resources such as english, french, german, chinese, etc., such a large corpus can be obtained completely. However, for resource-poor languages, it is very difficult to obtain massively parallel corpora. Relevant research work has proved that under the premise that the parallel corpus size is limited, the translation quality of the neural machine translation is inferior to that of the statistical machine translation. Therefore, how to realize high-quality neural machine translation under the premise that the parallel corpus is limited becomes a hot research point of the machine translation world. In addition, the language itself has historical, cultural and regional characteristics. These properties are evident in many resource-poor languages, which also present some challenges to the study of low-resource neural machine translation.
It is an object of the invention to achieve high quality data enhancement. Although various scholars generate bilingual corpus using monolingual corpus by data enhancement, it is difficult to avoid syntactic and semantic errors in the generated pseudo data, either sentence-level enhancement (pseudo data is generated by translation on monolingual corpus) or word-level enhancement (words in low-resource source language and words in high-resource source language are replaced with each other, i.e., a dictionary is constructed by word vector method and then words in high-resource source language are replaced with words in low-resource language). In other words, it is possible to construct high-quality pseudo data from existing monolingual data or bilingual data, and to expand the scale of bilingual data by making full use of existing data, thereby further improving the quality of low-resource neural machine translation. For example, the existing low-resource language pair source end (uzbeki, Uz) or target end (chinese, Zh) in uzbeki → chinese (Uz → Zh) uses the method proposed in the present invention to realize data enhancement, so as to achieve better translation effect. Although the data sets of the low-resource language are much less, the situation that the translation quality is reduced due to data sparsity needs to be avoided as much as possible. If a method for training a better translation model between low-resource languages or between high-resource and low-resource languages can be provided on the basis of an efficient and easy-to-use data enhancement technology, the problem will not be puzzled again. Therefore, the invention provides a data-enhanced low-resource neural machine translation method based on edit distance constraint sampling. The edit distance based constrained sampling method is more efficient than other random sampling methods used in the original text. In addition, the invention also designs a discriminator submodel to select higher-quality data after generation, and the submodel is used for screening out pseudo data with little syntax and semantic errors to a certain extent. In summary, a simple and effective data enhancement method is urgently needed in low-resource machine translation, so as to solve the problem that the quality of pseudo data is difficult to guarantee in the prior art, improve the performance of low-resource neural machine translation, improve the translation efficiency, and obtain a large amount of more accurate data.
The following describes a data enhancement-based low-resource neural machine translation method and system provided by the present invention with reference to fig. 1 to 10.
The embodiment of the invention provides a low-resource neural machine translation method based on data enhancement. Fig. 1 is a schematic flowchart of a data-enhancement-based low-resource neural machine translation method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
step 110, determining real data to be translated;
step 120, inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
According to the method provided by the embodiment of the invention, the real data is input into the neural machine translation model to obtain the neural machine translation result output by the neural machine translation model, so that the problem of resource shortage in low-resource neural machine translation can be efficiently and accurately solved.
Based on any of the above embodiments, as shown in fig. 2, the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora on a low-resource language pair, and includes:
step 210, acquiring original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair, and performing negative sampling on the original real data to obtain negative sample data;
specifically, Parallel corpora (Parallel Corpus) and MonoLingual corpora (MonoLingual Corpus) on a Low-Resource (Low-Resource) language pair are prepared as original Real Data (Real Data).
Several elements of parallel and monolingual corpora on low-resource language pairs are prepared:
A. the factors to be considered in selecting a low-resource language pair are the parallel corpus size, the number of non-repeating words in the monolingual corpus, and the recognition level of the data set.
B. Generally, a low-resource language pair as referred to herein refers to a language pair with few parallel corpora, for example, a language pair with a parallel corpus size of less than 100 ten thousand, less than 50 ten thousand, less than 20 ten thousand, or even less than 1 ten thousand.
C. The monolingual corpus used in the invention is a monolingual corpus of a Source (Source) language or a Target (Target) language in a low-resource Source language, but not a monolingual corpus of a language in other high-resource language pairs.
D. Meanwhile, the quality of the monolingual corpus also needs to be carefully considered, because in the present invention, the monolingual corpus is always considered as the original real data. If the quality of the prepared monolingual corpus is problematic, poor quality samples may occur when constructing the dummy data by the method proposed in the present invention.
E. Monolingual corpora must be prepared manually, not through machine-generated data. If the monolingual corpus is not manually prepared data, the monolingual corpus cannot be said to be real data. The monolingual corpus required in the present invention is not only raw data but also real data.
220, training a discriminator submodel based on the original real data and the negative sample data to obtain an evaluation model;
specifically, when a negative sample is generated on the original real data by using the negative sampling method, the more noise the generated sample contains, the better. If more noise is contained in the negative samples, the trained Discriminator Sub-Model (Discriminator Sub-Model) performs better. The positive and negative samples are needed in training the discriminator submodel, respectively using Dr(X) and Dn(X) represents. The Negative Sampling method (Negative Sampling) employed in the present invention is a method of random discard (Randomly infection). In practice, as long as D can be guaranteednThe noise in (X) is sufficient from Dr(X) production of DnIn the case of (X), other methods may be used. If D isnInsufficient noise in (X) may affect the performance of the discriminator submodel。
That is, on the original real data, a new Negative sample data, i.e., data with noise (noise) different from the original real data, is constructed by using a Negative Sampling (Negative Sampling) method.
A Discriminator Sub-Model (Discriminator Sub-Model) is trained as an evaluation Model (Evaluator Model) by using the original real data and a negative sample obtained by negative sampling. Data enhancement methods (either word-level or sentence-level methods) have difficulty in guaranteeing the quality of the dummy data, and even have no guarantee of semantic and syntactic integrity of the dummy data at all. Therefore, the invention designs a discriminator submodel, which can be also called a screening model or a Filter (Filter) model. This model is an independent part of the invention, i.e. the discriminator submodel is not trained together with the data enhancement module based on edit distance constrained sampling, but independently. The pseudo data generated by the data enhancement module is used as the input of the discriminator submodel, which is beneficial to constructing the pseudo data with higher quality.
Step 230, constructing the original real data into pseudo data based on data enhancement of an editing distance, and screening the pseudo data based on the evaluation model to obtain screened data;
specifically, in the positive sample Pr(X) on the basis of the constraint sampling data enhancement method based on the editing distance, the original real data D are enhancedr(X) constructing a data distribution of
Figure BDA0003184546180000091
By means of real data DrAnd (X) generating Pseudo Data (Pseudo Data) based on a Data enhancement method of editing distance constraint sampling, and screening Data through a discriminator sub-model provided by the invention, so that syntax and semantic errors can be reduced to a certain extent. Therefore, the data enhancement method provided by the invention is superior to the prior method in the whole architecture.
Parallel linguistic Data on an original low-resource language pair are enhanced (augmented) from original real Data in a Distance (Edit Distance) Constrained Sampling (Constrained Sampling) mode, namely Pseudo Data (Pseudo Data) are constructed through a Data enhancement method provided by the invention, so that the performance of a low-resource neural machine translation model is improved.
In order to further guarantee the fluency and the fidelity of the pseudo data, a discriminator submodel is adopted for screening, so that the data with the best quality is reserved. In other words, sentences containing syntax or semantic errors are removed through the discriminator submodel, and the pseudo data generated by the core algorithm provided by the invention is mainly screened, so that the quality of the pseudo data is further improved.
Fig. 3 shows the structure of the most core data enhancement module and discriminator submodel proposed in the present invention. Let x be x1,x2,x3…,xi,…,xIIs a source sentence containing I words, and y ═ y1,y2,y3…,yj,…,yJIs a target sentence containing J words,
Figure BDA0003184546180000101
Figure BDA0003184546180000102
representing the original training data containing M sentences. Formalizing the data enhancement task as shown in FIG. 3: given the distribution P of the real datar(x) The enhancement task is at Pr(x) Training the enhanced model on the basis of the data and generating pseudo data. Model generated pseudo-data distribution proposed by the invention
Figure BDA0003184546180000103
Is close to Pr(x)。
The negative sampling method shown in FIG. 4 is based on the original real data Pr(x) Some negative samples P are sampledn(x) For training the discriminator submodel; then on the original real data Pr(x) Go up through the bookThe procedure of step D explained in the description of the invention (the bright spots and the most core spots of the invention), i.e. the method of editing distance constrained sampling, generates pseudo data
Figure BDA0003184546180000104
Then constructing a negative sample P by a negative sampling methodn(x) And original real data Pr(x) Upper training discriminator submodel
Figure BDA0003184546180000105
Then the dummy data is transmitted
Figure BDA0003184546180000106
Using discriminator submodels
Figure BDA0003184546180000107
And further screening out data with the best quality, namely data with few syntactic and semantic errors, even data without syntactic errors at all. And finally, combining the screened high-quality data and the most original real data to generate large-scale data, thereby realizing data enhancement and effectively solving the problem that the performance cannot be improved due to insufficient data resources on a low-resource neural machine translation task.
And 240, combining the screening data and the original real data to construct enhanced data, and training a low-resource neural machine translation model by using the enhanced data and an attention-based encoder/decoder translation frame to obtain the neural machine translation model.
Specifically, a large amount of screened pseudo data and the most original small amount of real data are combined to construct larger-scale data, so that data enhancement is realized. The characteristic of Data hunger of neural machine translation is met through the process, the problem that the performance of a low-resource neural machine translation model cannot be improved all the time due to the fact that enough Data cannot be obtained can be solved to a certain extent, and the performance of the neural network machine translation model between languages with scarce resources or between the languages with abundant resources is improved. After data enhancement is achieved by the method proposed in the present invention, training of a low-resource neural machine translation model is started. The Neural machine translation model used by the invention adopts an encoder-decoder framework based on an attention mechanism, and an RNN (Current Neural Network) based on LSTM (Long-Short Term Memory) is used at both the encoder end and the decoder end.
The screened data and the most primitive real data are combined to construct high-quality enhancement data on which a low-resource neural machine translation model is trained using an attention-based encoder-decoder neural machine translation framework. Training the training corpus in the training set by a neural machine translation model of an encoder-decoder framework based on an attention mechanism to obtain neural machine translation model parameters of a low-resource language; specifically, the model parameters include source language end and target language end word vectors, and model weight matrix parameters.
The training process of the translation model is as follows:
firstly, obtaining a word vector of each word of an input sentence, and realizing the following steps through the preprocessing step of an RNN language model:
f1, RNN language model is composed of look-up layer, hidden layer and output layer. Each word contained in the input sentence is converted into a corresponding word vector representation through the look-up layer:
xt=look-up(s) (1)
wherein x istIs the word vector representation of s, s is the input for each time period t, and look-up represents the look-up layer.
F2, for the parallel sentence pair obtained in step a, let its source end be x ═ x1,…,xi,…,xIThe target end is y ═ y1,…,yj,…,yJ. Neural machine translation often factors sentence-level translation probabilities into word-level probabilities:
Figure BDA0003184546180000121
where θ is a series of model parameters, y<jIs a partial translation. If the training set is
Figure BDA0003184546180000122
The training goal is to maximize the log-likelihood on the training set:
Figure BDA0003184546180000123
the decision rule for translation is to utilize the learned model parameters for the source sentence x that has not been encountered (i.e., has not been trained)
Figure BDA0003184546180000124
Obtaining the target sentence with the maximum translation probability
Figure BDA0003184546180000125
Figure BDA0003184546180000126
In particular, in maximizing
Figure BDA0003184546180000127
By maximizing the probability of translation of the word
Figure BDA0003184546180000128
The word level translation probability of (c):
Figure BDA0003184546180000129
after the steps F1, F2, the following steps are also required:
f3, the input obtained through step F2, requires further processing, using bi-directional LSTM at the encoder side to obtain a representation of the entire source sentence. Since the GRU is also an element of the RNN network, and the description has been given in step F1, the RNN language model is composed of a look-up layer, a hidden layer, and an output layer. The word vector representation of each word is found by RNN in step F1, and the result is then used as input to the encoder, i.e. information prepared for the encoder stage hidden layer. When the hidden layer calculates the current hidden state, the output of the look-up layer is used as input, namely, words are mapped to a context vector according to the word vector of each word and a plurality of previous hidden state information:
ht=f(xt,ht-1), (7)
where f is an abstract function for giving an input xtAnd historical status ht-1On the premise of (1), calculating the current new hidden state. Initial state h0Often set to 0, the f function is typically of the form ht=σ(Wxhxt+Wxhht-1) Where σ is a non-linear function (e.g., sigmoid or tanh, etc.).
Thus, the forward (forward) state of a bi-directional BiRNN is calculated according to the following equation:
Figure BDA0003184546180000131
wherein the content of the first and second substances,
Figure BDA0003184546180000132
Figure BDA0003184546180000133
Figure BDA0003184546180000134
Figure BDA0003184546180000135
is a matrix of word vectors that is,
Figure BDA0003184546180000136
and
Figure BDA0003184546180000137
is a weight matrix, n × m is a word vector dimension and a hidden state dimension, respectively, and σ is a sigmoid function.
Reverse state
Figure BDA0003184546180000138
The calculation method of (3) is similar to the forward state. Sharing a word vector matrix between forward and reverse states
Figure BDA0003184546180000139
But not the weight matrix. Combining the forward and reverse states to obtain
Figure BDA00031845461800001310
Wherein the content of the first and second substances,
Figure BDA0003184546180000141
the decoder of the translation model uses a unidirectional RNN, as opposed to the encoder using a bidirectional RNN.
The decoder also has a corresponding hidden state, but unlike the hidden state of the encoder, the detailed calculation process is as follows:
Figure BDA0003184546180000142
wherein the content of the first and second substances,
Figure BDA0003184546180000143
zi=σ(WzEyi+Uzsi-1+Czci), (15)
ri=σ(WrEyi+Ursi-1+Crci), (16)
e is a word vector matrix for each word contained in the target language sentence,
Figure BDA0003184546180000144
and
Figure BDA0003184546180000145
the method is characterized in that the method is a weight matrix, m and n are a word vector dimension and a hidden state dimension respectively, and sigma is a sigmoid function. Initial hidden state s0Calculated by the following way:
Figure BDA0003184546180000146
wherein the content of the first and second substances,
Figure BDA0003184546180000147
the context word vector is recalculated at each time step:
Figure BDA0003184546180000148
wherein the content of the first and second substances,
Figure BDA0003184546180000151
Figure BDA0003184546180000152
hjis the hidden state corresponding to the jth word in the source sentence,
Figure BDA0003184546180000153
and
Figure BDA0003184546180000154
are all weight matrices.
According to any of the above embodiments, the raw real data is from a public data set or manually prepared data;
specifically, the constructed data sets are from five public data sets NIST, Tanzil, WMT14, IWSLT14, and IWSLT 15. The NIST data set contains chinese → english (Zh → En); the Tanzil dataset contains asebai → english (Az → En), indian → english (Hi → En), uzbekkaiki → english (Uz → En), uygur → english (Ug → En) and turkey → english (Tr → En); WMT14 contains english → german (En → De); IWSLT14 contains german → english (De → En); IWSLT15 contains vietnamese → english (Vi → En). Selecting a corresponding training set, a corresponding development set and a corresponding test set according to each language pair; it is emphasized that the parallel corpus referred to herein has no specific labeling information such as language direction (e.g., <2ch > indicates the direction of language from the source language to Chinese).
The low-resource language pair is a language pair with the parallel corpus size smaller than a preset value and comes from an open data set;
the monolingual corpus is a monolingual corpus of a source language or a target language in a low-resource source language and comes from manually prepared data;
the obtaining of negative sample data by negative sampling of the original real data includes: and generating negative sample data by randomly discarding or randomly adding the original real data.
Specifically, the negative sampling method employed in generating the negative examples is to randomly discard a word so that the original real sentence generates a grammatical error (semantic error or syntactic error). Further, because there are many methods of negative sampling, in order to make the negative sample contain enough noise to train a better-performing discriminator, a method of randomly discarding words is selected from many negative sampling methods. In fact, a certain position can be randomly selected to insert a new word to break the integrity of the whole sentence, or different words can be randomly selected from the original sentence and the positions can be exchanged. However, other methods do not produce enough noise in most cases, and therefore the present invention chooses a method to randomly discard words.
Some negative samples are generated by the method of negative sampling. The source is not assumed to be data enhanced, so only the source is temporarily Negative sampled (Negative Sampling) to generate some Negative samples. In fact, the enhancement of the target end is similar, that is, the target end single speech is sampled negatively when the target end needs to be enhanced. Setting the original real data as
Figure BDA0003184546180000161
The generated negative sample data is
Figure BDA0003184546180000162
Generating new negative examples by random discarding or random adding, e.g. the original real sentence Sr=w1,w2,w3…,wi,…,wIAfter random deletion (assuming that w is deleted)2) The negative sample thereafter is
Figure BDA0003184546180000163
The length of the compound is changed from I to I-1; by random addition (assuming at w)iAdding word w to the backl) The negative sample thereafter is
Figure BDA0003184546180000164
Its length changes from I to I + 1.
Fig. 4 shows a process of generating negative examples. The discriminator submodel mentioned in the present invention is a core part of the present invention and is also one of the bright spots. Since training the discriminator submodel requires the preparation of negative examples, the generation of negative examples is also an important part of the present invention. The negative examples used are based on the original real data Dr(x) The above negative sample data D with noise is constructed by the negative sampling method shown in fig. 4n(x)。
Negative sampling also has broad application in many tasks in the fields of machine learning and natural language processing. Negative sampling refers to randomly generating some negative examples related to positive examples in the training data. Negative sampling plays a different role in different machine learning tasks. For example, in contrast learning, negative sampling is used to achieve the training goal of contrast learning, increasing the distance between the representations of positive and negative examples. In word2vec, negative sampling is used for reducing the number of parameters of the model updated each time, and training efficiency can be effectively improved. In the task of machine translation, the use of negative sampling also follows its use in machine learning, i.e. the original real sentence S is translated in the machiner=w1,w2,w3…,wi,…,wIGeneration of new sequences by employing a 'negative sampling' strategy
Figure BDA0003184546180000171
But newly generated sentences
Figure BDA0003184546180000172
And the original sentence SrThere are certain differences. In most of the cases of the above-mentioned cases,
Figure BDA0003184546180000173
and SrThere is a gap at the syntactic or semantic level, which is the goal that negative sampling is expected to achieve.
As shown in fig. 4, a common method of negative sampling is from the original real sentence SrIn which some sentence components (core components) are randomly deleted or some components (irrelevant components) are randomly added so as to be generated
Figure BDA0003184546180000174
As much as possible containing some syntactic or semantic errors. Whether using random deletion or random addition from SrGenerating
Figure BDA0003184546180000175
Sometimes, some samples are generated that are free of syntactic and semantic errors, more like positive samples than negative samples. For example from SrIn using random deletionGenerated in the manner of
Figure BDA0003184546180000176
(the "is deleted) there is a syntax error, but in
Figure BDA0003184546180000177
(deleting "caremully") there are no syntax errors. Likewise, from
Figure BDA0003184546180000178
Generated by random addition
Figure BDA0003184546180000179
(increased "math") there is a syntax error in
Figure BDA00031845461800001710
(now "is added") there are no syntax errors. In the process of negative sampling, it is desirable to generate similarities as much as possible
Figure BDA00031845461800001711
Such sentences with grammatical errors, not the similarity
Figure BDA00031845461800001712
Such sentences without syntax errors are therefore used in a more extreme manner in negative sampling, ensuring that the sampled samples are negative samples as far as possible, while avoiding the occurrence of positive samples.
Aiming at the problem of data scarcity faced by neural machine translation under the low-resource scene, the invention provides a method for restricting a sampling strategy. In this method, negative examples are prepared in order to train a discriminator submodel. In the real data D by means of negative samplingr(x) Generating negative examples D by randomly deleting some wordsn(x)。
Based on any of the above embodiments, after obtaining the original real data including the parallel corpus and the monolingual corpus of the low-resource language pair, the method further includes:
and carrying out cleaning data preprocessing on the original real data comprising the source language or the target language and secondary preprocessing comprising cleaning data, eliminating blank lines, eliminating illegal characters and non-English characters at a target end.
Specifically, the step of preprocessing the data includes processing the source language text and the target language text in the data set, for example, cleaning the data using a preprocessing tool provided by NiuTrans, to eliminate illegal characters (the target end of the parallel sentence pair used in the experiment of the present invention is english). In addition, a series of preprocessing tools have been developed using Python language to perform some operations, including secondary preprocessing (cleaning data again, eliminating blank lines, eliminating illegal characters and non-english characters on the target side, etc.). In the preprocessing stage, the data of languages other than the Chinese language are segmented by using tokenizer. perl provided by a statistical machine translation open source system MOSES, and the data of the Chinese language are segmented by using a THULAC toolkit (Natural language processing of university A and Chinese segmentation tools proposed by B key laboratory). Then, for all corpora, the BPE method is used for sub-word segmentation.
Based on any of the above embodiments, as shown in fig. 5, the data enhancement based on the edit distance constructs the original real data into pseudo data, including:
step 510, performing edit distance sampling on the original real data based on an edit distance sub-model;
step 520, selecting the position of the replacement word based on the position sub-model and the sampled editing distance;
step 530, replacing the new word at the position of the replacement word based on the replacement sub-model, and obtaining the pseudo data.
In particular, data enhancement based on edit distance constrained sampling is also a word-level data enhancement method, i.e., a process of word replacement. This step can be viewed as a restricted constrained sampling process, which can be divided into three sub-steps: the first step is edit distance sampling, the second step is calculate position based on edit distance, and the third step is calculate the replacement word based on the previous two steps.
FIG. 6 shows one of the most common methods in the field of data enhancement, often used in other tasks, whether neural machine translation or natural language processing.
The data enhancement method refers to a method of making the scale of training data large. Data enhancement has also been widely used not only in machine translation, but also in natural language processing tasks such as dialog generation, question answering, machine writing, and natural language reasoning.
In machine translation, the commonly used data enhancement method is mainly performed from two perspectives of "word level" and "sentence level".
Word-level data enhancement involves randomly replacing words, randomly inserting words, randomly deleting words, randomly exchanging the positions of different words, etc. to achieve data enhancement, such as randomly selecting a word from the original sentence and then replacing it with a word in the dictionary (in FIG. 6 replacing "on" with "is" and "story" with "material"), or randomly selecting a word from the original sentence and exchanging positions with other words in the sentence (in FIG. 6, "now" is originally the second word, but in the first sentence generated
Figure BDA0003184546180000191
Where the position is exchanged for the first word) to generate a new sentence. One typical work for word-level data enhancement methods is the random substitution method proposed by fadae et al.
Sentence-level data enhancement is mainly performed by means of Back Translation (BT), Forward Translation (FT), and some modified versions of the Back Translation, such as Tagged Back Translation (Tagged BT). Among various sentence-level data enhancement methods, a common method is translation, and the core idea is to make full use of the existing bilingual data and train a reverse neural machine translation model, construct a pseudo source-end sentence through the target-end monolingual data, thereby forming pseudo parallel data, and combine the pseudo parallel data with the original bilingual corpus to perform data enhancement.
However, whether it is a BT method at sentence level or a substitution, exchange, or the like at word level, some sentences containing syntax errors are often generated, which are actually undesired sentences. Therefore, in order to alleviate the syntax problem after data enhancement to a certain extent, a method for generating high-quality data based on edit distance constraint sampling is provided.
Based on any of the above embodiments, the edit distance submodel is represented as follows:
Figure BDA0003184546180000201
wherein tau represents a temperature over-parameter, c (d, I) represents the number of sentences with the editing distance d (d is equal to {0,1,2,3, …, I }) and the length I;
the location submodel is represented as follows:
Figure BDA0003184546180000202
the alternative submodel is represented as follows:
P(wj|x,d,p)=P(wi|wi-1,pi); (23)
wherein, wjFor sampling new words, p, of step jjIs the sampling location.
In particular, the present invention develops a constrained sampling method for low-resource neural machine translation. The enhancement steps of the present invention are generally the same as the idea of the pure word replacement method, but replace words with a different sampling method than random sampling. Let Pr(x) In order to be the actual data distribution,
Figure BDA0003184546180000203
representing a forged sentence generated from the original sentence x. Given x, using
Figure BDA0003184546180000204
Representing dummy data
Figure BDA0003184546180000205
Distribution of (2). The distribution of the enhancement data can be expressed as:
Figure BDA0003184546180000206
generating from x by replacing words based on edit distance
Figure BDA0003184546180000207
For example, assume x is "I like the book",
Figure BDA0003184546180000208
to "I like a movie", the edit distance is 2, the positions of the replaced words are 3 and 4, and the replaced words are "a" and "movie". More formally, d is used to denote x and
Figure BDA0003184546180000209
edit distance between P ═ { P ═ P1,…,pdDenotes the position of the replaced word, P ═ w1,…,wdDenotes a list of replacement words. In the above example, d is 2, p1=3,p2=4,w1="a",w2Or "movie". According to distribution P of original real datar(x) The enhancement profile is defined as:
Figure BDA00031845461800002010
Figure BDA0003184546180000211
more precisely, the constrained sampling method herein consists of 3 parts as follows:
and D1, sampling the edit distance D according to the real data sample. According to the idea of the edit distance, an edit distance sub-model is defined as follows:
Figure BDA0003184546180000212
where τ represents a temperature hyperparameter that limits the search space around the original sentence. It can be concluded that a larger τ will get more samples with longer edit distances. c (d, I) represents the number of sentences with a sentence editing distance d (d ∈ {0,1,2,3, …, I }) with a length I, which can be obtained as follows:
Figure BDA0003184546180000213
wherein
Figure BDA0003184546180000214
Representing a vocabulary.
D2, when replacing the word, the position of the word is selected according to the sampled editing distance D. The location submodel is defined as follows:
Figure BDA0003184546180000215
according to the sampling method above, the position set P ═ { P is obtained1,p2,p3,…,pd}. This approach substantially guarantees an edit distance d between the new sentence and the initial sentence.
D3, substitution model will be at each sampling position pjThe new word is replaced. This process proceeds with d steps in common, and in each step (assuming the jth step), one can follow the distribution P (wX |)j-1,p=pj) Middle sampling new word wjThen at Xj-1P of (a)jPlace replacement old word to generate Xj. Finally, the replacement submodel is defined as:
P(wj|x,d,p)=P(wi|wi-1,pi), (29)
here using a constrained sampling scheme pair wjSampling to maximize sequence XjThe language model score of (1).
Based on any of the above embodiments, the discriminator submodel is configured to distinguish a distribution of original real data from a distribution of dummy data, and includes a discriminator loss function, which is expressed as follows:
Figure BDA0003184546180000221
wherein the content of the first and second substances,
Figure BDA0003184546180000222
in order to be a function of the loss,
Figure BDA0003184546180000223
as a discriminator, Pr(x) Is the distribution of the original real data.
Specifically, in order to improve the quality of the dummy data, the original real data Dr(X) and negative samples D obtained using negative samplingn(X) on-train arbiter submodel
Figure BDA0003184546180000224
Also called data filters. The discriminator acts as a filter for the pseudo data obtained after the limited sampling enhancement. The discriminator submodel is similar to the GAN, and is mainly used for distinguishing the distribution P of real datar(x) And distribution of dummy data
Figure BDA0003184546180000225
Similar to the least squares generating countermeasure network, the discriminator loss function is set as follows:
Figure BDA0003184546180000226
loss function
Figure BDA0003184546180000227
Make the arbiter
Figure BDA0003184546180000228
The reward for real data is higher than for dummy data. Thus, the discriminator
Figure BDA0003184546180000229
Dummy data of higher quality closer to the distribution of the real data can be selected. The data enhancement method for the source end is described above, but the method can be easily extended to the target end.
The invention provides a data enhancement-based low-resource neural machine translation system, and the following description and the above-described data enhancement-based low-resource neural machine translation method can be referred to correspondingly.
Fig. 7 is a schematic structural diagram of a data enhancement-based low-resource neural machine translation system according to an embodiment of the present invention, as shown in fig. 7, the system includes a data determination unit 710 and a machine translation unit 720;
the data determining unit 710 is configured to determine real data to be translated;
the machine translation unit 720 is configured to input the real data to be translated into a neural machine translation model, so as to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
According to the system provided by the embodiment of the invention, the real data is input into the neural machine translation model to obtain the neural machine translation result output by the neural machine translation model, so that the problem of resource shortage in low-resource neural machine translation can be efficiently and accurately solved.
Based on any of the above embodiments, as shown in fig. 8, the machine translation unit includes a data acquisition unit 810, a model training unit 820, a data screening unit 830, and a data enhancement unit 840;
the data obtaining unit 810 is configured to obtain original real data including a parallel corpus and a monolingual corpus of a low-resource language pair, and perform negative sampling on the original real data to obtain negative sample data;
the model training unit 820 is configured to train a discriminator sub-model based on the original real data and the negative sample data to obtain an evaluation model;
the data screening unit 830 is configured to build the original real data into pseudo data based on data enhancement of an editing distance, and screen the pseudo data based on the evaluation model to obtain screened data;
the data enhancement unit 840 is configured to combine the filtered data and the original real data to construct enhanced data, and train a low-resource neural machine translation model using the enhanced data and an attention-based encoder/decoder translation framework to obtain the neural machine translation model.
According to any of the above embodiments, the raw real data is from a public data set or manually prepared data;
the low-resource language pair is a language pair with the parallel corpus size smaller than a preset value and comes from an open data set;
the monolingual corpus is a monolingual corpus of a source language or a target language in a low-resource source language and comes from manually prepared data;
the obtaining of negative sample data by negative sampling of the original real data includes: and generating negative sample data by randomly discarding or randomly adding the original real data.
Based on any embodiment of the foregoing, after the data obtaining unit is configured to obtain original real data including parallel corpora and monolingual corpora of the low-resource language pair, the data obtaining unit further includes:
and carrying out cleaning data preprocessing on the original real data comprising the source language or the target language and secondary preprocessing comprising cleaning data, eliminating blank lines, eliminating illegal characters and non-English characters at a target end.
Based on any of the above embodiments, as shown in fig. 9, the data filtering unit includes an edit distance module 910, a location selection module 920, and a location replacement module 930;
the edit distance module 910 is configured to perform edit distance sampling on the original real data based on an edit distance submodel;
the position selection module 920 is configured to select a position of a replacement word based on the position sub-model and the sampled edit distance;
the position replacement module 930 is configured to replace a new word at the position of the replacement word based on the replacement sub-model, so as to obtain pseudo data.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a data-based enhanced low-resource neural machine translation method comprising: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the data enhancement-based low-resource neural machine translation method provided by the above methods, where the method includes: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
In yet another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the data enhancement-based low-resource neural machine translation method provided in the foregoing, the method including: determining real data to be translated; inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model; the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data enhancement-based low-resource neural machine translation method is characterized by comprising the following steps:
determining real data to be translated;
inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
2. The method for low-resource neural machine translation based on data enhancement of claim 1, wherein the neural machine translation model is obtained by training a low-resource neural machine translation model after data enhancement is performed on original real data including parallel corpora and monolingual corpora of a low-resource language pair, and comprises:
acquiring original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair, and carrying out negative sampling on the original real data to obtain negative sample data;
training a discriminator sub-model based on the original real data and the negative sample data to obtain an evaluation model;
constructing the original real data into pseudo data based on data enhancement of an editing distance, and screening the pseudo data based on the evaluation model to obtain screened data;
and combining the screening data and the original real data to construct enhanced data, and training a low-resource neural machine translation model by using the enhanced data and an attention-based encoder/decoder translation frame to obtain the neural machine translation model.
3. The data-based augmented low-resource neural machine translation method of claim 2, wherein the raw real data is from a published data set or manually prepared data;
the low-resource language pair is a language pair with the parallel corpus size smaller than a preset value and comes from an open data set;
the monolingual corpus is a monolingual corpus of a source language or a target language in a low-resource source language and comes from manually prepared data;
the obtaining of negative sample data by negative sampling of the original real data includes: and generating negative sample data by randomly discarding or randomly adding the original real data.
4. The method for low-resource neural machine translation based on data enhancement of claim 2, wherein after obtaining the original real data including the parallel corpora and the monolingual corpora of the low-resource language pair, further comprising:
and carrying out cleaning data preprocessing on the original real data comprising the source language or the target language and secondary preprocessing comprising cleaning data, eliminating blank lines, eliminating illegal characters and non-English characters at a target end.
5. The data enhancement-based low-resource neural machine translation method according to claim 2, wherein the data enhancement based on edit distance constructs the original real data into pseudo data, comprising:
performing edit distance sampling on the original real data based on an edit distance submodel;
selecting the position of the replacement word based on the position sub-model and the sampled editing distance;
and replacing the new word at the position of the replacement word based on the replacement sub-model to obtain pseudo data.
6. The data-based enhanced low-resource neural machine translation method of claim 5, wherein the edit distance sub-model is represented as follows:
Figure FDA0003184546170000021
wherein tau represents a temperature over-parameter, c (d, I) represents the number of sentences with the editing distance d (d is equal to {0,1,2,3, …, I }) and the length I;
the location submodel is represented as follows:
Figure FDA0003184546170000022
the alternative submodel is represented as follows:
P(wj|x,d,p)=P(wi|wi-1,pi);
wherein, wjFor sampling new words, p, of step jjIs the sampling location.
7. The data-enhancement-based low-resource neural machine translation method of claim 2, wherein the discriminator submodel is used for distinguishing the distribution of original real data from the distribution of pseudo data, and comprises a discriminator loss function represented as follows:
Figure FDA0003184546170000031
wherein the content of the first and second substances,
Figure FDA0003184546170000032
in order to be a function of the loss,
Figure FDA0003184546170000033
as a discriminator, Pr(x) Is the distribution of the original real data.
8. A data-based enhanced low-resource neural machine translation system, comprising:
the data determining unit is used for determining real data to be translated;
the machine translation unit is used for inputting the real data to be translated into a neural machine translation model to obtain a neural machine translation result output by the neural machine translation model;
the neural machine translation model is obtained by training the low-resource neural machine translation model after data enhancement is carried out on original real data including parallel linguistic data and monolingual linguistic data on a low-resource language pair.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the data-based enhanced low-resource neural machine translation method of any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the data-enhancement-based low-resource neural machine translation method of any one of claims 1 to 7.
CN202110857215.5A 2021-07-28 2021-07-28 Low-resource neural machine translation method and system based on data enhancement Pending CN113673259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110857215.5A CN113673259A (en) 2021-07-28 2021-07-28 Low-resource neural machine translation method and system based on data enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110857215.5A CN113673259A (en) 2021-07-28 2021-07-28 Low-resource neural machine translation method and system based on data enhancement

Publications (1)

Publication Number Publication Date
CN113673259A true CN113673259A (en) 2021-11-19

Family

ID=78540422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110857215.5A Pending CN113673259A (en) 2021-07-28 2021-07-28 Low-resource neural machine translation method and system based on data enhancement

Country Status (1)

Country Link
CN (1) CN113673259A (en)

Similar Documents

Publication Publication Date Title
CN109117483B (en) Training method and device of neural network machine translation model
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109359294B (en) Ancient Chinese translation method based on neural machine translation
CN110134968B (en) Poem generation method, device, equipment and storage medium based on deep learning
CN111178094B (en) Pre-training-based scarce resource neural machine translation training method
CN107967262A (en) A kind of neutral net covers Chinese machine translation method
CN108829722A (en) A kind of Dual-Attention relationship classification method and system of remote supervisory
TW201918913A (en) Machine processing and text correction method and device, computing equipment and storage media
CN108763539B (en) Text classification method and system based on part-of-speech classification
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN106919557A (en) A kind of document vector generation method of combination topic model
CN113239710A (en) Multi-language machine translation method and device, electronic equipment and storage medium
CN114201975B (en) Translation model training method, translation method and translation device
CN113204978B (en) Machine translation enhancement training method and system
Glaser et al. Summarization of German court rulings
CN115114940A (en) Machine translation style migration method and system based on curriculum pre-training
CN113673259A (en) Low-resource neural machine translation method and system based on data enhancement
CN110610006A (en) Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN112257463B (en) Compression method of neural machine translation model for Chinese-English inter-translation
CN114548117A (en) Cause-and-effect relation extraction method based on BERT semantic enhancement
Biadgligne et al. Offline corpus augmentation for english-amharic machine translation
Chen et al. Reinforced zero-shot cross-lingual neural headline generation
CN111708896B (en) Entity relationship extraction method applied to biomedical literature
CN112836528A (en) Machine translation post-editing method and system
CN112085985A (en) Automatic student answer scoring method for English examination translation questions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination