CN108549644A

CN108549644A - Omission pronominal translation method towards neural machine translation

Info

Publication number: CN108549644A
Application number: CN201810326895.6A
Authority: CN
Inventors: 熊德意; 谭新
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-04-12
Filing date: 2018-04-12
Publication date: 2018-09-18

Abstract

The present invention relates to a kind of language material processing method using neural machine translation system omit pronoun, apply in the NMT models based on attention mechanism and use encoder decoder frames, including：Obtain original language material；Word alignment is carried out to the language material of acquisition, obtains the Position Approximate of missing pronoun；All possible pronoun is put into the position of all possible missing；Most suitable pronoun and most suitable position are selected using language model；Word alignment is carried out again, changes the position of supplement missing pronoun into pronoun in respective objects sentence；SequenceLabeling marking models are trained using the training corpus supplemented.The above-mentioned language material processing method for using neural machine translation system omit pronoun can be automatically replenished the pronoun omitted in source statement and avoid using generated ambiguity after source language supplement source statement missing pronoun, to effectively improve translation quality.Further relate to a kind of interpretation method using neural machine translation system.

Description

Omission pronominal translation method towards neural machine translation

Technical field

The present invention relates to neural machine translation, more particularly to the omission pronominal translation method towards neural machine translation.

Background technology

With the raising of computer computation ability and the application of big data, deep learning obtains further application.Base It is had been to be concerned by more and more people in the Neural Machine Translation of deep learning.It is the most frequently used in the fields NMT A kind of translation model be the encoder-decoder models with attention mechanism (attention-based).It is mainly thought Think it is that sentence (hereinafter collectively referred to as ' source statement ') to be translated is become into a vector by encoder (encoder) coding It indicates, then utilizes decoder (decoder) to indicate to be decoded to the vector of source statement, translation becomes its corresponding translation (hereinafter collectively referred to as ' object statement ').Although the Neural Machine Translation based on deep learning are certain It can be good at translating source statement in degree, but for a kind of habitual uttered sentence for omitting pronoun is translated another kind What is but showed on the non-language for omitting pronoun is not so outstanding, literary " eaten for example, we are often right", and correspond to English should be " have you eaten", but reality carries attention mechanism (attention-based) with general Encoder-decoder model translations come out sentence be " eaten", as turning over for this uttered sentence for omitting pronoun It translates, because ' you ' is omitted in Chinese, and the custom that English does not omit, therefore carry out translation using machine and can drop significantly The fluency and readability of low translation, to influence translation quality.

It is existing to solve the problems, such as that this kind of method has following 2 kinds：

1. the pronoun omitted in artificial polishing source statement；

2. automatically supplementing the pronoun omitted in source statement with source language, the specific method is as follows：

First, source statement and object statement are carried out pair using word alignment (corresponding the word in two sentences) Neat operation, the approximate location of missing pronoun is obtained with this；Then, all possible pronoun is filled into all possible missing Position；Finally, (judge whether a sentence has the fluency of normal statement, puzzlement degree is smaller, then more connects using language model Near-nature forest language) come select be most suitable for supplement pronoun and position.

Then the processed language material of both the above method is utilized to carry attention mechanism (attention- by above-mentioned Based encoder-decoder models) are translated.

The shortcomings that for existing two classes technology：

The first (i.e. artificial supplementation)：The language material limited amount for taking, effort, and handling.

Second (being automatically replenished)：Although this kind of solution can solve disadvantage caused by artificial supplementation, It is that the pronoun that this compensation process supplemented easily causes ambiguity when being translated, i.e.,：For the Chinese pronoun ' I ' of supplement, Can be ' I ' or ' me ' when translating English, the ambiguity generated by this one-to-many word can make translation quality decline.

Invention content

In view of the above technology there are the shortcomings that, therefore, to solve this problem, we have proposed a kind of method, can from The pronoun omitted in dynamic supplement source statement, and can avoid utilizing generated discrimination after source language supplement source statement missing pronoun Justice, to effectively improve translation quality.

A kind of language material processing method using neural machine translation system omit pronoun, is applied based on attention machine System and the NMT models for using encoder-decoder frames, including：

Obtain original language material；

Word alignment is carried out to the language material of acquisition, obtains the Position Approximate of missing pronoun；

All possible pronoun is put into the position of all possible missing；

Most suitable pronoun and most suitable position are selected using language model；

Word alignment is carried out again, changes the position of supplement missing pronoun into pronoun in respective objects sentence；

SequenceLabeling marking models are trained using the training corpus supplemented；

Development set and test set are marked with trained SequenceLabeling marking models and supplement pronoun.

The above-mentioned language material processing method for using neural machine translation system omit pronoun, can be automatically replenished source statement The pronoun of middle omission, and can avoid using generated ambiguity after source language supplement source statement missing pronoun, to effectively Raising translation quality.

In other one embodiment, in step " word alignment is carried out to the language material of acquisition, obtains the general of missing pronoun Position；" in, carry out word alignment using GIZA++ models.

In other one embodiment, " word alignment is carried out again, changes the position of supplement missing pronoun into phase in step Answer the pronoun in object statement；" in, carry out word alignment using GIZA++ models.

In other one embodiment, in step " word alignment is carried out to the language material of acquisition, obtains the general of missing pronoun Position；" and step " carry out word alignment again, will supplement missing pronoun position change the pronoun in respective objects sentence into；” The method of middle word alignment is identical.

In other one embodiment, in step " word alignment is carried out to the language material of acquisition, obtains the general of missing pronoun Position；" and step " carry out word alignment again, will supplement missing pronoun position change the pronoun in respective objects sentence into；” The method of middle word alignment is different.

A kind of interpretation method using neural machine translation system is applied based on attention mechanism and using encoder- The NMT models of decoder frames,

Original language material is handled using the above-mentioned language material processing method for using neural machine translation system omit pronoun；

It is separately added into the first label and the second label before and after source statement supplements pronoun；

First label and second label is equally added in the position that object statement corresponds to pronoun；

The training of NMT systems is carried out using the language material after above-mentioned processing；

It is translated using trained NMT systems.

In other one embodiment, first label is<copy>It is with second label</copy>.

In other one embodiment, first label is</copy>It is with second label<copy>.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, which is characterized in that the step of processor realizes any one the method when executing described program.

A kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of any one the method is realized when execution.

Description of the drawings

Fig. 1 is that a kind of language material using neural machine translation system omit pronoun provided by the embodiments of the present application is handled The schematic diagram of method.

Fig. 2 is that a kind of language material using neural machine translation system omit pronoun provided by the embodiments of the present application is handled The flow chart of method.

Fig. 3 is a kind of flow chart of interpretation method using neural machine translation system provided by the embodiments of the present application.

Fig. 4 is that a kind of language material using neural machine translation system omit pronoun provided by the embodiments of the present application is handled One of effect diagram of method.

Fig. 5 is that a kind of language material using neural machine translation system omit pronoun provided by the embodiments of the present application is handled The two of the effect diagram of method.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The application foundation of the application is introduced first：NMT models based on attention mechanism (attention).

In neural machine translation system, translation is generally realized using encoder-decoder frames.To training corpus In each word, we are that it initializes a term vector, and the term vector of all words constitutes term vector dictionary.Word to Amount, the vector of usually one multidimensional, often one-dimensional in vectorial is all a real number, and the size of dimension is generally according in experimentation Result finally determine.For example, for word " we ", its term vector may be<0.12, -0.23 ..., 0.99>.

Encoder is made of two-way RNN (RecurentNeural Network) network.In the encoder stages, Encoder reads in a sentence, and sentence is encoded into a series of vector.Detailed process is as follows, first by a sentence expression For the sequence of term vector, i.e. x=<x₁, x₂..., x_T>, wherein x is the sentence of input, x_jFor i-th of word in sentence word to Amount, the i.e. vector of m dimensions.Forward direction RNN is according to formulaWe can obtain one and are made of hidden layer vector Forward direction sequence vectorReversed RNN can be obtained according to same principle by the anti-of hidden layer Vector Groups layer To sequence vectorWe connectWithAs word x_jContain context after encoder is encoded The vector of information indicatesBy hidden layer sequence vector<h₁, h₂..., h_T>, we can obtain context vectors c_t =q ({ h₁, h₂..., h_T}).Wherein, wherein h_j∈Rⁿ, hidden state when being sequential t, f and q are nonlinear activation primitives, Wherein f generally uses GRU or LSTM, q generally to use attention networks.

e_tj=a (s_t-1, h_j)

In classical neural machine translation system, context vectors are generally obtained using attention networks, can To be obtained by following formula operation：

Wherein, a is one one layer of feedforward network, α_tjBe encoder it is each hidden state h_jWeight.

Decoder is also to be made of RNN networks.In the Decoder stages, vector c is given_tAnd it is all predicted To word { y₁, y₂..., y_t′-1, it can continue to predict y_t, can be done step-by-step by such as giving a definition：WhereinIn addition, p (y_t|{y₁, y₂..., y_t-1, c_t)=g (y_t-1, s_t, c_t), wherein g is nonlinear activation function, generally uses softmax functions.s_tIt is hidden in RNN Layer state, s_t=f (y_t-1, s_t-1, c_t)。

The characteristics of encoder and decoder uses RNN networks, is primarily due to its feature, RNN networks be, hidden layer State is codetermined by current input and a upper hidden layer state.Such as in this nerve machine translation process, the Encoder stages Hidden layer state is codetermined by the term vector of source language end current word and a upper hidden layer state.The hidden layer state in Decoder stages It is codetermined by a hidden layer state on the target language terminal word vector sum that is calculated in previous step.

The training of model generally uses minimum to bear log-likelihood as loss function, uses stochastic gradient descent for training side Method is iterated training.In training setOn, wherein xⁿ, yⁿFor parallel sentence pair, model training object function is such as Under：

Obtain original language material；

All possible pronoun is put into the position of all possible missing；

It is translated using trained NMT systems.

In other one embodiment, first label is<copy>It is with second label</copy>.

In other one embodiment, first label is</copy>It is with second label<copy>.

The specific application scenarios of the present invention are described below：

The processing of training set language material：

Refering to fig. 1 and Fig. 2, the pronoun of destination end is added to the position that source lacks pronoun by us.Because of training set language Material is parallel corpora, and therefore, we can utilize alignment information.It uses GIZA++ to carry out word alignment first, obtains missing pronoun Position Approximate, then all possible pronoun is put into be possible to missing position, then utilize language model, select Most suitable pronoun and most suitable position after picking out best pronoun and position, then with a GIZA++ model carry out word pair Together, the position of supplement missing pronoun is changed into pronoun in respective objects sentence.

For example, if source sentence is：It " has eaten", then the sentence for supplementing destination end pronoun is exactly：" you is eaten ”.

Test set and the processing of development set language material：

For development set and test set, because the language material of the two set is not parallel, there is no destination end sentence, Therefore we cannot be handled using the method for processing training set.Here the handling development set and test set of the task is regarded as It is a part-of-speech tagging problem, the type of mark shares 32 classes, corresponds to each pronoun and sky (represent and lacked without pronoun) respectively, With the Foolnltk kits increased income, part-of-speech tagging model is trained using the training set language material handled well, then to test set and Development set is handled, and the example after processing is as above.

Refering to Fig. 3, we are added copy mechanism in above-mentioned NMT models here, that is, for training set, test set and Development set tries again processing, and label is separately added into before and after source statement supplements pronoun<copy>,</copy>.Meanwhile target Same label is also added in the position that sentence corresponds to pronoun, and the training of NMT systems is carried out using the language material after processing.Wherein I " you " in " you " and trg in src share identical word-embedding.NMT systems can learn the ends src< copy>……</copy>With the ends trg<copy>……</copy>Correspondence, shared word-embedding can ensure Generate the correctness of translation.

It is proposed that method can not only supplement the missing problem for omitting pronoun, but also can avoid filling into missing pronoun The lexical translation ambiguity problem introduced afterwards, to effectively promote the translation quality for the conversational language for omitting pronoun.

Passing through various experiments, it has been found that the translation filled into after pronoun is obviously improved compared to the translation effect not filled into, The translation effect that destination end pronoun (+ProDrop_target) acquisition is filled into our method is directly filled into than what forefathers proposed The translation of the pronoun (+ProDrop) of source language wants better, and BLEU values nearly 1 point of promotion illustrates our method The translation for the spoken utterance for omitting pronoun can largely be promoted.Experimental result is as shown in table 1 below：

Table 1

Below for a specific example：

Such as：src:I have used (I) a lifetime.

Trg:I spend my whole life.

For the example provided, pronoun ' I ' is omitted in original sentence, with it is proposed that method, be first word alignment behaviour Make, obtains refering to effect shown in Fig. 4：

Then according to the word of front and back alignment, we probably judge the position for lacking pronoun ' using ' and ' a lifetime ' Near, 31 pronouns are all put into these position candidates, using language model select it is most matched that, it is assumed that we judge It is best in ' using ' and ' a lifetime ' intermediate effect to go out to be added pronoun ' I ', then we are using this sentence as final time Then choosing utilizes word alignment, obtains refering to effect shown in fig. 5 again.

At this moment it is obvious that originally the part of missing pronoun has all been aligned, therefore, we only needed the Chinese generation of supplement Word with English end pronoun supplemented to get to：

Src:I has used my a lifetime.

Trg:I spendmywhole life.

Final step is separately added into before and after the pronoun of supplement<copy>With</copy>Label obtains：

Src:I uses<copy>my</copy>All one's life.

Trg:I spend<copy>my</copy>whole life.

Then by treated, language material gives that the Encoder-Decoder models based on attention are trained and turn over It translates.

Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.

Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of language material processing method using neural machine translation system omit pronoun, is applied based on attention mechanism And using the NMT models of encoder-decoder frames, which is characterized in that including：

Obtain the original language material；

All possible pronoun is put into the position of all possible missing；

2. the language material processing method according to claim 1 for using neural machine translation system omit pronoun, special Sign is, " carries out word alignment in step to the language material of acquisition, obtains the Position Approximate of missing pronoun；" in, utilize GIZA++ moulds Type carries out word alignment.

3. the language material processing method according to claim 1 for using neural machine translation system omit pronoun, special Sign is, " carries out word alignment again in step, changes the position of supplement missing pronoun into pronoun in respective objects sentence；” In, carry out word alignment using GIZA++ models.

4. the language material processing method according to claim 1 for using neural machine translation system omit pronoun, special Sign is, " carries out word alignment in step to the language material of acquisition, obtains the Position Approximate of missing pronoun；" and step " again into Row word alignment changes the position of supplement missing pronoun into pronoun in respective objects sentence；" in word alignment method it is identical.

5. the language material processing method according to claim 1 for using neural machine translation system omit pronoun, special Sign is, " carries out word alignment in step to the language material of acquisition, obtains the Position Approximate of missing pronoun；" and step " again into Row word alignment changes the position of supplement missing pronoun into pronoun in respective objects sentence；" in word alignment method it is different.

6. a kind of interpretation method using neural machine translation system is applied based on attention mechanism and using encoder- The NMT models of decoder frames, which is characterized in that

Using neural machine translation system omit the language material processing side of pronoun using claim 1 to 5 any one of them Method handles original language material；

It is translated using trained NMT systems.

7. the interpretation method according to claim 6 using neural machine translation system, which is characterized in that first mark Label are<copy>It is with second label</copy>.

8. the interpretation method according to claim 6 using neural machine translation system, which is characterized in that first mark Label are</copy>It is with second label<copy>.

9. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes any one of claim 1-8 the methods when executing described program The step of.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of claim 1-8 any one the methods are realized when execution.