CN108363704A

CN108363704A - A kind of neural network machine translation corpus expansion method based on statistics phrase table

Info

Publication number: CN108363704A
Application number: CN201810175915.4A
Authority: CN
Inventors: 黄河燕; 史学文; 鉴萍; 唐翼琨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-02
Filing date: 2018-03-02
Publication date: 2018-08-03

Abstract

A kind of neural network machine translation corpus expansion method based on statistics phrase table, belongs to machine translation mothod field.The present invention proposes a kind of machine translation corpus expansion method based on statistics phrase table for neural network machine translation technology, can effectively extend language material scale on the basis of machine translation original training set；This method includes mainly：Training set extension phase and model training stage；Stage one is fused into the training set after new extension by statistical machine learning method from original trained focusing study phrase table and by it according to certain filtering rule and original training set, stage two is trained neural Machine Translation Model, it first passes through the training set after extension and carries out pre-training, it is trained again with tuning by original training set, obtains final mask；The experimental results showed that the present invention is compared with without using the Machine Translation Model of corpus expansion method, BLEU assessment indicators are obviously improved.

Description

A kind of neural network machine translation corpus expansion method based on statistics phrase table

Technical field

The present invention relates to a kind of neural network machines based on statistics phrase table to translate corpus expansion method, belongs to computer Using and machine translation mothod field.

Background technology

Machine translation is that a kind of language (original language) is automatically translated into another language (target language using computer Speech) technology.

With the development of artificial neural network and depth learning technology, the neural network machine based on depth learning technology turns over It translates technology (hereinafter referred to as neural machine translation) and was achieving important achievement in recent years.Neural machine translation has：It needs Linguistic knowledge and artificial intervention are few, and model storage takes up space small, translates output the translation reads smoothly the advantages such as naturally.In face In the translation duties abundant to bilingual resource, neural machine translation is typically considered best selection.Currently, neural machine Translation has been subjected to extensive concern and the approval in machine translation field, and has put it into commercial operation.

The data of training neural network are based on bilingual parallel sentence pairs.In general, the neural network used in neural machine translation Model has large-scale free parameter, and theoretically, this class model needs large-scale bilingual parallel corporas to be trained it. Experience have shown that including the neural Machine Translation Model of ten million rank free parameter usually requires the data of at least million sentence pair ranks Ideal effect can be obtained by being trained.For the more rare language of some bilingual parallel resources, carried out using neural network Translation is difficult to obtain promising result.

In addition, the training of neural machine translation is usually carried out using one or a set of (multiple) complete sentence pairs as unit, when When language material scarcity of resources, the limited ability for the lower phrase study of some frequencies of occurrences that distich centering includes, especially independent When translating these phrases.

Invention content

Model training problem of the present invention for the neural machine translation of scarcity of resources language, it is proposed that one kind is based on statistics The neural network machine of phrase table translates corpus expansion method, can effectively extend the training data of neural Machine Translation Model, delays Solve the rare adverse effect to model training of language resource.

The present invention includes：Training set extension phase and model training stage；

Wherein, A) training set extension phase operation it is as follows：By statistical machine learning method from original training set middle school Acquistion is filtered to the phrase table with probability score, and according to the phrase table that rule obtains study, will be filtered short Language table is taken into new bilingual parallel phrase to data set, the data set newly extracted and original training set is spliced to obtain new Bilingual parallel pseudo- data, realize the extension of training set；

B) operation of model training stage is divided into two steps, and step 1 is pre-training, i.e., by stage A) obtain it is bilingual Parallel puppet data carry out pre-training to model, and the good model b of pre-training is obtained after training₁；Step 2 utilizes original training set weight Newly to model b₂It is trained, purpose is to carry out tuning to model, alleviates influence of the noise introduced in pseudo- data to model；

To achieve the above object and technology, the technical solution adopted by the present invention are as follows：

Related definition is carried out first, it is specific as follows：

Define 1：Original language, i.e., in machine translation, by the language belonging to content to be translated when being translated, such as from In translator of Chinese to the machine translation of English, Chinese is original language；

Define 2：Source language data belongs to the data of original language, if source language data is a natural language sentences, The data for belonging to original language are known as source language sentence, such as from the machine translation that Chinese translates English, the Chinese of input Sentence is exactly source language data, also referred to as source language sentence；

Source language data collection is collectively referred to as by the collection that source language data forms；

Define 3：Object language, i.e., in machine translation, the language belonging to content being translated into when being translated, such as from In translator of Chinese to the machine translation of English, English is object language；

Define 4：Target language data belongs to the data of object language, if target language data is a natural language Sentence, then the data for belonging to object language are known as target language sentence, such as are translated in English machine translation from Chinese, The english sentence of output is exactly target language data, also referred to as target language sentence；

Target language data collection is collectively referred to as by the collection that target language data forms；

Define 5：Training set refers in particular to the training set of statistical machine translation model, that is, is used to train statistical machine translation model Data acquisition system, be denoted as T；

Define 6：Original training set, i.e., by the training set before extension；

Define 7：Word alignment information, abbreviation word alignment, i.e. in training set T, between original language word and target language word Alignment relation, be denoted as α；

Wherein, if in training set T, there are alignment relations to be denoted as i-th of word of j-th of word of original language and object language (j,i)；

Definition 8, phrase, the linguistic unit of one or more words compositions；

The language used is that the phrase of original language is known as source language phrase, is denoted as f, and the language used is the short of object language Language is known as object language phrase, is denoted as e；

The phrase pair of the object language phrase composition of definition 9, translation phrase pair, source language phrase and alignment, for example, it is " (' long City ', ' The Great Wall ') "；

10 are defined, positive phrase translation probability translates the condition of object language phrase e when giving source language phrase f Probability is denoted as

11 are defined, reversed phrase translation probability is translated back into the condition of source language phrase f when giving object language phrase e Probability.It is denoted as

Definition 12, two-way phrase translation probability, positive phrase translation probability and reversed phrase translation probability are collectively referred to as two-way Phrase translation probability；

13 are defined, positive Lexical phrase translation probability translates object language phrase e's when giving source language phrase f Lexical translation probability is denoted as lex (e | f)；

14 are defined, reversed Lexical phrase translation probability is translated back into source language phrase f's when giving object language phrase e Lexical translation probability is denoted as lex (f | e)；

Definition 15, two-way Lexical phrase translation probability, positive Lexical translation probability and reversed Lexical translation probability It is collectively referred to as two-way Lexical translation probability；

Define 16, phrase table, also referred to as phrase translation table, by multigroup translation phrase to constituting, and it is to every group of translation short Language is to the two-way phrase translation probability of affix and two-way Lexical translation probability；

17 are defined, filtering rule filters the rule of phrase table, according to source language phrase, the mesh for being included in phrase table Mark language phrase, two-way phrase translation probability, two-way Lexical phrase translation probabilistic information are filtered phrase table artificial The rule of formulation；

Training set extension phase, includes the following steps：

Step A1 pre-processes original training set, obtains according to defining 1, defining 2, definition 3, definition 4 and definition 5 By pretreated original training set T_f；

Wherein, pretreated detailed process different, purpose due to different source language and the target language is carried out to original training set To carry out standardization processing to training set, obtain by pretreated original training set T_f；

Step A2, the original training set T after pretreatment obtained based on step A1_f, and learned according to defining 7 and defining 8 Word alignment information is practised, which using word alignment kit realization of increasing income, will usually obtain after pretreatment in step A1 Original training set as input, by training word alignment tool training, obtain the word alignment information α of training set；

Step A3 defines 7, defines 8, define 9, define 10, define 11, define 12, define 13, definition according to defining 6 14,15 are defined and defines 16, the pretreated original training set T of process obtained in conjunction with step A1_fAnd step A2 is obtained The word alignment information α of training set extracts translation phrase pair, and it is short to obtain each translation to carrying out probability Estimation to translation phrase The two-way phrase translation probability and two-way Lexical translation probability, combining translation phrase pair and translation probability of language pair, obtain phrase Table, every of phrase table record by translation phrase to, word alignment information, two-way phrase translation probability and two-way Lexical translation it is general Rate forms；

Step A4, according to defining 9, defining 12, define 15, define 16 and define 17, using the filtering rule of Manual definition, The obtained phrase tables of step A3 are filtered, the lower translation phrase pair of probability is filtered out, obtains filtered phrase table, are remembered For P_new；

Step A5 according to definition 5, defines 16, the filtered phrase table P that step A4 is obtained_newIn translation phrase pair The pretreated original training set T that part is obtained with step A1_fSplicing, obtains new training set T_new；

Step A1 to step A5 completes the training set extension phase of this method；

Model training stage includes the following steps：

Step B1, the new training set T obtained using step A5_newPre-training is carried out to model, obtains model b₁；

Step B2, the pretreated original training set T obtained using step A1_f, model b that step B1 is obtained₁Again It is trained, obtains new trained model b₂；

So far, from step B1 to step B2, the model training stage of this method is completed；

So far, from step A1 to step A5 and step B1 to step B2, a kind of god based on statistics phrase table is completed Through Network-based machine translation corpus expansion method.

Advantageous effect

A kind of neural network machine based on statistics phrase table of the present invention translates corpus expansion method, is turned over existing machine It translates training set application method to compare, have the advantages that：

1. the present invention devises the neural network machine based on statistics phrase table and translates corpus expansion method, this method is not In the case of needing additional bilingual or single language data, original training set can effectively be extended, alleviate scarcity of resources The adverse effect that speech training collection small scale carrys out the training band of neural Machine Translation Model.

2., the present invention and nerve without using the present invention identical in training set, development set and test set data Machine Translation Model training method is compared, and BLEU evaluation metrics are obviously improved.

Description of the drawings

Fig. 1 is in the present invention a kind of neural network machine translation corpus expansion method and embodiment based on statistics phrase table Flow chart.

Specific implementation mode

The method of the invention is described in detail with reference to the accompanying drawings and embodiments.Include according to the present invention when illustrating Two Main Stages：1) training set extension phase and 2) model training stage, illustrate respectively.

Embodiment 1

The present embodiment describes the flow and its specific embodiment of the method for the invention.

Fig. 1 is that a kind of neural network machine based on statistics phrase table of the present invention translates corpus expansion method and in this implementation Flow chart in example.

As can be seen from Figure 1 two stages 1 that the present invention includes) training set extension phase and 2) model training stage Operating process.

By taking the translation of Uighur to Chinese as an example, wherein Uighur is original language, and Chinese is object language.

1) training set extension phase：

Step 1 pre-processes original training set according to defining 1, defining 2, definition 3, definition 4, definition 5, pre- to locate It is different due to different source language and the target language to manage detailed process, purpose is to carry out standardization processing to training set, wherein to source language Say that the preprocessing process of the data of Uighur and target language Chinese is：Word segment (word-piece) is first carried out to cut Point, then word segmentation (tokenization) is carried out, it obtains by pretreated original training set T_f；

Step 2 learns word alignment according to 6 and definition 7 are defined, and in the present embodiment, which utilizes word alignment of increasing income Kit GIZA++ is realized, using the pretreated original training set of the process obtained in step 1 as input, by training word The training of alignment tool GIZA++ obtains the word alignment information α of training set；

Step 3 defines 7 according to defining 6, defines 8, defines 9, define 10, define 11, define 12, define 13, definition 14,15 are defined and defines 16, the pretreated original training set T of process obtained in conjunction with step 1_fAnd step 2 obtains The word alignment information α of training set extracts translation phrase pair, and to translation phrase to carrying out probability Estimation, in the present embodiment, utilizes Train-model.perl scripts in Moses Open-Source Tools realize above-mentioned function, obtain phrase table P, every note of phrase table Record by translation phrase to, word alignment information, two-way phrase translation probability and two-way Lexical translation probability form；

Step 4, according to defining 9, defining 12, defining 15, defining 16, defining 17, using the filtering rule of Manual definition, The phrase table that step 3 obtains is filtered, the rule of Manual definition is as follows：

Retain the translation phrase pair, and if only if the probability of the translation phrase pairAndAnd lex (e | f) >=0.025, and lex (f | e) >=0.025；

The lower translation phrase pair of probability is filtered out, filtered new phrase table P is obtained_new；

Step 5 according to definition 5, defines 16, the filtered new phrase table P that step 4 is obtained_newTranslation phrase pair The pretreated original training set T that part is obtained with step 1_fSplicing, obtains new training set T_new；

2) the step of model training stage is as follows：

Step 6 carries out model pre-training, neural Machine Translation Model of increasing income is used in the present embodiment Tesnor2tensor, the new training set T obtained using step 5_newPre-training is carried out to model, obtains model b₁；

Step 7, the pretreated original training set T obtained using step 1_f, model b that step 6 is obtained₁Again It is trained, obtains new trained model b₂；

So far, from step 1 to step 7, a kind of neural network machine translation language material based on statistics phrase table is completed Extended method.

Embodiment 2

Training set in Uighur-Chinese news translation duties that CWMT2017 is provided randomly is split as training Collection, development set and test set 1, in addition, the exploitation of the Uighur that CWMT2017 is provided-Chinese news translation evaluation and test task Collect data as test set 2, the experimental results showed that, in original training set, development set, test set data and neural machine translation mould In the case of type is identical, the present invention is compared with without using the neural Machine Translation Model training method of the present invention, using based on the Chinese The BLEU of word can obtain following experimental result as evaluation metrics.

Table 1 is compared using BLEU values before and after training set extended method proposed by the present invention

Table 1 the experimental results showed that：It is identical in training set, development set and test set data, using the present invention Compared with without using the neural Machine Translation Model training method of the present invention, BLEU evaluation metrics are obviously improved the method.

The above is presently preferred embodiments of the present invention, and it is public that the present invention should not be limited to embodiment and attached drawing institute The content opened.It is every not depart from the lower equivalent or modification completed of spirit disclosed in this invention, both fall within the model that the present invention protects It encloses.

Claims

1. a kind of neural network machine based on statistics phrase table translates corpus expansion method, it is characterised in that：Including：Training set Extension phase and model training stage；

Wherein, A) training set extension phase operation it is as follows：By statistical machine learning method from the acquistion of original training set middle school It is filtered to the phrase table with probability score, and according to the phrase table that rule obtains study, by filtered phrase table New bilingual parallel phrase is taken into data set, the data set newly extracted and original training set is spliced to obtain new bilingual Parallel puppet data, realize the extension of training set；

B) operation of model training stage is divided into two steps, and step 1 is pre-training, i.e., by stage A) obtain it is bilingual parallel Pseudo- data carry out pre-training to model, and the good model b of pre-training is obtained after training₁；Step 2 is again right using original training set Model b₂It is trained, purpose is to carry out tuning to model, alleviates influence of the noise introduced in pseudo- data to model.

2. a kind of neural network machine based on statistics phrase table according to claim 1 translates corpus expansion method, It is characterized in that：To achieve the above object and technology, it adopts the following technical scheme that：

Related definition is carried out first, it is specific as follows：

Define 1：Original language, i.e., in machine translation, by the language belonging to content to be translated when being translated, such as from Chinese In the machine translation for translating English, Chinese is original language；

Define 2：Source language data belongs to the data of original language, if source language data is a natural language sentences, the category It is known as source language sentence in the data of original language, such as from the machine translation that Chinese translates English, the Chinese sentence of input It is exactly source language data, also referred to as source language sentence；

Define 3：Object language, i.e., in machine translation, the language belonging to content being translated into when being translated, such as from Chinese In the machine translation for translating English, English is object language；

Define 4：Target language data belongs to the data of object language, if target language data is a natural language sentences, Then the data for belonging to object language are known as target language sentence, such as from the machine translation that Chinese translates English, output English sentence be exactly target language data, also referred to as target language sentence；

Define 5：Training set refers in particular to the training set of statistical machine translation model, that is, is used to train the number of statistical machine translation model According to set, it is denoted as T；

Define 6：Original training set, i.e., by the training set before extension；

Define 7：Word alignment information, abbreviation word alignment, i.e. in training set T, pair between original language word and target language word Homogeneous relation is denoted as α；

Wherein, if in training set T, there are alignment relations to be denoted as (j, i) with i-th of word of object language for j-th of word of original language；

Definition 8, phrase, the linguistic unit of one or more words compositions；

The language used is that the phrase of original language is known as source language phrase, is denoted as f, and the language used is that the phrase of object language claims For object language phrase, it is denoted as e；

The phrase pair of the object language phrase composition of definition 9, translation phrase pair, source language phrase and alignment, such as " (' Great Wall ', ‘The Great Wall’)”；

10 are defined, positive phrase translation probability, that is, when giving source language phrase f, the condition for translating object language phrase e is general Rate is denoted as

11 are defined, reversed phrase translation probability, that is, when giving object language phrase e, the condition for being translated back into source language phrase f is general Rate is denoted as

12 are defined, two-way phrase translation probability, positive phrase translation probability and reversed phrase translation probability are collectively referred to as two-way phrase Translation probability；

13 are defined, positive Lexical phrase translation probability translates the vocabulary of object language phrase e when giving source language phrase f Change translation probability, is denoted as lex (e | f)；

14 are defined, reversed Lexical phrase translation probability is translated back into the vocabulary of source language phrase f when giving object language phrase e Change translation probability, is denoted as lex (f | e)；

15 are defined, two-way Lexical phrase translation probability, positive Lexical translation probability and reversed Lexical translation probability are collectively referred to as For two-way Lexical translation probability；

16 are defined, phrase table, also referred to as phrase translation table, by multigroup translation phrase to constituting, and to every group of translation phrase pair The two-way phrase translation probability of affix and two-way Lexical translation probability；

17 are defined, filtering rule filters the rule of phrase table, according to source language phrase, the target language for being included in phrase table The artificial formulation that speech phrase, two-way phrase translation probability, two-way Lexical phrase translation probabilistic information are filtered phrase table Rule；

Training set extension phase, includes the following steps：

Step A1, according to define 1, define 2, define 3, define 4 and define 5, original training set is pre-processed, obtain by Pretreated original training set T_f；

Step A2, the original training set T after pretreatment obtained based on step A1_f, and learn words pair according to defining 7 and defining 8 Neat information, the process are pretreated original by the process obtained in step A1 usually using word alignment kit realization of increasing income Training set obtains the word alignment information α of training set as input by the training of training word alignment tool；

Step A3 defines 7, defines 8, define 9, define 10, define 11, define 12, define 13, define 14, is fixed according to defining 6 Justice 15 and definition 16, the pretreated original training set T of process obtained in conjunction with step A1_fAnd the training set that step A2 is obtained Word alignment information α, extract translation phrase pair, and each translation phrase pair is obtained to carrying out probability Estimation to translation phrase Two-way phrase translation probability and two-way Lexical translation probability, combining translation phrase pair and translation probability, obtain phrase table, phrase Every of table record by translation phrase to, word alignment information, two-way phrase translation probability and two-way Lexical translation probability form；

Step A4, according to defining 9, defining 12, define 15, define 16 and define 17, using the filtering rule of Manual definition, to step The phrase table that rapid A3 is obtained is filtered, and is filtered out the lower translation phrase pair of probability, is obtained filtered phrase table, be denoted as P_new；

Step A5 according to definition 5, defines 16, the filtered phrase table P that step A4 is obtained_newIn translation phrase to part The pretreated original training set T obtained with step A1_fSplicing, obtains new training set T_new。

3. a kind of neural network machine based on statistics phrase table according to claim 1 translates corpus expansion method, It is characterized in that：Model training stage includes the following steps：

Step B2, the pretreated original training set T obtained using step A1_f, model b that step B1 is obtained₁It carries out again Training obtains new trained model b₂。

4. a kind of neural network machine based on statistics phrase table according to claim 1 translates corpus expansion method, It is characterized in that：In step A1, wherein carry out pretreated detailed process because of different source language and the target language to original training set And it is different, purpose is to carry out standardization processing to training set, is obtained by pretreated original training set T_f。