CN107092594A

CN107092594A - Bilingual recurrence self-encoding encoder based on figure

Info

Publication number: CN107092594A
Application number: CN201710257714.4A
Authority: CN
Inventors: 苏劲松; 殷建民; 宋珍巧; 阮志伟
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2017-04-19
Filing date: 2017-04-19
Publication date: 2017-08-25
Anticipated expiration: 2037-04-19
Also published as: CN107092594B

Abstract

Bilingual recurrence self-encoding encoder based on figure, is related to the natural language processing based on deep learning.Bilingual phrase is extracted from parallel corpora as training data, the translation probability between bilingual phrase is calculated；Method based on pivot, calculates and repeats probability；Construct the semantic relation figure of bilingual phrase；Semantic relation figure based on bilingual phrase；Quantitative model object function, carries out model parameter training.Target is expressed as preferably to learn the insertion of bilingual phrase, lacks for conventional method and considers in natural language more fully semantic constraint relation, propose a kind of bilingual recurrence self-encoding encoder based on figure.Algorithm is clear and definite, clear thinking, can improve the bilingual phrase insertion acquired and represent, preferably act on natural language processing task.The semantic relation figure of bilingual phrase is constructed first, two implicit semantic constraints are defined by graph structure, is represented for learning more accurate bilingual phrase insertion, and then be preferably applied in natural language processing task, such as machine translation.

Description

Bilingual recurrence self-encoding encoder based on figure

Technical field

The present invention relates to the natural language processing based on deep learning, more particularly, to the bilingual recurrence own coding based on figure Device.

Background technology

Natural language processing is an important research direction of Computer Subject artificial intelligence.It study how cause people with Natural language can be used to carry out efficient communication between computer.It is one and melts linguistics, computer science, mathematics in one Subject.

This invention is mainly concerned with bilingual recurrence self-encoding encoder of the structure based on figure, and uses it for bilingual phrase insertion Represent modeling.Neutral net is that a kind of application is similar to the mathematical modeling that cerebral nerve cynapse connecting structure carries out information processing. In recent years, the natural language processing research based on neutral net has become the main trend of the discipline development, various nerve nets Network emerges in an endless stream.Wherein, recurrent neural network (Recursive Neural Network, referred to as RecNN) is widely used in The embedded of text represents research.(this structure can pass through minimizes reduction mistake to the tree topology of network foundation text Poor sum or syntactic analysis are obtained) merge to carry out bottom-up semantic expressiveness recurrence, finally give the language of whole text Justice is represented.At present, RecNN is used widely in many tasks of natural language processing, for example, emotional semantic classification [1], Repeat detection [2], semantic analysis [3] and statistical machine translation [4] [5] etc..Self-encoding encoder (Auto-Encoder, abbreviation AE) Initially proposed as a dimensionality reduction skill, now, AE more be used for obtain dimension it is higher and it is significant represent. AE is made up of encoder and decoder two parts, and further decoding is exported after input is encoded, and is inputted by minimizing Reconstructed error between output obtains more accurate and significant semantic expressiveness.At present, AE occurs by continuing to develop Many mutation, such as Denoising Auto-Encoder (DAE), Constractive Auto-Encoder (CAE) etc.. In natural language processing, statistical method has in occupation of consequence in the tasks such as participle, part-of-speech tagging, syntactic analysis Outstanding performance.But compared with the language phenomenon that statistical language can be described, the natural language in practical application will complexity Much, particularly various constraintss.Graph model combines graph theory and statistical method, should the reasoning based on figure Use in probability statistics framework, a kind of feasible thinking is proposed for various complicated restriction relations in description natural language.Mesh Before, graph model is used widely [6] [7] [8] in natural language processing task, such as syntactic analysis.

In terms of the insertion of bilingual phrase represents research, traditional method [9] [10] [11] [12] mainly includes two steps： 1) the embedded expression of single language phrase is generated respectively using recurrence self-encoding encoder (Recursive Auto-Encoder, abbreviation RAE)； In the specific implementation, conventional method sets up the binary tree structure corresponding with phrase according to the minimum principle of reduction error sum, so Semantic expressiveness recurrence merging is carried out based on the tree construction afterwards, the embedded of generation phrase represents vector；2) in bilingual recurrence certainly Propose in encoder (Bilingual Auto-Encoder, abbreviation BRAE), according to bilingual phrase it is semantic identical the characteristics of come Carry out the mutual supervised training that bilingual phrase insertion is represented.However, conventional method only consider in modeling process reconstructed error and The semantic corresponding relation of bilingual phrase, and lack and consider more fully semantic constraint relation.Therefore, existing method is still deposited In deficiency, how to learn to obtain preferably bilingual phrase insertion expression is still a good problem to study.

Bibliography：

[1]Richard Socher,Jeffrey Pennington,Eric H.Huang,Andrew Y.Ng,and Christopher D.Manning.Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions.In Proc.Of EMNLP,2011.

[2]Richard Socher,Eric H.Huang,Jeffrey Pennin,Christopher D Manning, and Andrew Y.Ng.Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.In Proc.Of NIPS,pages 801–809.2011.

[3]Richard Socher,Cliff Chiung-Yu Lin,Andrew Y.Ng,and Christopher D.Manning.Parsing Natural Scenes and Natural Language with Recursive Neural Networks.In Proc.of ICML,2011.

[4]Jiajun Zhang,Shujie Liu,Mu Li,Ming Zhou,and Chengqing Zong.Bilingually-constrained phrase embeddings for machine translation.In Proc.of ACL,pages111–121,June 2014.

[5]Jinsong Su,Deyi Xiong,Biao Zhang,Yang Liu,Junfeng Yao,and Min Zhang.Bilingual correspondence recursive autoencoder for statistical machine translation.In Proc.of EMNLP,pages1248–1258,September 2015.

[6]Andrei Alexandrescu and Katrin Kirchhoff.2009.Graph-based learning for statistical machine translation.In Proc.of NAACL 2009,pages 119-127.

[7]Shujie Liu,Chi-Ho Li,Mu Li,and Ming Zhou.2012.Learning translation consensus with structured label propagation.In Proc.of ACL 2012,pages 302- 310.

[8]Xiaoning Zhu,Zhongjun He,Hua Wu,Haifeng Wang,Conghui Zhu,and Tiejun Zhao.2013.Improving pivot-based statistical machine translation using random walk.In Proc.of EMNLP2013,pages 524-534.

[9]Jianfeng Gao,Xiaodong He,Wen-tau Yih,and Li Deng.2014.Learning continuous phrase representation for translation modeling.In Proc.of ACL 2014,pages 699-709

[10]Kyunghyun Cho,Bart van Merrienber,Caglar Culcehre,Dzmitry Bahdanau,Fethi Bougares,Holger Schwenk,and Yoshua Bengio.2014.Learning phrase representations using rnn encoder-decoder for statistical machine translation.In Proc.of EMNLP 2014,pages 1724-1734.

[11]Shixiang Lu,Zhenbiao Chen,and Bo Xu.2014.Learning new semi- supervised deep auto-encoder features for statistical machine translation.In Proc.of ACL 2014,pages 122-132.

[12]Xing Wang,Deyi Xiong,and Min Zhang.2015.Learning semantical representations for nonterminals in hierarchical phrase-based translation.In Proc.of EMNLP 2015,pages 1391-1400.

The content of the invention

It is an object of the invention to provide the bilingual recurrence self-encoding encoder based on figure.

The present invention comprises the following steps：

1) bilingual phrase is extracted from parallel corpora as training data, the translation probability between bilingual phrase is calculated；

2) method based on pivot, calculates and repeats probability；

3) the semantic relation figure of bilingual phrase is constructed；

In step 3) in, the specific method of the semantic relation figure of the bilingual phrase of construction can be：With source phrase and mesh Mark end phrase is node, to any source phrase and destination end phrase, if belonging to the phrase pair in bilingual phrase language material, structure Make a company side.All sets of node and side collection constitute the semantic relation figure of corresponding bilingual phrase；

4) the semantic relation figure based on bilingual phrase；

In step 4) in, the specific method of the semantic relation figure based on bilingual phrase can be：Define two kinds it is implicit Semantic constraint condition, for two different nodes in same language, if they are connected to the same node point collection of another language Close, then it is assumed that they are close to each other on semantic space, this is constraint one；For a kind of any node of language, its with it is another The semantic correlation intensity of the different adjacent nodes of kind of language should be close with the phrase translation probability that is obtained based on maximal possibility estimation Correlation, this is constraint two.

5) quantitative model object function, carries out model parameter training.

In step 5) in, the quantitative model object function, carrying out the specific method of model parameter training can be：Traditional Bilingual recurrence self-encoding encoder includes single language reconstructed error and bilingual alignment uniformity fraction；On this basis, while by definition Two implicit semantic constraint conditionings are on bilingual phrase, and the single language similar consistency fraction of introducing and bilingual translation are distributed one Cause property fraction.

The present invention is expressed as target preferably to learn bilingual phrase insertion, lacks for conventional method and considers natural language In more fully semantic constraint relation, it is proposed that a kind of bilingual recurrence self-encoding encoder based on figure.Inventive algorithm is clear and definite, think Road is clear, and this method can improve the bilingual phrase insertion acquired and represent, preferably act on natural language processing task.

The present invention constructs the semantic relation figure of bilingual phrase first, and two implicit semantic constraints are defined by graph structure, uses Represent, and then be preferably applied in natural language processing task in the more accurate bilingual phrase insertion of study, as machine is turned over Translate.

The concretism of the present invention is as follows：

Graph model can often describe restriction relation more complicated in natural language.The present invention traditional bilingual recurrence from On the basis of encoder, the semantic relation figure of bilingual phrase is constructed first, and more fully semantic knowledge is explored by graph structure, And further define two implicit semantic constraints：For two different nodes in same language, if they are connected to separately The same node point set of one language, then it is assumed that they are close to each other on semantic space, this is constraint one.For a kind of language Any node, the semantic correlation intensity of its adjacent node different from another language should be with being obtained based on maximal possibility estimation Phrase translation probability is closely related, and this is constraint two.Finally acted on the bilingual phrase of extraction, so as to learn to obtain more Plus accurately bilingual phrase insertion is represented.

Brief description of the drawings

Fig. 1 is traditional BRAE model frameworks.

Fig. 2 is semantic relation graph structure example of the invention.In fig. 2, a represents the semantic relation of the bilingual phrase of construction Figure, b represents the subgraph example of implicit semantic constraint one, and c represents the subgraph example of implicit semantic constraint two, v_fNode represents source Phrase, v_eNode represents destination end phrase, and solid black lines represent between connected source phrase and destination end phrase there is alignment to close System, is used as a company side of semantic relation figure.

Embodiment

The specific embodiment of the invention is as follows：

The first step, extracts bilingual phrase as training data, the translation calculated between bilingual phrase is general from parallel corpora Rate.

Second step, the method based on pivot calculates and repeats probability.

3rd step, constructs the semantic relation figure of bilingual phrase.Using source phrase and destination end phrase as node, to any source Phrase and destination end phrase are held, if belonging to the phrase pair in bilingual phrase language material, a company side is constructed.All sets of node and Side collection constitutes the semantic relation figure of corresponding bilingual phrase.

4th step, the semantic relation figure based on bilingual phrase defines two kinds of implicit semantic constraint conditions.For same language The two different nodes called the turn, if they are connected to the same node point set of another language, then it is assumed that they are in semantic space Upper close to each other, this is constraint one.For a kind of any node of language, the language of its adjacent node different from another language Adopted correlation intensity should be closely related with the phrase translation probability obtained based on maximal possibility estimation, and this is constraint two.

5th step, quantitative model object function carries out model parameter training.Traditional bilingual recurrence self-encoding encoder is comprising single Language reconstructed error and bilingual alignment uniformity fraction.On this basis, while by two of definition implicit semantic constraint conditions Act on bilingual phrase, introduce single language similar consistency fraction and bilingual translation distribution uniformity fraction.

Committed step is described below realizes details：

1. the semantic relation figure of the bilingual phrase of construction

Graph model combines graph theory and statistical method, and the reasoning based on figure is applied in probability statistics framework, is Various complicated restriction relations in description natural language propose a kind of feasible thinking.The present invention is proposed based on the bilingual of figure Recurrence self-encoding encoder, it is necessary first to construct the semantic relation figure of bilingual phrase.Left figure a such as Fig. 2 is the semantic relation of construction A part for figure.First, using all source phrase and destination end phrase as the node of figure, such as source phrase is " so remote " so far away " are as the node of figure for (ruci yuan) " and destination end phrase.Then, for any source phrase and mesh End phrase is marked, if belonging to the phrase pair in bilingual phrase language material, a company side, such as source phrase " so remote (ruci is constructed Yuan) " and destination end phrase " so far away " are the alignment phrases in bilingual corpora, so node n_f1And n_e1Between have one Bar connects side, and source phrase " so remote (ruci yuan) " and destination end phrase " up to now " and non-alignment phrase, so Node n_f1And n_e4Between the company of having no side.

2. define two kinds of implicit constraint conditions

Traditional bilingual recurrence self-encoding encoder includes single language reconstructed error and bilingual alignment uniformity fraction, and this is based on aobvious The semantic information of formula and realize, but this is inadequate for the Precise Representation for learning bilingual phrase, so the present invention proposes base In the bilingual recurrence self-encoding encoder of figure, from graph theory, deeper implicit semantic constraint information is explored, two are thus defined Individual implicit semantic constraint condition.

For two different nodes in same language, if they are connected to the same node point set of another language, Think that they are close to each other on semantic space, this is constraint one.Such as right figure top b, source phrase node n in Fig. 2_f2And n_f3 It is connected to identical destination end phrase book { n simultaneously_e2,n_e3,n_e4, then think node n_f2And n_f3Closely should may be used on semantic space Can it is close.The repetition probability for so calculating two source phrases with same target end phrase node collection is as follows：

Wherein, f, f_sRepresent source phrase, e represents destination end phrase, p (f | e) and p (e | f_s) represent based on maximum likelihood Estimate the translation probability calculated.Then, definition is repeated to f, f_sBetween weight it is as follows：

For any source phrase f, calculate all repetitions where it to associated weight, it is final only retain there is maximum The repetition pair of weight.The calculating of weight between is repeated for destination end phrase to repeat to similar with source phrase.Afterwards, can be with The error function of the single language similitude of definition：

Wherein,WithRepresent respectively source repeat pair and destination end repeat to Euclidean distance, θ represents The parameter of model.

For a kind of any node of language, the semantic correlation intensity of its adjacent node different from another language should be with The phrase translation probability obtained based on maximal possibility estimation is closely related, and this is constraint two.Such as Fig. 2 lower right figure c, source section Point n_f3It is connected to destination end set of node { n_e2,n_e3,n_e4, embedded according to phrase is represented, n can be tried to achieve respectively_f3To n_e2, n_e3, n_e4 Semantic correlation intensity value, and estimate that the translation probability between them is distributed based on this, this probability distribution should be with The probability distribution obtained based on maximal possibility estimation is tried one's best unanimously.Thus, it is possible to define the semantic correlation intensity point of adjacent node Number is as follows：

Wherein, e, f represent destination end phrase and source phrase respectively,Respectively its corresponding embedded expression, WithRespectively corresponding transposed matrix and deviation, tran (f) gather for f candidate's translation.Original language is portrayed with KL divergences In the semantic correlation intensity fraction and based on maximal possibility estimation obtain translation distribution score between similitude, following institute Show：

E_tran(f, θ)=count (f) KL (p (* | f), p_sc(*|f))

Wherein, p (* | f) represents the translation probability obtained based on maximal possibility estimation, and count (f) represents f in training corpus The number of times of middle appearance, θ is corresponding model parameter.The similarity scores of object language calculate similar, in this way, definable is based on The error function of the uniformity of bilingual translation distribution is as follows：

E_tran(n_f,n_e, θ) and=E_tran(f,θ)+E_tran(e,θ)

3. the object function and training method of model

The object function of the model of the present invention is mainly comprising four parts：Single language reconstructed error of traditional RAE models and The bilingual alignment conformity error of BRAE models, and single language similitude based on the two implicit semantic constraints of the invention defined Fraction and translation distribution uniformity fraction.It is specifically described as follows：

1) single language reconstructed error, the fraction has modeled the quality that phrase insertion is represented.First, by adjacent node generation father's section In point, such as Fig. 1 dashed rectangles, phrase p=(x are inputted₁,x₂,x₃), by x₁And x₂Generate corresponding father node as follows：

Wherein,Respectively x₁、x₂Term vector represent that f is activation primitive, here select tanh (), W⁽¹⁾And b⁽¹⁾Respectively corresponding weight matrix and straggling parameter.Then, by father nodeThe original child node of reconstruct is as follows：

Afterwards, by father nodeAnd child nodeNew father node is generated againSo constantly be recursively combined and Reconstruct the expression until generating whole phrase.Thus, the quality that phrase insertion is represented is modeled according to reconstructed error：

Wherein, T (p) is the set of father node,For the expression of word initial in phrase,For in phrase The expression of the word of reconstruct, cost function is the Euclidean distances for calculating two expressions.

2) bilingual alignment conformity error, the fraction has modeled loss function of the bilingual phrase in alignment uniformity.Such as Shown on the right of Fig. 1 dashed rectangles, it is defined as follows：

Wherein, f and e are respectively source phrase and destination end phrase,WithIt is embedded in and represents for corresponding phrase,With Respectively corresponding transposed matrix and deviation.

3) error function based on implicit semantic constraint, shown in its face as defined above.

The overall goals function J of model_GBRAEIt is defined as follows：

J_GBRAE(n_f,n_e)=α E_rec(n_f,n_e；θ)+β·E_sem(n_f,n_e；θ)+g·(E_syn(n_f,n_e；θ)+E_tran(n_f, n_e；θ))

Wherein, a, β, g are super ginseng, for balancing every loss function, and alpha+beta+g=1.To avoid model over-fitting, draw Enter parameter regularization term R (θ).Finally, the object function of model is as follows：

Wherein, G is all alignment phrase set, and N is the number of alignment phrase in training corpus, λ_*For regularization term, R (θ) is model parameter.

The object function not only allows for the explicit semantic constraint condition in traditional BRAE, and it is of the invention new fixed also to have incorporated Two kinds of implicit semantic constraint conditions of justice, can make full use of various semantic constraint information, so as to learn more accurate bilingual short Language insertion is represented, and is used in follow-up experiment, such as machine translation.

Claims

1. the bilingual recurrence self-encoding encoder based on figure, it is characterised in that comprise the following steps：

2) method based on pivot, calculates and repeats probability；

3) the semantic relation figure of bilingual phrase is constructed；

4) the semantic relation figure based on bilingual phrase；

5) quantitative model object function, carries out model parameter training.

2. the bilingual recurrence self-encoding encoder as claimed in claim 1 based on figure, it is characterised in that in step 3) in, the construction is double The specific method of the semantic relation figure of language phrase is：Using source phrase and destination end phrase as node, to any source phrase and Destination end phrase, if belonging to the phrase pair in bilingual phrase language material, constructs a company side.All sets of node and side collection are constituted The semantic relation figure of corresponding bilingual phrase.

3. the bilingual recurrence self-encoding encoder as claimed in claim 1 based on figure, it is characterised in that in step 4) in, it is described based on double The specific method of the semantic relation figure of language phrase is：Two kinds of implicit semantic constraint conditions are defined, for two in same language Individual different nodes, if they are connected to the same node point set of another language, then it is assumed that they are close to each other on semantic space, This is constraint one；For a kind of any node of language, the semantic correlation intensity of its adjacent node different from another language It should be closely related with the phrase translation probability obtained based on maximal possibility estimation, this is constraint two.

4. the bilingual recurrence self-encoding encoder as claimed in claim 1 based on figure, it is characterised in that in step 5) in, the quantization mould Type object function, carry out model parameter training specific method be：Traditional bilingual recurrence self-encoding encoder includes single language reconstruct and missed Difference and bilingual alignment uniformity fraction；On this basis, while by two of definition implicit semantic constraint conditionings in double On language phrase, single language similar consistency fraction and bilingual translation distribution uniformity fraction are introduced.