CN109062910A

CN109062910A - Sentence alignment method based on deep neural network

Info

Publication number: CN109062910A
Application number: CN201810835723.1A
Authority: CN
Inventors: 丁颖; 李军辉; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2018-12-21

Abstract

A kind of sentence alignment method based on deep neural network, using bidirectional circulating neural net layer, sentence is encoded, not only allows for the semantic information of word itself, the contextual information for also contemplating the word, obtaining each word includes the hidden state of its contextual information；Door related network layer, calculates the semantic relation information in two sentences between word pair；Using the hidden state of each word of bidirectional circulating neural network acquisition as input, use the network of the bilinear model and monolayer neural networks that merge by door machine system, then to capture the part of its most information content using the operation of maximum pondization from similitude between linear relationship and non-linear relation two angles capture words pair；For there are most words translated each other in two sentences translating each other, conventional method carries out alignment judgement to information also with word, and the present invention is not needed using additional dictinary information, to capture the semantic relation feature between word pair.And word can be obtained to similarity matrix.

Description

Sentence alignment method based on deep neural network

Technical field

The present invention relates to a kind of sentence alignment methods neural network based.

Background technique

Parallel Corpus is more extremely important resources of natural language processing task, and many in natural language processing is appointed Business, such as machine translation, cross-language information retrieval and bilingual dictionary require the support of parallel corpora.Sentence alignment task be from The parallel sentence pairs translated each other are extracted in the document of two different languages, for expanding Parallel Corpus, to solve Parallel corpora scale is compared with minor issue.

The method that the early stage research method of sentence alignment is mainly based upon characteristic matching, this method have only focused on bilingual sentence Between surface information, i.e., judge whether sentence is aligned according to the length relation between two sentences.Then according to parallel sentence In word to relationship, many researchers propose the methods based on dictionary, i.e., according to the word translated each other in two sentences to number Judge whether sentence is aligned with the relationship of the word number of entire sentence.In addition, there are also methods by the length of two sentences Information and word combine information, or add other feature information, heuristic strategies, translate into same language and compare Whether sentence after translation and another sentence are similar etc. to judge whether sentence is aligned.In recent years, with the depth of deep learning Enter research, neural network method also achieves significant effect in sentence alignment task.

Sentence alignment is the basic task in natural language processing.Currently, sentence alignment task is counted as one point Generic task is aligned or is misaligned two classes, extract final parallel sentence pairs finally by alignment optimisation strategy.

The sentence alignment method of early stage using statistics method, according to the length characteristic of sentence, each other translation pair Word number feature obtains these characteristic values by the method for statistics, and is sentenced according to these characteristic values and the alignment strategy of formulation Whether punctuate is aligned.It meanwhile there are also certain methods being to improve sentence alignment according to features such as punctuation mark, positional relationships Performance.In addition, being first to extract in corpus to be considered by tool using existing sentence tool there are also sentence alignment work Then parallel sentence pair is further judged by some rule features, finds optimal alignment.What they were largely used It is unsupervised and semi-supervised method to be trained.

As shown in Figure 1, in recent years, with the continuous development of deep learning, deep neural network is also in sentence alignment task In general middle successful application needs to refer to corpus to carry out Training, i.e., (is put down by a series of sentence pair of known class Two classifications of row sentence pair and non-parallel sentence pair) come train adjustment classifier parameter, reach the process of optimal performance.It is existing Neural network method mainly with good grounds bidirectional circulating neural network encodes two sentences, obtain the hiding shape of each word The hidden state of the last one word and first word is stitched together the hidden state as entire sentence by state, then will The sum of products absolute value of the difference of the two hidden states is added, then the phase between them is calculated by hyperbolic tangent function tanh Like degree, and more abstract expression is obtained by full linking layer, finally by S sigmoid growth curve -- sigmoid function calculates its point Class probability.This method can not capture the word information inside sentence only by sentence expression at a vector well, and Word information is the key factor for judging sentence and whether being aligned, therefore judges sentence pair only by two sentence vectors are compared It is easily lost important match information together.

As shown in Fig. 2, simultaneously, then the similarity between also having method to carry out comparing word pair is caught by convolutional neural networks Its most useful information is obtained to classify.According to the word insertion of each word pair in two sentences by cosine similarity and Euclidean away from From their similarity of calculating, to obtain the similarity matrix of a m*n, wherein m and n is respectively source sentence length and mesh Mark end sentence length；Then, its most useful information is captured using convolutional neural networks on obtained similarity matrix；Most Passing through S sigmoid growth curve afterwards -- sigmoid function exports its class probability.This method by calculate word pair between similarity come The match information between sentence pair is captured, sentence alignment performance can be preferably improved.

However, there is the word pair largely translated each other in the sentence pair that sentence alignment does not require nothing more than alignment, while also requirement pair Semantic consistency between neat sentence pair.The contextual information of sentence may be lost to similarity by being only embedded in calculating word by word, The sentence vector determination sentence alignment only encoded by bidirectional circulating neural network is easily lost the match information of word.

Summary of the invention

The technical problem to be solved by the present invention is to the shortcomings that overcoming the prior art, provides a kind of based on depth nerve net The sentence alignment method of network, for there is most words translated each other, conventional method in two sentences translating each other Alignment judgement is carried out to information also with word.However whether traditional method needs to judge the word in word using dictinary information In allusion quotation, the present invention is not needed using additional dictinary information, to capture the semantic relation feature between word pair.The present invention is by double To the word hidden state after Recognition with Recurrent Neural Network layer coding as input, using the bilinear model that is merged by door machine system with The network of monolayer neural networks, from relationship between two angle capture words pair of linear and nonlinear, to obtain word to similarity moment Battle array.

The technical problem to be solved by the present invention is to the shortcomings that overcoming the prior art, it is further provided one kind is based on depth The sentence alignment method of neural network, conventional method have generally only taken into account lexical information, however two translated each other sentence Sentence justice be also consistent.The present invention encodes sentence on the basis of a upper invention to obtain the expression of entire sentence, Capture the semantic feature of sentence.

The technical solution that the present invention solves the above technical problem is: a kind of sentence alignment side based on deep neural network Corpus pretreatment: method generates vocabulary according to training corpus and word is embedded in vocabulary；

1) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table；

2) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for, And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information；

3) door related network layer calculates the semantic relation information in two sentences between word pair；With bidirectional circulating neural network The hidden state of each word obtained uses the bilinear model and monolayer neural networks merged by door machine system as input Network, i.e. door related network then to use from similitude between linear relationship and non-linear relation two angles capture words pair The operation of maximum pondization captures the part of its most information content；

4) perceptron layer is input to the result of door related network layer as a vector expression in perceptron, and passes through Perceptron obtains more abstract expression to judge sentence alignment；

5) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting Whether it is aligned, then comes out the sentence extraction being aligned in two documents.

Based on above-mentioned purpose, further improvement of the present invention scheme is: further include following steps after step 2):

31) convolutional neural networks layer calculates sentence vector using multilayer convolutional neural networks, can preferably capture sentence Adopted information, encodes sentence, can preferably capture sentence justice information, and the vector for obtaining sentence indicates；

Two sentence vectors are stitched together；

6) step 5 includes the following steps: perceptron layer, by the result and convolutional neural networks layer of door related network layer Result be combined into the expression of vector and be input in perceptron, and more abstract expression is obtained to judge sentence by perceptron Alignment.

Based on above-mentioned purpose, further improvement of the present invention scheme is: the perceptron layer is multilayer perceptron layer.

Based on above-mentioned purpose, further improvement of the present invention scheme is: the bidirectional circulating neural net layer is not only to list Word itself is encoded, and is also encoded to the context of sentence from left to right and from right to left, we only describe length herein For the bidirectional circulating neural network process of the source sentence s of m, the coding of source sentence s uses a pair of of neural network i.e. double To Recognition with Recurrent Neural Network: forward direction Recognition with Recurrent Neural Network reads the sentence sequence s=(s of input from left to right₁,...,s_m), and export One forward direction hidden stateBackward Recognition with Recurrent Neural Network reads the sentence sequence of input from right to left, and defeated A backward hidden state out Each word s in so source sentence s_jHidden state h_j, it is expressed as Hidden stateWithSplicing, the present invention in bidirectional circulating neural network solved using gating cycle unit for a long time rely on close The problem concerning study of system, as follows: at the j of position, forward direction hidden stateIt is updated according to following four formula:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, is calculatingWhen contain The information of front position and current time can more or less include the word when calculating the hidden state of each word The information of front sequence, σ are S sigmoid growth curve, W_z, W_r, W is the model parameter to be learnt, training start when random initializtion this A little parameters, and update, representing matrix multiplication are optimized to it using gradient descent algorithm in the training process, ⊙ indicates member Plain multiplication, likewise, backward hidden stateIt is done in the same fashion update, the present invention uses identical bidirectional circulating nerve Network encodes source sentence s and target side sentence t, therefore the calculation formula of target side sentence t is same as described above；Meanwhile also Shot and long term memory unit can be used to substitute the gating cycle unit that the present invention uses, at the j of position, forward direction hidden state It is updated according to following six formula:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, σ is S sigmoid growth curve, W_f, W_i, W_c, W_o, b_f, b_i, b_c, b_oIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in training Update, representing matrix multiplication are optimized to it using gradient descent algorithm in the process, ⊙ indicates element multiplication.

Based on above-mentioned purpose, further improvement of the present invention scheme is the door related network layer by bidirectional circulating mind Hidden state (the h of two sentences after network code_s1,...,h_sm) and (h_t1,...,h_tn) it is input, then, according to following Formula calculate each word to (s_i,t_j) similarity scores:

Wherein, h_siM^[1:r]h_tjFor bilinear model, for capturing linear relationship between word pair, f (v [h_si,h_tj]) it is single layer Neural network, for capturing non-linear relation between word pair, g indicates how to integrate linear relationship and non-linear relation between word pair The linear relationship and non-linear relation that door, i.e. word have merged much ratios to final similarity score, U, M, V, Wg are ginseng Number, these parameters of random initializtion when training starts, and it is optimized more using gradient descent algorithm in the training process Newly, after by above-mentioned calculating, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence Length, the relationship between two text fragments is usually determined by some strong semantic interactions, therefore is obtaining Gray Square After the similarity scores matrix of block, the present invention divides the matrix using maximum pond strategy, is p1 × p2 by setting size Maximization pond, i.e., its maximum value is taken to p1 × p2 block each in similarity scores matrix, thus obtain size be (m/p1) × (n/p2) then matrix is remolded as an one-dimensional vector v_gOutput as door related network.

Based on above-mentioned purpose, further improvement of the present invention scheme is: the convolutional neural networks layer is identical using two Convolutional Neural network layer encode source sentence s and target side sentence t, for the sentence s=(s of input₁,...,s_m) pass through After word embeding layer, the matrix that size is [m, embed_size] is obtained, it is size for [m, embed_ that the present invention, which remold, Size, 1] matrix E, and defining a shape is [filter_depth, filter_rows, 1] filter, then, according to Sliding step, the filter slide on matrix E according to sliding step, and for sliding every time, are covered in matrix E by filter Cover is multiplied with the corresponding position of filter, and they are added the value as corresponding position in output, finally, we obtain The smaller output [m-filter_depth+1, embed_size-filter_rows+1,1] of one shape；The present invention uses Packaged conv3d function realizes the operation in Theano frame in Python programming language, then, adopts in output Pass through setting shape with 3 dimensions provided in Theano frame-maximum pondization strategy to obtain wherein the part of most information content For the maximum pond of [p1, p2,1], i.e., its maximum value is taken to each of matrix of above-mentioned input [p1, p2,1] block, to obtain Size is the square of the most information content of [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1] Battle array, in a convolution process, the value of filter is remained unchanged, therefore some important spies of sentence can be enhanced by convolution Sign, and noise can be reduced, since the feature that a convolutional Neural network layer is acquired is often part, used in the present invention Two convolutional Neural networks layers learn the feature that more is globalized, and the input at second convolutional Neural network layer is first The result of two sentences is pressed matrix after source sentence s and target side sentence t coding by the output of convolutional Neural network layer The last one-dimensional output v being stitched together as convolutional neural networks layer_c。

Based on above-mentioned purpose, further improvement of the present invention scheme is: according to foregoing description, can obtain a related network Output v_gWith the output v at convolutional Neural network_c, next, first by the two vectorial combinations at a vector representation v:

V=W_g·v_g+W_c·v_c

Then, it is inputted in perceptron layer, perceptron includes two hidden layers and an output layer, i.e. v in the present invention Multiple full link hidden layers are input into obtain more abstract expression, and are ultimately connected to output layer:

c_n=tanh (W_n·c_n-1+b_n)

Wherein, W_g、W_c、W_nAnd b_nIt is parameter, in training incipient stage random initializtion, and uses in the training process Gradient descent algorithm optimizes update to it；c_n-1Indicate that the hidden layer of perceptron calculates, finally, the output c of perceptron layer_n The Probability p that sentence is classified as 0 or 1 is calculated by S sigmoid growth curve, is judged as sentence if given threshold value ρ is greater than if probability It is aligned to 1, on the contrary then to be misaligned be 0:

Based on above-mentioned purpose, further improvement of the present invention scheme is: according to foregoing description, language pair that the present invention uses It is Chinese and English, wherein segmenting Chinese document, English document is needed to be completely converted into english lowercase, in language Expect pretreatment stage, first the word in training corpus is sorted from large to small by word frequency according to training corpus, and select first three ten thousand A word is generated with " word number " as representation vocabulary, wherein " number " is the number since 0, then can pass through " number " indicates corresponding word, and then, it is corresponding which finds each word in vocabulary from bilingual word insertion file Word insertion generates word and is embedded in vocabulary, and compared with traditional only hot vector expression, word insertion allows the word of similar meaning to have class As indicate, therefore, all words in word embeding layer, two sentences as input all will be mapped to that fixed size Low dimensional vector, this vector is exactly each word corresponding word insertion in word insertion vocabulary in sentence, corresponding to exceed vocabulary All low frequency words, we map it in a special word insertion, and the present invention using the insertion of bilingual word come Word embeding layer is initialized, by corpus pre-treatment step, the present invention obtains vocabulary and word insertion vocabulary, for each sentence of input Son first passes through inquiry vocabulary and generates corresponding numbered sequence, then generates and corresponds to further according to numbered sequence query word insertion vocabulary Word be embedded in sequence, thus complete initialization word embeding layer.

The beneficial effects of the present invention are: it is equally utilized in the present invention that neural network method carries out sentence alignment research.Calculate word When to similarity, the present invention is using the word vector after bidirectional circulating neural network coding first rather than simple word is embedding Enter, not only allow for the semantic information of word itself, and considers the contextual information of the word；Secondly the present invention uses double Linear model and monolayer neural networks are merged two kinds of similarities by door machine system to calculate word to similarity, thus from line Property and the angles of non-linear two opposition more fully consider that word to relationship, rather than uses cosine similarity and Euclidean distance Two judge the method for similarity by angle and distance；The last present invention captures word to similarity using maximum pondization operation In most information content part, rather than use convolutional neural networks.

Bidirectional circulating neural net layer: the word hidden state obtained after being encoded by bidirectional circulating neural network not only wraps It containing its own meaning, while also including its contextual information；Door related network layer: to be encoded by bidirectional circulating neural net layer Rear word hidden state is as input, using the network of the bilinear model and monolayer neural networks that are merged by door machine system, From relationship between two angle capture words pair of linear and nonlinear, to obtain word to similarity matrix；Convolutional neural networks layer: this Invention calculates sentence vector using multilayer convolutional neural networks, can more preferably, more fully capture sentence justice information.

Word is merged to information and sentence justice information: the present invention is refreshing by the output of door related network and convolution by multilayer perceptron Output through network combines.

Detailed description of the invention

Fig. 1 is existing bidirectional circulating neural network structure figure

Fig. 2 is existing convolutional neural networks structure chart；

Fig. 3 is the structure chart of the sentence alignment method of deep neural network of the present invention；

Fig. 4 is bidirectional circulating neural network structure figure of the present invention；

Fig. 5 is door related network structure chart of the present invention；

Fig. 6 is Bi-RNN model structure of the present invention；

Fig. 7 is Bi-RNN-CNN model structure of the present invention；

Fig. 8 is Bi-RNN-GRN model structure of the present invention；

Fig. 9 is the flow chart of the sentence alignment method of deep neural network of the present invention

Specific embodiment

The method based on deep neural network that the invention proposes a kind of extracts parallel sentence, and not any outside Dictionary or feature.It is special that the present invention mainly captures the semantic feature between sentence, the semantic relation between word pair using neural network That is, there are most words translated each other in two sentences being aligned in sign etc., and the characteristics of according to the sentence of alignment, because We calculate the similitude of each word pair in two sentences using door related network for this.In order to preferably reflect property of the invention Can, we are respectively using parallel corpora and comparable corpus respectively to using bidirectional circulating neural network (Bi-RNN, such as Fig. 6), double To Recognition with Recurrent Neural Network and convolutional neural networks (Bi-RNN+CNN, such as Fig. 7), bidirectional circulating neural network and door related network knot It closes (Bi-RNN+GRN, such as Fig. 8), bidirectional circulating neural network and door related network and convolutional neural networks and combines (Bi-RNN+ GRN+CNN, such as Fig. 3) and two kinds of common sentence alignment tool Champollion and Gargantua evaluated and tested.

Embodiment 1

The sentence alignment method (Bi-RNN+GRN+CNN) of deep neural network is as shown in figure 3, for the present invention is based on depth The structure chart of the sentence alignment method of neural network, the sentence alignment method based on deep neural network include the following steps, together When Fig. 9 give specific flow chart:

1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary；

2) word embeding layer finds its corresponding word insertion from word insertion table, that is, utilizes ginseng to each word in sentence The bilingual word for examining paper [note 1] offer, which is embedded in, is expressed as vector for word, so that similar word has similar expression；

3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for, And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information；

4) door related network layer calculates the semantic information in two sentences between word pair；With the acquisition of bidirectional circulating neural network Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum Pondization operation captures the part of its most information content, to obtain word to similarity matrix；

5) convolutional neural networks layer calculates sentence vector using two layers of convolutional neural networks, can preferably capture sentence Adopted information, encodes sentence, and the vector for obtaining sentence indicates, two sentence vectors are stitched together；

6) multilayer perceptron layer, by the result of the result of door related network layer and convolutional neural networks layer be combined into one to Amount indicates to be input in multilayer perceptron, and judges sentence alignment by the more abstract expression of multilayer perceptron acquisition,

7) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting Whether it is aligned, then comes out the sentence extraction being aligned in two documents.Judge whether two sentences are aligned, 0 indicates to be misaligned, 1 indicates alignment.The present invention regards sentence alignment as 0/1 classification task, that is, inputs two bilingual sentences, pass through above-mentioned model The probability for operating available two sentence alignments, when the probability value is greater than given threshold value, output 1 indicates two of input Sentence is alignment, as parallel sentence pairs；When the probability value is less than given threshold value, output 0 indicates two sentences of input It is misaligned, i.e., is not parallel sentence pairs.

[note 1] Will Y.Zou, Richard Socher, Daniel Cer, et al.Bilingual Word Embeddings for Phrase-Based Machine Translation.[C]//Proceedings of the 2013Conference on Empirical Methods in Natural Language Processing,2013:1393- 1398.

Embodiment 2

Bidirectional circulating neural network model (Bi-RNN) is as shown in fig. 6, be another embodiment of the present invention, based on depth mind The sentence alignment method present invention through network includes the following steps:

2) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table；

3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for, And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information；It will be each The hidden state of word asks it averagely to obtain sentence vector in sentence, then two sentence vectors are stitched together to obtain v_r；

4) result of bidirectional circulating neural net layer is input in multilayer perceptron by multilayer perceptron layer, and by more Layer perceptron obtains more abstract expression to judge sentence alignment,

Bidirectional circulating neural net layer: the word hidden state obtained after being encoded by bidirectional circulating neural network not only wraps It containing its own meaning, while also including its contextual information.It can be seen that from the corresponding performance of Bi-RNN in Tables 1 and 2 double The information captured to neural net layer can help to extract the sentence of alignment.

Embodiment 3

Bidirectional circulating neural network+convolutional neural networks model (Bi-RNN+CNN) is as shown in fig. 7, to be of the invention another Embodiment, the sentence alignment method based on deep neural network include the following steps:

3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic of itself is believed between not only allowing for word Breath, and the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information；It will be every The hidden state of word asks it averagely to obtain sentence vector in a sentence, then two sentence vectors are stitched together to obtain v_r；

4) convolutional neural networks layer calculates sentence vector using two layers of convolutional neural networks, can preferably capture sentence Adopted information, encodes sentence, and the vector for obtaining sentence indicates；Two sentence vectors are stitched together to obtain v_c；

5) result of the result of bidirectional circulating neural network and convolutional neural networks is combined into one by multilayer perceptron layer Vector expression is input in multilayer perceptron, i.e. calculating v=W_r·v_r+W_c·v_c(W_rAnd W_cIt is parameter), and pass through multilayer sense Know that device obtains more abstract expression to judge sentence alignment,

6) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting Whether it is aligned, then comes out the sentence extraction being aligned in two documents.

Convolutional neural networks layer: the present invention calculates sentence vector using two layers of convolutional neural networks, can be more preferable, more complete Capture to face sentence justice information.As can be found from Table 1, compared with Bi-RNN, the F1 value of Bi-RNN+CNN improves 0.6；

Embodiment 4

Bidirectional circulating neural network+door related network model (Bi-RNN+GRN) is as shown in figure 8, be Bi-RNN+ of the present invention GRN model structure, is another embodiment of the present invention, and the sentence alignment method based on deep neural network includes following step It is rapid:

4) door related network layer calculates the semantic information in two sentences between word pair；With the acquisition of bidirectional circulating neural network Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum Pondization operation captures the part of its most information content,

5) result of door related network layer is input to multilayer perceptron by multilayer perceptron layer In, and sentence alignment is judged by the more abstract expression of multilayer perceptron acquisition,

Door related network layer: using the word hidden state after bidirectional circulating neural net layer coding as input, make With the network of the bilinear model and monolayer neural networks that merge by door machine system, word is captured from two angles of linear and nonlinear Relationship between pair, to obtain word to similarity matrix.The corresponding performance of Bi-RNN+GRN can be seen that sentence from Tables 1 and 2 The performance of alignment is greatly improved, and 8.6 and 29.7 have been respectively increased in the F1 value ratio Bi-RNN of overall performance All, shows Word to relationship is the important feature for judging sentence alignment, while door related network layer of the invention can effectively capture out word Relation information between pair.

Embodiment 5

Place unlike the embodiments above is that the multilayer perceptron layer is full linking layer.

The present invention is test, the sentence alignment performance of more each system with bilingual Chinese-English corpus.

1) bidirectional circulating neural network (RNN) and convolutional neural networks (CNN) effect:

Table 1 gives alignment performance of the Bi-RNN and Bi-RNN+CNN model on parallel corpora, F1 value is 89.2 respectively, 89.8.Bi-RNN model is encoded using sentence pair of the bidirectional circulating neural network to input, so that the hidden state of each word Obtain its contextual information.Bi-RNN+CNN model is to obtain sentence using convolutional neural networks on the basis of Bi-RNN model Whole vector indicates.This is the result shows that simple deep neural network has the ability to extract the parallel sentence pairs in bilingual corpora.

Table 1

2) door related network (GRN) acts on:

The performance ratio that different models extract parallel sentence pairs on parallel corpora and comparable corpus is set forth in Tables 1 and 2 Compared with.

Table 2

Bi-RNN+GRN model is to capture word pair using door related network (GRN) on the basis of Bi-RNN model Semantic relation, the i.e. similitude of each word in each word and target side sentence in calculating source sentence.The result shows that Compared with Bi-RNN and Bi-RNN+CNN, on overall performance All, the alignment performance of Bi-RNN+GRN is in parallel corpora and comparable F1 value averagely improves 8.3 and 30.2 respectively on corpus, and language between word pair can be captured well by showing a related network Adopted relation information, to judge that sentence alignment provides strong evidence.

3) compared with existing sentence alignment tool:

It is above-mentioned 1) in show convolutional neural networks capture sentence vector indicate distich alignment performance promotion have centainly Help, therefore convolutional neural networks, i.e. Bi-RNN+GRN+CNN are added in final mask of the invention.From Tables 1 and 2 It can be found that improving its performance really.Table 3 gives deep neural network method Bi-RNN+GRN+CNN and sentence alignment work The performance that tool Champollion and Gargantuan extracts parallel sentence pairs in comparable corpus and parallel corpora compares.

Table 3

As can be seen from the table, experimental result of the invention is extracting different types of sentence pair, such as 1-1 or 1-2/2-1, with And it is higher than the performance of tool on overall performance.This also illustrates the sentence alignments proposed by the present invention based on deep neural network Method can be improved the performance of sentence alignment.

According to include in two sentences (parallel sentence pairs) translated each other largely the words translated each other to the fact that, this hair The bright importance for fully considering word to information in sentence alignment task captures the semantic pass between word pair using door related network System.In addition, since bidirectional circulating neural network can obtain the contextual information of word, the present invention uses bidirectional circulating Neural network encodes sentence as bottom-layer network.In addition, more judging that information, the present invention pass through convolutional Neural to obtain Network indicates to obtain the vector of sentence.From Tables 1 and 2 it can be found that only considering sentence justice information (Bi-RNN, Bi-RNN+CNN) Or only consider that word can obtain certain alignment performance to information (Bi-RNN+GRN), and with final mask (Bi-RNN+GRN+ CNN) a certain information is used alone it can be found that the performance that the two information combines consideration will be substantially better than in comparison.Meanwhile Test result in comparable corpus shows method of the invention compared with common sentence alignment tool, and performance has obtained obviously Raising, obtain respectively as shown in table 3, on overall performance All 2.0,2.2 F1 value promoted.

Corpus pretreatment

The language that the present invention uses is to being Chinese and English, wherein segmenting to Chinese document, to English document It needs to be completely converted into english lowercase.In corpus pretreatment stage, the word in training corpus is first pressed by word according to training corpus Frequency sorts from large to small, and selects first three ten thousand word, generates with " word number " for representation vocabulary, wherein " number " is Number since 0, then can indicate corresponding word by " number ".Then, which mentions from reference papers [note 1] The corresponding word insertion of each word in vocabulary is found in bilingual word insertion (the bilingual word embedding) file supplied, It generates word and is embedded in vocabulary.Such as: entry in vocabulary " positive 248 ", entry in bilingual word insertion file " positive word insertion (and be expressed as to Amount) ", then generating corresponding entry in word insertion vocabulary is " insertion of 248 words ".

As needed, document can also be handled using other Languages.

Word embeding layer

Compared with traditional one-hot vector (only hot vector) indicates, word insertion allows the word of similar meaning to have class As indicate.Therefore, all words in Embedding layers (word embeding layers), two sentences as input will all be reflected Be mapped to the low dimensional vector of fixed size, this vector be exactly in sentence each word corresponding word in word insertion vocabulary it is embedding Enter.The corresponding all low frequency words for exceeding vocabulary, we map it in a special word insertion.And it is testing In, the bilingual word that the present invention is provided using reference papers [note 1] is embedded in initialize word embeding layer.It pre-processes and walks by corpus Suddenly, the present invention obtains vocabulary and word insertion vocabulary.For each sentence of input, first passes through inquiry vocabulary and generate corresponding number Sequence, such as " I likes China " generate corresponding " 13 10 " numbered sequence；Then vocabulary is embedded in further according to numbered sequence query word Corresponding word insertion sequence is generated, to complete initialization word embeding layer.Word embeding layer: it is provided using reference papers [note 1] Word is expressed as vector by the insertion of bilingual word so that similar word has similar expression, for later the step of provide Certain word feature.

Bidirectional circulating neural net layer

Bidirectional circulating neural network not only encodes word itself, also to sentence from left to right and from right to left upper Hereafter (this can provide important feature for sentence translation) is encoded.In order to avoid repeating, we only describe length herein For the bidirectional circulating neural network process of the source sentence s of m.Bidirectional circulating neural network structure figure is as shown in Figure 4.Source sentence The coding of s uses a pair of of neural network i.e. bidirectional circulating neural network: forward direction Recognition with Recurrent Neural Network reads defeated from left to right The sentence sequence s=(s entered₁,...,s_m), and export a forward direction hidden stateBackward circulation nerve net Network reads the sentence sequence of input from right to left, and exports a backward hidden stateSo source sentence s In each word s_jHidden state h_j, it is expressed as hidden stateWithSplicing.Bidirectional circulating neural network in the present invention Process the study of long-term dependence is solved using gating cycle unit (Gated Recurrent Unit, abbreviation GRU) Problem, as follows: at the j of position, forward direction hidden stateIt is updated according to following four formula:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, is calculatingWhen contain The information of front position and current time therefore, can be more or less comprising being somebody's turn to do when calculating the hidden state of each word The information of sequence before word.σ is sigmoid threshold function table, W_z, W_r, W is the model parameter to be learnt, and training is random when starting These parameters are initialized, and optimize update to it using gradient descent algorithm in the training process.Representing matrix multiplication, ⊙ indicates element multiplication.Likewise, backward hidden stateIt is done in the same fashion update.It is worth noting that, of the invention Source sentence s and target side sentence t, therefore the calculating of target side sentence t are encoded using identical bidirectional circulating neural network Formula is same as described above.At the same time it can also use shot and long term memory unit (Long Short-Term Memory, abbreviation LSTM) To substitute the gating cycle unit (GRU) that the present invention uses.At the j of position, forward direction hidden stateAccording to following six formula It is updated:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, σ is S sigmoid growth curve, W_f, W_i, W_c, W_o, b_f, b_i, b_c, b_oIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in training Update is optimized to it using gradient descent algorithm in the process.Representing matrix multiplication, ⊙ indicate element multiplication.

Door related network layer

Word widely applies similarity feature in many tasks, and according to the feature of alignment sentence pair, can be with It was found that word has greatly improved to performance of the similarity feature to sentence alignment.In the present invention, we are associated with net using door Network carrys out modeled words pair, the context relationship between word to capture bilingual sentence, is applied to bidirectional circulating nerve On network layer.Existing major part sentence alignment method is judged based on features such as length, dictionaries, wherein word-based The method of allusion quotation is the terminology match information obtained in sentence pair according to dictionary, to obtain the matching degree information of sentence pair.These Method is mechanically matched according to the word sense information of shallow-layer, it is difficult to capture its deep layer language message.And the present invention in order to Semantic information more abundant between word pair can be preferably captured in two sentences, the bilinearity mould merged by door machine system is used The network of type and monolayer neural networks, i.e. door related network come between two angle capture words pair of linear relationship and non-linear relation Similitude.The structure chart of door related network is as shown in Figure 5.The network is with two sentences after bidirectional circulating neural network coding Hidden state (the h of son_s1,...,h_sm) and (h_t1,...,h_tn) it is input, it the advantage is that and compiled by bidirectional circulating neural network Code, the hidden state of each word not only includes itself semantic information, but also includes its contextual information.Then, under The formula in face calculates each word to (s_i,t_j) similarity scores:

Wherein, as shown in Fig. 5 bottom left section (a), h_siM^[1:r]h_tjFor bilinear model, linearly closed between word pair for capturing System, as shown in Fig. 5 lower right-most portion (b), f (v [h_si,h_tj]) it is monolayer neural networks, for capturing non-linear relation between word pair, As shown in Fig. 5 middle-lower part (c), g is the door of linear relationship and non-linear relation between indicating how to integrate word pair, i.e., word is to final Similarity score merged the linear relationship and non-linear relation of much ratios, U, M, V, Wg are parameter, when training starts These parameters of random initializtion, and update is optimized to it using gradient descent algorithm in the training process, pass through above-mentioned meter After calculation, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence length, i.e. ash in Fig. 3 Shown in color square.Relationship between two text fragments is usually determined by some strong semantic interactions, therefore is being obtained After the similarity scores matrix of grey square in Fig. 3, the present invention divides the matrix using maximum pond strategy, passes through setting Size is the maximization pond of p1 × p2, i.e., takes its maximum value to p1 × p2 block each in similarity scores matrix, to obtain big Small is (m/p1) × (n/p2) matrix, is then remolded as an one-dimensional vector v_gOutput as door related network.

Convolutional neural networks layer

Similar to bidirectional circulating neural net layer, the present invention encodes source sentence using identical convolutional Neural network layer S and target side sentence t.For the sentence s=(s of input₁,...,s_m) after word embeding layer, obtaining size is [m, embed_ Size] (embed_size is the size being embedded in word) matrix, the present invention remolded be size be [m, embed_size, 1] matrix E, and defining a shape is [filter_depth, filter_rows, 1] filter.Then, according to sliding Step-length, such as selection 1, the filter slide on matrix E according to the sliding step that size is 1, and for sliding every time, matrix It is multiplied by filter covering part with the corresponding position of filter in E, and they is added the value as corresponding position in output. Finally, we obtain the smaller output [m-filter_depth+1, embed_size-filter_rows+1,1] of a shape. The present invention realizes the operation using conv3d function packaged in the Theano frame in Python programming language.Then, Using 3 dimensions provided in Theano frame-maximum pondization strategy in output, to obtain wherein the part of most information content.It is logical The maximum pond that setting shape is [p1, p2,1] is crossed, i.e., its maximum is taken to each of matrix of above-mentioned input [p1, p2,1] block Value, thus obtain size be [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1] most Has the matrix of information content.In a convolution process, the value of filter is remained unchanged, therefore sentence can be enhanced by convolution Some important features, and noise can be reduced.Since the feature that a convolutional Neural network layer is acquired is often part, this Two convolutional Neural network layers have been used to learn the feature that is more globalized in invention.Second convolutional Neural network layer it is defeated Enter be first convolutional Neural network layer output.After source sentence s and target side sentence t coding, by two sentences As a result by the last one-dimensional output v being stitched together as convolutional neural networks layer of matrix_c.Meanwhile two layers volume in the present invention Product neural network can also be substituted for multilayer convolutional neural networks, i.e. the output of preceding layer convolutional neural networks is rolled up as later layer The input of product neural network.

Multilayer perceptron layer

According to foregoing description, the output v of a related network can be obtained_gWith the output v at convolutional Neural network_c.Next, First by the two vectorial combinations at a vector representation v:

V=W_g·v_g+W_c·v_c

Then, it is inputted in multilayer perceptron layer.Multilayer perceptron is exported by two hidden layers and one in the present invention Layer composition, i.e. v is input into two full link hidden layers to obtain more abstract expression, and is ultimately connected to output layer:

c₁=tanh (v+b₁)

c₂=tanh (W₂·c₁+b₂)

Wherein, W_g、W_c、W₂And b₁、b₂It is parameter, in training incipient stage random initializtion, and makes in the training process Update is optimized to it with gradient descent algorithm；c₁And c₂Two hidden layers for respectively indicating multilayer perceptron calculate.Meanwhile Two full linking layers in the present invention in multilayer perceptron can be substituted for multiple full linking layers, i.e. c_n=tanh (W_n·c_n-1+ b_n), n is the number of full linking layer.Finally, the output c of multilayer perceptron layer₂Pass through sigmoid function -- S sigmoid growth curve come The Probability p that sentence is classified as 0 or 1 is calculated, is judged as that sentence alignment is 1 if given threshold value ρ is greater than if probability, on the contrary it is not right Neat is 0:

Claims

1. a kind of sentence alignment method based on deep neural network, it is characterised in that:

3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for, and The contextual information for considering the word, obtaining each word includes the hidden state of its contextual information；

4) door related network layer calculates the semantic relation information in two sentences between word pair；With the acquisition of bidirectional circulating neural network Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum Pondization operation captures the part of its most information content；

5) perceptron layer is input to the result of door related network layer as a vector expression in perceptron, and passes through perception Device obtains more abstract expression to judge sentence alignment；

6) probability of two sentence alignments is calculated, finally whether judges two sentences according to whether probability is greater than the value of setting Alignment, then comes out the sentence extraction being aligned in two documents.

2. the sentence alignment method based on deep neural network as described in claim 1, it is characterised in that: after step 2) Further include following steps:

31) convolutional neural networks layer calculates sentence vector using multilayer convolutional neural networks, can preferably capture sentence justice letter Breath, encodes sentence, can preferably capture sentence justice information, and the vector for obtaining sentence indicates；Two sentence vectors are spelled It picks up and；

5) step 5 includes the following steps: perceptron layer, by the knot of the result of door related network layer and convolutional neural networks layer Fruit is combined into a vector expression and is input in perceptron, and obtains more abstract expression by perceptron to judge sentence pair Together.

3. the sentence alignment method based on deep neural network as claimed in claim 1 or 2, it is characterised in that: the perception Device layer is multilayer perceptron layer.

4. the sentence alignment method based on deep neural network as claimed in claim 3, it is characterised in that: the bidirectional circulating Neural net layer not only encodes word itself, also encodes to the context of sentence from left to right and from right to left, We only describe the bidirectional circulating neural network process for the source sentence s that length is m herein, what the coding of source sentence s used Be a pair of of neural network i.e. bidirectional circulating neural network: forward direction Recognition with Recurrent Neural Network reads the sentence sequence s of input from left to right =(s₁,...,s_m), and export a forward direction hidden stateBackward Recognition with Recurrent Neural Network is read from right to left The sentence sequence of input, and export a backward hidden stateEach word s in so source sentence s_j Hidden state h_j, it is expressed as hidden stateWithSplicing, the present invention in bidirectional circulating neural network use gating cycle list Member solves the problem concerning study of long-term dependence, as follows: at the j of position, forward direction hidden stateAccording to following four Formula is updated:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, is calculatingWhen contain front The information of state and current time can be more or less comprising before the word when calculating the hidden state of each word The information of sequence, σ are S sigmoid growth curve, W_z, W_r, W is the model parameter to be learnt, these ginsengs of random initializtion when training starts Number, and update, representing matrix multiplication are optimized to it using gradient descent algorithm in the training process, ⊙ indicates that element multiplies Method, likewise, backward hidden stateIt is done in the same fashion update, the present invention uses identical bidirectional circulating neural network Encode source sentence s and target side sentence t, therefore the calculation formula of target side sentence t is same as described above；At the same time it can also The gating cycle unit that the present invention uses is substituted using shot and long term memory unit, at the j of position, forward direction hidden stateAccording to Following six formula is updated:

Wherein,Indicate the hidden state of previous moment, s_jIt is the word insertion of j-th of word, σ is S sigmoid growth curve, W_f, W_i, W_c, W_o, b_f, b_i, b_c, b_oIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in the training process Update, representing matrix multiplication are optimized to it using gradient descent algorithm, ⊙ indicates element multiplication.

5. the sentence alignment method based on deep neural network as claimed in claim 4, it is characterised in that: the door is associated with net Network layers are with the hidden state (h of two sentences after bidirectional circulating neural network coding_s1,...,h_sm) and (h_t1,...,h_tn) For input, then, each word is calculated according to the following equation to (s_i,t_j) similarity scores:

Wherein, h_siM^[1:r]h_tjFor bilinear model, for capturing linear relationship between word pair, f (v [h_si,h_tj]) it is monolayer neuronal net Network, for capturing non-linear relation between word pair, g is the door of linear relationship and non-linear relation between indicating how to integrate word pair, i.e., Word has merged the linear relationship and non-linear relation of much ratios to final similarity score, and U, M, V, Wg are parameter, instruction Practice these parameters of random initializtion when starting, and optimize update to it using gradient descent algorithm in the training process, leads to After crossing above-mentioned calculating, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence length, Relationship between two text fragments is usually determined by some strong semantic interactions, therefore is obtaining the phase of grey square After property score matrix, the present invention divides the matrix using maximum pond strategy, the maximum for being p1 × p2 by setting size Change pond, i.e., its maximum value is taken to p1 × p2 block each in similarity scores matrix, so that obtaining size is (m/p1) × (n/p2) Then matrix is remolded as an one-dimensional vector v_gOutput as door related network.

6. the sentence alignment method based on deep neural network as claimed in claim 5, it is characterised in that: the convolutional Neural Network layer encodes source sentence s and target side sentence t using two identical convolutional Neural network layers, for the sentence of input S=(s₁,...,s_m) after word embeding layer, the matrix that size is [m, embed_size] is obtained, it is big that the present invention, which is remolded, The small matrix E for [m, embed_size, 1], and a shape is defined as [filter_depth, filter_rows, 1] filter Wave device, then, according to sliding step, which slides on matrix E according to sliding step, and for sliding every time, square It is multiplied by filter covering part with the corresponding position of filter in battle array E, and they is added as corresponding position in output Value, finally, we obtain a shape it is smaller output [m-filter_depth+1, embed_size-filter_rows+1, 1]；The present invention realizes the operation using conv3d function packaged in the Theano frame in Python programming language, so Afterwards, using 3 dimensions provided in Theano frame-maximum pondization strategy in output, to obtain wherein the part of most information content, It is the maximum pond of [p1, p2,1] by setting shape, i.e., its maximum is taken to each of matrix of above-mentioned input [p1, p2,1] block Value, thus obtain size be [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1] most Has the matrix of information content, in a convolution process, the value of filter is remained unchanged, therefore sentence can be enhanced by convolution Some important features, and noise can be reduced, since the feature that a convolutional Neural network layer is acquired is often part, this Two convolutional Neural networks layer has been used in invention to learn the feature that more is globalized, second convolutional Neural network layer it is defeated Enter be first convolutional Neural network layer output, after source sentence s and target side sentence t coding, by two sentences As a result by the last one-dimensional output v being stitched together as convolutional neural networks layer of matrix_c。

7. the sentence alignment method based on deep neural network as claimed in claim 6, it is characterised in that: retouched according to above-mentioned It states, the output v of a related network can be obtained_gWith the output v at convolutional Neural network_c, next, first by the two set of vectors Synthesize a vector representation v:

V=W_g·v_g+W_c·v_c

Then, it is inputted in perceptron layer, perceptron includes multiple hidden layers and an output layer in the present invention, i.e. v is defeated Enter to multiple full link hidden layers and obtain more abstract expression, and is ultimately connected to output layer:

c_n=tanh (W_n·c_n-1+b_n)

Wherein, W_g、W_c、W_nAnd b_nIt is parameter, in training incipient stage random initializtion, and in the training process using under gradient Drop algorithm optimizes update to it；c_n-1Indicate that the hidden layer of perceptron calculates, finally, the output c of perceptron layer_nPass through S type Growth curve calculates the Probability p that sentence is classified as 0 or 1, is judged as that sentence alignment is 1 if given threshold value ρ is greater than if probability, It is on the contrary then to be misaligned be 0:

8. the sentence alignment method based on deep neural network as claimed in claim 7, it is characterised in that: retouched according to above-mentioned It states, the language that the present invention uses, wherein segmenting to Chinese document, needs English document complete to being Chinese and English Portion is converted to english lowercase, in corpus pretreatment stage, first according to training corpus by the word in training corpus by word frequency from big To small sequence, and first three ten thousand word is selected, generated with " word number " as representation vocabulary, wherein " number " is to open from 0 The number of beginning, then can indicate corresponding word by " number ", then, which finds from bilingual word insertion file The corresponding word insertion of each word, generates word and is embedded in vocabulary in vocabulary, and compared with traditional only hot vector expression, word insertion allows The word of similar meaning has similar expression, and therefore, all words in word embeding layer, two sentences as input are all Will be mapped to that the low dimensional vector of fixed size, this vector be exactly in sentence each word it is corresponding in word insertion vocabulary Word insertion, the corresponding all low frequency words for exceeding vocabulary, we map it in a special word insertion, and this hair Bright to be embedded in using bilingual word to initialize word embeding layer, by corpus pre-treatment step, the present invention obtains vocabulary and word insertion word Table first passes through inquiry vocabulary and generates corresponding numbered sequence, then inquire further according to numbered sequence for each sentence of input Word is embedded in vocabulary and generates corresponding word insertion sequence, to complete initialization word embeding layer.