CN109062910A - Sentence alignment method based on deep neural network - Google Patents
Sentence alignment method based on deep neural network Download PDFInfo
- Publication number
- CN109062910A CN109062910A CN201810835723.1A CN201810835723A CN109062910A CN 109062910 A CN109062910 A CN 109062910A CN 201810835723 A CN201810835723 A CN 201810835723A CN 109062910 A CN109062910 A CN 109062910A
- Authority
- CN
- China
- Prior art keywords
- word
- sentence
- layer
- neural network
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
A kind of sentence alignment method based on deep neural network, using bidirectional circulating neural net layer, sentence is encoded, not only allows for the semantic information of word itself, the contextual information for also contemplating the word, obtaining each word includes the hidden state of its contextual information;Door related network layer, calculates the semantic relation information in two sentences between word pair;Using the hidden state of each word of bidirectional circulating neural network acquisition as input, use the network of the bilinear model and monolayer neural networks that merge by door machine system, then to capture the part of its most information content using the operation of maximum pondization from similitude between linear relationship and non-linear relation two angles capture words pair;For there are most words translated each other in two sentences translating each other, conventional method carries out alignment judgement to information also with word, and the present invention is not needed using additional dictinary information, to capture the semantic relation feature between word pair.And word can be obtained to similarity matrix.
Description
Technical field
The present invention relates to a kind of sentence alignment methods neural network based.
Background technique
Parallel Corpus is more extremely important resources of natural language processing task, and many in natural language processing is appointed
Business, such as machine translation, cross-language information retrieval and bilingual dictionary require the support of parallel corpora.Sentence alignment task be from
The parallel sentence pairs translated each other are extracted in the document of two different languages, for expanding Parallel Corpus, to solve
Parallel corpora scale is compared with minor issue.
The method that the early stage research method of sentence alignment is mainly based upon characteristic matching, this method have only focused on bilingual sentence
Between surface information, i.e., judge whether sentence is aligned according to the length relation between two sentences.Then according to parallel sentence
In word to relationship, many researchers propose the methods based on dictionary, i.e., according to the word translated each other in two sentences to number
Judge whether sentence is aligned with the relationship of the word number of entire sentence.In addition, there are also methods by the length of two sentences
Information and word combine information, or add other feature information, heuristic strategies, translate into same language and compare
Whether sentence after translation and another sentence are similar etc. to judge whether sentence is aligned.In recent years, with the depth of deep learning
Enter research, neural network method also achieves significant effect in sentence alignment task.
Sentence alignment is the basic task in natural language processing.Currently, sentence alignment task is counted as one point
Generic task is aligned or is misaligned two classes, extract final parallel sentence pairs finally by alignment optimisation strategy.
The sentence alignment method of early stage using statistics method, according to the length characteristic of sentence, each other translation pair
Word number feature obtains these characteristic values by the method for statistics, and is sentenced according to these characteristic values and the alignment strategy of formulation
Whether punctuate is aligned.It meanwhile there are also certain methods being to improve sentence alignment according to features such as punctuation mark, positional relationships
Performance.In addition, being first to extract in corpus to be considered by tool using existing sentence tool there are also sentence alignment work
Then parallel sentence pair is further judged by some rule features, finds optimal alignment.What they were largely used
It is unsupervised and semi-supervised method to be trained.
As shown in Figure 1, in recent years, with the continuous development of deep learning, deep neural network is also in sentence alignment task
In general middle successful application needs to refer to corpus to carry out Training, i.e., (is put down by a series of sentence pair of known class
Two classifications of row sentence pair and non-parallel sentence pair) come train adjustment classifier parameter, reach the process of optimal performance.It is existing
Neural network method mainly with good grounds bidirectional circulating neural network encodes two sentences, obtain the hiding shape of each word
The hidden state of the last one word and first word is stitched together the hidden state as entire sentence by state, then will
The sum of products absolute value of the difference of the two hidden states is added, then the phase between them is calculated by hyperbolic tangent function tanh
Like degree, and more abstract expression is obtained by full linking layer, finally by S sigmoid growth curve -- sigmoid function calculates its point
Class probability.This method can not capture the word information inside sentence only by sentence expression at a vector well, and
Word information is the key factor for judging sentence and whether being aligned, therefore judges sentence pair only by two sentence vectors are compared
It is easily lost important match information together.
As shown in Fig. 2, simultaneously, then the similarity between also having method to carry out comparing word pair is caught by convolutional neural networks
Its most useful information is obtained to classify.According to the word insertion of each word pair in two sentences by cosine similarity and Euclidean away from
From their similarity of calculating, to obtain the similarity matrix of a m*n, wherein m and n is respectively source sentence length and mesh
Mark end sentence length;Then, its most useful information is captured using convolutional neural networks on obtained similarity matrix;Most
Passing through S sigmoid growth curve afterwards -- sigmoid function exports its class probability.This method by calculate word pair between similarity come
The match information between sentence pair is captured, sentence alignment performance can be preferably improved.
However, there is the word pair largely translated each other in the sentence pair that sentence alignment does not require nothing more than alignment, while also requirement pair
Semantic consistency between neat sentence pair.The contextual information of sentence may be lost to similarity by being only embedded in calculating word by word,
The sentence vector determination sentence alignment only encoded by bidirectional circulating neural network is easily lost the match information of word.
Summary of the invention
The technical problem to be solved by the present invention is to the shortcomings that overcoming the prior art, provides a kind of based on depth nerve net
The sentence alignment method of network, for there is most words translated each other, conventional method in two sentences translating each other
Alignment judgement is carried out to information also with word.However whether traditional method needs to judge the word in word using dictinary information
In allusion quotation, the present invention is not needed using additional dictinary information, to capture the semantic relation feature between word pair.The present invention is by double
To the word hidden state after Recognition with Recurrent Neural Network layer coding as input, using the bilinear model that is merged by door machine system with
The network of monolayer neural networks, from relationship between two angle capture words pair of linear and nonlinear, to obtain word to similarity moment
Battle array.
The technical problem to be solved by the present invention is to the shortcomings that overcoming the prior art, it is further provided one kind is based on depth
The sentence alignment method of neural network, conventional method have generally only taken into account lexical information, however two translated each other sentence
Sentence justice be also consistent.The present invention encodes sentence on the basis of a upper invention to obtain the expression of entire sentence,
Capture the semantic feature of sentence.
The technical solution that the present invention solves the above technical problem is: a kind of sentence alignment side based on deep neural network
Corpus pretreatment: method generates vocabulary according to training corpus and word is embedded in vocabulary;
1) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table;
2) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for,
And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information;
3) door related network layer calculates the semantic relation information in two sentences between word pair;With bidirectional circulating neural network
The hidden state of each word obtained uses the bilinear model and monolayer neural networks merged by door machine system as input
Network, i.e. door related network then to use from similitude between linear relationship and non-linear relation two angles capture words pair
The operation of maximum pondization captures the part of its most information content;
4) perceptron layer is input to the result of door related network layer as a vector expression in perceptron, and passes through
Perceptron obtains more abstract expression to judge sentence alignment;
5) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting
Whether it is aligned, then comes out the sentence extraction being aligned in two documents.
Based on above-mentioned purpose, further improvement of the present invention scheme is: further include following steps after step 2):
31) convolutional neural networks layer calculates sentence vector using multilayer convolutional neural networks, can preferably capture sentence
Adopted information, encodes sentence, can preferably capture sentence justice information, and the vector for obtaining sentence indicates;
Two sentence vectors are stitched together;
6) step 5 includes the following steps: perceptron layer, by the result and convolutional neural networks layer of door related network layer
Result be combined into the expression of vector and be input in perceptron, and more abstract expression is obtained to judge sentence by perceptron
Alignment.
Based on above-mentioned purpose, further improvement of the present invention scheme is: the perceptron layer is multilayer perceptron layer.
Based on above-mentioned purpose, further improvement of the present invention scheme is: the bidirectional circulating neural net layer is not only to list
Word itself is encoded, and is also encoded to the context of sentence from left to right and from right to left, we only describe length herein
For the bidirectional circulating neural network process of the source sentence s of m, the coding of source sentence s uses a pair of of neural network i.e. double
To Recognition with Recurrent Neural Network: forward direction Recognition with Recurrent Neural Network reads the sentence sequence s=(s of input from left to right1,...,sm), and export
One forward direction hidden stateBackward Recognition with Recurrent Neural Network reads the sentence sequence of input from right to left, and defeated
A backward hidden state out Each word s in so source sentence sjHidden state hj, it is expressed as
Hidden stateWithSplicing, the present invention in bidirectional circulating neural network solved using gating cycle unit for a long time rely on close
The problem concerning study of system, as follows: at the j of position, forward direction hidden stateIt is updated according to following four formula:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, is calculatingWhen contain
The information of front position and current time can more or less include the word when calculating the hidden state of each word
The information of front sequence, σ are S sigmoid growth curve, Wz, Wr, W is the model parameter to be learnt, training start when random initializtion this
A little parameters, and update, representing matrix multiplication are optimized to it using gradient descent algorithm in the training process, ⊙ indicates member
Plain multiplication, likewise, backward hidden stateIt is done in the same fashion update, the present invention uses identical bidirectional circulating nerve
Network encodes source sentence s and target side sentence t, therefore the calculation formula of target side sentence t is same as described above;Meanwhile also
Shot and long term memory unit can be used to substitute the gating cycle unit that the present invention uses, at the j of position, forward direction hidden state
It is updated according to following six formula:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, σ is S sigmoid growth curve,
Wf, Wi, Wc, Wo, bf, bi, bc, boIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in training
Update, representing matrix multiplication are optimized to it using gradient descent algorithm in the process, ⊙ indicates element multiplication.
Based on above-mentioned purpose, further improvement of the present invention scheme is the door related network layer by bidirectional circulating mind
Hidden state (the h of two sentences after network codes1,...,hsm) and (ht1,...,htn) it is input, then, according to following
Formula calculate each word to (si,tj) similarity scores:
Wherein, hsiM[1:r]htjFor bilinear model, for capturing linear relationship between word pair, f (v [hsi,htj]) it is single layer
Neural network, for capturing non-linear relation between word pair, g indicates how to integrate linear relationship and non-linear relation between word pair
The linear relationship and non-linear relation that door, i.e. word have merged much ratios to final similarity score, U, M, V, Wg are ginseng
Number, these parameters of random initializtion when training starts, and it is optimized more using gradient descent algorithm in the training process
Newly, after by above-mentioned calculating, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence
Length, the relationship between two text fragments is usually determined by some strong semantic interactions, therefore is obtaining Gray Square
After the similarity scores matrix of block, the present invention divides the matrix using maximum pond strategy, is p1 × p2 by setting size
Maximization pond, i.e., its maximum value is taken to p1 × p2 block each in similarity scores matrix, thus obtain size be (m/p1) ×
(n/p2) then matrix is remolded as an one-dimensional vector vgOutput as door related network.
Based on above-mentioned purpose, further improvement of the present invention scheme is: the convolutional neural networks layer is identical using two
Convolutional Neural network layer encode source sentence s and target side sentence t, for the sentence s=(s of input1,...,sm) pass through
After word embeding layer, the matrix that size is [m, embed_size] is obtained, it is size for [m, embed_ that the present invention, which remold,
Size, 1] matrix E, and defining a shape is [filter_depth, filter_rows, 1] filter, then, according to
Sliding step, the filter slide on matrix E according to sliding step, and for sliding every time, are covered in matrix E by filter
Cover is multiplied with the corresponding position of filter, and they are added the value as corresponding position in output, finally, we obtain
The smaller output [m-filter_depth+1, embed_size-filter_rows+1,1] of one shape;The present invention uses
Packaged conv3d function realizes the operation in Theano frame in Python programming language, then, adopts in output
Pass through setting shape with 3 dimensions provided in Theano frame-maximum pondization strategy to obtain wherein the part of most information content
For the maximum pond of [p1, p2,1], i.e., its maximum value is taken to each of matrix of above-mentioned input [p1, p2,1] block, to obtain
Size is the square of the most information content of [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1]
Battle array, in a convolution process, the value of filter is remained unchanged, therefore some important spies of sentence can be enhanced by convolution
Sign, and noise can be reduced, since the feature that a convolutional Neural network layer is acquired is often part, used in the present invention
Two convolutional Neural networks layers learn the feature that more is globalized, and the input at second convolutional Neural network layer is first
The result of two sentences is pressed matrix after source sentence s and target side sentence t coding by the output of convolutional Neural network layer
The last one-dimensional output v being stitched together as convolutional neural networks layerc。
Based on above-mentioned purpose, further improvement of the present invention scheme is: according to foregoing description, can obtain a related network
Output vgWith the output v at convolutional Neural networkc, next, first by the two vectorial combinations at a vector representation v:
V=Wg·vg+Wc·vc
Then, it is inputted in perceptron layer, perceptron includes two hidden layers and an output layer, i.e. v in the present invention
Multiple full link hidden layers are input into obtain more abstract expression, and are ultimately connected to output layer:
cn=tanh (Wn·cn-1+bn)
Wherein, Wg、Wc、WnAnd bnIt is parameter, in training incipient stage random initializtion, and uses in the training process
Gradient descent algorithm optimizes update to it;cn-1Indicate that the hidden layer of perceptron calculates, finally, the output c of perceptron layern
The Probability p that sentence is classified as 0 or 1 is calculated by S sigmoid growth curve, is judged as sentence if given threshold value ρ is greater than if probability
It is aligned to 1, on the contrary then to be misaligned be 0:
Based on above-mentioned purpose, further improvement of the present invention scheme is: according to foregoing description, language pair that the present invention uses
It is Chinese and English, wherein segmenting Chinese document, English document is needed to be completely converted into english lowercase, in language
Expect pretreatment stage, first the word in training corpus is sorted from large to small by word frequency according to training corpus, and select first three ten thousand
A word is generated with " word number " as representation vocabulary, wherein " number " is the number since 0, then can pass through
" number " indicates corresponding word, and then, it is corresponding which finds each word in vocabulary from bilingual word insertion file
Word insertion generates word and is embedded in vocabulary, and compared with traditional only hot vector expression, word insertion allows the word of similar meaning to have class
As indicate, therefore, all words in word embeding layer, two sentences as input all will be mapped to that fixed size
Low dimensional vector, this vector is exactly each word corresponding word insertion in word insertion vocabulary in sentence, corresponding to exceed vocabulary
All low frequency words, we map it in a special word insertion, and the present invention using the insertion of bilingual word come
Word embeding layer is initialized, by corpus pre-treatment step, the present invention obtains vocabulary and word insertion vocabulary, for each sentence of input
Son first passes through inquiry vocabulary and generates corresponding numbered sequence, then generates and corresponds to further according to numbered sequence query word insertion vocabulary
Word be embedded in sequence, thus complete initialization word embeding layer.
The beneficial effects of the present invention are: it is equally utilized in the present invention that neural network method carries out sentence alignment research.Calculate word
When to similarity, the present invention is using the word vector after bidirectional circulating neural network coding first rather than simple word is embedding
Enter, not only allow for the semantic information of word itself, and considers the contextual information of the word;Secondly the present invention uses double
Linear model and monolayer neural networks are merged two kinds of similarities by door machine system to calculate word to similarity, thus from line
Property and the angles of non-linear two opposition more fully consider that word to relationship, rather than uses cosine similarity and Euclidean distance
Two judge the method for similarity by angle and distance;The last present invention captures word to similarity using maximum pondization operation
In most information content part, rather than use convolutional neural networks.
Bidirectional circulating neural net layer: the word hidden state obtained after being encoded by bidirectional circulating neural network not only wraps
It containing its own meaning, while also including its contextual information;Door related network layer: to be encoded by bidirectional circulating neural net layer
Rear word hidden state is as input, using the network of the bilinear model and monolayer neural networks that are merged by door machine system,
From relationship between two angle capture words pair of linear and nonlinear, to obtain word to similarity matrix;Convolutional neural networks layer: this
Invention calculates sentence vector using multilayer convolutional neural networks, can more preferably, more fully capture sentence justice information.
Word is merged to information and sentence justice information: the present invention is refreshing by the output of door related network and convolution by multilayer perceptron
Output through network combines.
Detailed description of the invention
Fig. 1 is existing bidirectional circulating neural network structure figure
Fig. 2 is existing convolutional neural networks structure chart;
Fig. 3 is the structure chart of the sentence alignment method of deep neural network of the present invention;
Fig. 4 is bidirectional circulating neural network structure figure of the present invention;
Fig. 5 is door related network structure chart of the present invention;
Fig. 6 is Bi-RNN model structure of the present invention;
Fig. 7 is Bi-RNN-CNN model structure of the present invention;
Fig. 8 is Bi-RNN-GRN model structure of the present invention;
Fig. 9 is the flow chart of the sentence alignment method of deep neural network of the present invention
Specific embodiment
The method based on deep neural network that the invention proposes a kind of extracts parallel sentence, and not any outside
Dictionary or feature.It is special that the present invention mainly captures the semantic feature between sentence, the semantic relation between word pair using neural network
That is, there are most words translated each other in two sentences being aligned in sign etc., and the characteristics of according to the sentence of alignment, because
We calculate the similitude of each word pair in two sentences using door related network for this.In order to preferably reflect property of the invention
Can, we are respectively using parallel corpora and comparable corpus respectively to using bidirectional circulating neural network (Bi-RNN, such as Fig. 6), double
To Recognition with Recurrent Neural Network and convolutional neural networks (Bi-RNN+CNN, such as Fig. 7), bidirectional circulating neural network and door related network knot
It closes (Bi-RNN+GRN, such as Fig. 8), bidirectional circulating neural network and door related network and convolutional neural networks and combines (Bi-RNN+
GRN+CNN, such as Fig. 3) and two kinds of common sentence alignment tool Champollion and Gargantua evaluated and tested.
Embodiment 1
The sentence alignment method (Bi-RNN+GRN+CNN) of deep neural network is as shown in figure 3, for the present invention is based on depth
The structure chart of the sentence alignment method of neural network, the sentence alignment method based on deep neural network include the following steps, together
When Fig. 9 give specific flow chart:
1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary;
2) word embeding layer finds its corresponding word insertion from word insertion table, that is, utilizes ginseng to each word in sentence
The bilingual word for examining paper [note 1] offer, which is embedded in, is expressed as vector for word, so that similar word has similar expression;
3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for,
And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information;
4) door related network layer calculates the semantic information in two sentences between word pair;With the acquisition of bidirectional circulating neural network
Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system
Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum
Pondization operation captures the part of its most information content, to obtain word to similarity matrix;
5) convolutional neural networks layer calculates sentence vector using two layers of convolutional neural networks, can preferably capture sentence
Adopted information, encodes sentence, and the vector for obtaining sentence indicates, two sentence vectors are stitched together;
6) multilayer perceptron layer, by the result of the result of door related network layer and convolutional neural networks layer be combined into one to
Amount indicates to be input in multilayer perceptron, and judges sentence alignment by the more abstract expression of multilayer perceptron acquisition,
7) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting
Whether it is aligned, then comes out the sentence extraction being aligned in two documents.Judge whether two sentences are aligned, 0 indicates to be misaligned,
1 indicates alignment.The present invention regards sentence alignment as 0/1 classification task, that is, inputs two bilingual sentences, pass through above-mentioned model
The probability for operating available two sentence alignments, when the probability value is greater than given threshold value, output 1 indicates two of input
Sentence is alignment, as parallel sentence pairs;When the probability value is less than given threshold value, output 0 indicates two sentences of input
It is misaligned, i.e., is not parallel sentence pairs.
[note 1] Will Y.Zou, Richard Socher, Daniel Cer, et al.Bilingual Word
Embeddings for Phrase-Based Machine Translation.[C]//Proceedings of the
2013Conference on Empirical Methods in Natural Language Processing,2013:1393-
1398.
Embodiment 2
Bidirectional circulating neural network model (Bi-RNN) is as shown in fig. 6, be another embodiment of the present invention, based on depth mind
The sentence alignment method present invention through network includes the following steps:
1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary;
2) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table;
3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for,
And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information;It will be each
The hidden state of word asks it averagely to obtain sentence vector in sentence, then two sentence vectors are stitched together to obtain vr;
4) result of bidirectional circulating neural net layer is input in multilayer perceptron by multilayer perceptron layer, and by more
Layer perceptron obtains more abstract expression to judge sentence alignment,
5) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting
Whether it is aligned, then comes out the sentence extraction being aligned in two documents.
Bidirectional circulating neural net layer: the word hidden state obtained after being encoded by bidirectional circulating neural network not only wraps
It containing its own meaning, while also including its contextual information.It can be seen that from the corresponding performance of Bi-RNN in Tables 1 and 2 double
The information captured to neural net layer can help to extract the sentence of alignment.
Embodiment 3
Bidirectional circulating neural network+convolutional neural networks model (Bi-RNN+CNN) is as shown in fig. 7, to be of the invention another
Embodiment, the sentence alignment method based on deep neural network include the following steps:
1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary;
2) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table;
3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic of itself is believed between not only allowing for word
Breath, and the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information;It will be every
The hidden state of word asks it averagely to obtain sentence vector in a sentence, then two sentence vectors are stitched together to obtain vr;
4) convolutional neural networks layer calculates sentence vector using two layers of convolutional neural networks, can preferably capture sentence
Adopted information, encodes sentence, and the vector for obtaining sentence indicates;Two sentence vectors are stitched together to obtain vc;
5) result of the result of bidirectional circulating neural network and convolutional neural networks is combined into one by multilayer perceptron layer
Vector expression is input in multilayer perceptron, i.e. calculating v=Wr·vr+Wc·vc(WrAnd WcIt is parameter), and pass through multilayer sense
Know that device obtains more abstract expression to judge sentence alignment,
6) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting
Whether it is aligned, then comes out the sentence extraction being aligned in two documents.
Convolutional neural networks layer: the present invention calculates sentence vector using two layers of convolutional neural networks, can be more preferable, more complete
Capture to face sentence justice information.As can be found from Table 1, compared with Bi-RNN, the F1 value of Bi-RNN+CNN improves 0.6;
Embodiment 4
Bidirectional circulating neural network+door related network model (Bi-RNN+GRN) is as shown in figure 8, be Bi-RNN+ of the present invention
GRN model structure, is another embodiment of the present invention, and the sentence alignment method based on deep neural network includes following step
It is rapid:
1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary;
2) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table;
3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for,
And the contextual information of the word is considered, obtaining each word includes the hidden state of its contextual information;
4) door related network layer calculates the semantic information in two sentences between word pair;With the acquisition of bidirectional circulating neural network
Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system
Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum
Pondization operation captures the part of its most information content,
5) result of door related network layer is input to multilayer perceptron by multilayer perceptron layer
In, and sentence alignment is judged by the more abstract expression of multilayer perceptron acquisition,
6) probability of two sentence alignments is calculated, finally judges two sentences according to whether probability is greater than the value of setting
Whether it is aligned, then comes out the sentence extraction being aligned in two documents.
Door related network layer: using the word hidden state after bidirectional circulating neural net layer coding as input, make
With the network of the bilinear model and monolayer neural networks that merge by door machine system, word is captured from two angles of linear and nonlinear
Relationship between pair, to obtain word to similarity matrix.The corresponding performance of Bi-RNN+GRN can be seen that sentence from Tables 1 and 2
The performance of alignment is greatly improved, and 8.6 and 29.7 have been respectively increased in the F1 value ratio Bi-RNN of overall performance All, shows
Word to relationship is the important feature for judging sentence alignment, while door related network layer of the invention can effectively capture out word
Relation information between pair.
Embodiment 5
Place unlike the embodiments above is that the multilayer perceptron layer is full linking layer.
The present invention is test, the sentence alignment performance of more each system with bilingual Chinese-English corpus.
1) bidirectional circulating neural network (RNN) and convolutional neural networks (CNN) effect:
Table 1 gives alignment performance of the Bi-RNN and Bi-RNN+CNN model on parallel corpora, F1 value is 89.2 respectively,
89.8.Bi-RNN model is encoded using sentence pair of the bidirectional circulating neural network to input, so that the hidden state of each word
Obtain its contextual information.Bi-RNN+CNN model is to obtain sentence using convolutional neural networks on the basis of Bi-RNN model
Whole vector indicates.This is the result shows that simple deep neural network has the ability to extract the parallel sentence pairs in bilingual corpora.
Table 1
2) door related network (GRN) acts on:
The performance ratio that different models extract parallel sentence pairs on parallel corpora and comparable corpus is set forth in Tables 1 and 2
Compared with.
Table 2
Bi-RNN+GRN model is to capture word pair using door related network (GRN) on the basis of Bi-RNN model
Semantic relation, the i.e. similitude of each word in each word and target side sentence in calculating source sentence.The result shows that
Compared with Bi-RNN and Bi-RNN+CNN, on overall performance All, the alignment performance of Bi-RNN+GRN is in parallel corpora and comparable
F1 value averagely improves 8.3 and 30.2 respectively on corpus, and language between word pair can be captured well by showing a related network
Adopted relation information, to judge that sentence alignment provides strong evidence.
3) compared with existing sentence alignment tool:
It is above-mentioned 1) in show convolutional neural networks capture sentence vector indicate distich alignment performance promotion have centainly
Help, therefore convolutional neural networks, i.e. Bi-RNN+GRN+CNN are added in final mask of the invention.From Tables 1 and 2
It can be found that improving its performance really.Table 3 gives deep neural network method Bi-RNN+GRN+CNN and sentence alignment work
The performance that tool Champollion and Gargantuan extracts parallel sentence pairs in comparable corpus and parallel corpora compares.
Table 3
As can be seen from the table, experimental result of the invention is extracting different types of sentence pair, such as 1-1 or 1-2/2-1, with
And it is higher than the performance of tool on overall performance.This also illustrates the sentence alignments proposed by the present invention based on deep neural network
Method can be improved the performance of sentence alignment.
According to include in two sentences (parallel sentence pairs) translated each other largely the words translated each other to the fact that, this hair
The bright importance for fully considering word to information in sentence alignment task captures the semantic pass between word pair using door related network
System.In addition, since bidirectional circulating neural network can obtain the contextual information of word, the present invention uses bidirectional circulating
Neural network encodes sentence as bottom-layer network.In addition, more judging that information, the present invention pass through convolutional Neural to obtain
Network indicates to obtain the vector of sentence.From Tables 1 and 2 it can be found that only considering sentence justice information (Bi-RNN, Bi-RNN+CNN)
Or only consider that word can obtain certain alignment performance to information (Bi-RNN+GRN), and with final mask (Bi-RNN+GRN+
CNN) a certain information is used alone it can be found that the performance that the two information combines consideration will be substantially better than in comparison.Meanwhile
Test result in comparable corpus shows method of the invention compared with common sentence alignment tool, and performance has obtained obviously
Raising, obtain respectively as shown in table 3, on overall performance All 2.0,2.2 F1 value promoted.
Corpus pretreatment
The language that the present invention uses is to being Chinese and English, wherein segmenting to Chinese document, to English document
It needs to be completely converted into english lowercase.In corpus pretreatment stage, the word in training corpus is first pressed by word according to training corpus
Frequency sorts from large to small, and selects first three ten thousand word, generates with " word number " for representation vocabulary, wherein " number " is
Number since 0, then can indicate corresponding word by " number ".Then, which mentions from reference papers [note 1]
The corresponding word insertion of each word in vocabulary is found in bilingual word insertion (the bilingual word embedding) file supplied,
It generates word and is embedded in vocabulary.Such as: entry in vocabulary " positive 248 ", entry in bilingual word insertion file " positive word insertion (and be expressed as to
Amount) ", then generating corresponding entry in word insertion vocabulary is " insertion of 248 words ".
As needed, document can also be handled using other Languages.
Word embeding layer
Compared with traditional one-hot vector (only hot vector) indicates, word insertion allows the word of similar meaning to have class
As indicate.Therefore, all words in Embedding layers (word embeding layers), two sentences as input will all be reflected
Be mapped to the low dimensional vector of fixed size, this vector be exactly in sentence each word corresponding word in word insertion vocabulary it is embedding
Enter.The corresponding all low frequency words for exceeding vocabulary, we map it in a special word insertion.And it is testing
In, the bilingual word that the present invention is provided using reference papers [note 1] is embedded in initialize word embeding layer.It pre-processes and walks by corpus
Suddenly, the present invention obtains vocabulary and word insertion vocabulary.For each sentence of input, first passes through inquiry vocabulary and generate corresponding number
Sequence, such as " I likes China " generate corresponding " 13 10 " numbered sequence;Then vocabulary is embedded in further according to numbered sequence query word
Corresponding word insertion sequence is generated, to complete initialization word embeding layer.Word embeding layer: it is provided using reference papers [note 1]
Word is expressed as vector by the insertion of bilingual word so that similar word has similar expression, for later the step of provide
Certain word feature.
Bidirectional circulating neural net layer
Bidirectional circulating neural network not only encodes word itself, also to sentence from left to right and from right to left upper
Hereafter (this can provide important feature for sentence translation) is encoded.In order to avoid repeating, we only describe length herein
For the bidirectional circulating neural network process of the source sentence s of m.Bidirectional circulating neural network structure figure is as shown in Figure 4.Source sentence
The coding of s uses a pair of of neural network i.e. bidirectional circulating neural network: forward direction Recognition with Recurrent Neural Network reads defeated from left to right
The sentence sequence s=(s entered1,...,sm), and export a forward direction hidden stateBackward circulation nerve net
Network reads the sentence sequence of input from right to left, and exports a backward hidden stateSo source sentence s
In each word sjHidden state hj, it is expressed as hidden stateWithSplicing.Bidirectional circulating neural network in the present invention
Process the study of long-term dependence is solved using gating cycle unit (Gated Recurrent Unit, abbreviation GRU)
Problem, as follows: at the j of position, forward direction hidden stateIt is updated according to following four formula:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, is calculatingWhen contain
The information of front position and current time therefore, can be more or less comprising being somebody's turn to do when calculating the hidden state of each word
The information of sequence before word.σ is sigmoid threshold function table, Wz, Wr, W is the model parameter to be learnt, and training is random when starting
These parameters are initialized, and optimize update to it using gradient descent algorithm in the training process.Representing matrix multiplication,
⊙ indicates element multiplication.Likewise, backward hidden stateIt is done in the same fashion update.It is worth noting that, of the invention
Source sentence s and target side sentence t, therefore the calculating of target side sentence t are encoded using identical bidirectional circulating neural network
Formula is same as described above.At the same time it can also use shot and long term memory unit (Long Short-Term Memory, abbreviation LSTM)
To substitute the gating cycle unit (GRU) that the present invention uses.At the j of position, forward direction hidden stateAccording to following six formula
It is updated:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, σ is S sigmoid growth curve,
Wf, Wi, Wc, Wo, bf, bi, bc, boIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in training
Update is optimized to it using gradient descent algorithm in the process.Representing matrix multiplication, ⊙ indicate element multiplication.
Door related network layer
Word widely applies similarity feature in many tasks, and according to the feature of alignment sentence pair, can be with
It was found that word has greatly improved to performance of the similarity feature to sentence alignment.In the present invention, we are associated with net using door
Network carrys out modeled words pair, the context relationship between word to capture bilingual sentence, is applied to bidirectional circulating nerve
On network layer.Existing major part sentence alignment method is judged based on features such as length, dictionaries, wherein word-based
The method of allusion quotation is the terminology match information obtained in sentence pair according to dictionary, to obtain the matching degree information of sentence pair.These
Method is mechanically matched according to the word sense information of shallow-layer, it is difficult to capture its deep layer language message.And the present invention in order to
Semantic information more abundant between word pair can be preferably captured in two sentences, the bilinearity mould merged by door machine system is used
The network of type and monolayer neural networks, i.e. door related network come between two angle capture words pair of linear relationship and non-linear relation
Similitude.The structure chart of door related network is as shown in Figure 5.The network is with two sentences after bidirectional circulating neural network coding
Hidden state (the h of sons1,...,hsm) and (ht1,...,htn) it is input, it the advantage is that and compiled by bidirectional circulating neural network
Code, the hidden state of each word not only includes itself semantic information, but also includes its contextual information.Then, under
The formula in face calculates each word to (si,tj) similarity scores:
Wherein, as shown in Fig. 5 bottom left section (a), hsiM[1:r]htjFor bilinear model, linearly closed between word pair for capturing
System, as shown in Fig. 5 lower right-most portion (b), f (v [hsi,htj]) it is monolayer neural networks, for capturing non-linear relation between word pair,
As shown in Fig. 5 middle-lower part (c), g is the door of linear relationship and non-linear relation between indicating how to integrate word pair, i.e., word is to final
Similarity score merged the linear relationship and non-linear relation of much ratios, U, M, V, Wg are parameter, when training starts
These parameters of random initializtion, and update is optimized to it using gradient descent algorithm in the training process, pass through above-mentioned meter
After calculation, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence length, i.e. ash in Fig. 3
Shown in color square.Relationship between two text fragments is usually determined by some strong semantic interactions, therefore is being obtained
After the similarity scores matrix of grey square in Fig. 3, the present invention divides the matrix using maximum pond strategy, passes through setting
Size is the maximization pond of p1 × p2, i.e., takes its maximum value to p1 × p2 block each in similarity scores matrix, to obtain big
Small is (m/p1) × (n/p2) matrix, is then remolded as an one-dimensional vector vgOutput as door related network.
Convolutional neural networks layer
Similar to bidirectional circulating neural net layer, the present invention encodes source sentence using identical convolutional Neural network layer
S and target side sentence t.For the sentence s=(s of input1,...,sm) after word embeding layer, obtaining size is [m, embed_
Size] (embed_size is the size being embedded in word) matrix, the present invention remolded be size be [m, embed_size,
1] matrix E, and defining a shape is [filter_depth, filter_rows, 1] filter.Then, according to sliding
Step-length, such as selection 1, the filter slide on matrix E according to the sliding step that size is 1, and for sliding every time, matrix
It is multiplied by filter covering part with the corresponding position of filter in E, and they is added the value as corresponding position in output.
Finally, we obtain the smaller output [m-filter_depth+1, embed_size-filter_rows+1,1] of a shape.
The present invention realizes the operation using conv3d function packaged in the Theano frame in Python programming language.Then,
Using 3 dimensions provided in Theano frame-maximum pondization strategy in output, to obtain wherein the part of most information content.It is logical
The maximum pond that setting shape is [p1, p2,1] is crossed, i.e., its maximum is taken to each of matrix of above-mentioned input [p1, p2,1] block
Value, thus obtain size be [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1] most
Has the matrix of information content.In a convolution process, the value of filter is remained unchanged, therefore sentence can be enhanced by convolution
Some important features, and noise can be reduced.Since the feature that a convolutional Neural network layer is acquired is often part, this
Two convolutional Neural network layers have been used to learn the feature that is more globalized in invention.Second convolutional Neural network layer it is defeated
Enter be first convolutional Neural network layer output.After source sentence s and target side sentence t coding, by two sentences
As a result by the last one-dimensional output v being stitched together as convolutional neural networks layer of matrixc.Meanwhile two layers volume in the present invention
Product neural network can also be substituted for multilayer convolutional neural networks, i.e. the output of preceding layer convolutional neural networks is rolled up as later layer
The input of product neural network.
Multilayer perceptron layer
According to foregoing description, the output v of a related network can be obtainedgWith the output v at convolutional Neural networkc.Next,
First by the two vectorial combinations at a vector representation v:
V=Wg·vg+Wc·vc
Then, it is inputted in multilayer perceptron layer.Multilayer perceptron is exported by two hidden layers and one in the present invention
Layer composition, i.e. v is input into two full link hidden layers to obtain more abstract expression, and is ultimately connected to output layer:
c1=tanh (v+b1)
c2=tanh (W2·c1+b2)
Wherein, Wg、Wc、W2And b1、b2It is parameter, in training incipient stage random initializtion, and makes in the training process
Update is optimized to it with gradient descent algorithm;c1And c2Two hidden layers for respectively indicating multilayer perceptron calculate.Meanwhile
Two full linking layers in the present invention in multilayer perceptron can be substituted for multiple full linking layers, i.e. cn=tanh (Wn·cn-1+
bn), n is the number of full linking layer.Finally, the output c of multilayer perceptron layer2Pass through sigmoid function -- S sigmoid growth curve come
The Probability p that sentence is classified as 0 or 1 is calculated, is judged as that sentence alignment is 1 if given threshold value ρ is greater than if probability, on the contrary it is not right
Neat is 0:
Claims (8)
1. a kind of sentence alignment method based on deep neural network, it is characterised in that:
1) corpus pre-processes: generating vocabulary according to training corpus and word is embedded in vocabulary;
2) word embeding layer is generated, to each word in sentence, its corresponding word insertion is found from word insertion table;
3) bidirectional circulating neural net layer is used, sentence is encoded, the semantic information of word itself is not only allowed for, and
The contextual information for considering the word, obtaining each word includes the hidden state of its contextual information;
4) door related network layer calculates the semantic relation information in two sentences between word pair;With the acquisition of bidirectional circulating neural network
Each word hidden state as input, use the net of the bilinear model and monolayer neural networks that merge by door machine system
Network, i.e. door related network come from similitude between two angle capture words pair of linear relationship and non-linear relation, then using maximum
Pondization operation captures the part of its most information content;
5) perceptron layer is input to the result of door related network layer as a vector expression in perceptron, and passes through perception
Device obtains more abstract expression to judge sentence alignment;
6) probability of two sentence alignments is calculated, finally whether judges two sentences according to whether probability is greater than the value of setting
Alignment, then comes out the sentence extraction being aligned in two documents.
2. the sentence alignment method based on deep neural network as described in claim 1, it is characterised in that: after step 2)
Further include following steps:
31) convolutional neural networks layer calculates sentence vector using multilayer convolutional neural networks, can preferably capture sentence justice letter
Breath, encodes sentence, can preferably capture sentence justice information, and the vector for obtaining sentence indicates;Two sentence vectors are spelled
It picks up and;
5) step 5 includes the following steps: perceptron layer, by the knot of the result of door related network layer and convolutional neural networks layer
Fruit is combined into a vector expression and is input in perceptron, and obtains more abstract expression by perceptron to judge sentence pair
Together.
3. the sentence alignment method based on deep neural network as claimed in claim 1 or 2, it is characterised in that: the perception
Device layer is multilayer perceptron layer.
4. the sentence alignment method based on deep neural network as claimed in claim 3, it is characterised in that: the bidirectional circulating
Neural net layer not only encodes word itself, also encodes to the context of sentence from left to right and from right to left,
We only describe the bidirectional circulating neural network process for the source sentence s that length is m herein, what the coding of source sentence s used
Be a pair of of neural network i.e. bidirectional circulating neural network: forward direction Recognition with Recurrent Neural Network reads the sentence sequence s of input from left to right
=(s1,...,sm), and export a forward direction hidden stateBackward Recognition with Recurrent Neural Network is read from right to left
The sentence sequence of input, and export a backward hidden stateEach word s in so source sentence sj
Hidden state hj, it is expressed as hidden stateWithSplicing, the present invention in bidirectional circulating neural network use gating cycle list
Member solves the problem concerning study of long-term dependence, as follows: at the j of position, forward direction hidden stateAccording to following four
Formula is updated:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, is calculatingWhen contain front
The information of state and current time can be more or less comprising before the word when calculating the hidden state of each word
The information of sequence, σ are S sigmoid growth curve, Wz, Wr, W is the model parameter to be learnt, these ginsengs of random initializtion when training starts
Number, and update, representing matrix multiplication are optimized to it using gradient descent algorithm in the training process, ⊙ indicates that element multiplies
Method, likewise, backward hidden stateIt is done in the same fashion update, the present invention uses identical bidirectional circulating neural network
Encode source sentence s and target side sentence t, therefore the calculation formula of target side sentence t is same as described above;At the same time it can also
The gating cycle unit that the present invention uses is substituted using shot and long term memory unit, at the j of position, forward direction hidden stateAccording to
Following six formula is updated:
Wherein,Indicate the hidden state of previous moment, sjIt is the word insertion of j-th of word, σ is S sigmoid growth curve, Wf, Wi,
Wc, Wo, bf, bi, bc, boIt is the model parameter to be learnt, these parameters of random initializtion when training starts, and in the training process
Update, representing matrix multiplication are optimized to it using gradient descent algorithm, ⊙ indicates element multiplication.
5. the sentence alignment method based on deep neural network as claimed in claim 4, it is characterised in that: the door is associated with net
Network layers are with the hidden state (h of two sentences after bidirectional circulating neural network codings1,...,hsm) and (ht1,...,htn)
For input, then, each word is calculated according to the following equation to (si,tj) similarity scores:
Wherein, hsiM[1:r]htjFor bilinear model, for capturing linear relationship between word pair, f (v [hsi,htj]) it is monolayer neuronal net
Network, for capturing non-linear relation between word pair, g is the door of linear relationship and non-linear relation between indicating how to integrate word pair, i.e.,
Word has merged the linear relationship and non-linear relation of much ratios to final similarity score, and U, M, V, Wg are parameter, instruction
Practice these parameters of random initializtion when starting, and optimize update to it using gradient descent algorithm in the training process, leads to
After crossing above-mentioned calculating, available size is the matrix of m*n, and wherein m and n is respectively source sentence and target side sentence length,
Relationship between two text fragments is usually determined by some strong semantic interactions, therefore is obtaining the phase of grey square
After property score matrix, the present invention divides the matrix using maximum pond strategy, the maximum for being p1 × p2 by setting size
Change pond, i.e., its maximum value is taken to p1 × p2 block each in similarity scores matrix, so that obtaining size is (m/p1) × (n/p2)
Then matrix is remolded as an one-dimensional vector vgOutput as door related network.
6. the sentence alignment method based on deep neural network as claimed in claim 5, it is characterised in that: the convolutional Neural
Network layer encodes source sentence s and target side sentence t using two identical convolutional Neural network layers, for the sentence of input
S=(s1,...,sm) after word embeding layer, the matrix that size is [m, embed_size] is obtained, it is big that the present invention, which is remolded,
The small matrix E for [m, embed_size, 1], and a shape is defined as [filter_depth, filter_rows, 1] filter
Wave device, then, according to sliding step, which slides on matrix E according to sliding step, and for sliding every time, square
It is multiplied by filter covering part with the corresponding position of filter in battle array E, and they is added as corresponding position in output
Value, finally, we obtain a shape it is smaller output [m-filter_depth+1, embed_size-filter_rows+1,
1];The present invention realizes the operation using conv3d function packaged in the Theano frame in Python programming language, so
Afterwards, using 3 dimensions provided in Theano frame-maximum pondization strategy in output, to obtain wherein the part of most information content,
It is the maximum pond of [p1, p2,1] by setting shape, i.e., its maximum is taken to each of matrix of above-mentioned input [p1, p2,1] block
Value, thus obtain size be [(m-filter_depth+1)/p1, (embed_size-filter_rows+1)/p2,1] most
Has the matrix of information content, in a convolution process, the value of filter is remained unchanged, therefore sentence can be enhanced by convolution
Some important features, and noise can be reduced, since the feature that a convolutional Neural network layer is acquired is often part, this
Two convolutional Neural networks layer has been used in invention to learn the feature that more is globalized, second convolutional Neural network layer it is defeated
Enter be first convolutional Neural network layer output, after source sentence s and target side sentence t coding, by two sentences
As a result by the last one-dimensional output v being stitched together as convolutional neural networks layer of matrixc。
7. the sentence alignment method based on deep neural network as claimed in claim 6, it is characterised in that: retouched according to above-mentioned
It states, the output v of a related network can be obtainedgWith the output v at convolutional Neural networkc, next, first by the two set of vectors
Synthesize a vector representation v:
V=Wg·vg+Wc·vc
Then, it is inputted in perceptron layer, perceptron includes multiple hidden layers and an output layer in the present invention, i.e. v is defeated
Enter to multiple full link hidden layers and obtain more abstract expression, and is ultimately connected to output layer:
cn=tanh (Wn·cn-1+bn)
Wherein, Wg、Wc、WnAnd bnIt is parameter, in training incipient stage random initializtion, and in the training process using under gradient
Drop algorithm optimizes update to it;cn-1Indicate that the hidden layer of perceptron calculates, finally, the output c of perceptron layernPass through S type
Growth curve calculates the Probability p that sentence is classified as 0 or 1, is judged as that sentence alignment is 1 if given threshold value ρ is greater than if probability,
It is on the contrary then to be misaligned be 0:
8. the sentence alignment method based on deep neural network as claimed in claim 7, it is characterised in that: retouched according to above-mentioned
It states, the language that the present invention uses, wherein segmenting to Chinese document, needs English document complete to being Chinese and English
Portion is converted to english lowercase, in corpus pretreatment stage, first according to training corpus by the word in training corpus by word frequency from big
To small sequence, and first three ten thousand word is selected, generated with " word number " as representation vocabulary, wherein " number " is to open from 0
The number of beginning, then can indicate corresponding word by " number ", then, which finds from bilingual word insertion file
The corresponding word insertion of each word, generates word and is embedded in vocabulary in vocabulary, and compared with traditional only hot vector expression, word insertion allows
The word of similar meaning has similar expression, and therefore, all words in word embeding layer, two sentences as input are all
Will be mapped to that the low dimensional vector of fixed size, this vector be exactly in sentence each word it is corresponding in word insertion vocabulary
Word insertion, the corresponding all low frequency words for exceeding vocabulary, we map it in a special word insertion, and this hair
Bright to be embedded in using bilingual word to initialize word embeding layer, by corpus pre-treatment step, the present invention obtains vocabulary and word insertion word
Table first passes through inquiry vocabulary and generates corresponding numbered sequence, then inquire further according to numbered sequence for each sentence of input
Word is embedded in vocabulary and generates corresponding word insertion sequence, to complete initialization word embeding layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810835723.1A CN109062910A (en) | 2018-07-26 | 2018-07-26 | Sentence alignment method based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810835723.1A CN109062910A (en) | 2018-07-26 | 2018-07-26 | Sentence alignment method based on deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109062910A true CN109062910A (en) | 2018-12-21 |
Family
ID=64836638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810835723.1A Pending CN109062910A (en) | 2018-07-26 | 2018-07-26 | Sentence alignment method based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109062910A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070175A (en) * | 2019-04-12 | 2019-07-30 | 北京市商汤科技开发有限公司 | Image processing method, model training method and device, electronic equipment |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN110391010A (en) * | 2019-06-11 | 2019-10-29 | 山东大学 | Food recommendation method and system based on personal health perception |
CN111368564A (en) * | 2019-04-17 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer readable storage medium and computer equipment |
CN111783430A (en) * | 2020-08-04 | 2020-10-16 | 腾讯科技(深圳)有限公司 | Sentence pair matching rate determination method and device, computer equipment and storage medium |
CN112487305A (en) * | 2020-12-01 | 2021-03-12 | 重庆邮电大学 | GCN-based dynamic social user alignment method |
CN112613295A (en) * | 2020-12-21 | 2021-04-06 | 竹间智能科技(上海)有限公司 | Corpus identification method and device, electronic equipment and storage medium |
CN113408267A (en) * | 2021-06-23 | 2021-09-17 | 沈阳雅译网络技术有限公司 | Word alignment performance improving method based on pre-training model |
CN113656066A (en) * | 2021-08-16 | 2021-11-16 | 南京航空航天大学 | Clone code detection method based on feature alignment |
CN113705158A (en) * | 2021-09-26 | 2021-11-26 | 上海一者信息科技有限公司 | Method for intelligently restoring original text style in document translation |
CN113779978A (en) * | 2021-09-26 | 2021-12-10 | 上海一者信息科技有限公司 | Method for realizing unsupervised cross-language sentence alignment |
US20220215177A1 (en) * | 2018-07-27 | 2022-07-07 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Method and system for processing sentence, and electronic device |
US12039281B2 (en) * | 2018-07-27 | 2024-07-16 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Method and system for processing sentence, and electronic device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617227A (en) * | 2013-11-25 | 2014-03-05 | 福建工程学院 | Fuzzy neural network based sentence matching degree calculation method and fuzzy neural network based sentence alignment method |
CN106126507A (en) * | 2016-06-22 | 2016-11-16 | 哈尔滨工业大学深圳研究生院 | A kind of based on character-coded degree of depth nerve interpretation method and system |
CN106886516A (en) * | 2017-02-27 | 2017-06-23 | 竹间智能科技(上海)有限公司 | The method and device of automatic identification statement relationship and entity |
CN107391495A (en) * | 2017-06-09 | 2017-11-24 | 北京吾译超群科技有限公司 | A kind of sentence alignment schemes of bilingual parallel corporas |
CN107423290A (en) * | 2017-04-19 | 2017-12-01 | 厦门大学 | A kind of neural network machine translation model based on hierarchical structure |
CN107992476A (en) * | 2017-11-28 | 2018-05-04 | 苏州大学 | Towards the language material library generating method and system of Sentence-level biological contexts network abstraction |
-
2018
- 2018-07-26 CN CN201810835723.1A patent/CN109062910A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103617227A (en) * | 2013-11-25 | 2014-03-05 | 福建工程学院 | Fuzzy neural network based sentence matching degree calculation method and fuzzy neural network based sentence alignment method |
CN106126507A (en) * | 2016-06-22 | 2016-11-16 | 哈尔滨工业大学深圳研究生院 | A kind of based on character-coded degree of depth nerve interpretation method and system |
CN106886516A (en) * | 2017-02-27 | 2017-06-23 | 竹间智能科技(上海)有限公司 | The method and device of automatic identification statement relationship and entity |
CN107423290A (en) * | 2017-04-19 | 2017-12-01 | 厦门大学 | A kind of neural network machine translation model based on hierarchical structure |
CN107391495A (en) * | 2017-06-09 | 2017-11-24 | 北京吾译超群科技有限公司 | A kind of sentence alignment schemes of bilingual parallel corporas |
CN107992476A (en) * | 2017-11-28 | 2018-05-04 | 苏州大学 | Towards the language material library generating method and system of Sentence-level biological contexts network abstraction |
Non-Patent Citations (2)
Title |
---|
丁颖等: "基于词对建模的句子对齐研究", 《网络首发地址: HTTP://KNS.CNKI.NET/KCMS/DETAIL/31.1289.TP.20180607.1454.002.HTML》 * |
李洋等: "基于卷积神经网络和 BLSTM 网络特征融合的文本情感分析", 《网络出版地址: HTTP://KNS.CNKI.NET/KCMS/DETAIL/51.1307.TP.20180719.1604.080.HTML》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12039281B2 (en) * | 2018-07-27 | 2024-07-16 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Method and system for processing sentence, and electronic device |
US20220215177A1 (en) * | 2018-07-27 | 2022-07-07 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Method and system for processing sentence, and electronic device |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN110196906B (en) * | 2019-01-04 | 2023-07-04 | 华南理工大学 | Deep learning text similarity detection method oriented to financial industry |
CN110070175A (en) * | 2019-04-12 | 2019-07-30 | 北京市商汤科技开发有限公司 | Image processing method, model training method and device, electronic equipment |
CN110070175B (en) * | 2019-04-12 | 2021-07-02 | 北京市商汤科技开发有限公司 | Image processing method, model training method and device and electronic equipment |
CN111368564A (en) * | 2019-04-17 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer readable storage medium and computer equipment |
CN110391010A (en) * | 2019-06-11 | 2019-10-29 | 山东大学 | Food recommendation method and system based on personal health perception |
CN110391010B (en) * | 2019-06-11 | 2022-05-13 | 山东大学 | Food recommendation method and system based on personal health perception |
CN111783430A (en) * | 2020-08-04 | 2020-10-16 | 腾讯科技(深圳)有限公司 | Sentence pair matching rate determination method and device, computer equipment and storage medium |
CN112487305B (en) * | 2020-12-01 | 2022-06-03 | 重庆邮电大学 | GCN-based dynamic social user alignment method |
CN112487305A (en) * | 2020-12-01 | 2021-03-12 | 重庆邮电大学 | GCN-based dynamic social user alignment method |
CN112613295B (en) * | 2020-12-21 | 2023-12-22 | 竹间智能科技(上海)有限公司 | Corpus recognition method and device, electronic equipment and storage medium |
CN112613295A (en) * | 2020-12-21 | 2021-04-06 | 竹间智能科技(上海)有限公司 | Corpus identification method and device, electronic equipment and storage medium |
CN113408267B (en) * | 2021-06-23 | 2023-09-01 | 沈阳雅译网络技术有限公司 | Word alignment performance improving method based on pre-training model |
CN113408267A (en) * | 2021-06-23 | 2021-09-17 | 沈阳雅译网络技术有限公司 | Word alignment performance improving method based on pre-training model |
CN113656066B (en) * | 2021-08-16 | 2022-08-05 | 南京航空航天大学 | Clone code detection method based on feature alignment |
CN113656066A (en) * | 2021-08-16 | 2021-11-16 | 南京航空航天大学 | Clone code detection method based on feature alignment |
CN113779978A (en) * | 2021-09-26 | 2021-12-10 | 上海一者信息科技有限公司 | Method for realizing unsupervised cross-language sentence alignment |
CN113705158A (en) * | 2021-09-26 | 2021-11-26 | 上海一者信息科技有限公司 | Method for intelligently restoring original text style in document translation |
CN113779978B (en) * | 2021-09-26 | 2024-05-24 | 上海一者信息科技有限公司 | Method for realizing non-supervision cross-language sentence alignment |
CN113705158B (en) * | 2021-09-26 | 2024-05-24 | 上海一者信息科技有限公司 | Method for intelligently restoring original text style in document translation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109062897A (en) | Sentence alignment method based on deep neural network | |
CN109062910A (en) | Sentence alignment method based on deep neural network | |
CN110287320B (en) | Deep learning multi-classification emotion analysis model combining attention mechanism | |
Alonso et al. | Adversarial generation of handwritten text images conditioned on sequences | |
CN108363753B (en) | Comment text emotion classification model training and emotion classification method, device and equipment | |
Ma et al. | Multimodal convolutional neural networks for matching image and sentence | |
CN108664996A (en) | A kind of ancient writing recognition methods and system based on deep learning | |
CN108830287A (en) | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method | |
CN109992686A (en) | Based on multi-angle from the image-text retrieval system and method for attention mechanism | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN110069778A (en) | Chinese incorporates the commodity sentiment analysis method of insertion word location aware | |
CN110288029A (en) | Image Description Methods based on Tri-LSTMs model | |
CN112287695A (en) | Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method | |
CN109711465A (en) | Image method for generating captions based on MLL and ASCA-FR | |
CN112686345A (en) | Off-line English handwriting recognition method based on attention mechanism | |
CN110162639A (en) | Knowledge figure knows the method, apparatus, equipment and storage medium of meaning | |
CN108920586A (en) | A kind of short text classification method based on depth nerve mapping support vector machines | |
Daskalakis et al. | Learning deep spatiotemporal features for video captioning | |
Inunganbi et al. | Recognition of handwritten Meitei Mayek script based on texture feature | |
Fan et al. | Long-term recurrent merge network model for image captioning | |
Vijayaraju | Image retrieval using image captioning | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
Jang et al. | Paraphrase thought: Sentence embedding module imitating human language recognition | |
CN114881038A (en) | Chinese entity and relation extraction method and device based on span and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181221 |