CN105868181B

CN105868181B - The automatic identifying method of natural language parallel construction based on new neural network

Info

Publication number: CN105868181B
Application number: CN201610250258.6A
Authority: CN
Inventors: 黄书剑; 周逸初; 戴新宇; 陈家骏; 张建兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-04-21
Filing date: 2016-04-21
Publication date: 2018-08-21
Anticipated expiration: 2036-04-21
Also published as: CN105868181A

Abstract

The present invention proposes the automatic identifying method of the natural language parallel construction based on new neural network, including：Syntactic analysis processing first is carried out to sentence to be analyzed, the candidate collection of a parallel construction is obtained, is then given a mark to the parallel construction in candidate collection using novel neural network learning device, to select final output of the best parallel construction as system.This method has considered the similitude between the phrase independence of parallel construction and phrase, improves parallel construction accuracy of identification.Existing other technologies are compared, this method protrusion can automatically identify arbitrary parallel construction, and other technologies can only identify certain types of parallel construction, the parallel construction that such as only noun forms.Method proposes a kind of more effective parallel construction recognition methods, improve identification quality in practical applications.

Description

The automatic identifying method of natural language parallel construction based on new neural network

Technical field

The present invention relates to a kind of methods using Computer Automatic Recognition parallel construction, are based particularly on new neural network Natural language parallel construction automatic identifying method.

Background technology

Syntactic analysis technology has developed very rapidly since the nineties in last century, has made significant headway, had become Research hotspot in natural language processing field.

Although syntactic analysis treatment technology has had a significant progress, the practicability of current syntactic analysis technology and Availability is not very high, also undesirable to the handling result of complicated sentence, especially to the sentence comprising labyrinth, such as simultaneously The quality of array structure, syntactic analysis is also improved.According to statistics, there are about 10% mistakes to come from knot arranged side by side in syntactic analysis Structure.Therefore, in the case where the difficulty for promoting syntactic analysis ability is increasing, how by focusing special construction, than Such as parallel construction, the quality to promote syntactic analysis becomes a major issue.

In syntactic analysis technology, a kind of very efficient mode is the syntactic analysis technology based on state transition method, Its course of work is as follows：Sentence to be analyzed is inputted, system is unit according to word, and sequence reads in word one by one from left to right Language often reads in a word, it is possible to carry out stipulations operation to the sequence of terms read in, and when carry out stipulations and what is carried out Kind stipulations operation will be determined by the trained scoring model finished.As the word in sentence is read in one by one, tree The longer structure will be the bigger, and when entire sentence is all read in, syntax tree also completes with regard to analysis.From the foregoing, it will be observed that when carry out It is an important factor for influencing syntactic analysis ability that stipulations, which operate, carry out which kind of stipulations operation all,.Knot arranged side by side is automatically identified in advance This partial information is simultaneously input in syntactic analysis system by structure, it will help system correctly sentences above-mentioned two factor It is disconnected, have greatly improved effect for the syntactic analysis ability of entire sentence, the present invention focuses on the automatic identification of parallel construction Research, will improve the quality of syntactic analysis in actual use.

It is to carry out automatic identifications for some special parallel constructions, such as only by teasing in existing inventive technique Number parallel construction separated, only the parallel construction etc. that is made of noun, these methods and techniques can not all accomplish to automatically identify The parallel construction being likely to occur in any one natural language.Therefore, it in order to continue the ability of raising syntactic analysis, needs to seek It can identify the new method of arbitrary parallel construction.

Invention content

Goal of the invention：The technical problem to be solved by the present invention is to focus only on identification for the identification of current parallel construction Special parallel construction, there is no enough generalization abilities, do not play the role of the weakness of raising to syntax analyzing processing, propose A method of utilizing arbitrary parallel construction in neural network learning device automatic identification natural language.

In order to solve the above-mentioned technical problem, the invention discloses the natural language parallel constructions based on new neural network Automatic identifying method.

In the automatic identification natural language sentences of the present invention using neural network structure the method for parallel construction include with Lower step：

Step 1, it includes natural language sentences text file to be analyzed that computer, which reads one, is carried out to the sentence of reading For the syntactic analysis of parallel construction, obtains the merging of parallel construction syntax tree Candidate Set and be input in neural network learning device；

Step 2, neural network learning device gives a mark to parallel construction all in parallel construction syntax tree candidate collection, Therefrom select best parallel construction.

Step 1 includes the following steps：

Step 1-1 is successively read each word in natural language sentences, using based on state according to sequence from left to right The syntactic analysis technology of transfer techniques carries out the syntactic analysis just for parallel construction to the sentence of input, is obtained after analysis side by side Structure syntax tree candidate collection.

Step 1-2, extract parallel construction syntax tree candidate collection in all parallel constructions left ingredient phrase and it is right at Divide phrase and tentatively given a mark, the left ingredient phrase of all parallel constructions and right ingredient phrase are input to neural network learning In device.

The neural network learning device is made of two Recognition with Recurrent Neural Network and a neural networks with single hidden layer, two cycles Neural network shares identical parameter setting, and its hidden layer is directly connected to the input layer of neural networks with single hidden layer, and two are followed There is individual output layer to be independent of each other independently of each other for ring neural network and neural networks with single hidden layer.

Step 1-2 includes the following steps：

Step 1-2-1 extracts parallel construction to each parallel construction in parallel construction syntax tree candidate collection Left ingredient phrase S_leftWith right ingredient phrase S_right, S_left=w₀w₁…w_n1, S_right=w '₀w′₁…w′_m1, wherein w_n1Indicate left N-th in ingredient phrase₁A word, w '_m1Indicate the m in right ingredient phrase₁A word；

Step 1-2-2, using following formula by left ingredient phrase S_leftWith right ingredient phrase S_rightIt is input to identical In two Recognition with Recurrent Neural Network of parameter setting：

Y (t)=g (Vs (t)),

S (t)=f (U₀w(t)+U₁O (t)+Ps (t-1)),

Wherein, y (t) is the final output of Recognition with Recurrent Neural Network, and w indicates that the word in sentence, o indicate the word of corresponding word Property label, t indicate currently processed to t-th word；W (t) indicates that t-th of word, o (t) indicate the part of speech label of t-th of word；s (t), s (t-1) indicates that the vector of t-th of word indicates and the vector of the t-1 word indicates respectively； U₀、U₁, V and P be trained Good model parameter, the typically form of matrix, each element in matrix can arbitrary real number value, concrete numerical value is by being The automatic study of system obtains；F () and g () is activation primitive and normalized function in Recognition with Recurrent Neural Network, Vs (t), U respectively₀w (t),U₁O (t), Ps (t-1) are matrix multiple operations.

Using Recognition with Recurrent Neural Network respectively to S_leftAnd S_rightGive a mark, using the final output of Recognition with Recurrent Neural Network as The score of left and right phrase, is denoted as Score respectively_leftAnd Score_right。

Step 2 includes the following steps：

Step 2-1, by left ingredient phrase S_left, right ingredient phrase S_rightAnd their common contextual information c are defeated simultaneously Enter into the neural network of single hidden layer, is integrally given a mark to parallel construction according to following formula:

H=f (Rc),

Y=g (Q₀s₀(n₂)+Q₁s₁(m₂)+Th),

Wherein, h is that the vector of contextual information indicates, the final output of the neural network of the single hidden layer of y expressions, wherein R, Q₀、Q₁It is trained model parameter, the typically form of matrix with T, each element in matrix can be arbitrarily real Numerical value, concrete numerical value are learnt to obtain automatically by system.n₂、m₂The length of the length and right ingredient phrase of left ingredient phrase is indicated respectively Degree, s₀(n₂) and s₁(m₂) left ingredient phrase S is indicated respectively_leftIndicated by the vector obtained after neural circuitry network and it is right at Divide phrase S_rightIt is indicated by the vector obtained after neural circuitry network；The final output of the neural network of the list hidden layer is just made For the score of current parallel construction, it is denoted as Score； Rc,Q₀s₀(n₂),Q₁s₁(m₂), Th is matrix multiple operation；

Step 2-2 considers the marking in step 1-2-2 and step 2-1, to Score_left,Score_right,Score Average value is calculated, selects the highest parallel construction of average mark as best parallel construction.

Wherein, f (z) and g (z) is common activation primitive and normalized function, specific shape in Recognition with Recurrent Neural Network respectively Formula is：

Wherein, z is the input parameter of activation primitive and normalized function, and e indicates that natural logrithm, x indicate the dimension of vector, K is a counting to vector element.

Advantageous effect：The present invention has considered the local message and global information of phrase simultaneously, is selected in this, as basis Best parallel construction is selected, the recognition capability of parallel construction is improved.

Description of the drawings

The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, of the invention is above-mentioned And/or otherwise advantage will become apparent.

Fig. 1 and Fig. 2 indicates two different syntax trees being likely to occur in syntax tree analytic process in embodiment 1.

Fig. 3 is the flow chart of the present invention.

Fig. 4 and Fig. 5 indicates two different syntax trees being likely to occur in syntax tree analytic process in embodiment 2.

Specific implementation mode

The present invention proposes the automatic identifying method of the natural language parallel construction based on new neural network.First with Syntactic analysis technology finds out possible candidate collection, then using neural network learning device found out from candidate collection it is best and Array structure.In existing system can only identification division parallel construction, such as only by separated by commas parallel construction, only by noun phrase At parallel construction etc., these methods and techniques can not all accomplish to automatically identify to be likely to occur in any one natural language Parallel construction.

As shown in figure 3, the invention discloses one kind based on arranged side by side in new neural network structure automatic identification natural language The method of structure, the system based on the present invention have considered the local message and Global Information of parallel construction as a whole, identify best Parallel construction.Fig. 3 describes the new neural network structure proposed in the present invention.

The process of parallel construction includes the following steps in identification natural language of the present invention:

Step 11, it includes natural language sentences text file to be analyzed that computer, which reads one, is turned using based on state The syntactic analysis technology of shifting method carries out syntactic analysis to the sentence of input, syntactic analysis herein by the corresponding syntax about Beam, can only carry out parallel construction syntactic analysis, and analysis obtains the candidate collection of a parallel construction syntax tree.

Step 12, in the candidate collection of parallel construction syntax tree, all possible parallel construction candidate is extracted, by this A little candidate parallel constructions are input in new neural network proposed by the present invention.

It is as follows that process is identified in new neural network learning device of the present invention:

Step 21, system receives candidate parallel construction set, therefrom extracts the left ingredient phrase S of parallel construction_left: w₀w₁…w_nWith right ingredient phrase S_right:w′₀w′₁…w′_m。

Step 22, the left and right ingredient phrase of parallel construction is input to two cycle nerve nets with identical parameters simultaneously In network structure, as shown in structure in box in Fig. 2.By the neural network structure of two shared parameters, system is according to following public Formula is to S_leftAnd S_rightMarking：

Y (t)=g (Vs (t))

S (t)=f (U₀w(t)+U₁o(t)+Ps(t-1))

Wherein, y (t) is the final output score of neural network, and w indicates that the word in sentence, o indicate the word of corresponding word Property label, t indicate currently processed to t-th word；W (t), o (t) indicate that t-th of word and its part of speech mark respectively； s(t), S (t-1) indicates that t-th of word and the vector of the t-1 word indicate respectively,；U₀、U₁, V and P be trained model parameter, The typically form of matrix, each element in matrix can arbitrary real number value, concrete numerical value learns automatically by system Go out；F and g is activation primitive and normalized function in Recognition with Recurrent Neural Network respectively.Using the network respectively to S_leftAnd S_right It gives a mark, the score by the final output of network as left and right phrase is denoted as respectively：Score_leftAnd Score_right。Vs (t),U₀w(t),U₁O (t), Ps (t-1) are matrix multiple operations.

Step 23, by left ingredient phrase S_left, right ingredient phrase S_rightAnd its common contextual information c is input to simultaneously In the neural network of one single hidden layer, following formula is utilized:

H=f (Rc)

Y=g (Q₀s₀(n)+Q₁s₁(m)+Th)

It integrally gives a mark to parallel construction.

Wherein, h is that the vector of contextual information indicates that y indicates the final output of model, wherein R, Q₀、Q₁It is with T Trained model parameter, the typically form of matrix, each element in matrix can arbitrary real number value, concrete numerical value Learnt to obtain automatically by system.N, m indicates the length of the length and right ingredient phrase of left ingredient phrase respectively, and S₀(n) and S₁ (m) left ingredient phrase S is indicated respectively_leftIt is indicated and right ingredient phrase S by the vector obtained after neural circuitry network_rightThrough The vector obtained after neural circuitry network is crossed to indicate；The output of the network is denoted as with regard to the score as current parallel construction Score。Rc,Q₀s₀(n),Q₁s₁(m), Th is matrix multiple operation.

Step 24, after giving a mark respectively to left and right phrase and overall structure, to this three (Score_left,Score_right, Score score) calculates average mark, the final score as current parallel construction.

Step 25, the operation that step 21 arrives step 24 is executed to the parallel construction of all candidates, therefrom selects highest scoring Parallel construction, as best parallel construction.

Embodiment 1

The present embodiment utilizes parallel construction operational process in new neural network structure recognition natural language as follows:

1. input natural language sentences to be analyzed:" Pudong, Shanghai exploitation is synchronous with legal construction ", wherein really simultaneously Array structure is " exploitation and legal construction ".

2. system starts the natural language sentences progress to input just for the syntactic analysis of parallel construction, obtain possible Parsing tree, as depicted in figs. 1 and 2:

3. pair parallel construction syntax tree being likely to occur, system extracts its parallel construction, for Fig. 1, extraction and Array structure is " exploitation and legal system "；For Fig. 2, the parallel construction of extraction is " exploitation and legal construction ".

4. the parallel construction S that will be extracted¹=" exploitation and legal system " and S²=" exploitation and legal construction " is input to this hair In new neural network in bright.

5. after neural network receives the parallel construction set of input, it is short to extract its left and right to each parallel construction Language, for S¹For, its left phraseRight phrase isAnd for S²For, its left side PhraseRight phrase is

6. willWithIt is input in Recognition with Recurrent Neural Network, is given a mark to it using Recognition with Recurrent Neural Network simultaneously, Score is respectivelyWithIt willWithIt is input in Recognition with Recurrent Neural Network simultaneously, It is given a mark to it using Recognition with Recurrent Neural Network, score is respectivelyWith

7. by S¹And S²It is input in neural networks with single hidden layer, is given a mark to parallel construction using neural networks with single hidden layer, S¹It is scored at Score¹=0.7, S²It is scored at Score²=0.9.

8. pairAnd Score¹Calculate average valueIt is right And Score²Calculate average valueThus judge, S²Highest scoring, therefore " exploitation built with legal system If " will be as the final output of system.

Embodiment 2

1. input natural language sentences to be analyzed:" new situation, the new problem that encountered ", wherein real parallel construction It is " new situation, new problem ".

2. system starts the natural language sentences progress to input just for the syntactic analysis of parallel construction, obtain possible Parsing tree, as shown in Figure 4 and Figure 5:

3. pair parallel construction syntax tree being likely to occur, system extracts its parallel construction, for Fig. 4, extraction and Array structure is " new situation, new problem "；For Fig. 2, the parallel construction of extraction is " situation, new problem ".

4. the parallel construction S that will be extracted¹=" new situation, new problem " and S²=" situation, new problem " is input to this hair In new neural network in bright.

5. after neural network receives the parallel construction set of input, it is short to extract its left and right to each parallel construction Language, for S¹For, its left phraseRight phrase isAnd for S²For, it Left phraseRight phrase is

6. willWithIt is input in Recognition with Recurrent Neural Network, is given a mark to it using Recognition with Recurrent Neural Network simultaneously, Score is respectivelyWithIt willWithIt is input to Recognition with Recurrent Neural Network simultaneously In, it is given a mark to it using Recognition with Recurrent Neural Network, score is respectivelyWith

7. by S¹And S²It is input in neural networks with single hidden layer, is given a mark to parallel construction using neural networks with single hidden layer, S¹It is scored at Score¹=0.95, S²It is scored at Score²=0.6.

8. pairAnd Score¹Calculate average valueIt is right And Score²Calculate average valueThus judge, S¹Highest scoring, therefore " new situation and new problem " will Final output as system.

The present invention provides the automatic identifying method of the natural language parallel construction based on new neural network, specific implementations The method and approach of the technical solution have very much, and the above is the preferred embodiment of the present invention.The present invention is based on a kind of new The neural network structure of type is given a mark between each ingredient of parallel construction and integrally so that be respectively using neural network System can automatically identify any type of parallel construction.In concrete practice, method proposed by the present invention and other manner phase Than being not limited to special parallel construction, such as the parallel construction by separated by commas, the parallel construction etc. that only noun forms, energy Enough automatically identify the parallel construction of arbitrary structures.It should be pointed out that for those skilled in the art, not Under the premise of being detached from the principle of the invention, several improvements and modifications can also be made, these improvements and modifications also should be regarded as the present invention Protection domain.The available prior art of each component part being not known in the present invention is realized.

Claims

1. the automatic identifying method of the natural language parallel construction based on neural network, which is characterized in that include the following steps：

Step 1, it includes natural language sentences text file to be analyzed that computer, which reads one, is directed to the sentence of reading The syntactic analysis of parallel construction obtains the merging of parallel construction syntax tree Candidate Set and is input in neural network learning device；

Step 2, neural network learning device gives a mark to parallel construction all in parallel construction syntax tree candidate collection, therefrom Select best parallel construction；

Step 1 includes the following steps：

Step 1-1 is successively read each word in natural language sentences according to sequence from left to right, is shifted using based on state The syntactic analysis technology of technology carries out the syntactic analysis just for parallel construction to the sentence of input, and parallel construction is obtained after analysis Syntax tree candidate collection；

Step 1-2, the left ingredient phrase and right ingredient for extracting all parallel constructions in parallel construction syntax tree candidate collection are short Language is simultaneously tentatively given a mark, and the left ingredient phrase of all parallel constructions and right ingredient phrase are input to neural network learning device In；

The neural network learning device is made of two Recognition with Recurrent Neural Network and a neural networks with single hidden layer, two cycle nerves The identical parameter setting of network share, and its hidden layer is directly connected to the input layer of neural networks with single hidden layer, two cycle god Through network and neural networks with single hidden layer there is individual output layer to be independent of each other independently of each other；

Step 1-2 includes the following steps：

Step 1-2-1, to each parallel construction in parallel construction syntax tree candidate collection extract parallel construction it is left at Divide phrase S_leftWith right ingredient phrase S_right, S_left=w₀w₁...w_n1, S_right=w '₀w′₁...w′_m1, wherein w_n1Indicate it is left at Divide n-th in phrase₁A word, w '_m1Indicate the m in right ingredient phrase₁A word；

Step 1-2-2, using following formula by left ingredient phrase S_leftWith right ingredient phrase S_rightIt is input to identical parameters In two Recognition with Recurrent Neural Network being arranged：

Y (t)=g (Vs (t)),

S (t)=f (U₀w(t)+U₁O (t)+Ps (t-1)),

Wherein, y (t) is the final output of Recognition with Recurrent Neural Network, and w indicates that the word in sentence, o indicate the part of speech mark of corresponding word Note, t indicate currently processed to t-th word；W (t) indicates that t-th of word, o (t) indicate the part of speech label of t-th of word；s(t)、s (t-1) indicate that the vector of t-th of word indicates and the vector of the t-1 word indicates respectively；U₀、U₁, V and P be trained mould Shape parameter；F () and g () is activation primitive and normalized function in Recognition with Recurrent Neural Network, Vs (t), U respectively₀W (t), U₁o (t), Ps (t-1) is matrix multiple operation；

Using Recognition with Recurrent Neural Network respectively to S_leftAnd S_rightGive a mark, using the final output of Recognition with Recurrent Neural Network as it is left, The score of right phrase, is denoted as Score respectively_leftAnd Score_right；

Step 2 includes the following steps：

Step 2-1, by left ingredient phrase S_left, right ingredient phrase S_rightAnd their common contextual information c are input to simultaneously In the neural network of single hidden layer, integrally given a mark to parallel construction according to following formula：

H=f (Rc),

Y=g (Q₀s₀(n₂)+Q₁s₁(m₂)+Th),

Wherein, h is that the vector of contextual information indicates, y indicates the final output of the neural network of single hidden layer, wherein R, Q₀、Q₁With T is trained model parameter；n₂、m₂The length of the length and right ingredient phrase of left ingredient phrase, s are indicated respectively₀(n₂) And s₁(m₂) left ingredient phrase S is indicated respectively_leftIt is indicated and right ingredient phrase by the vector obtained after neural circuitry network S_rightIt is indicated by the vector obtained after neural circuitry network；The final output of the neural network of the list hidden layer is just used as current The score of parallel construction, is denoted as Score；Rc, Q₀s₀(n₂), Q₁s₁(m₂), Th is matrix multiple operation；

Step 2-2 considers the marking in step 1-2-2 and step 2-1, calculates average value, it is highest simultaneously to select average mark Array structure is as best parallel construction；

F (z) and g (z) is activation primitive and normalized function in Recognition with Recurrent Neural Network respectively, and concrete form is：

Wherein, z is the input parameter of activation primitive and normalized function, and e indicates that natural logrithm, x indicate that the dimension of vector, k are A counting to vector element.