CN101814065A - Syntactic analysis device and syntactic analysis method - Google Patents

Syntactic analysis device and syntactic analysis method Download PDF

Info

Publication number
CN101814065A
CN101814065A CN200910118104A CN200910118104A CN101814065A CN 101814065 A CN101814065 A CN 101814065A CN 200910118104 A CN200910118104 A CN 200910118104A CN 200910118104 A CN200910118104 A CN 200910118104A CN 101814065 A CN101814065 A CN 101814065A
Authority
CN
China
Prior art keywords
rule
syntactic analysis
production
production rule
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910118104A
Other languages
Chinese (zh)
Other versions
CN101814065B (en
Inventor
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN200910118104.1A priority Critical patent/CN101814065B/en
Publication of CN101814065A publication Critical patent/CN101814065A/en
Application granted granted Critical
Publication of CN101814065B publication Critical patent/CN101814065B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a syntactic analysis device and a syntactic analysis method. The syntactic analysis device using the regular expression rule comprises a training tree library, a rule acquisition module, a rule application module, a syntax tree generation module and a rule set. The rule acquisition module studies the syntactic analysis rule from the well noted training tree library through the statistics of study methods, and generates the rule set used when analyzing an inputted sentence. The rule acquisition module applies the regular expression for expression of the repeated part in the latter item of the production rule. The syntactic analysis rule studied by the rule acquisition module further contains context information. The rule application module uses the syntactic analysis rule set obtained by studying of the rule acquisition module for analyzing the inputted sentence, thereby identifying the relationship between grammatical components and components of the inputted sentence. The syntax tree generation module generates a dependency syntactic relation diagram or a phrase structure type syntax analysis tree according to the outputted analysis result of the rule application module and the demands of a user.

Description

Syntactic analysis device and syntactic analysis method
Technical field
The present invention relates to the syntactic analysis technology, be used for identifying the grammer composition of sentence and the relation between the composition from the natural language sentences of input.More particularly, the present invention relates to a kind of syntactic analysis device and syntactic analysis method that uses regular expression, the syntactic analysis rules of its using regular expression form and syntactic analysis algorithm go to analyze the grammer composition of input sentence and export parsing tree.
Background technology
The grammer composition of identification natural language and the relation between composition are to handle the difficult point and the vital task of natural language.Research about this respect discloses many pieces of papers and relevant patent, for example, U.S. Pat 5,386,556A discloses a kind of natural language analysis apparatus and method (Natural language analyzing apparatus and method), and U.S. Pat 5,930,746A then discloses the apparatus and method (Parsing and translating natural language sentences automatically) of a kind of automatic parsing and translation natural language.
In processing procedure to natural language, need use the syntactic analysis rules storehouse when carrying out syntactic analysis, the quality in syntactic analysis rules storehouse and ability are the most critical reasons that influences the syntactic analysis result.But, in existing natural language analysis apparatus and method, because the ability to express of syntactic analysis rules is limited, therefore can not use the grammar property that syntactic rule is described natural language flexible and efficiently, correspondingly can not identify the sentence structure composition of input sentence effectively and accurately.
Summary of the invention
In view of the foregoing, the present invention proposes a kind of syntactic analysis device and syntactic analysis method that uses regular expression to describe syntactic analysis rules, carries out the identification that concerns between sentence structure composition and composition in order to the natural language to input.According to syntactic analysis device of the present invention and syntactic analysis method, syntactic analysis rules that can the using regular expression form and syntactic analysis algorithm go to analyze the grammer composition of input sentence and export parsing tree, thereby strengthen the ability of describing the natural language rule.
According to an aspect of the present invention, a kind of syntactic analysis device is provided, comprise: rule acquisition module, be configured to comprise the production rule collection of regular expression form with generation, wherein represent with regular expression for the repeating part in production rule consequent from the training treebank that marked study syntactic analysis rules; Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the composition of input sentence and the relation between composition; And the syntax tree generation module, be configured to generate the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence according to the grammer composition of the input sentence of rule application module output and the relation between composition.
Syntactic analysis device according to an embodiment of the invention, rule acquisition module comprises: tree fragment decomposition portion is configured to every syntax tree in the training treebank is decomposed into tree fragment as production rule, to form tree fragment collection; The repeated fragment test section, be configured to detect the tree fragment and decompose part and separate in production rule that the tree fragment that obtains concentrates consequent whether the node sequence of repetition is arranged, and the production rule that will have a node sequence of repetition is expressed as the production rule of regular expression form; And recurring rule merging portion, be configured to the identic production rule that the repeated fragment test section generates is merged into a production rule, to form the production rule collection.
Preferably, rule acquisition module also comprises regular selection portion, is configured to according to selection strategy the production rule that recurring rule merging portion is generated be selected, to generate the production rule collection of reduction.
According to one embodiment of present invention, the tree fragment is decomposed part and is separated the tree fragment that obtains and be expressed as<freq:xx〉{<f 1...<f nP → Y 1Y 2... Y n, n is the positive integer more than or equal to 1, wherein<freq:xx〉be frequency information, represent the occurrence number of this tree fragment in the training treebank;<f iBe attributive character, and being used to describe contextual information, vocabulary or semantic features when using this production rule, i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y iThe child node of expression P node is a P-marker or a vocabulary mark; " {<f 1...<f nP → Y 1Y 2... Y n" be illustrated in and occurred<f 1...<f nUnder the situation of attribute, phrase P can be by Y 1Y 2... Y nConstitute.When recurring rule merging portion merges into a production rule at the identic production rule that the repeated fragment test section is generated, the frequency of production rule is carried out corresponding merging.In addition, attributive character<f in production rule iBe optional.
Preferably, production rule comprises: single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than the secondary; Many nodes repetition type rule is for a certain fragment in production rule consequent repeats production rule more than the secondary; Mix repetition type rule, both comprised the production rule that the single node repeating part also comprises many nodes repeating part in production rule consequent; And do not have a repetition type rule, for there not being the production rule of repeating part in production rule consequent.
Syntactic analysis device according to an embodiment of the invention, rule application module comprises: regular compiling portion is configured to the production rule collection that rule acquisition module generates is compiled into the rule query table of syntactic analysis rules; Rule query portion is configured to the rule query table that rule searching compiling portion compiles; And syntactic analysis portion, be configured in the rule query table, inquire about the syntactic analysis rules that can be applied to import sentence, according to the grammer composition of syntactic analysis rules identification input sentence and the relation between composition by rule query portion.
Preferably, rule application module also comprises ambiguity resolution portion, is configured to select optimum partial analysis result from the partial analysis candidate that the sentence structure analysis portion generates; And the partial analysis result that syntactic analysis portion is selected ambiguity resolution portion carries out further syntactic analysis, the final analysis result who meets the demands with output.
According to one embodiment of present invention, rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and replace the part that regular expression is represented in the production rule with the middle mark that generates, middle generation mark is incorporated the P-marker collection of syntactic analysis rules collection into, and middle create-rule is incorporated the syntactic analysis rules collection into; And syntactic analysis portion discerns the middle mark that generates according to the mode identical with the identification P-marker, inquire about the syntactic analysis rules that comprising middle create-rule that all are fit to current input sentence by rule query portion, with the grammer composition of identification input sentence and the relation between composition.
Preferably, ambiguity resolution portion according to the probability P of following formula by calculating parsing tree (S t) determines the optimum production rule that uses, to select optimum partial analysis result:
P ( s , t ) = Π r ∈ D ( T ) p ( r )
Wherein S is the input sentence, and t is a parsing tree, and r is syntactic analysis rules of using in the syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability at syntactic analysis rules r.
According to a preferred embodiment of the present invention, the calculating of the regular probability of the phrase symbol that exists in the calculating of the middle regular probability that generates mark and the syntactic analysis is identical, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-centre create-rule; And for the production rule formula that contains regular expression
Figure B2009101181041D0000032
Computation rule probability, wherein r iBe create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be at this centre create-rule r iRegular probability, n is the positive integer more than or equal to 1, i is the arbitrary integer from 1 to n.
Syntactic analysis device according to an embodiment of the invention, the syntax tree generation module comprises: central marker cleaning portion is configured to remove the middle mark that generates that uses in the syntactic analysis process; And the phrase structure generating unit, be configured to generate phrase structure type parsing tree according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleaning portion output and the relation between composition.
Syntactic analysis device according to another embodiment of the invention, the syntax tree generation module also can comprise: central marker cleaning portion is configured to remove the middle mark that generates that uses in the syntactic analysis process; Obs network node mark portion is configured to carry out the obs network node mark according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleaning portion output and the relation between composition; And the dependency structure generating unit, be configured to generate the interdependent syntactic relation figure that imports sentence according to the obs network node that the obs network node mark standard laid down by the ministries or commissions of the Central Government is annotated.
According to another aspect of the present invention, a kind of syntactic analysis method is provided, comprise: regular obtaining step, comprise the production rule collection of regular expression form from the training treebank that marked study syntactic analysis rules with generation, wherein represent with regular expression for the repeating part in production rule consequent; The rule application step, the production rule set pair input sentence that the service regeulations obtaining step obtains is analyzed, and identifies the composition of input sentence and the relation between composition; And syntax tree generates step, according to the grammer composition of the input sentence of rule application step output and the relation between composition, generates the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.
Syntactic analysis device and syntactic analysis method according to the bright proposition of we, can use the syntactic analysis rules of regular expression form, increased the descriptive power of syntactic analysis rules, it is dumb to have overcome the regular expression that exists in the existing method, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and syntactic analysis algorithm that the present invention proposes can form a parser of supporting the regular expression rule, thereby realize efficient correct syntactic analysis.
In addition, the present invention also is provided for realizing the computer program of above-mentioned character identifying method.
In addition, the present invention also provides the computer program of computer-readable medium form at least, records the computer program code that is used to realize above-mentioned character identifying method on it.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose of the present invention, characteristics and advantage more easily to the embodiment of the invention.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.In the accompanying drawing:
Fig. 1 illustrates the structural representation according to the syntactic analysis device of the use regular expression rule of the embodiment of the invention;
Fig. 2 illustrates the block diagram according to the rule acquisition module shown in Figure 1 of the embodiment of the invention;
Fig. 3 illustrates the block diagram according to the rule application module shown in Figure 1 of the embodiment of the invention;
Fig. 4 illustrates the block diagram according to the syntax tree generation module shown in Figure 1 of the embodiment of the invention;
Fig. 5 is the synoptic diagram that is used to illustrate upper node and the next node;
Fig. 6 is used to illustrate employed example syntax tree S1 when rule acquisition module is obtained syntactic rule;
Fig. 7 is used to illustrate employed another example syntax tree S2 when rule acquisition module is obtained syntactic rule;
Fig. 8 illustrates the final syntactic analysis result who by the syntactic analysis module input sentence is analyzed back output according to one embodiment of present invention;
Fig. 9 illustrates according to one embodiment of present invention by the central marker cleaning portion of syntax tree generation module final syntactic analysis result shown in Figure 8 is carried out result after intermediate node is removed;
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention;
Figure 11 illustrates the detail flowchart of the disposal route of carrying out according to one embodiment of the invention in regular obtaining step shown in Figure 10;
Figure 12 illustrates the detail flowchart of the disposal route of carrying out according to one embodiment of the invention in rule application step shown in Figure 10;
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in the step according to one embodiment of the invention syntax tree shown in Figure 10; And
Figure 14 illustrates the structure calcspar that is used to implement according to the messaging device of syntactic analysis method of the present invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
Here, at first provide single node repetition type rule, many nodes repetition type rule, the mixing repetition type rule that is applied in the present invention and do not have the definition of repetition type rule and the definition of upper node and the next node, so that better principle of the present invention is set forth.
Definition 1: single node repetition type rule, it is single node repetition type rule that the consequent a certain composition of production rule repeats the above rule definition of secondary.
Definition 2: many nodes repetition type rule, it is many nodes repetition type rule that the consequent a certain fragment of production rule repeats the above rule definition of secondary.
Definition 3: mix repetition type rule, both comprised the single node repeating part in production rule consequent and also comprised the rule definition of many nodes repeating part for mixing repetition type rule.
Definition 4: the rule definition that does not have repeating part in the no repetition type rule, production rule consequent is no repetition type rule.
Provide some examples of various types of production rules below, to further specify single node repetition type rule defined above, many nodes repetition type rule, mix repetition type rule and not have repetition type rule.
(1) single node repetition type rule
For example, RL1:P → AAABC, wherein " AAA " part is the single node repeating part, then represents that with regular expression rule of the present invention method representation is P → A*BC.
Again for example, RL2:P → ABBCCD wherein includes two groups of single node repeating parts " BB " and " CC ", then represents that with regular expression rule of the present invention method representation is P → AB*C*D.
(2) many nodes repetition type rule
For example, RL3:P → ABABC, wherein " AB " part is many nodes repeating part, then is expressed as P → [AB] with regular expression regular expression of the present invention +C.
Again for example, RL4:P → ABCBCDEFDEFG, wherein " BCBC " and " DEFDEF " is many nodes repeating part, then is expressed as P → A[BC with regular expression regular expression of the present invention] +[DEF] +G.
(3) mix repetition type rule
For example, RL5:P → AAABCDBCDE comprises single node repetition type and many nodes repetition type, and wherein " AAA " part is the single node repeating part, " BCDBCD " part is many nodes repeating part, then is expressed as P → A*[BCD with regular expression regular expression of the present invention] +E.
(4) no repetition type rule
For example, RL6:P → ABCDEABC neither comprises single node repetition type, does not also comprise many nodes repetition type, therefore is no repetition type rule.
Then define upper node and the next node.
Definition 5: upper node, the node that comprises child node in tree construction is defined as upper node.
Definition 6: the next node has the node of father node to be defined as the next node in tree construction.
Upper node and the next node with the same node of one tree, both can have been done upper node with respect to the different piece of the tree of being paid close attention to and change, and also can do the next node sometimes.For example, A2, B3, these three nodes of C1 in the syntax tree as shown in Figure 5, wherein C1 is the next node of B3, A2, and A2 is the upper node of B3, C1, and B3 is the upper node of C1 and is the next node of A2.
Next will be with reference to the accompanying drawings, particularly Fig. 1 to Fig. 4 describes the general work principle according to the syntactic analysis device of the embodiment of the invention.As shown in Figure 1, the syntactic analysis device according to the use regular expression rule of the embodiment of the invention comprises training treebank 101, rule acquisition module 102, rule application module 103, syntax tree generation module 104 and rule set 105.
Rule acquisition module 102 is created on the rule set 105 that uses when sentence is analyzed importing by the training treebank 101 study syntactic analysis rules of method from having marked of for example statistical learning.For the repeating part in production rule consequent, rule acquisition module 102 is applied in above defined regular expression form and explains accordingly.Therefore, rule set 105 is the set that comprises the production rule of regular expression form.In addition, the syntactic analysis rules learnt of rule acquisition module 102 can also comprise contextual information.
The syntactic analysis rules collection 105 that 102 study of rule application module 103 service regeulations acquisition modules obtain is analyzed the input sentence, identifies the composition of input sentence and the relation between composition.
Syntax tree generation module 104 generates the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence according to the analysis result of rule application module 103 outputs according to user's demand.
Three main modular rule acquisition module 102, rule application module 103 and syntax tree generation module 104 to syntactic analysis device of the present invention is specifically described below.
Fig. 2 illustrates the block diagram according to the rule acquisition module shown in Figure 1 102 of the embodiment of the invention.As shown in Figure 2, the rule acquisition module 102 according to this embodiment comprises syntactic analysis treebank 201, tree fragment decomposition portion 202, decomposition parameter input unit 203, tree fragment collection 204, repeated fragment test section 205, no repetition type rule unit 206, single node repetition type rule unit 207, many nodes repetition type rule unit 208, mixing repetition type rule unit 209, recurring rule merging portion 211 and production rule collection 213.
The syntactic analysis treebank of syntactic analysis treebank 201 for being used to learn, promptly the training treebank 101 shown in Fig. 1 has wherein indicated the grammer composition of the sentence that is used to train and the nest relation between composition.The present invention has carried out practical application respectively on English PennTreebank of two treebanks and Chinese PennTreebank, but should be noted that syntactic analysis device proposed by the invention and syntactic analysis method and language independent, as long as any language has marked the grammer composition of sentence and the nest relation between composition, just can obtain syntactic analysis rules, and subsequently the input sentence be carried out syntactic analysis with technical scheme of the present invention.
Tree fragment decomposition portion 202 promptly trains every syntax tree in the treebank 101 to be decomposed into some less subtrees or tree fragment syntactic analysis treebank 201 according to the decomposition parameter of decomposing parameter input unit 203 inputs.The presentation format of tree fragment is as follows.
<freq:xx>{<f 1>...<f n>}P→Y 1?Y 2...Y n
Wherein,<freq:xx〉be frequency information, represent the occurrence number of this tree fragment in the training treebank.<f iBe attributive character, be mainly used to describe contextual information, vocabulary or semantic features when using this rule.The decomposition parameter that attributive character is imported according to decomposition parameter input unit 203 determines that rule can comprise attributive character, also can not comprise.
P represents the upper node of this tree fragment, is a P-marker.Y iThe child node of expression P node is a P-marker or a vocabulary mark." {<f 1...<f nP → Y 1Y 2... Y n" be illustrated in and occurred<f 1...<f nUnder the situation of attribute, phrase P can be by Y 1Y 2... Y nConstitute.
By the presentation format of setting fragment as can be known, set the rule that fragment is a form of production for one.Every syntax tree in the syntactic analysis treebank 201 can be decomposed into some tree fragments, and all decomposition result all deposit in the tree fragment collection 204, forms the rule set of form of production.
Then, tree fragment collection (that is production rule collection) input repeated fragment test section 205." the Y that repeated fragment test section 205 detects in the tree fragment of being imported 1Y 2... Y n" whether part have the node sequence of repetition.According to the form that node repeats, tree fragment collection is divided into as defined above single node repetition type rule unit 207, many nodes repetition type rule unit 208, mixes repetition type rule unit 209 and do not have repetition type rule unit 206.The rule that comprises repetition node sequence will be represented with regular expression symbol " * " or "+".
After introducing regular expression, some identic rules will be transformed out.For example, " regular R 1: P → ABBBC " and " regular R 2: P → ABBC " all be expressed as P → AB when representing with the regular expression form *C.
Therefore, strictly all rules is converted into the regular expression form after, need remove the identic rule that repeats.Based on this, recurring rule merging portion 211 merges into a rule with identic rule, and the frequency with rule carries out corresponding merging simultaneously.Thus, can directly generate and be used for the input sentence production rule collection 213 that carries out syntactic analysis.
In addition, in order to improve the efficient of syntactic analysis, rule acquisition module 102 can also comprise selection strategy unit 210 and regular selection portion 212 according to an embodiment of the invention.The selection strategy that rule selection portion 212 provides according to selection strategy unit 212 is selected the production rule that recurring rule merging portion 211 is generated, thereby generates the efficient production rule collection 213 of reduction, so that improve the efficient of syntactic analysis.
Below operating process by instantiation explanation rule acquisition module 102.Fig. 6 and Fig. 7 are used to illustrate rule acquisition module 102 employed two example syntax tree S1 and S2 when obtaining syntactic rule.Rule acquisition module 102 is obtained syntactic analysis rules from syntax tree S1 and S2 process is as follows.
At first, tree fragment decomposition portion 202 decomposes syntax tree S1 and S2 according to the decomposition parameter that decomposition parameter input unit 203 provides.Hypothesis is decomposed parameter for tree is decomposed into context-free phrase in this example, then Fig. 6 and syntax tree S1 shown in Figure 7 and S2 is decomposed that afterwards formed tree fragment collection is as shown in table 1 below.
Table 1 tree fragment collection
Figure B2009101181041D0000091
Then, the decomposition fragment collection shown in the table 1 is entered repeated fragment test section 205, detect at this whether repetition node is arranged.Fragment collection in this example
Figure B2009101181041D0000092
In include the repetition node.Repeating part is sent into unit 206 to unit 209 respectively according to the type that repeats after adopting the regular expression form to represent.Above the result of the fragment collection shown in the table 1 after representing with regular expression as shown in table 2 below.
The rule set that table 2 regular expression is represented
Figure B2009101181041D0000101
Fragment after representing with regular expression will be input to recurring rule merging portion 211 as the syntactic analysis rules candidate, carry out the merging of recurring rule at this.Above the rule shown in the table 2 merge the rule set that the back forms through recurring rule merging portion 211 and comprise 9 rules altogether, wherein to rule " X → a *" carried out repeating merging.The output result of recurring rule merging portion 211 is as shown in table 3 below.
Rule set after table 3 recurring rule merges
Figure B2009101181041D0000102
Afterwards, the output result of recurring rule merging portion 211 enters regular selection portion 212, select according to selection strategy, final formation rule application module 103 carries out the needed syntactic analysis rules collection of syntactic analysis, and as production rule collection 213 input rule application modules 103.
Fig. 3 illustrates the block diagram according to the rule application module shown in Figure 1 103 of the embodiment of the invention.Rule application module 103 is used the syntactic analysis rules collection that rule acquisition module shown in Figure 2 102 forms, i.e. production rule set pair input sentence carries out syntactic analysis, the grammer composition of output input sentence and the relation between composition.As shown in Figure 3, comprise production rule collection 302, regular compiling portion 303, rule query table 304, rule query portion 305, syntactic analysis portion 306, ambiguity resolution portion 308 etc. according to the rule application module 103 of this embodiment.
Production rule collection 302 is the production rule collection 213 that rule acquisition module 102 forms, and at first enters regular compiling portion 303.Rule compiling portion 303 forms 302 compilings of production rule collection can be by the rule query table 304 of rule query portion 305 uses.
After the input sentence 301 input syntactic analysis portions 306, in rule query table 304, inquire about the syntactic analysis rules that can be applied to this input sentence 301 by syntactic analysis portion 306 by rule query portion 305, according to the grammer composition of syntactic analysis rules identification input sentence 301, and the output analysis result.The process of syntactic analysis adopts the speech node of CYK algorithm from input sentence 301, and the process of rule query expands to 2 nodes from 1 node, covers whole sentence to end.
Syntactic analysis rules may provide a plurality of partial analysis candidates 307, and optimum partial analysis result 309 therefrom selects in ambiguity resolution portion 308.The partial analysis result 309 that ambiguity resolution portion 308 is selected enters syntactic analysis portion 306 and carries out further syntactic analysis, until the satisfied final analysis result 310 of syntactic analysis portion 306 outputs.
For the production syntactic analysis rules that contain the regular expression form that in rule acquisition module 102, generate, according to embodiments of the invention, in the middle of the part that contains regular expression in the production syntactic analysis rules is increased, generate the result, solved the problem of in the syntactic analysis process, using the syntactic analysis rules that comprise regular expression by rule compiling portion 303, rule query portion 305, syntactic analysis portion 306.Its concrete operating process is as follows.
Rule compiling portion 303 concentrates the part that comprises regular expression for production rule increases middle mark and the middle create-rule of generating, and goes the part that this regular expression is represented in the Substitution Rules with the middle mark that generates.
Specifically, for x *The matrix section generates mark<x in the middle of increasing〉and middle create-rule<x → xx and<x →<x x.For [x..y]+matrix section, generate mark [x..y] and middle create-rule [x..y] → x..yx...y and [x..y] → [x..y] x...y in the middle of increasing.
For example, rule " R 2: X → a *" in " a *" contain regular expression, then be " a *" generate in the middle of increasing create-rule in the middle of mark "<a〉" and two "<a〉→ aa " and "<a〉→<a a ".With the regular expression part in the centre generation mark Substitution Rules, with regular R 2Be converted into X →<a.
The output result of the recurring rule merging portion 211 shown in the table 3, result in the middle of increasing behind generation mark and the middle create-rule is as shown in table 4 below, employed central marker and middle create-rule are as shown in table 5, wherein introduce 6 middle create-rules, use R respectively 13~R 17Expression.
The output result of table 4 recurring rule merging portion 211 increases the result after the central marker
Figure B2009101181041D0000121
Central marker and middle create-rule that the output that table 5 transforms recurring rule merging portion 211 is used as a result the time
Central marker Middle create-rule
??<a> ??R 13:<a>→aa
??R 14:<a>→<a>a
??[ab] ??R 15:[ab]→abab
??R 16:[ab]→[ab]ab
??<X> ??R 17:<X>→XX
??R 17:<X>→<X>X
In regular compilation process, central marker is incorporated the P-marker collection of rule set into, and middle create-rule is incorporated the syntactic analysis rules collection into, and regular compiling portion 303 organization of unity strictly all rules and marks generate the rule query table of being convenient to inquire about 304.
Syntactic analysis portion 306 is in analytic process, discern central marker according to the mode identical with the identification phrase, all are fit to the syntactic analysis rules (create-rule in the middle of comprising) of current input sentence 301 by 305 inquiries of rule query portion to adopt the CYK algorithm, generate the syntactic analysis result.
Rule analysis input sentence " aaababababcaa " below by using in table 4 and the table 5 illustrates the operating process of rule application module 103.
Table 6 parsing sentence service regeulations example
Process The present analysis state Service regeulations
Process 1 ??aaababababcaa ??R 13:<a>→aa??R 15:[ab]→abab??R 13:<a>→aa
Process 2 ??<a>[ab]abc<a> ??R 2:X→<a>??R 4:Z→[ab]
Process 3 ??XZabcX ??R 5:M→abcX
Process 4 ??XZM ??R 7:S→XZM
Process 5 ??S
Here should be noted that above-mentioned example only is to have provided a kind of mode of using syntactic analysis rules as an example, is not optimum or unique.How the purpose of this example is explanation by introducing intermediate symbols and intermediate analysis rule, and use contains the syntactic analysis rules of regular expression in the process of syntactic analysis.
In the process that rule is used, some available analysis rules may appear in same state, ambiguity promptly occurs analyzing.The multiple create-rule candidate 307 that 305 inquiries of rule query portion obtain will be input to ambiguity resolution portion 308, select optimal rules to use by ambiguity resolution portion 308, thereby generate partial analysis result 309.
According to one embodiment of present invention, (S t) selects the optimal rules used, and then exports optimum syntactic analysis result 309 probability P by calculating parsing tree of ambiguity resolution portion 308.Its basic computing formula is as follows:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is the input sentence, and t is a parsing tree, and r is syntactic analysis rules of using in the syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability at syntactic analysis rules r.
When the computation rule probability, the phrase symbol that exists in intermediate symbols and the syntactic analysis is equal to, and the intermediate analysis rule is equal to non-intermediate analysis rule.The method of existing various calculating syntactic analysis rules all can be used for calculating the syntactic analysis rules of obtaining in treatment in accordance with the present invention.Its regular probability estimate method simplified summary is as follows:
1. non-regular expression rule: regular probability estimate is identical with existing method.
2. middle create-rule: intermediate symbols is treated as phrase, adopted existing method to estimate regular probability.
3. the rule that contains regular expression:
Figure B2009101181041D0000141
R wherein iFor transforming an intermediate rule of using when this contains regular expression regular, and p (r i) be at this centre syntactic analysis rules r iRegular probability.
In rule application module 103,, solved problem how to use the regular expression rule by introducing intermediate symbols and middle create-rule according to this embodiment of the invention.By intermediate symbols and middle create-rule and phrase symbol and general rule are put on an equal footing, solved the probability estimate problem of calculating the regular expression rule.
By using the rule of table 6, the syntactic analysis result who input sentence " aaababababcaa " is carried out the final output that obtains after the syntactic analysis as shown in Figure 8.
The syntactic analysis result of rule application module 103 outputs, for example syntactic analysis result as shown in Figure 8 will import syntax tree generation module 104, generate interdependent syntactic relation figure or phrase structure type parsing tree in this demand according to the user.
Fig. 4 illustrates the block diagram according to the syntax tree generation module 104 shown in Figure 1 of the embodiment of the invention.As shown in Figure 4, the syntax tree generation module 104 according to this embodiment of the invention comprises central marker cleaning portion 402, obs network node mark portion 403, dependency structure generating unit 404 and phrase structure generating unit 405.
Final analysis result 401 shown in Figure 4 is the syntactic analysis result 310 of rule application module 103 outputs.Final analysis result 401 at first imports central marker cleaning portion 402, removes the central marker of using in the syntactic analysis process.
After removing central marker, generate phrase structure type parsing tree if desired, then generate phrase structure tree by phrase structure generating unit 405.Generate interdependent syntactic relation figure if desired, then will remove central marker analysis result input obs network node mark portion 403 afterwards and carry out the obs network node mark, generate dependences by dependency structure generating unit 404 afterwards.
According to one embodiment of present invention, the step of central marker cleaning portion 402 cleaning intermediate nodes is described below described with recursive function.
Function?CleanTags(n i)
Begin
For?each?s i,where?s i∈{sons?of?n i}
cleanTags(s i)
If // s iBe intermediate symbols
If?s i?is?a?semi-finished?label
All sons of //si rise as the son of ni
move?up?all?the?sons?of?s i?as?the?sons?of?n i
Endif
End?for
End
The final analysis result 310 of syntactic analysis module 103 shown in Figure 8 by central marker cleaning portion 402 carry out after intermediate node is removed the result as shown in Figure 9.
Structure and principle of work thereof according to the syntactic analysis device of the embodiment of the invention have more than been described.Below in conjunction with the above-mentioned syntactic analysis device applied syntactic analysis method of accompanying drawing 10~13 descriptions according to the embodiment of the invention.
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention.As shown in figure 10, the syntactic analysis method according to this embodiment comprises that regular obtaining step S1001, rule application step S1003 and syntax tree generate step S1005.
At first, in regular obtaining step S1001, from the training treebank that has marked, training treebank 101 for example shown in Figure 1, the study syntactic analysis rules comprise the production rule collection of regular expression form with generation, wherein represent with regular expression for the repeating part in production rule consequent.Then, in rule application step S1003, the production rule set pair input sentence that service regeulations obtaining step S1001 obtains is analyzed, and identifies the composition of input sentence and the relation between composition.At last, generate among the step S1005 syntax tree, the composition of the input sentence of being exported according to rule application step S1003 and the relation between composition generate the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.
Figure 11 illustrates the detail flowchart of the disposal route of carrying out according to one embodiment of the invention in regular obtaining step S1001 shown in Figure 10.As shown in figure 10, the regulation obtaining method according to this embodiment comprises tree fragment decomposition step S1101, repeated fragment detects step S1103, recurring rule combining step S1105 and rule is selected step S1107.
At first, in tree fragment decomposition step S1101, every syntax tree in the training treebank 101 that the training treebank is for example shown in Figure 1 is decomposed into the tree fragment as production rule, to form tree fragment collection, for example the tree fragment collection 204 shown in Fig. 2.
Then, detect among the step S1103 at repeated fragment, detect in production rule that tree fragment that tree fragment decomposition step S1101 obtains concentrates consequent whether the node sequence of repetition is arranged, and the production rule that will have a node sequence of repetition is expressed as the production rule of regular expression form.Then, in recurring rule combining step S1105, repeated fragment is detected the identic production rule of step S1103 generation and merge into a production rule, to form the production rule collection.
Preferably, in order to improve the efficient of syntactic analysis, regulation obtaining method also comprises rule selection step S1107 according to an embodiment of the invention, selection strategy according to prior setting is selected the production rule that recurring rule combining step S1105 is generated, to generate the production rule collection of reduction, such as production rule collection 213 shown in Figure 2.
Tree fragment decomposition step is decomposed the tree fragment that obtains and can be expressed as
<freq:xx〉{<f 1...<f nP → Y 1Y 2... Y n, n is the positive integer more than or equal to 1.Wherein<and freq:xx〉be frequency information, represent the occurrence number of this tree fragment in the training treebank;<f iBe attributive character, and being used to describe contextual information, vocabulary or semantic features when using this production rule, i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y iThe child node of expression P node is a P-marker or a vocabulary mark; " {<f 1...<f nP → Y 1Y 2... Y n" be illustrated in and occurred<f 1...<f nUnder the situation of attribute, phrase P can be by Y 1Y 2... Y nConstitute.
It is worthy of note that when recurring rule combining step S1105 merged into a production rule at the identic production rule that repeated fragment is detected step S1103 generation, also the frequency with production rule carried out corresponding merging.
Figure 12 illustrates the detail flowchart of the disposal route of carrying out according to one embodiment of the invention in rule application step S1003 shown in Figure 10.
As shown in figure 12, at first at regular compile step S1201, the production rule collection that regular obtaining step S1001 is generated is compiled into the rule query table of syntactic analysis rules, for example rule query table 304 as shown in Figure 3.Then, in syntactic analysis step S1203, in rule query table 304, inquire about the syntactic analysis rules that can be applied to import sentence, according to the grammer composition of syntactic analysis rules identification input sentence and the relation between composition by rule query step S1205.Here, rule query step S1205 is used for the rule query table 304 that rule searching compile step S1201 is compiled.
Next, in ambiguity resolution step S1207, from the partial analysis candidate of sentence structure analytical procedure S1203 and rule query step S1205 generation, select optimum partial analysis result.Then, judge at step S1209 whether resulting analysis result meets the demands.If do not meet the demands, then return syntactic analysis step S1203, the partial analysis result that ambiguity resolution step S1207 is selected carries out further syntactic analysis.
If judge that in step S1209 the analysis result through obtaining after the ambiguity resolution meets the demands, then treatment scheme advances to step S1211, exports final syntactic analysis result.
According to a preferred embodiment of the present invention, in regular compile step S1201, concentrate the part that comprises regular expression for production rule and increase middle mark and the middle create-rule of generating, and replace the part that regular expression is represented in the production rule with the middle mark that generates, middle generation mark is incorporated the P-marker collection of syntactic analysis rules collection into, and middle create-rule is incorporated the syntactic analysis rules collection into.
In syntactic analysis step S1203, discern the middle mark that generates according to the mode identical, inquire about all syntactic analysis rules of comprising middle create-rule that are fit to current input sentences by rule query step S1205 and import the grammer composition of sentence and the relation between composition with identification with the identification P-marker.
In ambiguity resolution step S1207, according to the probability P of following formula by calculating parsing tree (S t) determines the optimum production rule that uses, to select optimum partial analysis result:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is the input sentence, and t is a parsing tree, and r is syntactic analysis rules of using in the syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability at syntactic analysis rules r.
The calculating of the regular probability of the phrase symbol that exists in the calculating of the middle regular probability that generates mark and the syntactic analysis is identical, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-centre create-rule.
For the production rule formula that contains regular expression Computation rule probability, wherein r iBe create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be at this centre create-rule r iRegular probability, n is the positive integer more than or equal to 1, i is the arbitrary integer from 1 to n.
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out among the step S1005 according to one embodiment of the invention syntax tree shown in Figure 10.
As shown in figure 13, at first in central marker cleanup step S1301, remove the middle mark that generates that in the syntactic analysis process, uses.Then, in step S1303, judge it is to generate interdependent syntactic relation figure or will generate phrase structure type parsing tree.
Need to generate phrase structure type parsing tree if in step S1303, judge, then treatment scheme advances to phrase structure and generates step S1305, the composition of the input sentence after in the middle of this removing of exporting, generating mark and the relation between composition according to central marker cleanup step S1301, generate phrase structure type parsing tree, and the output of the phrase structure type parsing tree of the input sentence that will generate.
If in step S1303, judge the interdependent syntactic relation figure that needs to generate the input sentence, then treatment scheme advances to obs network node annotation step S1307, carries out the obs network node mark according to composition that generates mark input sentence afterwards in the middle of the removing of central marker cleanup step S1301 output and the relation between composition.Generate among the step S1309 at dependency structure then, the obs network node that marks according to obs network node annotation step S1307 generates the interdependent syntactic relation figure that imports sentence, and the interdependent syntactic relation figure of the input sentence that will generate.
Below describe the specific embodiment of syntactic analysis device of the present invention and syntactic analysis method in conjunction with the accompanying drawings in detail.From the above description as can be seen, according to syntactic analysis device of the present invention and syntactic analysis method, by using the syntactic analysis rules of regular expression form, increased the descriptive power of syntactic analysis rules, it is dumb to have overcome the regular expression that exists in the existing method, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and syntactic analysis algorithm that the present invention proposes can form a parser of supporting the regular expression rule, realize efficient, correct syntactic analysis.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, when in the production syntactic analysis rules, introducing regular expression, provided a kind of method of describing repeating part in the grammar construct, the multiplicity of repeated fragment can be unrestricted, same syntactic analysis rules can be analyzed different length one class phrase.Syntactic analysis device and syntactic analysis method by the present invention's proposition, the syntax can be described more neatly, the position relation of each composition in the syntax both can have been described, the recursive characteristic of local composition in the syntactic structure can be described again, therefore the rule that obtains with the present invention has stronger versatility, more robust.
In addition, because in the training treebank of study, the phrase that comprises than multicomponent (is the more length language of node, this paper is called length language) frequency that occurs is lower, when describing with existing syntactic analysis rules, this part phrase will be left in the basket usually, like this when syntactic analysis if length language can not be analyzed, the sparse problem of rule can appear.And the length language all includes repeatably part usually, use syntactic analysis device of the present invention and syntactic analysis method, merge by the part that will repeat in the length language, no matter the part that repeats in the phrase repeats how many times, can show with same syntactic analysis rules, to a certain extent the sparse problem of solution rule.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, when carrying out syntactic analysis, in the middle of the part that contains regular expression in the rule is increased, generate the result, thereby solved the problem of in the syntactic analysis process, using the syntactic analysis rules that comprise regular expression.
Above set forth basic functional principle of the present invention as instantiation with the syntax tree that takes out from Chinese, but the syntactic analysis device and the syntactic analysis method thereof that use the present invention to describe can be discerned grammer in other various language or semantic composition equally.In addition, the inventive method also can be used for analysis or the similar task of discerning certain class composition from the incoming symbol sequence to genome sequence.Therefore be appreciated that all other Languages or notations of being applied to, the variation that does not exceed design main points of the present invention is all due among protection scope of the present invention.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand the whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized that with hardware, firmware, software or their combination this is that those of ordinary skills use their basic programming skill just can realize under the situation of having read explanation of the present invention.
Therefore, purpose of the present invention can also realize by program of operation or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, purpose of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say that such program product also constitutes the present invention, and the storage medium that stores such program product also constitutes the present invention.Obviously, described storage medium can be any storage medium that is developed in any known storage medium or future.
Realizing under the situation of embodiments of the invention by software and/or firmware, from storage medium or network to computing machine with specialized hardware structure, general purpose personal computer 700 for example shown in Figure 14 is installed the program that constitutes this software, this computing machine can be carried out various functions or the like when various program is installed.
In Figure 14, CPU (central processing unit) (CPU) 701 carries out various processing according to program stored among ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random-access memory (ram) 703.In RAM 703, also store data required when CPU 701 carries out various processing or the like as required.CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 also is connected to bus 704.
Following parts are connected to input/output interface 705: importation 706 comprises keyboard, mouse or the like; Output 707 comprises display, such as cathode ray tube (CRT), LCD (LCD) or the like and loudspeaker or the like; Storage area 708 comprises hard disk or the like; With communications portion 709, comprise that network interface unit is such as LAN card, modulator-demodular unit or the like.Communications portion 709 is handled such as the Internet executive communication via network.
As required, driver 710 also is connected to input/output interface 705.Detachable media 711 is installed on the driver 710 as required such as disk, CD, magneto-optic disk, semiconductor memory or the like, makes the computer program of therefrom reading be installed to as required in the storage area 708.
Realizing by software under the situation of above-mentioned series of processes, such as detachable media 711 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 14 wherein having program stored therein, distribute separately so that the detachable media 711 of program to be provided to the user with device.The example of detachable media 711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 702, the storage area 708 or the like, computer program stored wherein, and be distributed to the user with the device that comprises them.
It is pointed out that also that in apparatus and method of the present invention obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and to be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order following the instructions naturally carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
Though described the present invention and advantage thereof in detail, be to be understood that and under not breaking away from, can carry out various changes, alternative and conversion by the situation of the appended the spirit and scope of the present invention that claim limited.And, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make the process, method, article or the device that comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or device intrinsic key element.Do not having under the situation of more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.

Claims (24)

1. syntactic analysis device comprises:
Rule acquisition module is configured to comprise the production rule collection of regular expression form from the training treebank that marked study syntactic analysis rules with generation, wherein represents with regular expression for the repeating part in production rule consequent;
Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the grammer composition of input sentence and the relation between composition; And
The syntax tree generation module is configured to generate the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence according to the grammer composition of the input sentence of rule application module output and the relation between composition.
2. syntactic analysis device according to claim 1, wherein rule acquisition module comprises:
Tree fragment decomposition portion is configured to every syntax tree in the training treebank is decomposed into tree fragment as production rule, to form tree fragment collection;
The repeated fragment test section, be configured to detect the tree fragment and decompose part and separate in production rule that the tree fragment that obtains concentrates consequent whether the node sequence of repetition is arranged, and the production rule that will have a node sequence of repetition is expressed as the production rule of regular expression form; And
Recurring rule merging portion is configured to the identic production rule that the repeated fragment test section generates is merged into a production rule, to form the production rule collection.
3. syntactic analysis device according to claim 2, wherein rule acquisition module also comprises regular selection portion, is configured to according to selection strategy the production rule that recurring rule merging portion is generated be selected, to generate the production rule collection of reduction.
4. syntactic analysis device according to claim 3 is wherein set fragment and is decomposed part and separate the tree fragment that obtains and be expressed as<freq:xx〉{<f 1...<f nP → Y 1Y 2... Y n, n is the positive integer more than or equal to 1, wherein
<freq:xx〉be frequency information, represent the occurrence number of this tree fragment in the training treebank;
<f iBe attributive character, and being used to describe contextual information, vocabulary or semantic features when using this production rule, i is the arbitrary integer from 1 to n;
P represents the upper node of this tree fragment, is a P-marker;
Y iThe child node of expression P node is a P-marker or a vocabulary mark;
" {<f 1...<f nP → Y 1Y 2... Y n" be illustrated in and occurred<f 1...<f nUnder the situation of attribute, phrase P can be by Y 1Y 2... Y nConstitute; And
Wherein, when recurring rule merging portion merges into a production rule at the identic production rule that the repeated fragment test section is generated, the frequency of production rule is carried out corresponding merging;
Attributive character<f in production rule iBe optional.
5. syntactic analysis device according to claim 4, wherein production rule comprises:
Single node repetition type rule is for a certain composition in production rule consequent repeats production rule more than the secondary;
Many nodes repetition type rule is for a certain fragment in production rule consequent repeats production rule more than the secondary;
Mix repetition type rule, both comprised the production rule that the single node repeating part also comprises many nodes repeating part in production rule consequent; And
No repetition type rule is not for there being the production rule of repeating part in production rule consequent.
6. according to any described syntactic analysis device of claim 1 to 5, wherein rule application module comprises:
Rule compiling portion is configured to the production rule collection that rule acquisition module generates is compiled into the rule query table of syntactic analysis rules;
Rule query portion is configured to the rule query table that rule searching compiling portion compiles; And
Syntactic analysis portion is configured to inquire about the syntactic analysis rules that can be applied to import sentence by rule query portion in the rule query table, according to the grammer composition of syntactic analysis rules identification input sentence and the relation between composition.
7. syntactic analysis device according to claim 6, wherein rule application module also comprises ambiguity resolution portion, is configured to select optimum partial analysis result from the partial analysis candidate that the sentence structure analysis portion generates; And
The partial analysis result that syntactic analysis portion is selected ambiguity resolution portion carries out further syntactic analysis, the final analysis result who meets the demands with output.
8. syntactic analysis device according to claim 7, wherein:
Rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and replace the part that regular expression is represented in the production rule with the middle mark that generates, middle generation mark is incorporated the P-marker collection of syntactic analysis rules collection into, and middle create-rule is incorporated the syntactic analysis rules collection into; And
Syntactic analysis portion generates mark in the middle of discerning according to the mode identical with the identification P-marker, inquire about the syntactic analysis rules that comprising middle create-rule that all are fit to current input sentence by rule query portion, with the grammer composition of identification input sentence and the relation between composition.
9. syntactic analysis device according to claim 8, wherein ambiguity resolution portion according to the probability P of following formula by calculating parsing tree (S t) determines the optimum production rule that uses, to select optimum partial analysis result:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is the input sentence, and t is a parsing tree, and r is syntactic analysis rules of using in the syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability at syntactic analysis rules r.
10. syntactic analysis device according to claim 9, wherein:
The calculating of the regular probability of the phrase symbol that exists in the calculating of the middle regular probability that generates mark and the syntactic analysis is identical, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-centre create-rule; And
For the production rule formula that contains regular expression Computation rule probability, wherein r iBe create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be at this centre create-rule r iRegular probability, n is the positive integer more than or equal to 1, i is the arbitrary integer from 1 to n.
11. to any described syntactic analysis device of 10, wherein the syntax tree generation module comprises according to Claim 8:
Central marker cleaning portion is configured to remove the middle mark that generates that uses in the syntactic analysis process; And
The phrase structure generating unit is configured to generate phrase structure type parsing tree according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleaning portion output and the relation between composition.
12. to any described syntactic analysis device of 10, wherein the syntax tree generation module comprises according to Claim 8:
Central marker cleaning portion is configured to remove the middle mark that generates that uses in the syntactic analysis process;
Obs network node mark portion is configured to carry out the obs network node mark according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleaning portion output and the relation between composition; And
The dependency structure generating unit is configured to generate the interdependent syntactic relation figure that imports sentence according to the obs network node that the obs network node mark standard laid down by the ministries or commissions of the Central Government is annotated.
13. a syntactic analysis method comprises:
The rule obtaining step comprises the production rule collection of regular expression form from the training treebank that marked study syntactic analysis rules with generation, wherein represents with regular expression for the repeating part in production rule consequent;
The rule application step, the production rule set pair input sentence that the service regeulations obtaining step obtains is analyzed, and identifies the grammer composition of input sentence and the relation between composition; And
Syntax tree generates step, according to the grammer composition of the input sentence of rule application step output and the relation between composition, generates the interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.
14. syntactic analysis method according to claim 13, wherein regular obtaining step comprises:
Tree fragment decomposition step is decomposed into tree fragment as production rule with every syntax tree in the training treebank, to form tree fragment collection;
Repeated fragment detects step, detect tree fragment decomposition step and decompose in production rule that the tree fragment that obtains concentrates consequent whether the node sequence of repetition is arranged, and the production rule that will have a node sequence of repetition is expressed as the production rule of regular expression form; And
The recurring rule combining step detects repeated fragment the identic production rule of step generation and merges into a production rule, to form the production rule collection.
15. syntactic analysis method according to claim 14, wherein regular obtaining step also comprise rule selection step, according to selection strategy the production rule that the recurring rule combining step is generated are selected, to generate the production rule collection of reduction.
16. syntactic analysis method according to claim 15 is wherein set the fragment decomposition step and is decomposed the tree fragment obtain and be expressed as<freq:xx〉{<f 1...<f nP → Y 1Y 2... Y n, n is the positive integer more than or equal to 1, wherein
<freq:xx〉be frequency information, represent the occurrence number of this tree fragment in the training treebank;
<f iBe attributive character, and being used to describe contextual information, vocabulary or semantic features when using this production rule, i is the arbitrary integer from 1 to n;
P represents the upper node of this tree fragment, is a P-marker;
Y iThe child node of expression P node is a P-marker or a vocabulary mark;
" {<f 1...<f nP → Y 1Y 2... Y n" be illustrated in and occurred<f 1...<f nUnder the situation of attribute, phrase P can be by Y 1Y 2... Y nConstitute; And
Wherein, the recurring rule combining step carries out corresponding merging with the frequency of production rule when repeated fragment being detected identic production rule that step generates and merge into a production rule;
Attributive character<f in production rule iBe optional.
17. syntactic analysis method according to claim 16, wherein production rule comprises:
Single node repetition type rule is for a certain composition in production rule consequent repeats production rule more than the secondary;
Many nodes repetition type rule is for a certain fragment in production rule consequent repeats production rule more than the secondary;
Mix repetition type rule, both comprised the production rule that the single node repeating part also comprises many nodes repeating part in production rule consequent; And
No repetition type rule is not for there being the production rule of repeating part in production rule consequent.
18. according to any described syntactic analysis method of claim 13 to 17, wherein the rule application step comprises:
The rule compile step, the production rule collection that regular obtaining step is generated is compiled into the rule query table of syntactic analysis rules;
The rule query step, the rule query table of rule searching compile step compiling; And
The syntactic analysis step is inquired about the syntactic analysis rules that can be applied to import sentence by the rule query step in the rule query table, according to the grammer composition of syntactic analysis rules identification input sentence and the relation between composition.
19. syntactic analysis method according to claim 18, wherein the rule application step also comprises the ambiguity resolution step, selects optimum partial analysis result from the partial analysis candidate that the sentence structure analytical procedure generates; And
The partial analysis result that the syntactic analysis step is selected the ambiguity resolution step carries out further syntactic analysis, the final analysis result who meets the demands with output.
20. syntactic analysis method according to claim 19, wherein:
The rule compile step is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and replace the part that regular expression is represented in the production rule with the middle mark that generates, middle generation mark is incorporated the P-marker collection of syntactic analysis rules collection into, and middle create-rule is incorporated the syntactic analysis rules collection into; And
The syntactic analysis step generates mark in the middle of discerning according to the mode identical with the identification P-marker, inquire about the syntactic analysis rules that comprising middle create-rule that all are fit to current input sentence by the rule query step, with the grammer composition of identification input sentence and the relation between composition.
21. syntactic analysis method according to claim 20, wherein ambiguity resolution step according to the probability P of following formula by calculating parsing tree (S t) determines the optimum production rule that uses, to select optimum partial analysis result:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is the input sentence, and t is a parsing tree, and r is syntactic analysis rules of using in the syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability at syntactic analysis rules r.
22. syntactic analysis method according to claim 21, wherein:
The calculating of the regular probability of the phrase symbol that exists in the calculating of the middle regular probability that generates mark and the syntactic analysis is identical, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-centre create-rule; And
For the production rule formula that contains regular expression
Figure F2009101181041C0000072
Computation rule probability, wherein r iBe create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be at this centre create-rule r iRegular probability, n is the positive integer more than or equal to 1, i is the arbitrary integer from 1 to n.
23. according to any described syntactic analysis method of claim 20 to 22, wherein syntax tree generation step comprises:
The central marker cleanup step is removed the middle mark that generates that uses in the syntactic analysis process; And
Phrase structure generates step, according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleanup step output and the relation between composition, generates phrase structure type parsing tree.
24. according to any described syntactic analysis method of claim 20 to 22, wherein syntax tree generation step comprises:
The central marker cleanup step is removed the middle mark that generates that uses in the syntactic analysis process;
The obs network node annotation step is carried out the obs network node mark according to grammer composition that generates mark input sentence afterwards in the middle of the removing of central marker cleanup step output and the relation between composition; And
Dependency structure generates step, generates the interdependent syntactic relation figure of input sentence according to the obs network node of obs network node annotation step mark.
CN200910118104.1A 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method Expired - Fee Related CN101814065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910118104.1A CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910118104.1A CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Publications (2)

Publication Number Publication Date
CN101814065A true CN101814065A (en) 2010-08-25
CN101814065B CN101814065B (en) 2014-07-30

Family

ID=42621323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910118104.1A Expired - Fee Related CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Country Status (1)

Country Link
CN (1) CN101814065B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN103324653A (en) * 2012-03-21 2013-09-25 株式会社东芝 Main point extraction device and main point extraction method
WO2014190901A1 (en) * 2013-05-28 2014-12-04 百度在线网络技术(北京)有限公司 Syntax compilation method, semantic parsing method, devices, computer storage medium and apparatus
CN105740234A (en) * 2016-01-29 2016-07-06 昆明理工大学 MST algorithm based Vietnamese dependency tree library construction method
CN105893343A (en) * 2015-02-12 2016-08-24 富士通株式会社 Information management device and information management method
CN106202395A (en) * 2016-07-11 2016-12-07 上海智臻智能网络科技股份有限公司 Text clustering method and device
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
CN104281564B (en) * 2014-08-12 2017-08-08 中国科学院计算技术研究所 A kind of bilingual unsupervised syntactic analysis method and system
CN108021559A (en) * 2018-02-05 2018-05-11 威盛电子股份有限公司 Natural language understanding system and lexical analysis method
CN109684644A (en) * 2018-12-27 2019-04-26 南京大学 The construction method of interdependent syntax tree based on context
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226327A (en) * 1996-06-28 1999-08-18 微软公司 Method and system for computing semantic logical forms from syntax trees

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226327A (en) * 1996-06-28 1999-08-18 微软公司 Method and system for computing semantic logical forms from syntax trees

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
魏蓉: "限定领域的基本陈述句句法分析", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637180B (en) * 2011-02-14 2014-06-18 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN102637180A (en) * 2011-02-14 2012-08-15 汉王科技股份有限公司 Character post processing method and device based on regular expression
CN103324653A (en) * 2012-03-21 2013-09-25 株式会社东芝 Main point extraction device and main point extraction method
CN103324653B (en) * 2012-03-21 2016-12-28 株式会社东芝 Withdrawing device to be put and extraction method to be put
WO2014190901A1 (en) * 2013-05-28 2014-12-04 百度在线网络技术(北京)有限公司 Syntax compilation method, semantic parsing method, devices, computer storage medium and apparatus
CN103294666B (en) * 2013-05-28 2017-03-01 百度在线网络技术(北京)有限公司 Grammar compilation method, semantic analytic method and corresponding intrument
CN104281564B (en) * 2014-08-12 2017-08-08 中国科学院计算技术研究所 A kind of bilingual unsupervised syntactic analysis method and system
CN105893343A (en) * 2015-02-12 2016-08-24 富士通株式会社 Information management device and information management method
CN105740234A (en) * 2016-01-29 2016-07-06 昆明理工大学 MST algorithm based Vietnamese dependency tree library construction method
CN106202395A (en) * 2016-07-11 2016-12-07 上海智臻智能网络科技股份有限公司 Text clustering method and device
CN106202395B (en) * 2016-07-11 2019-12-31 上海智臻智能网络科技股份有限公司 Text clustering method and device
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
CN108021559A (en) * 2018-02-05 2018-05-11 威盛电子股份有限公司 Natural language understanding system and lexical analysis method
CN109684644A (en) * 2018-12-27 2019-04-26 南京大学 The construction method of interdependent syntax tree based on context
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text

Also Published As

Publication number Publication date
CN101814065B (en) 2014-07-30

Similar Documents

Publication Publication Date Title
CN101814065B (en) Syntactic analysis device and syntactic analysis method
Täckström et al. Efficient inference and structured learning for semantic role labeling
US6928448B1 (en) System and method to match linguistic structures using thesaurus information
Constant et al. MWU-aware part-of-speech tagging with a CRF model and lexical resources
Shindo et al. Bayesian symbol-refined tree substitution grammars for syntactic parsing
CN106777296A (en) Method and system are recommended in a kind of talent&#39;s search based on semantic matches
CN103314369B (en) Machine translation apparatus and method
CN105138864A (en) Protein interaction relationship data base construction method based on biomedical science literature
Van Cranenburgh et al. Discontinuous parsing with an efficient and accurate DOP model
Piskorski Morphisto-an open source morphological analyzer for German
Bharati et al. A two-stage constraint based dependency parser for free word order languages
Kumar et al. Sanskrit compound processor
CN109558314B (en) Java source code clone detection oriented method
Gervás A logic programming application for the analysis of Spanish verse
Bai et al. Enhanced natural language interface for web-based information retrieval
Singh et al. Improving neural machine translation using rule-based machine translation
JPH1196177A (en) Method for generating term dictionary, and storage medium recording term dictionary generation program
CN103019924B (en) The intelligent evaluating system of input method and method
CN104750484B (en) A kind of code abstraction generating method based on maximum entropy model
Khoufi et al. Supervised learning model for parsing Arabic language
CN116483314A (en) Automatic intelligent activity diagram generation method
CN111859929B (en) Data visualization method and device and related equipment thereof
Barrett et al. Automated clinical coding using semantic atoms and topology
Bourigault et al. Evaluation of terminology extractors: principles and experiments.
KR100946317B1 (en) Apparatus and method thereof for relation construction between ontology classes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140730

Termination date: 20180223