CN101814065B - Syntactic analysis device and syntactic analysis method - Google Patents
Syntactic analysis device and syntactic analysis method Download PDFInfo
- Publication number
- CN101814065B CN101814065B CN200910118104.1A CN200910118104A CN101814065B CN 101814065 B CN101814065 B CN 101814065B CN 200910118104 A CN200910118104 A CN 200910118104A CN 101814065 B CN101814065 B CN 101814065B
- Authority
- CN
- China
- Prior art keywords
- rule
- syntactic analysis
- production
- production rule
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a syntactic analysis device and a syntactic analysis method. The syntactic analysis device using the regular expression rule comprises a training tree library, a rule acquisition module, a rule application module, a syntax tree generation module and a rule set. The rule acquisition module studies the syntactic analysis rule from the well noted training tree library through the statistics of study methods, and generates the rule set used when analyzing an inputted sentence. The rule acquisition module applies the regular expression for expression of the repeated part in the latter item of the production rule. The syntactic analysis rule studied by the rule acquisition module further contains context information. The rule application module uses the syntactic analysis rule set obtained by studying of the rule acquisition module for analyzing the inputted sentence, thereby identifying the relationship between grammatical components and components of the inputted sentence. The syntax tree generation module generates a dependency syntactic relation diagram or a phrase structure type syntax analysis tree according to the outputted analysis result of the rule application module and the demands of a user.
Description
Technical field
The present invention relates to syntactic analysis technology, for identifying the relation between grammer composition and the composition of sentence from the natural language sentences of input.More particularly, the present invention relates to a kind of syntactic analysis device and syntactic analysis method that uses regular expression, the syntactic analysis rules of its using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree.
Background technology
Relation between grammer composition and the composition of identification natural language is to process difficult point and the vital task of natural language.Research about this respect discloses many sections of papers and Patents, for example, US Patent No. 5,386,556A discloses a kind of natural language analysis apparatus and method (Naturallanguage analyzing apparatus and method), and US Patent No. 5,930,746A discloses the apparatus and method (Parsing andtranslating natural language sentences automatically) of a kind of automatic parsing and translation natural language.
In to the processing procedure of natural language, while carrying out syntactic analysis, need to use syntactic analysis rules storehouse, the quality in syntactic analysis rules storehouse and ability are the most critical reasons that affects syntactic analysis result.But, in existing natural language analysis apparatus and method, because the ability to express of syntactic analysis rules is limited, therefore can not apply syntactic rule flexible and efficiently and describe the grammar property of natural language, correspondingly can not identify effectively and accurately the syntax composition of input sentence.
Summary of the invention
In view of the foregoing, the present invention proposes a kind of syntactic analysis device and syntactic analysis method that uses regular expression to describe syntactic analysis rules, in order to the natural language of input is carried out to the identification of relation between syntax composition and composition.According to syntactic analysis device of the present invention and syntactic analysis method, syntactic analysis rules that can using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree, thereby strengthen the ability of describing natural language rule.
According to an aspect of the present invention, a kind of syntactic analysis device is provided, comprise: rule acquisition module, be configured to training treebank from having marked study syntactic analysis rules to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent; Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generation module, be configured to, according to the relation between grammer composition and the composition of the input sentence of rule application module output, generate interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.
Syntactic analysis device according to an embodiment of the invention, rule acquisition module comprises: tree fragment decomposition unit, is configured to every syntax tree in training treebank to be decomposed into the tree fragment as production rule, to form tree fragment collection; Repeated fragment test section, be configured to detect tree fragment decomposition unit and decompose the node sequence that whether has repetition in the concentrated production rule of the tree fragment that obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form; And recurring rule merging portion, the identic production rule that is configured to repeated fragment test section to generate is merged into a production rule, to form production rule collection.
Preferably, rule acquisition module also comprises rules selection portion, and the production rule that is configured to according to selection strategy, repetition compatible rule merging portion be generated is selected, to generate the production rule collection of reduction.
According to one embodiment of present invention, tree fragment decomposition unit is decomposed the tree fragment obtaining and is expressed as <freq:xx>{<f
1... <f
n>}P → Y
1y
2... Y
n, n is more than or equal to 1 positive integer, and wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f
i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y
ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f
1>...<f
n>}P → Y
1y
2... Y
n" be illustrated in and occurred <f
1>...<f
nin the situation of > attribute, phrase P can be by Y
1y
2... Y
nform.When recurring rule merging portion merges into a production rule at the identic production rule that repeated fragment test section is generated, the frequency of production rule is carried out to corresponding merging.In addition, attributive character <f in production rule
i> is optional.
Preferably, production rule comprises: single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than secondary; Many nodes repetition type rule, for a certain fragment in production rule consequent repeats production rule more than secondary; Mix repetition type rule, both comprised in production rule consequent the production rule that single node repeating part also comprises many nodes repeating part; And without repetition type rule, for there is no the production rule of repeating part in production rule consequent.
Syntactic analysis device according to an embodiment of the invention, rule application module comprises: regular compiling portion, the production rule collection that is configured to that rule acquisition module is generated is compiled into the rule query table of syntactic analysis rules; Rule query portion, is configured to the rule query table that rule searching compiling portion compiles; And syntactic analysis portion, be configured to inquire about in rule query table by rule query portion the syntactic analysis rules that can be applied to input sentence, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.
Preferably, rule application module also comprises ambiguity resolution portion, the partial analysis candidate who is configured to generate from syntax analysis portion, selects optimum partial analysis result; And syntactic analysis portion to ambiguity resolution portion select partial analysis result carry out further syntactic analysis, to export the final analysis result meeting the demands.
According to one embodiment of present invention, rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection; And syntactic analysis portion identifies the middle mark that generates according to the mode identical with identification P-marker, by all syntactic analysis rules that comprise middle create-rule that are applicable to current input sentence of rule query portion inquiry, to identify the relation between grammer composition and the composition of inputting sentence.
Preferably, ambiguity resolution portion determines by calculating the probability P (S, t) of parsing tree the optimum production rule using according to the following formula, to select optimum partial analysis result:
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
According to a preferred embodiment of the present invention, the calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule; And for the production rule formula that contains regular expression
Computation rule probability, wherein r
ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r
i) be for this centre create-rule r
iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
Syntactic analysis device according to an embodiment of the invention, syntax tree generation module comprises: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; And phrase structure generating unit, be configured to, according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output, generate phrase structure type parsing tree.
Syntactic analysis device according to another embodiment of the invention, syntax tree generation module also can comprise: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; Obs network node mark portion, is configured to carry out obs network node mark according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output; And dependency structure generating unit, be configured to generate according to the obs network node of obs network node mark standard laid down by the ministries or commissions of the Central Government note the interdependent syntactic relation figure that inputs sentence.
According to another aspect of the present invention, a kind of syntactic analysis method is provided, comprise: Rule step,, wherein represent with regular expression for the repeating part in production rule consequent to generate the production rule collection that comprises regular expression form from the training treebank study syntactic analysis rules that marked; Rule application step, the production rule set pair input sentence that service regeulations obtaining step obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generates step, the relation between grammer composition and the composition of the input sentence of exporting according to rule application step, interdependent syntactic relation figure or the phrase structure type parsing tree of generation input sentence.
According to the syntactic analysis device of the bright proposition of we and syntactic analysis method, can use the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, thereby realize efficient correct syntactic analysis.
In addition, the present invention is also provided for realizing the computer program of above-mentioned character identifying method.
In addition, the present invention also provides at least computer program of computer-readable medium form, records the computer program code for realizing above-mentioned character identifying method on it.
Brief description of the drawings
Below with reference to the accompanying drawings illustrate embodiments of the invention, can understand more easily above and other objects, features and advantages of the present invention.In the accompanying drawings, identical or corresponding technical characterictic or parts will adopt identical or corresponding Reference numeral to represent.In accompanying drawing:
Fig. 1 illustrates according to the structural representation of the syntactic analysis device of the use regular expression rule of the embodiment of the present invention;
Fig. 2 illustrates according to the block diagram of the rule acquisition module shown in Fig. 1 of the embodiment of the present invention;
Fig. 3 illustrates according to the block diagram of the rule application module shown in Fig. 1 of the embodiment of the present invention;
Fig. 4 illustrates according to the block diagram of the syntax tree generation module shown in Fig. 1 of the embodiment of the present invention;
Fig. 5 is the schematic diagram for upper node and the next node are described;
Fig. 6 is the example syntax tree S1 using when illustrating that rule acquisition module is obtained syntactic rule;
Fig. 7 is another example syntax tree S2 using when illustrating that rule acquisition module is obtained syntactic rule;
Fig. 8 illustrates the final syntactic analysis result of by syntactic analysis module, input sentence being analyzed according to one embodiment of present invention rear output;
Fig. 9 illustrates according to one embodiment of present invention, by the central marker cleaning portion of syntax tree generation module, the final syntactic analysis result shown in Fig. 8 is carried out to the result after intermediate node removing;
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention;
Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step shown in Figure 10 according to one embodiment of the invention;
Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step shown in Figure 10 according to one embodiment of the invention;
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step according to the syntax tree shown in one embodiment of the invention Figure 10; And
Figure 14 illustrates for implementing the structure calcspar according to the messaging device of syntactic analysis method of the present invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.It should be noted that for purposes of clarity, in accompanying drawing and explanation, omitted expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and processing.
Here, single node repetition type rule that given first is applied in the present invention, many nodes repetition type rule, mix repetition type rule and without the definition of repetition type rule and the definition of upper node and the next node, to better principle of the present invention is set forth.
Definition 1: single node repetition type rule, the rule definition that the consequent a certain composition of production rule repeats more than secondary is single node repetition type rule.
Definition 2: many nodes repetition type rule, the rule definition that the consequent a certain fragment of production rule repeats more than secondary is many nodes repetition type rule.
Definition 3: mix repetition type rule, both comprised rule definition that single node repeating part also comprises many nodes repeating part in production rule consequent for mixing repetition type rule.
Definition 4: without repetition type rule, the rule definition that there is no repeating part in production rule consequent is for without repetition type rule.
Provide some examples of various types of production rules below, further illustrating single node repetition type rule defined above, many nodes repetition type rule, mix repetition type rule and without repetition type rule.
(1) single node repetition type rule
For example, RL1:P → AAABC, wherein " AAA " part is single node repeating part, is P → A with regular expression Rule Expression method representation of the present invention
*bC.
Again for example, RL2:P → ABBCCD, wherein includes two groups of single node repeating parts " BB " and " CC ", is P → AB with regular expression Rule Expression method representation of the present invention
*c
*d.
(2) many nodes repetition type rule
For example, RL3:P → ABABC, wherein " AB " part is many nodes repeating part, is expressed as P → [AB] with regular expression regular expression of the present invention
+c.
Again for example, RL4:P → ABCBCDEFDEFG, wherein " BCBC " and " DEFDEF " is many nodes repeating part, is expressed as P → A[BC with regular expression regular expression of the present invention]
+[DEF]
+g.
(3) mix repetition type rule
For example, RL5:P → AAABCDBCDE, comprises single node repetition type and many nodes repetition type, and wherein " AAA " part is single node repeating part, " BCDBCD " part is many nodes repeating part, is expressed as P → A with regular expression regular expression of the present invention
*[BCD]
+e.
(4) without repetition type rule
For example, RL6:P → ABCDEABC, neither comprises single node repetition type, does not also comprise many nodes repetition type, is therefore without repetition type rule.
Then define upper node and the next node.
Definition 5: upper node, the node that comprises child node in tree construction is defined as upper node.
Definition 6: the next node has the node of father node to be defined as the next node in tree construction.
Upper node and the next node with respect to the different piece of paid close attention to tree and change, with the same node of one tree, both can do upper node, sometimes also can do the next node.For example, A2, B3, these three nodes of C1 in syntax tree as shown in Figure 5, wherein C1 is the next node of B3, A2, and A2 is the upper node of B3, C1, and B3 is the upper node of C1 and is the next node of A2.
Next with reference to accompanying drawing, particularly Fig. 1 to Fig. 4, describes according to the general work principle of the syntactic analysis device of the embodiment of the present invention.As shown in Figure 1, comprise training treebank 101, rule acquisition module 102, rule application module 103, syntax tree generation module 104 and rule set 105 according to the syntactic analysis device of the use regular expression rule of the embodiment of the present invention.
Rule acquisition module 102 is learnt syntactic analysis rules by the method for for example statistical learning from the training treebank 101 having marked, and is created on the rule set 105 using when input sentence is analyzed.For the repeating part in production rule consequent, rule acquisition module 102 is applied in above defined regular expression form and explains accordingly.Therefore the set that, rule set 105 is the production rule that comprises regular expression form.In addition, the syntactic analysis rules that rule acquisition module 102 is learnt can also comprise contextual information.
The syntactic analysis rules collection 105 that rule application module 103 service regeulations acquisition module 102 study obtain is analyzed input sentence, identifies the relation between composition and the composition of inputting sentence.
The analysis result that syntax tree generation module 104 is exported according to rule application module 103, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence according to user's demand.
Below three of syntactic analysis device of the present invention main modular rule acquisition module 102, rule application module 103 and syntax tree generation modules 104 are specifically described.
Fig. 2 illustrates according to the block diagram of the rule acquisition module 102 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 2, according to the rule acquisition module 102 of this embodiment comprise syntactic analysis treebank 201, tree fragment decomposition unit 202, resolution parameter input block 203, tree fragment collection 204, repeated fragment test section 205, without repetition type rule unit 206, single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209, recurring rule merging portion 211 and production rule collection 213.
Syntactic analysis treebank 201 is the syntactic analysis treebank for learning, i.e. training treebank 101 shown in Fig. 1 has wherein indicated the nest relation between grammer composition and the composition of the sentence for training.The present invention has carried out practical application respectively on the English PennTreebank of two treebanks and Chinese PennTreebank, but should be noted that syntactic analysis device proposed by the invention and syntactic analysis method and language independent, as long as any language has marked the nest relation between grammer composition and the composition of sentence, just can obtain syntactic analysis rules by technical scheme of the present invention, and subsequently input sentence be carried out to syntactic analysis.
The resolution parameter that tree fragment decomposition unit 202 is inputted according to resolution parameter input block 203, trains every syntax tree in treebank 101 to be decomposed into some less subtrees or tree fragment syntactic analysis treebank 201.The presentation format of tree fragment is as follows.
<freq:xx>{<f
1>...<f
n>}P→Y
1Y
2...Y
n
Wherein, <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank.<f
i> is attributive character, is mainly used to describe contextual information, vocabulary or the semantic features while using this rule.The resolution parameter that attributive character is inputted according to resolution parameter input block 203 is definite, and rule can comprise attributive character, also can not comprise.
P represents the upper node of this tree fragment, is a P-marker.Y
ithe child node that represents P node is a P-marker or a vocabulary mark." { <f
1>...<f
n>}P → Y
1y
2... Y
n" be illustrated in and occurred <f
1>...<f
nin the situation of > attribute, phrase P can be by Y
1y
2... Y
nform.
From the presentation format of tree fragment, a tree fragment is the rule of a form of production.Every syntax tree in syntactic analysis treebank 201 can be decomposed into some tree fragments, and all decomposition result all deposit in tree fragment collection 204, forms the rule set of form of production.
Then, tree fragment collection (, production rule collection) input repeated fragment test section 205.Repeated fragment test section 205 detects the " Y in inputted tree fragment
1y
2... Y
n" whether part have the node sequence of repetition.The form repeating according to node, tree fragment collection is divided into as defined above single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209 and without repetition type rule unit 206.The rule that comprises repetition node sequence, will represent with regular expression symbol " * " or "+".
After introducing regular expression, some identic rules will be transformed out.For example, " regular R
1: P → ABBBC " and " regular R
2: P → ABBC " be all expressed as P → AB while representing by regular expression form
*c.
Therefore, strictly all rules is converted into after regular expression form, need to removes the identic rule repeating.Based on this, recurring rule merging portion 211 is a rule by identic compatible rule merging, regular frequency is carried out to corresponding merging simultaneously.Thus, can directly generate the production rule collection 213 for input sentence is carried out to syntactic analysis.
In addition, in order to improve the efficiency of syntactic analysis, rule acquisition module 102 can also comprise selection strategy unit 210 and rules selection portion 212 according to an embodiment of the invention.The selection strategy that rules selection portion 212 provides according to selection strategy unit 212, the production rule that repetition compatible rule merging portion 211 is generated is selected, thereby generates the efficient production rule collection 213 of reduction, to improve the efficiency of syntactic analysis.
The operating process of rule acquisition module 102 is described by instantiation below.Fig. 6 and Fig. 7 are two example syntax tree S1 and the S2 for illustrating that rule acquisition module 102 uses in the time obtaining syntactic rule.The process that rule acquisition module 102 is obtained syntactic analysis rules from syntax tree S1 and S2 is as follows.
First the resolution parameter that, tree fragment decomposition unit 202 provides according to resolution parameter input block 203 decomposes syntax tree S1 and S2.Suppose that in this example resolution parameter is that tree is decomposed into context-free phrase, by as shown in table 1 below the tree fragment collection forming after the syntax tree S1 shown in Fig. 6 and Fig. 7 and S2 decomposition.
Table 1 is set fragment collection
Then, the decomposition fragment collection shown in table 1 is entered to repeated fragment test section 205, detect whether there is repetition node at this.Fragment collection in this example
in include repetition node.Repeating part is sent into respectively 206Zhi unit, unit 209 according to the type repeating after adopting regular expression form to represent.The result of fragment collection shown in table 1 after representing with regular expression is as shown in table 2 below above.
The rule set that table 2 regular expression represents
Fragment after representing with regular expression will be input to recurring rule merging portion 211 as syntactic analysis rules candidate, carry out the merging of recurring rule at this.The rule set that rule shown in table 2 forms after recurring rule merging portion 211 merges above comprises 9 rules altogether, wherein to rule " X → a
*" carried out repeating merging.The Output rusults of recurring rule merging portion 211 is as shown in table 3 below.
Rule set after table 3 recurring rule merges
Afterwards, the Output rusults of recurring rule merging portion 211 enters rules selection portion 212, select according to selection strategy, final formation rule application module 103 carries out the needed syntactic analysis rules collection of syntactic analysis, and as production rule collection 213 input rule application modules 103.
Fig. 3 illustrates according to the block diagram of the rule application module 103 shown in Fig. 1 of the embodiment of the present invention.The syntactic analysis rules collection that rule acquisition module 102 shown in rule application module 103 application drawings 2 forms, i.e. production rule set pair input sentence carries out syntactic analysis, the relation between grammer composition and the composition of output input sentence.As shown in Figure 3, comprise production rule collection 302, regular compiling portion 303, rule query table 304, rule query portion 305, syntactic analysis portion 306, ambiguity resolution portion 308 etc. according to the rule application module 103 of this embodiment.
Production rule integrates the 302 production rule collection 213 that form as rule acquisition module 102, first enters regular compiling portion 303.Production rule collection 302 is compiled the rule query table 304 that formation can be used by rule query portion 305 by rule compiling portion 303.
Input sentence 301 is inputted after syntactic analysis portion 306, in rule query table 304, inquired about the syntactic analysis rules that can be applied to this input sentence 301 by rule query portion 305 by syntactic analysis portion 306, according to the grammer composition of syntactic analysis rules identification input sentence 301, and export analysis result.The process of syntactic analysis adopts CYK algorithm from the word node of input sentence 301, and the process of rule query expands to 2 nodes from 1 node, covers to end whole sentence.
Syntactic analysis rules may provide multiple partial analysis candidates 307, and optimum partial analysis result 309 is therefrom selected by ambiguity resolution portion 308.The partial analysis result 309 that ambiguity resolution portion 308 is selected enters syntactic analysis portion 306 and carries out further syntactic analysis, until the satisfied final analysis result 310 of syntactic analysis portion 306 output.
For the production syntactic analysis rules that contain regular expression form that generate in rule acquisition module 102, according to embodiments of the invention, in the middle of increasing by the part to containing regular expression in production syntactic analysis rules, generate result, solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process by rule compiling portion 303, rule query portion 305, syntactic analysis portion 306.Its concrete operating process is as follows.
Rule compiling portion 303 concentrates for production rule the part that comprises regular expression and generates mark and middle create-rule in the middle of increasing, and goes the part that in Substitution Rules, this regular expression represents with centre generation mark.
Specifically, for x
*matrix section, generates label L EssT.LTssT.LTx> and middle create-rule <x> → xx and <x> → <x>x in the middle of increasing.For [x..y]+matrix section, in the middle of increasing, generate mark [x..y] and middle create-rule [x..y] → x.yx...y and [x..y] → [x..y] x...y.
For example, rule " R
2: X → a
*" in " a
*" contain regular expression, be " a
*" middle mark " <a> " and two the middle create-rules " <a> → aa " and " <a> → <a>a " of generating of increase.By the regular expression part in centre generation Marker exchange rule, by regular R
2be converted into X → <a>.
The Output rusults of the recurring rule merging portion 211 shown in table 3, result in the middle of increasing after generation mark and middle create-rule is as shown in table 4 below, the central marker using and middle create-rule are as shown in table 5, wherein introduce 6 middle create-rules, use respectively R
13~R
17represent.
The Output rusults of table 4 recurring rule merging portion 211
Increase the result after central marker
When table 5 transforms the Output rusults of recurring rule merging portion 211
The central marker of using and middle create-rule
Central marker | Middle create-rule |
<a> | R 13:<a>→aa |
R 14:<a>→<a>a | |
[ab] | R 15:[ab]→abab |
R 16:[ab]→[ab]ab | |
<X> | R 17:<X>→XX |
R 17:<X>→<X>X |
In regular compilation process, central marker is incorporated to the P-marker collection of rule set, and middle create-rule is incorporated to syntactic analysis rules collection, and regular compiling portion 303 organization of unity strictly all rules and marks, generate the rule query table 304 of being convenient to inquiry.
Syntactic analysis portion 306 is in analytic process, identify central marker according to the mode identical with identification phrase, adopt CYK algorithm to inquire about all syntactic analysis rules (create-rule in the middle of comprising) that are applicable to current input sentence 301 by rule query portion 305, generate syntactic analysis result.
Below by the rule analysis input sentence " aaababababcaa " using in table 4 and table 5, illustrate the operating process of rule application module 103.
Table 6 parsing sentence service regeulations example
Process | Present analysis state | Service regeulations |
Process 1 | aaababababcaa | R 13:<a>→aa R 15:[ab]→abab R 13:<a>→aa |
Process 2 | <a>[ab]abc<a> | R 2:X→<a> R 4:Z→[ab] |
Process 3 | XZabcX | R 5:M→abcX |
Process 4 | XZM | R 7:S→XZM |
Process 5 | S |
Here should be noted that, above-mentioned example is only to have provided as an example a kind of mode that uses syntactic analysis rules, not optimum or unique.The object of this example is that explanation, how by introducing intermediate symbols and intermediate analysis rule, is used the syntactic analysis rules that contain regular expression in the process of syntactic analysis.
In the process using in rule, may there are some available analysis rules in same state, occurs analyzing ambiguity.The multiple create-rule candidate 307 who obtains inquires about in rule query portion 305 will be input to ambiguity resolution portion 308, select optimal rules to apply by ambiguity resolution portion 308, thereby generate partial analysis result 309.
According to one embodiment of present invention, ambiguity resolution portion 308 passes through the optimal rules of probability P (S, the t) choice for use that calculates parsing tree, and then exports optimum syntactic analysis result 309.Its basic computing formula is as follows:
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
In the time of computation rule probability, the phrase symbol existing in intermediate symbols and syntactic analysis is equal to, and intermediate analysis rule is equal to non-intermediate analysis rule.The method of existing various calculating syntactic analysis rules all can be used for calculating the syntactic analysis rules of obtaining in treatment in accordance with the present invention.Its regular Probabilistic estimation simplified summary is as follows:
1. non-regular expression rule: regular probability estimate is identical with existing method.
2. create-rule in the middle of: intermediate symbols is treated as phrase, adopted existing method to estimate regular probability.
3. contain the rule of regular expression:
Wherein r
ifor transforming an intermediate rule of using when this contains regular expression regular, and p (r
i) be for this centre syntactic analysis rules r
iregular probability.
According in the rule application module 103 of this embodiment of the invention, by introducing intermediate symbols and middle create-rule, solve the problem that how to use regular expression rule.By intermediate symbols and middle create-rule and phrase symbol and general rule are put on an equal footing, solve the probability estimate problem of calculating regular expression rule.
By using the rule of table 6, the syntactic analysis result that input sentence " aaababababcaa " is carried out to the final output obtaining after syntactic analysis as shown in Figure 8.
The syntactic analysis result that rule application module 103 is exported, for example syntactic analysis result as shown in Figure 8 will be inputted syntax tree generation module 104, and the demand at this according to user generates interdependent syntactic relation figure or phrase structure type parsing tree.
Fig. 4 illustrates according to the block diagram of the syntax tree generation module 104 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 4, comprise central marker cleaning portion 402, obs network node mark portion 403, dependency structure generating unit 404 and phrase structure generating unit 405 according to the syntax tree generation module 104 of this embodiment of the invention.
Final analysis result 401 shown in Fig. 4 is the syntactic analysis result 310 that rule application module 103 is exported.First final analysis result 401 inputs central marker cleaning portion 402, removes the central marker using in syntactic analysis process.
After removing central marker, if need to generate phrase structure type parsing tree, generate phrase structure tree by phrase structure generating unit 405.If need to generate interdependent syntactic relation figure, carry out obs network node mark by removing central marker analysis result input obs network node mark portion 403 afterwards, generate dependence by dependency structure generating unit 404 afterwards.
According to one embodiment of present invention, described in the step recursive function that central marker cleaning portion 402 clears up intermediate node is described below.
Function CleanTags(n
i)
Begin
For each s
i,where s
i∈{sons of n
i}
cleanTags(s
i)
If // s
ifor intermediate symbols
If s
i is a semi-finished label
All sons of //si rise as the son of ni
move up all the sons of s
i as the sons of n
i
Endif
End for
End
The final analysis result 310 of the syntactic analysis module 103 shown in Fig. 8 is carried out result after intermediate node removing as shown in Figure 9 by central marker cleaning portion 402.
More than describe according to the structure of the syntactic analysis device of the embodiment of the present invention and principle of work thereof.Describe according to the applied syntactic analysis method of above-mentioned syntactic analysis device of the embodiment of the present invention below in conjunction with accompanying drawing 10~13.
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention.As shown in figure 10, comprise that according to the syntactic analysis method of this embodiment Rule step S1001, rule application step S1003 and syntax tree generate step S1005.
First, in Rule step S1001, from the training treebank having marked, example training treebank 101 as shown in Figure 1, study syntactic analysis rules, to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent.Then,, in rule application step S1003, the production rule set pair input sentence that service regeulations obtaining step S1001 obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence.Finally, generate in step S1005 syntax tree, the relation between composition and the composition of the input sentence of exporting according to rule application step S1003, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence.
Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step S1001 shown in Figure 10 according to one embodiment of the invention.As shown in figure 10, comprise tree fragment decomposition step S1101, repeated fragment detecting step S1103, recurring rule combining step S1105 and rules selection step S1107 according to the regulation obtaining method of this embodiment.
First, in tree fragment decomposition step S1101, every syntax tree in training treebank example training treebank 101 is as shown in Figure 1 decomposed into the tree fragment as production rule, to form tree fragment collection, example is set fragment collection 204 as shown in Figure 2.
Then, in repeated fragment detecting step S1103, detect the node sequence that whether has repetition in the concentrated production rule of tree fragment that tree fragment decomposition step S1101 obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form.Then,, in recurring rule combining step S1105, the identic production rule that repeated fragment detecting step S1103 is generated is merged into a production rule, to form production rule collection.
Preferably, in order to improve the efficiency of syntactic analysis, regulation obtaining method also comprises rules selection step S1107 according to an embodiment of the invention, production rule repetition compatible rule merging step S1105 being generated according to the selection strategy of prior setting is selected, to generate the production rule collection of reduction, than production rule collection 213 as shown in Figure 2.
Tree fragment decomposition step is decomposed the tree fragment obtaining and can be expressed as
<freq:xx>{<f
1>...<f
n>}P → Y
1y
2... Y
n, n is more than or equal to 1 positive integer.Wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f
i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y
ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f
1>...<f
n>}P → Y
1y
2... Y
n" be illustrated in and occurred <f
1>...<f
nin the situation of > attribute, phrase P can be by Y
1y
2... Y
nform.
It is worthy of note, when recurring rule combining step S1105 merges into a production rule at the identic production rule that repeated fragment detecting step S1103 is generated, also the frequency of production rule is carried out to corresponding merging.
Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step S1003 shown in Figure 10 according to one embodiment of the invention.
As shown in figure 12, first at regular compile step S1201, the production rule collection that Rule step S1001 is generated is compiled into the rule query table of syntactic analysis rules, for example rule query table 304 as shown in Figure 3.Then, in syntactic analysis step S1203, in rule query table 304, inquire about the syntactic analysis rules that can be applied to input sentence by rule query step S1205, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.Here the rule query table 304 that, rule query step S1205 compiles for rule searching compile step S1201.
Next, in ambiguity resolution step S1207, from the partial analysis candidate of syntax analytical procedure S1203 and rule query step S1205 generation, select optimum partial analysis result.Then, judge at step S1209 whether the analysis result obtaining meets the demands.If do not met the demands, return to syntactic analysis step S1203, the partial analysis result that ambiguity resolution step S1207 is selected is carried out further syntactic analysis.
If judge that in step S1209 the analysis result obtaining meets the demands after ambiguity resolution, treatment scheme advances to step S1211, exports final syntactic analysis result.
According to a preferred embodiment of the present invention, in regular compile step S1201, concentrate for production rule the part that comprises regular expression and increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection.
In syntactic analysis step S1203, in the middle of identifying according to the mode identical with identifying P-marker, generate mark, inquire about all syntactic analysis rules of comprising middle create-rule that are applicable to current input sentences with the relation between grammer composition and the composition of identification input sentence by rule query step S1205.
In ambiguity resolution step S1207, determine by calculating the probability P (S, t) of parsing tree the optimum production rule using according to formula below, to select optimum partial analysis result:
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
The calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule.
For the production rule formula that contains regular expression
Computation rule probability, wherein r
ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r
i) be for this centre create-rule r
iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step S1005 according to the syntax tree shown in one embodiment of the invention Figure 10.
As shown in figure 13, first, in central marker cleanup step S1301, remove the middle mark that generates using in syntactic analysis process.Then, in step S1303, judgement is will generate interdependent syntactic relation figure or will generate phrase structure type parsing tree.
If judgement need to generate phrase structure type parsing tree in step S1303, treatment scheme advances to phrase structure and generates step S1305, relation between composition and the composition of the input sentence in the middle of the removing of exporting according to central marker cleanup step S1301 at this after generation mark, generate phrase structure type parsing tree, and by the phrase structure type parsing tree output of the input sentence generating.
If judgement needs to generate the interdependent syntactic relation figure of input sentence in step S1303, treatment scheme advances to obs network node annotation step S1307, carries out obs network node mark according to the relation between composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleanup step S1301 output.Then generate in step S1309 at dependency structure, generate the interdependent syntactic relation figure of input sentence according to the obs network node of obs network node annotation step S1307 mark, and by the interdependent syntactic relation figure of the input sentence generating.
Below describe by reference to the accompanying drawings the specific embodiment of syntactic analysis device of the present invention and syntactic analysis method in detail.Can find out from the above description, according to syntactic analysis device of the present invention and syntactic analysis method, by using the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, realizes efficient, correct syntactic analysis.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, introduce regular expression in production syntactic analysis rules time, provide a kind of method of describing repeating part in grammar construct, the multiplicity of repeated fragment can be unrestricted, same syntactic analysis rules, can analyze different length one class phrase.The syntactic analysis device proposing by the present invention and syntactic analysis method, the syntax can be described more neatly, the position relationship of each composition in the syntax both can have been described, the recursive characteristic of local composition in syntactic structure can be described again, therefore there is stronger versatility, more robust by the rule that the present invention obtains.
In addition, due in the training treebank of study, comprising more multicomponent phrase (is the more length language of node, be called length language herein) occur frequency lower, when describing by existing syntactic analysis rules, this part phrase conventionally will be left in the basket, in the time of syntactic analysis, can not analyze if there is length language like this, there will be the sparse problem of rule.And length language all includes repeatably part conventionally, use syntactic analysis device of the present invention and syntactic analysis method, by the part repeating in length language is merged, no matter the part repeating in phrase repeats how many times, can show by same syntactic analysis rules, the Sparse Problems of solution rule to a certain extent.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, in the time carrying out syntactic analysis, in the middle of increasing by the part to containing regular expression in rule, generate result, thereby solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process.
Above set forth basic functional principle of the present invention using the syntax tree taking out from Chinese as instantiation, but the syntactic analysis device and the syntactic analysis method thereof that use the present invention to describe can be identified the grammer in other various language or semantic composition equally.In addition, the inventive method also can be used for analysis or the similar task of identifying certain class composition from incoming symbol sequence to genome sequence.Therefore be appreciated that all other Languages or notations of being applied to, the variation that does not exceed design main points of the present invention is all due among protection scope of the present invention.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skill in the art use their basic programming skill just can realize in the situation that having read explanation of the present invention.
Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.
In the situation that realizing embodiments of the invention by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, example general purpose personal computer 700 is as shown in figure 14 installed the program that forms this software, this computing machine, in the time that various program is installed, can be carried out various functions etc.
In Figure 14, CPU (central processing unit) (CPU) 701 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random access memory (RAM) 703.In RAM 703, also store as required data required in the time that CPU 701 carries out various processing etc.CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 is also connected to bus 704.
Following parts are connected to input/output interface 705: importation 706, comprises keyboard, mouse etc.; Output 707, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 708, comprises hard disk etc.; With communications portion 709, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 709 via network such as the Internet executive communication processing.
As required, driver 710 is also connected to input/output interface 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 710 as required, is installed in storage area 708 computer program of therefrom reading as required.
In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 711 is installed the program that forms softwares.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 14, distributes separately the detachable media 711 so that program to be provided to user with device.The example of detachable media 711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 702, storage area 708 etc., wherein computer program stored, and be distributed to user together with comprising their device.
Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order naturally following the instructions carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
Although described the present invention and advantage thereof in detail, be to be understood that in the case of not departing from the spirit and scope of the present invention that limited by appended claim and can carry out various changes, alternative and conversion.And, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the device that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or device.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.
Claims (18)
1. a syntactic analysis device, comprising:
Rule acquisition module, is configured to training treebank from having marked study syntactic analysis rules to generate the production rule collection that comprises regular expression form, wherein represents with regular expression for the repeating part in production rule consequent;
Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the grammer composition of input sentence and the relation that composition is asked; And
Syntax tree generation module, is configured to, according to the grammer composition of the input sentence of rule application module output and the relation that composition is asked, generate interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence;
Wherein rule acquisition module comprises:
Tree fragment decomposition unit, is configured to every syntax tree in training treebank to be decomposed into the tree fragment as production rule, to form tree fragment collection;
Repeated fragment test section, be configured to detect tree fragment decomposition unit and decompose the node sequence that whether has repetition in the concentrated production rule of the tree fragment that obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form; And
Recurring rule merging portion, the identic production rule that is configured to repeated fragment test section to generate is merged into a production rule, to form production rule collection;
Wherein rule acquisition module also comprises rules selection portion, and the production rule that is configured to according to selection strategy, repetition compatible rule merging portion be generated is selected, to generate the production rule collection of reduction;
Wherein setting fragment decomposition unit decomposes the tree fragment obtaining and is expressed as <freq:xx>{<f
1> ... <f
n>}P → Y
1y
2y
n, n is more than or equal to 1 positive integer, wherein
<freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank;
<f
i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n;
P represents the upper node of this tree fragment, is a P-marker;
Y
ithe child node that represents P node is a P-marker or a vocabulary mark;
" { <f
1> ... <f
n>}P → Y
1y
2y
n" be illustrated in and occurred <f
1> ... <f
nin the situation of > attribute, phrase P can be by Y
1y
2y
nform; And
Wherein, when recurring rule merging portion merges into a production rule at the identic production rule that repeated fragment test section is generated, the frequency of production rule is carried out to corresponding merging;
Attributive character <f in production rule
i> is optional.
2. syntactic analysis device according to claim 1, wherein production rule comprises:
Single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than secondary;
Many nodes repetition type rule, for a certain fragment in production rule consequent repeats production rule more than secondary;
Mix repetition type rule, both comprised in production rule consequent the production rule that single node repeating part also comprises many nodes repeating part; And
Without repetition type rule, for there is no the production rule of repeating part in production rule consequent.
3. syntactic analysis device according to claim 1, wherein rule application module comprises:
Rule compiling portion, the production rule collection that is configured to that rule acquisition module is generated is compiled into the rule query table of syntactic analysis rules;
Rule query portion, is configured to the rule query table that rule searching compiling portion compiles; And
Syntactic analysis portion, is configured to inquire about in rule query table by rule query portion the syntactic analysis rules that can be applied to input sentence, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.
4. syntactic analysis device according to claim 3, wherein rule application module also comprises ambiguity resolution portion, the partial analysis candidate who is configured to generate from syntax analysis portion, selects optimum partial analysis result; And
The partial analysis result that syntactic analysis portion is selected ambiguity resolution portion is carried out further syntactic analysis, to export the final analysis result meeting the demands.
5. syntactic analysis device according to claim 4, wherein:
Rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection; And
Syntactic analysis portion generates mark in the middle of identifying according to the mode identical with identification P-marker, by all syntactic analysis rules that comprise middle create-rule that are applicable to current input sentence of rule query portion inquiry, to identify the relation between grammer composition and the composition of inputting sentence.
6. syntactic analysis device according to claim 5, wherein ambiguity resolution portion determines by calculating the probability P (S, t) of parsing tree the optimum production rule using according to the following formula, to select optimum partial analysis result:
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
7. syntactic analysis device according to claim 6, wherein:
The calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule; And
For the production rule formula that contains regular expression
computation rule probability, wherein r
ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r
i) be for this centre create-rule r
iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
8. according to the syntactic analysis device described in any one of claim 5 to 7, wherein syntax tree generation module comprises:
Central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; And
Phrase structure generating unit, is configured to, according to generating the grammer composition of mark input sentence afterwards and the relation that composition is asked in the middle of the removing of central marker cleaning portion output, generate phrase structure type parsing tree.
9. according to the syntactic analysis device described in any one of claim 5 to 7, wherein syntax tree generation module comprises:
Central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process;
Obs network node mark portion, is configured to carry out obs network node mark according to generating the grammer composition of mark input sentence afterwards and the relation that composition is asked in the middle of the removing of central marker cleaning portion output; And
Dependency structure generating unit, is configured to generate according to the obs network node of obs network node mark standard laid down by the ministries or commissions of the Central Government note the interdependent syntactic relation figure that inputs sentence.
10. a syntactic analysis method, comprising:
Rule step,, wherein represents with regular expression for the repeating part in production rule consequent to generate the production rule collection that comprises regular expression form from the training treebank study syntactic analysis rules that marked;
Rule application step, the production rule set pair input sentence that service regeulations obtaining step obtains is analyzed, and identifies the relation between grammer composition and the composition of inputting sentence; And
Syntax tree generates step, according to the grammer composition of the input sentence of rule application step output and the relation that composition is asked, generates interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence;
Wherein Rule step comprises:
Tree fragment decomposition step, is decomposed into the tree fragment as production rule using every syntax tree in training treebank, to form tree fragment collection;
Repeated fragment detecting step, detect tree fragment decomposition step and decompose the node sequence that whether has repetition in the concentrated production rule of the tree fragment that obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form; And
Recurring rule combining step, the identic production rule that repeated fragment detecting step is generated is merged into a production rule, to form production rule collection;
Wherein Rule step also comprises rules selection step, and production rule repetition compatible rule merging step being generated according to selection strategy is selected, to generate the production rule collection of reduction;
Wherein setting fragment decomposition step decomposes the tree fragment obtaining and is expressed as <freq:xx>{<f
1> ... <f
n>}P → Y
1y
2y
n, n is more than or equal to 1 positive integer, wherein
<freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank;
<f
i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n;
P represents the upper node of this tree fragment, is a P-marker;
Y
ithe child node that represents P node is a P-marker or a vocabulary mark;
" { <f
1> ... <f
n>}P → Y
1y
2y
n" be illustrated in and occurred <f
1> ... <f
nin the situation of > attribute, phrase P can be by Y
1y2 ... Y
nform; And
Wherein, when recurring rule combining step is merged into a production rule at the identic production rule that repeated fragment detecting step is generated, the frequency of production rule is carried out to corresponding merging;
Attributive character <f in production rule
i> is optional.
11. syntactic analysis methods according to claim 10, wherein production rule comprises:
Single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than secondary;
Many nodes repetition type rule, for a certain fragment in production rule consequent repeats production rule more than secondary;
Mix repetition type rule, both comprised in production rule consequent the production rule that single node repeating part also comprises many nodes repeating part; And
Without repetition type rule, for there is no the production rule of repeating part in production rule consequent.
12. syntactic analysis methods according to claim 10, wherein rule application step comprises:
Rule compile step, the production rule collection that Rule step is generated is compiled into the rule query table of syntactic analysis rules;
Rule query step, the rule query table of rule searching compile step compiling; And
Syntactic analysis step is inquired about the syntactic analysis rules that can be applied to input sentence in rule query table by rule query step, according to the grammer composition of syntactic analysis rules identification input sentence and the relation that composition is asked.
13. syntactic analysis methods according to claim 12, wherein rule application step also comprises ambiguity resolution step, the partial analysis candidate who generates, selects optimum partial analysis result from syntax analytical procedure; And
The partial analysis result that syntactic analysis step is selected ambiguity resolution step is carried out further syntactic analysis, to export the final analysis result meeting the demands.
14. syntactic analysis methods according to claim 13, wherein:
Rule compile step is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection; And
Syntactic analysis step generates mark in the middle of identifying according to the mode identical with identification P-marker, by all syntactic analysis rules that comprise middle create-rule that are applicable to current input sentence of rule query step inquiry, to identify the relation between grammer composition and the composition of inputting sentence.
15. syntactic analysis methods according to claim 14, wherein ambiguity resolution step determines by calculating the probability P (S, t) of parsing tree the optimum production rule using according to the following formula, to select optimum partial analysis result:
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
16. syntactic analysis methods according to claim 15, wherein:
The calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule; And
For the production rule formula that contains regular expression
computation rule probability, wherein r
ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r
i) be for this centre create-rule r
iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
17. according to claim 14 to the syntactic analysis method described in 16 any one, and wherein syntax tree generation step comprises:
Central marker cleanup step, removes the middle mark that generates using in syntactic analysis process; And
Phrase structure generates step, according to asking the relation between grammer composition and the composition that generates mark input sentence afterwards in the removing of central marker cleanup step output, generates phrase structure type parsing tree.
18. according to claim 14 to the syntactic analysis method described in 16 any one, and wherein syntax tree generation step comprises:
Central marker cleanup step, removes the middle mark that generates using in syntactic analysis process;
Obs network node annotation step, carries out obs network node mark according to generating the grammer composition of mark input sentence afterwards and the relation that composition is asked in the middle of the removing of central marker cleanup step output; And
Dependency structure generates step, generates the interdependent syntactic relation figure of input sentence according to the obs network node of obs network node annotation step mark.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910118104.1A CN101814065B (en) | 2009-02-23 | 2009-02-23 | Syntactic analysis device and syntactic analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910118104.1A CN101814065B (en) | 2009-02-23 | 2009-02-23 | Syntactic analysis device and syntactic analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101814065A CN101814065A (en) | 2010-08-25 |
CN101814065B true CN101814065B (en) | 2014-07-30 |
Family
ID=42621323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910118104.1A Expired - Fee Related CN101814065B (en) | 2009-02-23 | 2009-02-23 | Syntactic analysis device and syntactic analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101814065B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637180B (en) * | 2011-02-14 | 2014-06-18 | 汉王科技股份有限公司 | Character post processing method and device based on regular expression |
JP2013196504A (en) * | 2012-03-21 | 2013-09-30 | Toshiba Corp | Gist extracting device and program |
CN103294666B (en) * | 2013-05-28 | 2017-03-01 | 百度在线网络技术(北京)有限公司 | Grammar compilation method, semantic analytic method and corresponding intrument |
CN104281564B (en) * | 2014-08-12 | 2017-08-08 | 中国科学院计算技术研究所 | A kind of bilingual unsupervised syntactic analysis method and system |
JP2016149023A (en) * | 2015-02-12 | 2016-08-18 | 富士通株式会社 | Information management unit, information management method and information management program |
CN105740234A (en) * | 2016-01-29 | 2016-07-06 | 昆明理工大学 | MST algorithm based Vietnamese dependency tree library construction method |
CN106202395B (en) * | 2016-07-11 | 2019-12-31 | 上海智臻智能网络科技股份有限公司 | Text clustering method and device |
CN106843849B (en) * | 2016-12-28 | 2020-04-14 | 南京大学 | Automatic synthesis method of code model based on library function of document |
CN108021559B (en) * | 2018-02-05 | 2022-05-03 | 威盛电子股份有限公司 | Natural language understanding system and semantic analysis method |
CN109684644A (en) * | 2018-12-27 | 2019-04-26 | 南京大学 | The construction method of interdependent syntax tree based on context |
CN111767709A (en) * | 2019-03-27 | 2020-10-13 | 武汉慧人信息科技有限公司 | Logic method for carrying out error correction and syntactic analysis on English text |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1226327A (en) * | 1996-06-28 | 1999-08-18 | 微软公司 | Method and system for computing semantic logical forms from syntax trees |
-
2009
- 2009-02-23 CN CN200910118104.1A patent/CN101814065B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1226327A (en) * | 1996-06-28 | 1999-08-18 | 微软公司 | Method and system for computing semantic logical forms from syntax trees |
Non-Patent Citations (1)
Title |
---|
魏蓉.限定领域的基本陈述句句法分析.《中国优秀硕士学位论文全文数据库(电子期刊)》.2008,(第8期),F084-154. * |
Also Published As
Publication number | Publication date |
---|---|
CN101814065A (en) | 2010-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101814065B (en) | Syntactic analysis device and syntactic analysis method | |
Täckström et al. | Efficient inference and structured learning for semantic role labeling | |
US6928448B1 (en) | System and method to match linguistic structures using thesaurus information | |
CN110888943B (en) | Method and system for assisted generation of court judge document based on micro-template | |
Ochieng | PAROT: Translating natural language to SPARQL | |
Van Cranenburgh et al. | Discontinuous parsing with an efficient and accurate DOP model | |
Bai et al. | Enhanced natural language interface for web-based information retrieval | |
CN113220901A (en) | Writing concept auxiliary system and network system based on enhanced intelligence | |
Bharati et al. | A two-stage constraint based dependency parser for free word order languages | |
CN109558314B (en) | Java source code clone detection oriented method | |
Osman et al. | Generate use case from the requirements written in a natural language using machine learning | |
D'Ulizia et al. | A learning algorithm for multimodal grammar inference | |
Caldwell et al. | Bilingual generation of job descriptions from quasi-conceptual forms | |
Gürbüz et al. | From organizational guidelines to business process models: Exploratory case for an ontology based methodology | |
Zaenen et al. | Language analysis and understanding | |
CN113420124B (en) | Method for resolving conflict under multiple conditions of voice retrieval | |
Jyothilakshmi et al. | Domain ontology based class diagram generation from functional requirements | |
Alami et al. | Generating sequence diagrams from arabic user requirements using mada+ tokan tool. | |
Saparov | A probabilistic generative grammar for semantic parsing | |
Mangassarian et al. | A general framework for subjective information extraction from unstructured English text | |
Bourigault et al. | Evaluation of terminology extractors: principles and experiments. | |
Ouaddi et al. | A sketch of an approach for discovering UML use-case diagrams from textual specifications of systems using a chatbot | |
Ramamonjison et al. | LaTeX2Solver: a Hierarchical Semantic Parsing of LaTeX Document into Code for an Assistive Optimization Modeling Application | |
KR100946317B1 (en) | Apparatus and method thereof for relation construction between ontology classes | |
Pecheti et al. | Recursive Descent Parser for Abstract Syntax Tree Visualization of Mathematical Expressions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140730 Termination date: 20180223 |
|
CF01 | Termination of patent right due to non-payment of annual fee |