CN101814065B - Syntactic analysis device and syntactic analysis method - Google Patents

Syntactic analysis device and syntactic analysis method Download PDF

Info

Publication number
CN101814065B
CN101814065B CN200910118104.1A CN200910118104A CN101814065B CN 101814065 B CN101814065 B CN 101814065B CN 200910118104 A CN200910118104 A CN 200910118104A CN 101814065 B CN101814065 B CN 101814065B
Authority
CN
China
Prior art keywords
rule
production
rules
tree
syntactic analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200910118104.1A
Other languages
Chinese (zh)
Other versions
CN101814065A (en
Inventor
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN200910118104.1A priority Critical patent/CN101814065B/en
Publication of CN101814065A publication Critical patent/CN101814065A/en
Application granted granted Critical
Publication of CN101814065B publication Critical patent/CN101814065B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开了一种句法分析装置和句法分析方法。根据本发明的使用正则表达式规则的句法分析装置包括训练树库、规则获取模块、规则应用模块、句法树生成模块和规则集。规则获取模块通过统计学习的方法从已经标注好的训练树库学习句法分析规则,生成在对输入句子进行分析时使用的规则集。对于产生式规则的后项中的重复部分,规则获取模块应用正则表达式来表示。规则获取模块所学习的句法分析规则还可以包含上下文信息。规则应用模块使用规则获取模块学习获得的句法分析规则集分析输入句子,识别出输入句子的语法成份及成份间的关系。句法树生成模块根据规则应用模块输出的分析结果,按照用户的需求生成输入句子的依存句法关系图或者短语结构型句法分析树。

The invention discloses a syntax analysis device and a syntax analysis method. The syntax analysis device using regular expression rules according to the present invention includes a training tree library, a rule acquisition module, a rule application module, a syntax tree generation module and a rule set. The rule acquisition module learns the syntactic analysis rules from the marked training tree bank through the statistical learning method, and generates a rule set used when analyzing the input sentence. For the repeated part in the subsequent item of the production rule, the rule acquisition module uses a regular expression to represent it. The syntax analysis rules learned by the rule acquisition module may also include context information. The rule application module uses the syntactic analysis rule set learned by the rule acquisition module to analyze the input sentence and identify the grammatical components of the input sentence and the relationship between the components. The syntactic tree generation module generates a dependency syntactic relationship graph or a phrase structure syntactic analysis tree of the input sentence according to the analysis results output by the rule application module according to the needs of the user.

Description

Syntactic analysis device and syntactic analysis method
Technical field
The present invention relates to syntactic analysis technology, for identifying the relation between grammer composition and the composition of sentence from the natural language sentences of input.More particularly, the present invention relates to a kind of syntactic analysis device and syntactic analysis method that uses regular expression, the syntactic analysis rules of its using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree.
Background technology
Relation between grammer composition and the composition of identification natural language is to process difficult point and the vital task of natural language.Research about this respect discloses many sections of papers and Patents, for example, US Patent No. 5,386,556A discloses a kind of natural language analysis apparatus and method (Naturallanguage analyzing apparatus and method), and US Patent No. 5,930,746A discloses the apparatus and method (Parsing andtranslating natural language sentences automatically) of a kind of automatic parsing and translation natural language.
In to the processing procedure of natural language, while carrying out syntactic analysis, need to use syntactic analysis rules storehouse, the quality in syntactic analysis rules storehouse and ability are the most critical reasons that affects syntactic analysis result.But, in existing natural language analysis apparatus and method, because the ability to express of syntactic analysis rules is limited, therefore can not apply syntactic rule flexible and efficiently and describe the grammar property of natural language, correspondingly can not identify effectively and accurately the syntax composition of input sentence.
Summary of the invention
In view of the foregoing, the present invention proposes a kind of syntactic analysis device and syntactic analysis method that uses regular expression to describe syntactic analysis rules, in order to the natural language of input is carried out to the identification of relation between syntax composition and composition.According to syntactic analysis device of the present invention and syntactic analysis method, syntactic analysis rules that can using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree, thereby strengthen the ability of describing natural language rule.
According to an aspect of the present invention, a kind of syntactic analysis device is provided, comprise: rule acquisition module, be configured to training treebank from having marked study syntactic analysis rules to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent; Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generation module, be configured to, according to the relation between grammer composition and the composition of the input sentence of rule application module output, generate interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.
Syntactic analysis device according to an embodiment of the invention, rule acquisition module comprises: tree fragment decomposition unit, is configured to every syntax tree in training treebank to be decomposed into the tree fragment as production rule, to form tree fragment collection; Repeated fragment test section, be configured to detect tree fragment decomposition unit and decompose the node sequence that whether has repetition in the concentrated production rule of the tree fragment that obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form; And recurring rule merging portion, the identic production rule that is configured to repeated fragment test section to generate is merged into a production rule, to form production rule collection.
Preferably, rule acquisition module also comprises rules selection portion, and the production rule that is configured to according to selection strategy, repetition compatible rule merging portion be generated is selected, to generate the production rule collection of reduction.
According to one embodiment of present invention, tree fragment decomposition unit is decomposed the tree fragment obtaining and is expressed as <freq:xx>{<f 1... <f n>}P → Y 1y 2... Y n, n is more than or equal to 1 positive integer, and wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f 1>...<f n>}P → Y 1y 2... Y n" be illustrated in and occurred <f 1>...<f nin the situation of > attribute, phrase P can be by Y 1y 2... Y nform.When recurring rule merging portion merges into a production rule at the identic production rule that repeated fragment test section is generated, the frequency of production rule is carried out to corresponding merging.In addition, attributive character <f in production rule i> is optional.
Preferably, production rule comprises: single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than secondary; Many nodes repetition type rule, for a certain fragment in production rule consequent repeats production rule more than secondary; Mix repetition type rule, both comprised in production rule consequent the production rule that single node repeating part also comprises many nodes repeating part; And without repetition type rule, for there is no the production rule of repeating part in production rule consequent.
Syntactic analysis device according to an embodiment of the invention, rule application module comprises: regular compiling portion, the production rule collection that is configured to that rule acquisition module is generated is compiled into the rule query table of syntactic analysis rules; Rule query portion, is configured to the rule query table that rule searching compiling portion compiles; And syntactic analysis portion, be configured to inquire about in rule query table by rule query portion the syntactic analysis rules that can be applied to input sentence, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.
Preferably, rule application module also comprises ambiguity resolution portion, the partial analysis candidate who is configured to generate from syntax analysis portion, selects optimum partial analysis result; And syntactic analysis portion to ambiguity resolution portion select partial analysis result carry out further syntactic analysis, to export the final analysis result meeting the demands.
According to one embodiment of present invention, rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection; And syntactic analysis portion identifies the middle mark that generates according to the mode identical with identification P-marker, by all syntactic analysis rules that comprise middle create-rule that are applicable to current input sentence of rule query portion inquiry, to identify the relation between grammer composition and the composition of inputting sentence.
Preferably, ambiguity resolution portion determines by calculating the probability P (S, t) of parsing tree the optimum production rule using according to the following formula, to select optimum partial analysis result:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
According to a preferred embodiment of the present invention, the calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule; And for the production rule formula that contains regular expression p ( r ) = &Pi; i = 1 . . n p ( r i ) Computation rule probability, wherein r ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be for this centre create-rule r iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
Syntactic analysis device according to an embodiment of the invention, syntax tree generation module comprises: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; And phrase structure generating unit, be configured to, according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output, generate phrase structure type parsing tree.
Syntactic analysis device according to another embodiment of the invention, syntax tree generation module also can comprise: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; Obs network node mark portion, is configured to carry out obs network node mark according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output; And dependency structure generating unit, be configured to generate according to the obs network node of obs network node mark standard laid down by the ministries or commissions of the Central Government note the interdependent syntactic relation figure that inputs sentence.
According to another aspect of the present invention, a kind of syntactic analysis method is provided, comprise: Rule step,, wherein represent with regular expression for the repeating part in production rule consequent to generate the production rule collection that comprises regular expression form from the training treebank study syntactic analysis rules that marked; Rule application step, the production rule set pair input sentence that service regeulations obtaining step obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generates step, the relation between grammer composition and the composition of the input sentence of exporting according to rule application step, interdependent syntactic relation figure or the phrase structure type parsing tree of generation input sentence.
According to the syntactic analysis device of the bright proposition of we and syntactic analysis method, can use the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, thereby realize efficient correct syntactic analysis.
In addition, the present invention is also provided for realizing the computer program of above-mentioned character identifying method.
In addition, the present invention also provides at least computer program of computer-readable medium form, records the computer program code for realizing above-mentioned character identifying method on it.
Brief description of the drawings
Below with reference to the accompanying drawings illustrate embodiments of the invention, can understand more easily above and other objects, features and advantages of the present invention.In the accompanying drawings, identical or corresponding technical characterictic or parts will adopt identical or corresponding Reference numeral to represent.In accompanying drawing:
Fig. 1 illustrates according to the structural representation of the syntactic analysis device of the use regular expression rule of the embodiment of the present invention;
Fig. 2 illustrates according to the block diagram of the rule acquisition module shown in Fig. 1 of the embodiment of the present invention;
Fig. 3 illustrates according to the block diagram of the rule application module shown in Fig. 1 of the embodiment of the present invention;
Fig. 4 illustrates according to the block diagram of the syntax tree generation module shown in Fig. 1 of the embodiment of the present invention;
Fig. 5 is the schematic diagram for upper node and the next node are described;
Fig. 6 is the example syntax tree S1 using when illustrating that rule acquisition module is obtained syntactic rule;
Fig. 7 is another example syntax tree S2 using when illustrating that rule acquisition module is obtained syntactic rule;
Fig. 8 illustrates the final syntactic analysis result of by syntactic analysis module, input sentence being analyzed according to one embodiment of present invention rear output;
Fig. 9 illustrates according to one embodiment of present invention, by the central marker cleaning portion of syntax tree generation module, the final syntactic analysis result shown in Fig. 8 is carried out to the result after intermediate node removing;
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention;
Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step shown in Figure 10 according to one embodiment of the invention;
Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step shown in Figure 10 according to one embodiment of the invention;
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step according to the syntax tree shown in one embodiment of the invention Figure 10; And
Figure 14 illustrates for implementing the structure calcspar according to the messaging device of syntactic analysis method of the present invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.It should be noted that for purposes of clarity, in accompanying drawing and explanation, omitted expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and processing.
Here, single node repetition type rule that given first is applied in the present invention, many nodes repetition type rule, mix repetition type rule and without the definition of repetition type rule and the definition of upper node and the next node, to better principle of the present invention is set forth.
Definition 1: single node repetition type rule, the rule definition that the consequent a certain composition of production rule repeats more than secondary is single node repetition type rule.
Definition 2: many nodes repetition type rule, the rule definition that the consequent a certain fragment of production rule repeats more than secondary is many nodes repetition type rule.
Definition 3: mix repetition type rule, both comprised rule definition that single node repeating part also comprises many nodes repeating part in production rule consequent for mixing repetition type rule.
Definition 4: without repetition type rule, the rule definition that there is no repeating part in production rule consequent is for without repetition type rule.
Provide some examples of various types of production rules below, further illustrating single node repetition type rule defined above, many nodes repetition type rule, mix repetition type rule and without repetition type rule.
(1) single node repetition type rule
For example, RL1:P → AAABC, wherein " AAA " part is single node repeating part, is P → A with regular expression Rule Expression method representation of the present invention *bC.
Again for example, RL2:P → ABBCCD, wherein includes two groups of single node repeating parts " BB " and " CC ", is P → AB with regular expression Rule Expression method representation of the present invention *c *d.
(2) many nodes repetition type rule
For example, RL3:P → ABABC, wherein " AB " part is many nodes repeating part, is expressed as P → [AB] with regular expression regular expression of the present invention +c.
Again for example, RL4:P → ABCBCDEFDEFG, wherein " BCBC " and " DEFDEF " is many nodes repeating part, is expressed as P → A[BC with regular expression regular expression of the present invention] +[DEF] +g.
(3) mix repetition type rule
For example, RL5:P → AAABCDBCDE, comprises single node repetition type and many nodes repetition type, and wherein " AAA " part is single node repeating part, " BCDBCD " part is many nodes repeating part, is expressed as P → A with regular expression regular expression of the present invention *[BCD] +e.
(4) without repetition type rule
For example, RL6:P → ABCDEABC, neither comprises single node repetition type, does not also comprise many nodes repetition type, is therefore without repetition type rule.
Then define upper node and the next node.
Definition 5: upper node, the node that comprises child node in tree construction is defined as upper node.
Definition 6: the next node has the node of father node to be defined as the next node in tree construction.
Upper node and the next node with respect to the different piece of paid close attention to tree and change, with the same node of one tree, both can do upper node, sometimes also can do the next node.For example, A2, B3, these three nodes of C1 in syntax tree as shown in Figure 5, wherein C1 is the next node of B3, A2, and A2 is the upper node of B3, C1, and B3 is the upper node of C1 and is the next node of A2.
Next with reference to accompanying drawing, particularly Fig. 1 to Fig. 4, describes according to the general work principle of the syntactic analysis device of the embodiment of the present invention.As shown in Figure 1, comprise training treebank 101, rule acquisition module 102, rule application module 103, syntax tree generation module 104 and rule set 105 according to the syntactic analysis device of the use regular expression rule of the embodiment of the present invention.
Rule acquisition module 102 is learnt syntactic analysis rules by the method for for example statistical learning from the training treebank 101 having marked, and is created on the rule set 105 using when input sentence is analyzed.For the repeating part in production rule consequent, rule acquisition module 102 is applied in above defined regular expression form and explains accordingly.Therefore the set that, rule set 105 is the production rule that comprises regular expression form.In addition, the syntactic analysis rules that rule acquisition module 102 is learnt can also comprise contextual information.
The syntactic analysis rules collection 105 that rule application module 103 service regeulations acquisition module 102 study obtain is analyzed input sentence, identifies the relation between composition and the composition of inputting sentence.
The analysis result that syntax tree generation module 104 is exported according to rule application module 103, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence according to user's demand.
Below three of syntactic analysis device of the present invention main modular rule acquisition module 102, rule application module 103 and syntax tree generation modules 104 are specifically described.
Fig. 2 illustrates according to the block diagram of the rule acquisition module 102 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 2, according to the rule acquisition module 102 of this embodiment comprise syntactic analysis treebank 201, tree fragment decomposition unit 202, resolution parameter input block 203, tree fragment collection 204, repeated fragment test section 205, without repetition type rule unit 206, single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209, recurring rule merging portion 211 and production rule collection 213.
Syntactic analysis treebank 201 is the syntactic analysis treebank for learning, i.e. training treebank 101 shown in Fig. 1 has wherein indicated the nest relation between grammer composition and the composition of the sentence for training.The present invention has carried out practical application respectively on the English PennTreebank of two treebanks and Chinese PennTreebank, but should be noted that syntactic analysis device proposed by the invention and syntactic analysis method and language independent, as long as any language has marked the nest relation between grammer composition and the composition of sentence, just can obtain syntactic analysis rules by technical scheme of the present invention, and subsequently input sentence be carried out to syntactic analysis.
The resolution parameter that tree fragment decomposition unit 202 is inputted according to resolution parameter input block 203, trains every syntax tree in treebank 101 to be decomposed into some less subtrees or tree fragment syntactic analysis treebank 201.The presentation format of tree fragment is as follows.
<freq:xx>{<f 1>...<f n>}P→Y 1Y 2...Y n
Wherein, <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank.<f i> is attributive character, is mainly used to describe contextual information, vocabulary or the semantic features while using this rule.The resolution parameter that attributive character is inputted according to resolution parameter input block 203 is definite, and rule can comprise attributive character, also can not comprise.
P represents the upper node of this tree fragment, is a P-marker.Y ithe child node that represents P node is a P-marker or a vocabulary mark." { <f 1>...<f n>}P → Y 1y 2... Y n" be illustrated in and occurred <f 1>...<f nin the situation of > attribute, phrase P can be by Y 1y 2... Y nform.
From the presentation format of tree fragment, a tree fragment is the rule of a form of production.Every syntax tree in syntactic analysis treebank 201 can be decomposed into some tree fragments, and all decomposition result all deposit in tree fragment collection 204, forms the rule set of form of production.
Then, tree fragment collection (, production rule collection) input repeated fragment test section 205.Repeated fragment test section 205 detects the " Y in inputted tree fragment 1y 2... Y n" whether part have the node sequence of repetition.The form repeating according to node, tree fragment collection is divided into as defined above single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209 and without repetition type rule unit 206.The rule that comprises repetition node sequence, will represent with regular expression symbol " * " or "+".
After introducing regular expression, some identic rules will be transformed out.For example, " regular R 1: P → ABBBC " and " regular R 2: P → ABBC " be all expressed as P → AB while representing by regular expression form *c.
Therefore, strictly all rules is converted into after regular expression form, need to removes the identic rule repeating.Based on this, recurring rule merging portion 211 is a rule by identic compatible rule merging, regular frequency is carried out to corresponding merging simultaneously.Thus, can directly generate the production rule collection 213 for input sentence is carried out to syntactic analysis.
In addition, in order to improve the efficiency of syntactic analysis, rule acquisition module 102 can also comprise selection strategy unit 210 and rules selection portion 212 according to an embodiment of the invention.The selection strategy that rules selection portion 212 provides according to selection strategy unit 212, the production rule that repetition compatible rule merging portion 211 is generated is selected, thereby generates the efficient production rule collection 213 of reduction, to improve the efficiency of syntactic analysis.
The operating process of rule acquisition module 102 is described by instantiation below.Fig. 6 and Fig. 7 are two example syntax tree S1 and the S2 for illustrating that rule acquisition module 102 uses in the time obtaining syntactic rule.The process that rule acquisition module 102 is obtained syntactic analysis rules from syntax tree S1 and S2 is as follows.
First the resolution parameter that, tree fragment decomposition unit 202 provides according to resolution parameter input block 203 decomposes syntax tree S1 and S2.Suppose that in this example resolution parameter is that tree is decomposed into context-free phrase, by as shown in table 1 below the tree fragment collection forming after the syntax tree S1 shown in Fig. 6 and Fig. 7 and S2 decomposition.
Table 1 is set fragment collection
Then, the decomposition fragment collection shown in table 1 is entered to repeated fragment test section 205, detect whether there is repetition node at this.Fragment collection in this example in include repetition node.Repeating part is sent into respectively 206Zhi unit, unit 209 according to the type repeating after adopting regular expression form to represent.The result of fragment collection shown in table 1 after representing with regular expression is as shown in table 2 below above.
The rule set that table 2 regular expression represents
Fragment after representing with regular expression will be input to recurring rule merging portion 211 as syntactic analysis rules candidate, carry out the merging of recurring rule at this.The rule set that rule shown in table 2 forms after recurring rule merging portion 211 merges above comprises 9 rules altogether, wherein to rule " X → a *" carried out repeating merging.The Output rusults of recurring rule merging portion 211 is as shown in table 3 below.
Rule set after table 3 recurring rule merges
Afterwards, the Output rusults of recurring rule merging portion 211 enters rules selection portion 212, select according to selection strategy, final formation rule application module 103 carries out the needed syntactic analysis rules collection of syntactic analysis, and as production rule collection 213 input rule application modules 103.
Fig. 3 illustrates according to the block diagram of the rule application module 103 shown in Fig. 1 of the embodiment of the present invention.The syntactic analysis rules collection that rule acquisition module 102 shown in rule application module 103 application drawings 2 forms, i.e. production rule set pair input sentence carries out syntactic analysis, the relation between grammer composition and the composition of output input sentence.As shown in Figure 3, comprise production rule collection 302, regular compiling portion 303, rule query table 304, rule query portion 305, syntactic analysis portion 306, ambiguity resolution portion 308 etc. according to the rule application module 103 of this embodiment.
Production rule integrates the 302 production rule collection 213 that form as rule acquisition module 102, first enters regular compiling portion 303.Production rule collection 302 is compiled the rule query table 304 that formation can be used by rule query portion 305 by rule compiling portion 303.
Input sentence 301 is inputted after syntactic analysis portion 306, in rule query table 304, inquired about the syntactic analysis rules that can be applied to this input sentence 301 by rule query portion 305 by syntactic analysis portion 306, according to the grammer composition of syntactic analysis rules identification input sentence 301, and export analysis result.The process of syntactic analysis adopts CYK algorithm from the word node of input sentence 301, and the process of rule query expands to 2 nodes from 1 node, covers to end whole sentence.
Syntactic analysis rules may provide multiple partial analysis candidates 307, and optimum partial analysis result 309 is therefrom selected by ambiguity resolution portion 308.The partial analysis result 309 that ambiguity resolution portion 308 is selected enters syntactic analysis portion 306 and carries out further syntactic analysis, until the satisfied final analysis result 310 of syntactic analysis portion 306 output.
For the production syntactic analysis rules that contain regular expression form that generate in rule acquisition module 102, according to embodiments of the invention, in the middle of increasing by the part to containing regular expression in production syntactic analysis rules, generate result, solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process by rule compiling portion 303, rule query portion 305, syntactic analysis portion 306.Its concrete operating process is as follows.
Rule compiling portion 303 concentrates for production rule the part that comprises regular expression and generates mark and middle create-rule in the middle of increasing, and goes the part that in Substitution Rules, this regular expression represents with centre generation mark.
Specifically, for x *matrix section, generates label L EssT.LTssT.LTx> and middle create-rule <x> → xx and <x> → <x>x in the middle of increasing.For [x..y]+matrix section, in the middle of increasing, generate mark [x..y] and middle create-rule [x..y] → x.yx...y and [x..y] → [x..y] x...y.
For example, rule " R 2: X → a *" in " a *" contain regular expression, be " a *" middle mark " <a> " and two the middle create-rules " <a> → aa " and " <a> → <a>a " of generating of increase.By the regular expression part in centre generation Marker exchange rule, by regular R 2be converted into X → <a>.
The Output rusults of the recurring rule merging portion 211 shown in table 3, result in the middle of increasing after generation mark and middle create-rule is as shown in table 4 below, the central marker using and middle create-rule are as shown in table 5, wherein introduce 6 middle create-rules, use respectively R 13~R 17represent.
The Output rusults of table 4 recurring rule merging portion 211
Increase the result after central marker
When table 5 transforms the Output rusults of recurring rule merging portion 211
The central marker of using and middle create-rule
Central marker Middle create-rule
<a> R 13:<a>→aa
R 14:<a>→<a>a
[ab] R 15:[ab]→abab
R 16:[ab]→[ab]ab
<X> R 17:<X>→XX
R 17:<X>→<X>X
In regular compilation process, central marker is incorporated to the P-marker collection of rule set, and middle create-rule is incorporated to syntactic analysis rules collection, and regular compiling portion 303 organization of unity strictly all rules and marks, generate the rule query table 304 of being convenient to inquiry.
Syntactic analysis portion 306 is in analytic process, identify central marker according to the mode identical with identification phrase, adopt CYK algorithm to inquire about all syntactic analysis rules (create-rule in the middle of comprising) that are applicable to current input sentence 301 by rule query portion 305, generate syntactic analysis result.
Below by the rule analysis input sentence " aaababababcaa " using in table 4 and table 5, illustrate the operating process of rule application module 103.
Table 6 parsing sentence service regeulations example
Process Present analysis state Service regeulations
Process 1 aaababababcaa R 13:<a>→aa R 15:[ab]→abab R 13:<a>→aa
Process 2 <a>[ab]abc<a> R 2:X→<a> R 4:Z→[ab]
Process 3 XZabcX R 5:M→abcX
Process 4 XZM R 7:S→XZM
Process 5 S
Here should be noted that, above-mentioned example is only to have provided as an example a kind of mode that uses syntactic analysis rules, not optimum or unique.The object of this example is that explanation, how by introducing intermediate symbols and intermediate analysis rule, is used the syntactic analysis rules that contain regular expression in the process of syntactic analysis.
In the process using in rule, may there are some available analysis rules in same state, occurs analyzing ambiguity.The multiple create-rule candidate 307 who obtains inquires about in rule query portion 305 will be input to ambiguity resolution portion 308, select optimal rules to apply by ambiguity resolution portion 308, thereby generate partial analysis result 309.
According to one embodiment of present invention, ambiguity resolution portion 308 passes through the optimal rules of probability P (S, the t) choice for use that calculates parsing tree, and then exports optimum syntactic analysis result 309.Its basic computing formula is as follows:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
In the time of computation rule probability, the phrase symbol existing in intermediate symbols and syntactic analysis is equal to, and intermediate analysis rule is equal to non-intermediate analysis rule.The method of existing various calculating syntactic analysis rules all can be used for calculating the syntactic analysis rules of obtaining in treatment in accordance with the present invention.Its regular Probabilistic estimation simplified summary is as follows:
1. non-regular expression rule: regular probability estimate is identical with existing method.
2. create-rule in the middle of: intermediate symbols is treated as phrase, adopted existing method to estimate regular probability.
3. contain the rule of regular expression: p ( r ) = &Pi; i = 1 . . n p ( r i ) , Wherein r ifor transforming an intermediate rule of using when this contains regular expression regular, and p (r i) be for this centre syntactic analysis rules r iregular probability.
According in the rule application module 103 of this embodiment of the invention, by introducing intermediate symbols and middle create-rule, solve the problem that how to use regular expression rule.By intermediate symbols and middle create-rule and phrase symbol and general rule are put on an equal footing, solve the probability estimate problem of calculating regular expression rule.
By using the rule of table 6, the syntactic analysis result that input sentence " aaababababcaa " is carried out to the final output obtaining after syntactic analysis as shown in Figure 8.
The syntactic analysis result that rule application module 103 is exported, for example syntactic analysis result as shown in Figure 8 will be inputted syntax tree generation module 104, and the demand at this according to user generates interdependent syntactic relation figure or phrase structure type parsing tree.
Fig. 4 illustrates according to the block diagram of the syntax tree generation module 104 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 4, comprise central marker cleaning portion 402, obs network node mark portion 403, dependency structure generating unit 404 and phrase structure generating unit 405 according to the syntax tree generation module 104 of this embodiment of the invention.
Final analysis result 401 shown in Fig. 4 is the syntactic analysis result 310 that rule application module 103 is exported.First final analysis result 401 inputs central marker cleaning portion 402, removes the central marker using in syntactic analysis process.
After removing central marker, if need to generate phrase structure type parsing tree, generate phrase structure tree by phrase structure generating unit 405.If need to generate interdependent syntactic relation figure, carry out obs network node mark by removing central marker analysis result input obs network node mark portion 403 afterwards, generate dependence by dependency structure generating unit 404 afterwards.
According to one embodiment of present invention, described in the step recursive function that central marker cleaning portion 402 clears up intermediate node is described below.
Function CleanTags(n i)
Begin
For each s i,where s i∈{sons of n i}
cleanTags(s i)
If // s ifor intermediate symbols
If s i is a semi-finished label
All sons of //si rise as the son of ni
move up all the sons of s i as the sons of n i
Endif
End for
End
The final analysis result 310 of the syntactic analysis module 103 shown in Fig. 8 is carried out result after intermediate node removing as shown in Figure 9 by central marker cleaning portion 402.
More than describe according to the structure of the syntactic analysis device of the embodiment of the present invention and principle of work thereof.Describe according to the applied syntactic analysis method of above-mentioned syntactic analysis device of the embodiment of the present invention below in conjunction with accompanying drawing 10~13.
Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention.As shown in figure 10, comprise that according to the syntactic analysis method of this embodiment Rule step S1001, rule application step S1003 and syntax tree generate step S1005.
First, in Rule step S1001, from the training treebank having marked, example training treebank 101 as shown in Figure 1, study syntactic analysis rules, to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent.Then,, in rule application step S1003, the production rule set pair input sentence that service regeulations obtaining step S1001 obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence.Finally, generate in step S1005 syntax tree, the relation between composition and the composition of the input sentence of exporting according to rule application step S1003, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence.
Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step S1001 shown in Figure 10 according to one embodiment of the invention.As shown in figure 10, comprise tree fragment decomposition step S1101, repeated fragment detecting step S1103, recurring rule combining step S1105 and rules selection step S1107 according to the regulation obtaining method of this embodiment.
First, in tree fragment decomposition step S1101, every syntax tree in training treebank example training treebank 101 is as shown in Figure 1 decomposed into the tree fragment as production rule, to form tree fragment collection, example is set fragment collection 204 as shown in Figure 2.
Then, in repeated fragment detecting step S1103, detect the node sequence that whether has repetition in the concentrated production rule of tree fragment that tree fragment decomposition step S1101 obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form.Then,, in recurring rule combining step S1105, the identic production rule that repeated fragment detecting step S1103 is generated is merged into a production rule, to form production rule collection.
Preferably, in order to improve the efficiency of syntactic analysis, regulation obtaining method also comprises rules selection step S1107 according to an embodiment of the invention, production rule repetition compatible rule merging step S1105 being generated according to the selection strategy of prior setting is selected, to generate the production rule collection of reduction, than production rule collection 213 as shown in Figure 2.
Tree fragment decomposition step is decomposed the tree fragment obtaining and can be expressed as
<freq:xx>{<f 1>...<f n>}P → Y 1y 2... Y n, n is more than or equal to 1 positive integer.Wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f 1>...<f n>}P → Y 1y 2... Y n" be illustrated in and occurred <f 1>...<f nin the situation of > attribute, phrase P can be by Y 1y 2... Y nform.
It is worthy of note, when recurring rule combining step S1105 merges into a production rule at the identic production rule that repeated fragment detecting step S1103 is generated, also the frequency of production rule is carried out to corresponding merging.
Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step S1003 shown in Figure 10 according to one embodiment of the invention.
As shown in figure 12, first at regular compile step S1201, the production rule collection that Rule step S1001 is generated is compiled into the rule query table of syntactic analysis rules, for example rule query table 304 as shown in Figure 3.Then, in syntactic analysis step S1203, in rule query table 304, inquire about the syntactic analysis rules that can be applied to input sentence by rule query step S1205, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.Here the rule query table 304 that, rule query step S1205 compiles for rule searching compile step S1201.
Next, in ambiguity resolution step S1207, from the partial analysis candidate of syntax analytical procedure S1203 and rule query step S1205 generation, select optimum partial analysis result.Then, judge at step S1209 whether the analysis result obtaining meets the demands.If do not met the demands, return to syntactic analysis step S1203, the partial analysis result that ambiguity resolution step S1207 is selected is carried out further syntactic analysis.
If judge that in step S1209 the analysis result obtaining meets the demands after ambiguity resolution, treatment scheme advances to step S1211, exports final syntactic analysis result.
According to a preferred embodiment of the present invention, in regular compile step S1201, concentrate for production rule the part that comprises regular expression and increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection.
In syntactic analysis step S1203, in the middle of identifying according to the mode identical with identifying P-marker, generate mark, inquire about all syntactic analysis rules of comprising middle create-rule that are applicable to current input sentences with the relation between grammer composition and the composition of identification input sentence by rule query step S1205.
In ambiguity resolution step S1207, determine by calculating the probability P (S, t) of parsing tree the optimum production rule using according to formula below, to select optimum partial analysis result:
P ( s , t ) = &Pi; r &Element; D ( T ) p ( r )
Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.
The calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule.
For the production rule formula that contains regular expression p ( r ) = &Pi; i = 1 . . n p ( r i ) Computation rule probability, wherein r ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r i) be for this centre create-rule r iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.
Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step S1005 according to the syntax tree shown in one embodiment of the invention Figure 10.
As shown in figure 13, first, in central marker cleanup step S1301, remove the middle mark that generates using in syntactic analysis process.Then, in step S1303, judgement is will generate interdependent syntactic relation figure or will generate phrase structure type parsing tree.
If judgement need to generate phrase structure type parsing tree in step S1303, treatment scheme advances to phrase structure and generates step S1305, relation between composition and the composition of the input sentence in the middle of the removing of exporting according to central marker cleanup step S1301 at this after generation mark, generate phrase structure type parsing tree, and by the phrase structure type parsing tree output of the input sentence generating.
If judgement needs to generate the interdependent syntactic relation figure of input sentence in step S1303, treatment scheme advances to obs network node annotation step S1307, carries out obs network node mark according to the relation between composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleanup step S1301 output.Then generate in step S1309 at dependency structure, generate the interdependent syntactic relation figure of input sentence according to the obs network node of obs network node annotation step S1307 mark, and by the interdependent syntactic relation figure of the input sentence generating.
Below describe by reference to the accompanying drawings the specific embodiment of syntactic analysis device of the present invention and syntactic analysis method in detail.Can find out from the above description, according to syntactic analysis device of the present invention and syntactic analysis method, by using the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, realizes efficient, correct syntactic analysis.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, introduce regular expression in production syntactic analysis rules time, provide a kind of method of describing repeating part in grammar construct, the multiplicity of repeated fragment can be unrestricted, same syntactic analysis rules, can analyze different length one class phrase.The syntactic analysis device proposing by the present invention and syntactic analysis method, the syntax can be described more neatly, the position relationship of each composition in the syntax both can have been described, the recursive characteristic of local composition in syntactic structure can be described again, therefore there is stronger versatility, more robust by the rule that the present invention obtains.
In addition, due in the training treebank of study, comprising more multicomponent phrase (is the more length language of node, be called length language herein) occur frequency lower, when describing by existing syntactic analysis rules, this part phrase conventionally will be left in the basket, in the time of syntactic analysis, can not analyze if there is length language like this, there will be the sparse problem of rule.And length language all includes repeatably part conventionally, use syntactic analysis device of the present invention and syntactic analysis method, by the part repeating in length language is merged, no matter the part repeating in phrase repeats how many times, can show by same syntactic analysis rules, the Sparse Problems of solution rule to a certain extent.
In addition, according to syntactic analysis device of the present invention and syntactic analysis method, in the time carrying out syntactic analysis, in the middle of increasing by the part to containing regular expression in rule, generate result, thereby solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process.
Above set forth basic functional principle of the present invention using the syntax tree taking out from Chinese as instantiation, but the syntactic analysis device and the syntactic analysis method thereof that use the present invention to describe can be identified the grammer in other various language or semantic composition equally.In addition, the inventive method also can be used for analysis or the similar task of identifying certain class composition from incoming symbol sequence to genome sequence.Therefore be appreciated that all other Languages or notations of being applied to, the variation that does not exceed design main points of the present invention is all due among protection scope of the present invention.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skill in the art use their basic programming skill just can realize in the situation that having read explanation of the present invention.
Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.
In the situation that realizing embodiments of the invention by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, example general purpose personal computer 700 is as shown in figure 14 installed the program that forms this software, this computing machine, in the time that various program is installed, can be carried out various functions etc.
In Figure 14, CPU (central processing unit) (CPU) 701 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random access memory (RAM) 703.In RAM 703, also store as required data required in the time that CPU 701 carries out various processing etc.CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 is also connected to bus 704.
Following parts are connected to input/output interface 705: importation 706, comprises keyboard, mouse etc.; Output 707, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 708, comprises hard disk etc.; With communications portion 709, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 709 via network such as the Internet executive communication processing.
As required, driver 710 is also connected to input/output interface 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 710 as required, is installed in storage area 708 computer program of therefrom reading as required.
In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 711 is installed the program that forms softwares.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 14, distributes separately the detachable media 711 so that program to be provided to user with device.The example of detachable media 711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 702, storage area 708 etc., wherein computer program stored, and be distributed to user together with comprising their device.
Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order naturally following the instructions carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
Although described the present invention and advantage thereof in detail, be to be understood that in the case of not departing from the spirit and scope of the present invention that limited by appended claim and can carry out various changes, alternative and conversion.And, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the device that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or device.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.

Claims (18)

1.一种句法分析装置,包括:1. A syntax analysis device, comprising: 规则获取模块,配置为从已经标注好的训练树库学习句法分析规则以生成包含正则表达式形式的产生式规则集,其中对于产生式规则的后项中的重复部分用正则表达式来表示;The rule acquisition module is configured to learn the syntax analysis rules from the marked training tree bank to generate a production rule set that includes a regular expression form, wherein the repeated part in the latter item of the production rule is represented by a regular expression; 规则应用模块,配置为使用规则获取模块获得的产生式规则集对输入句子进行分析,识别出输入句子的语法成份及成份问的关系;以及The rule application module is configured to use the production rule set obtained by the rule acquisition module to analyze the input sentence, and identify the grammatical components of the input sentence and the relationship between the components; and 句法树生成模块,配置为根据规则应用模块输出的输入句子的语法成份及成份问的关系,生成输入句子的依存句法关系图或者短语结构型句法分析树;The syntactic tree generation module is configured to generate a dependency syntactic relationship diagram or a phrase structure type syntactic analysis tree of the input sentence according to the grammatical components of the input sentence output by the rule application module and the relationship between the components; 其中规则获取模块包括:The rule acquisition module includes: 树片段分解部,配置为将训练树库中的每棵句法树分解为作为产生式规则的树片段,以形成树片段集;a tree fragment decomposition part configured to decompose each syntax tree in the training tree bank into tree fragments as production rules to form a set of tree fragments; 重复片段检测部,配置为检测树片段分解部分解得到的树片段集中的产生式规则的后项中是否有重复的结点序列,并将具有重复的结点序列的产生式规则表示为正则表达式形式的产生式规则;以及The repeated segment detection part is configured to detect whether there is a repeated node sequence in the subsequent item of the production rule in the tree segment set decomposed by the tree segment decomposition part, and express the production rule with the repeated node sequence as a regular expression production rules of the form; and 重复规则合并部,配置为将重复片段检测部生成的形式相同的产生式规则合并为一个产生式规则,以形成产生式规则集;The repeated rule merging unit is configured to combine the production rules of the same form generated by the repeated segment detection unit into one production rule to form a production rule set; 其中规则获取模块还包括规则选择部,配置为根据选择策略对重复规则合并部所生成的产生式规则进行选择,以生成缩减的产生式规则集;The rule acquisition module further includes a rule selection part, which is configured to select the production rules generated by the repeated rule merging part according to the selection strategy, so as to generate a reduced production rule set; 其中树片段分解部分解得到的树片段表示为<freq:xx>{<f1>…<fn>}P→Y1Y2…Yn,n为大于等于1的正整数,其中The tree fragment decomposed by the tree fragment decomposition part is expressed as <freq:xx>{<f 1 >…<f n >}P→Y 1 Y 2 …Y n , n is a positive integer greater than or equal to 1, where <freq:xx>为频度信息,表示该树片段在训练树库中的出现次数;<freq:xx> is the frequency information, indicating the number of occurrences of the tree segment in the training tree bank; <fi>为属性特征,用于描述使用该产生式规则时的上下文信息、词汇或语义特点,i为从1至n的任一整数;<f i > is an attribute feature, which is used to describe the context information, vocabulary or semantic features when using the production rule, i is any integer from 1 to n; P表示该树片段的上位结点,为一个短语标记;P represents the upper node of the tree segment, which is a phrase tag; Yi表示P结点的子结点,为一个短语标记或一个词汇标记;Y i represents the child node of the P node, which is a phrase mark or a vocabulary mark; “{<f1>…<fn>}P→Y1Y2…Yn”表示在出现了<f1>…<fn>属性的情况下,短语P可以由Y1Y2…Yn构成;以及“{<f 1 >…<f n >}P→Y 1 Y 2 …Y n ” means that in the case of <f 1 >…<f n > attributes, the phrase P can be composed of Y 1 Y 2 …Y n constitutes; and 其中,重复规则合并部在将重复片段检测部生成的形式相同的产生式规则合并为一个产生式规则时,将产生式规则的频度进行相应合并;Wherein, when the repeated rule merging unit merges the production rules of the same form generated by the repeated segment detection unit into one production rule, the frequencies of the production rules are correspondingly combined; 在产生式规则中属性特征<fi>为可选。The attribute feature <f i > is optional in production rules. 2.根据权利要求1所述的句法分析装置,其中产生式规则包括:2. The syntax analysis device according to claim 1, wherein the production rules include: 单结点重复型规则,为产生式规则的后项中的某一成份重复二次以上的产生式规则;Single-node repetitive rules are production rules that repeat more than two times for a certain component in the subsequent item of the production rule; 多结点重复型规则,为产生式规则的后项中的某一片段重复二次以上的产生式规则;A multi-node repetitive rule is a production rule that repeats more than two times for a segment in the subsequent item of the production rule; 混合重复型规则,为产生式规则的后项中既包含单结点重复部分也包含多结点重复部分的产生式规则;以及A mixed-repetition rule is a production rule whose successor contains both a single-node repeat part and a multi-node repeat part; and 无重复型规则,为产生式规则的后项中没有重复部分的产生式规则。A non-repetitive rule is a production rule in which there is no repeated part in the successor of the production rule. 3.根据权利要求1所述的句法分析装置,其中规则应用模块包括:3. The syntax analysis device according to claim 1, wherein the rule application module comprises: 规则编译部,配置为将规则获取模块生成的产生式规则集编译成句法分析规则的规则查询表;The rule compilation part is configured to compile the production rule set generated by the rule acquisition module into a rule lookup table of syntax analysis rules; 规则查询部,配置为查询规则编译部编译的规则查询表;以及The rule query unit is configured to query the rule query table compiled by the rule compilation unit; and 句法分析部,配置为通过规则查询部在规则查询表中查询能够应用于输入句子的句法分析规则,根据句法分析规则识别输入句子的语法成份及成份间的关系。The syntactic analysis unit is configured to query the syntactic analysis rules applicable to the input sentence in the rule query table through the rule query unit, and identify the grammatical components of the input sentence and the relationship between the components according to the syntactic analysis rules. 4.根据权利要求3所述的句法分析装置,其中规则应用模块还包括歧义消解部,配置为从句法分析部生成的局部分析候选中选择最优的局部分析结果;以及4. The syntax analysis device according to claim 3, wherein the rule application module further comprises an ambiguity resolution unit configured to select an optimal local analysis result from the local analysis candidates generated by the syntax analysis unit; and 句法分析部对歧义消解部选择的局部分析结果进行进一步的句法分析,以输出满足要求的最终分析结果。The syntactic analysis unit performs further syntactic analysis on the partial analysis results selected by the ambiguity resolution unit, so as to output the final analysis results that meet the requirements. 5.根据权利要求4所述的句法分析装置,其中:5. The syntax analysis device according to claim 4, wherein: 规则编译部为产生式规则集中包含正则表达式的部分增加中间生成标记和中间生成规则,并用中间生成标记替换产生式规则中正则表达式表示的部分,中间生成标记并入句法分析规则集的短语标记集,中间生成规则并入句法分析规则集;以及The rule compilation department adds intermediate generation tags and intermediate generation rules to the part of the production rule set that contains regular expressions, and replaces the part represented by regular expressions in the production rules with intermediate generation tags, and the intermediate generation tags are incorporated into the phrases of the syntax analysis rule set a tagging set, with intermediate generation rules incorporated into a parsing rule set; and 句法分析部按照与识别短语标记相同的方式识别中间生成标记,通过规则查询部查询所有适合当前输入句子的包括中间生成规则的句法分析规则,以识别输入句子的语法成份及成份间的关系。The syntactic analysis part identifies the intermediate generated tokens in the same way as identifying the phrase tokens, and queries all the syntactic analysis rules including the intermediate generated rules suitable for the current input sentence through the rule query part to identify the grammatical components of the input sentence and the relationship between the components. 6.根据权利要求5所述的句法分析装置,其中歧义消解部按照下式通过计算句法分析树的概率P(S,t)来确定使用的最优产生式规则,以选择最优的局部分析结果:6. The syntax analysis device according to claim 5, wherein the disambiguation part determines the optimal production rule used by calculating the probability P (S, t) of the syntax analysis tree according to the following formula, to select the optimal local analysis result: PP (( sthe s ,, tt )) == &Pi;&Pi; rr &Element;&Element; DD. (( TT )) pp (( rr )) 其中S为输入句子,t为句法分析树,r为句法分析过程中使用的一条句法分析规则,D(T)为生成句法分析树t使用的全部句法分析规则,而p(r)为针对句法分析规则r的规则概率。Among them, S is the input sentence, t is the syntactic analysis tree, r is a syntactic analysis rule used in the syntactic analysis process, D(T) is all the syntactic analysis rules used to generate the syntactic analysis tree t, and p(r) is the Analyze rule probabilities for rule r. 7.根据权利要求6所述的句法分析装置,其中:7. The syntax analysis device according to claim 6, wherein: 中间生成标记的规则概率的计算与句法分析中存在的短语符号的规则概率的计算相同,中间生成规则的规则概率的计算与非中间生成规则的规则概率的计算相同;以及the calculation of rule probabilities for intermediate generating tokens is the same as for phrase tokens present in the syntactic analysis, and the calculation of rule probabilities for intermediate generating rules is the same as for non-intermediate generating rules; and 对于含有正则表达式的产生式规则用公式计算规则概率,其中ri为转化该含有正则表达式的规则时用到的一条中间生成规则,而p(ri)为针对该中间生成规则ri的规则概率,n为大于等于1的正整数,i为从1至n的任一整数。For production rules that contain regular expressions use the formula Calculate the rule probability, where r i is an intermediate generation rule used when converting the rule containing regular expressions, and p(r i ) is the rule probability for the intermediate generation rule r i , n is a regular number greater than or equal to 1 Integer, i is any integer from 1 to n. 8.根据权利要求5至7之任意一项所述的句法分析装置,其中句法树生成模块包括:8. The syntax analysis device according to any one of claims 5 to 7, wherein the syntax tree generation module comprises: 中间标记清理部,配置为清除在句法分析过程中使用的中间生成标记;以及an intermediate token cleanup section configured to clean intermediate generated tokens used during parsing; and 短语结构生成部,配置为根据中间标记清理部输出的清除中间生成标记之后的输入句子的语法成份及成份问的关系,生成短语结构型句法分析树。The phrase structure generation unit is configured to generate a phrase structure type syntax analysis tree according to the grammatical components and the relationship between the components of the input sentence after the intermediate generated tokens are removed from the intermediate token cleaning unit. 9.根据权利要求5至7之任意一项所述的句法分析装置,其中句法树生成模块包括:9. The syntax analysis device according to any one of claims 5 to 7, wherein the syntax tree generation module comprises: 中间标记清理部,配置为清除在句法分析过程中使用的中间生成标记;an intermediate token cleaning section configured to remove intermediate generated tokens used during parsing; 核心结点标注部,配置为根据中间标记清理部输出的清除中间生成标记之后的输入句子的语法成份及成份问的关系进行核心结点标注;以及The core node labeling part is configured to carry out core node labeling according to the grammatical components and the relationship between the components of the input sentence after clearing the intermediate generated tags output by the intermediate tag cleaning part; and 依存结构生成部,配置为根据核心结点标注部标注的核心结点生成输入句子的依存句法关系图。The dependency structure generation part is configured to generate a dependency syntactic relationship graph of the input sentence according to the core nodes marked by the core node tagging part. 10.一种句法分析方法,包括:10. A method of syntactic analysis, comprising: 规则获取步骤,从已经标注好的训练树库学习句法分析规则以生成包含正则表达式形式的产生式规则集,其中对于产生式规则的后项中的重复部分用正则表达式来表示;The rule acquisition step is to learn the syntactic analysis rules from the marked training tree bank to generate a production rule set containing regular expressions, wherein the repeated part in the latter item of the production rules is represented by regular expressions; 规则应用步骤,使用规则获取步骤获得的产生式规则集对输入句子进行分析,识别出输入句子的语法成份及成份间的关系;以及A rule applying step, using the production rule set obtained in the rule obtaining step to analyze the input sentence, and identify the grammatical components of the input sentence and the relationship between the components; and 句法树生成步骤,根据规则应用步骤输出的输入句子的语法成份及成份问的关系,生成输入句子的依存句法关系图或者短语结构型句法分析树;Syntax tree generating step, according to the grammatical components of the input sentence output by the rule application step and the relationship between the components, generate a dependency syntactic relationship graph or a phrase structure type syntactic analysis tree of the input sentence; 其中规则获取步骤包括:The rules acquisition steps include: 树片段分解步骤,将训练树库中的每棵句法树分解为作为产生式规则的树片段,以形成树片段集;A tree fragment decomposition step, decomposing each syntax tree in the training tree bank into tree fragments as production rules to form a tree fragment set; 重复片段检测步骤,检测树片段分解步骤分解得到的树片段集中的产生式规则的后项中是否有重复的结点序列,并将具有重复的结点序列的产生式规则表示为正则表达式形式的产生式规则;以及Repeat the fragment detection step to detect whether there are repeated node sequences in the subsequent items of the production rules in the tree fragment set decomposed by the tree fragment decomposition step, and express the production rules with repeated node sequences in the form of regular expressions The production rules for ; and 重复规则合并步骤,将重复片段检测步骤生成的形式相同的产生式规则合并为一个产生式规则,以形成产生式规则集;The repeat rule merging step is to merge the production rules of the same form generated by the repeat fragment detection step into one production rule to form a production rule set; 其中规则获取步骤还包括规则选择步骤,根据选择策略对重复规则合并步骤所生成的产生式规则进行选择,以生成缩减的产生式规则集;The rule acquisition step also includes a rule selection step, selecting the production rules generated in the repeated rule merging step according to the selection strategy, so as to generate a reduced production rule set; 其中树片段分解步骤分解得到的树片段表示为<freq:xx>{<f1>…<fn>}P→Y1Y2…Yn,n为大于等于1的正整数,其中The tree fragments decomposed in the tree fragment decomposition step are expressed as <freq:xx>{<f 1 >…<f n >}P→Y 1 Y 2 …Y n , n is a positive integer greater than or equal to 1, where <freq:xx>为频度信息,表示该树片段在训练树库中的出现次数;<freq:xx> is the frequency information, indicating the number of occurrences of the tree segment in the training tree bank; <fi>为属性特征,用于描述使用该产生式规则时的上下文信息、词汇或语义特点,i为从1至n的任一整数;<f i > is an attribute feature, which is used to describe the context information, vocabulary or semantic features when using the production rule, i is any integer from 1 to n; P表示该树片段的上位结点,为一个短语标记;P represents the upper node of the tree segment, which is a phrase tag; Yi表示P结点的子结点,为一个短语标记或一个词汇标记;Y i represents the child node of the P node, which is a phrase mark or a vocabulary mark; “{<f1>…<fn>}P→Y1Y2…Yn”表示在出现了<f1>…<fn>属性的情况下,短语P可以由Y1Y2…Yn构成;以及“{<f 1 >…<f n >}P→Y 1 Y 2 …Y n ” means that in the case of <f 1 >…<f n > attributes, the phrase P can be composed of Y 1 Y2…Y n constitute; and 其中,重复规则合并步骤在将重复片段检测步骤生成的形式相同的产生式规则合并为一个产生式规则时,将产生式规则的频度进行相应合并;Wherein, when the repeated rule merging step merges the production rules of the same form generated by the repeated segment detection step into one production rule, the frequency of the production rule is merged accordingly; 在产生式规则中属性特征<fi>为可选。The attribute feature <f i > is optional in production rules. 11.根据权利要求10所述的句法分析方法,其中产生式规则包括:11. The syntax analysis method according to claim 10, wherein the production rules include: 单结点重复型规则,为产生式规则的后项中的某一成份重复二次以上的产生式规则;Single-node repetitive rules are production rules that repeat more than two times for a certain component in the subsequent item of the production rule; 多结点重复型规则,为产生式规则的后项中的某一片段重复二次以上的产生式规则;A multi-node repetitive rule is a production rule that repeats more than two times for a segment in the subsequent item of the production rule; 混合重复型规则,为产生式规则的后项中既包含单结点重复部分也包含多结点重复部分的产生式规则;以及A mixed-repetition rule is a production rule whose successor contains both a single-node repeat part and a multi-node repeat part; and 无重复型规则,为产生式规则的后项中没有重复部分的产生式规则。A non-repetitive rule is a production rule in which there is no repeated part in the successor of the production rule. 12.根据权利要求10所述的句法分析方法,其中规则应用步骤包括:12. The syntax analysis method according to claim 10, wherein the rule application step comprises: 规则编译步骤,将规则获取步骤生成的产生式规则集编译成句法分析规则的规则查询表;A rule compilation step, compiling the production rule set generated by the rule acquisition step into a rule lookup table of syntax analysis rules; 规则查询步骤,查询规则编译步骤编译的规则查询表;以及The rule query step is to query the rule query table compiled by the rule compilation step; and 句法分析步骤,通过规则查询步骤在规则查询表中查询能够应用于输入句子的句法分析规则,根据句法分析规则识别输入句子的语法成份及成份问的关系。In the syntactic analysis step, the syntactic analysis rules that can be applied to the input sentence are searched in the rule query table through the rule query step, and the grammatical components of the input sentence and the relationship between the components are identified according to the syntactic analysis rules. 13.根据权利要求12所述的句法分析方法,其中规则应用步骤还包括歧义消解步骤,从句法分析步骤生成的局部分析候选中选择最优的局部分析结果;以及13. The syntax analysis method according to claim 12, wherein the rule application step further comprises an ambiguity resolution step, selecting an optimal local analysis result from the local analysis candidates generated by the syntax analysis step; and 句法分析步骤对歧义消解步骤选择的局部分析结果进行进一步的句法分析,以输出满足要求的最终分析结果。The syntactic analysis step performs further syntactic analysis on the local analysis results selected by the ambiguity resolution step, so as to output the final analysis results that meet the requirements. 14.根据权利要求13所述的句法分析方法,其中:14. The syntax analysis method according to claim 13, wherein: 规则编译步骤为产生式规则集中包含正则表达式的部分增加中间生成标记和中间生成规则,并用中间生成标记替换产生式规则中正则表达式表示的部分,中间生成标记并入句法分析规则集的短语标记集,中间生成规则并入句法分析规则集;以及The rule compilation step adds intermediate generation tags and intermediate generation rules to the part of the production rule set that contains regular expressions, and replaces the part represented by regular expressions in the production rules with intermediate generation tags, and the intermediate generation tags are incorporated into the phrases of the syntax analysis rule set a tagging set, with intermediate generation rules incorporated into a parsing rule set; and 句法分析步骤按照与识别短语标记相同的方式识别中间生成标记,通过规则查询步骤查询所有适合当前输入句子的包括中间生成规则的句法分析规则,以识别输入句子的语法成份及成份间的关系。The syntactic analysis step identifies the intermediate generated tokens in the same manner as identifying phrase tokens, and queries all syntactic analysis rules that are suitable for the current input sentence including the intermediate generated rules through the rule query step to identify the grammatical components of the input sentence and the relationship between the components. 15.根据权利要求14所述的句法分析方法,其中歧义消解步骤按照下式通过计算句法分析树的概率P(S,t)来确定使用的最优产生式规则,以选择最优的局部分析结果:15. The syntax analysis method according to claim 14, wherein the ambiguity resolution step determines the optimal production rule used by calculating the probability P (S, t) of the syntax analysis tree according to the following formula to select the optimal local analysis result: PP (( sthe s ,, tt )) == &Pi;&Pi; rr &Element;&Element; DD. (( TT )) pp (( rr )) 其中S为输入句子,t为句法分析树,r为句法分析过程中使用的一条句法分析规则,D(T)为生成句法分析树t使用的全部句法分析规则,而p(r)为针对句法分析规则r的规则概率。Among them, S is the input sentence, t is the syntactic analysis tree, r is a syntactic analysis rule used in the syntactic analysis process, D(T) is all the syntactic analysis rules used to generate the syntactic analysis tree t, and p(r) is the Analyze rule probabilities for rule r. 16.根据权利要求15所述的句法分析方法,其中:16. The syntax analysis method according to claim 15, wherein: 中间生成标记的规则概率的计算与句法分析中存在的短语符号的规则概率的计算相同,中间生成规则的规则概率的计算与非中间生成规则的规则概率的计算相同;以及the calculation of rule probabilities for intermediate generating tokens is the same as for phrase tokens present in the syntactic analysis, and the calculation of rule probabilities for intermediate generating rules is the same as for non-intermediate generating rules; and 对于含有正则表达式的产生式规则用公式计算规则概率,其中ri为转化该含有正则表达式的规则时用到的一条中间生成规则,而p(ri)为针对该中间生成规则ri的规则概率,n为大于等于1的正整数,i为从1至n的任一整数。For production rules that contain regular expressions use the formula Calculate the rule probability, where r i is an intermediate generation rule used when converting the rule containing regular expressions, and p(r i ) is the rule probability for the intermediate generation rule r i , n is a regular number greater than or equal to 1 Integer, i is any integer from 1 to n. 17.根据权利要求14至16之任意一项所述的句法分析方法,其中句法树生成步骤包括:17. The syntax analysis method according to any one of claims 14 to 16, wherein the syntax tree generation step comprises: 中间标记清理步骤,清除在句法分析过程中使用的中间生成标记;以及an intermediate token cleanup step that removes intermediate generated tokens used during parsing; and 短语结构生成步骤,根据中间标记清理步骤输出的清除中问生成标记之后的输入句子的语法成份及成份间的关系,生成短语结构型句法分析树。The phrase structure generation step generates a phrase structure type syntax analysis tree according to the grammatical components of the input sentence after the intermediate mark cleaning step outputs the intermediate mark cleaning step and the grammatical components of the input sentence and the relationship between the components. 18.根据权利要求14至16之任意一项所述的句法分析方法,其中句法树生成步骤包括:18. The syntax analysis method according to any one of claims 14 to 16, wherein the syntax tree generation step comprises: 中间标记清理步骤,清除在句法分析过程中使用的中间生成标记;an intermediate token cleaning step that removes intermediate generated tokens used during the parsing process; 核心结点标注步骤,根据中间标记清理步骤输出的清除中间生成标记之后的输入句子的语法成份及成份问的关系进行核心结点标注;以及The core node labeling step is to mark the core nodes according to the grammatical components of the input sentence after the intermediate mark is cleared and the relationship between the components is cleared according to the output of the intermediate mark cleaning step; and 依存结构生成步骤,根据核心结点标注步骤标注的核心结点生成输入句子的依存句法关系图。The dependency structure generation step is to generate a dependency syntactic relationship graph of the input sentence according to the core nodes marked in the core node labeling step.
CN200910118104.1A 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method Expired - Fee Related CN101814065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910118104.1A CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910118104.1A CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Publications (2)

Publication Number Publication Date
CN101814065A CN101814065A (en) 2010-08-25
CN101814065B true CN101814065B (en) 2014-07-30

Family

ID=42621323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910118104.1A Expired - Fee Related CN101814065B (en) 2009-02-23 2009-02-23 Syntactic analysis device and syntactic analysis method

Country Status (1)

Country Link
CN (1) CN101814065B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637180B (en) * 2011-02-14 2014-06-18 汉王科技股份有限公司 Character post processing method and device based on regular expression
JP2013196504A (en) * 2012-03-21 2013-09-30 Toshiba Corp Gist extracting device and program
CN103294666B (en) * 2013-05-28 2017-03-01 百度在线网络技术(北京)有限公司 Grammar compilation method, semantic analytic method and corresponding intrument
CN104281564B (en) * 2014-08-12 2017-08-08 中国科学院计算技术研究所 A kind of bilingual unsupervised syntactic analysis method and system
JP2016149023A (en) * 2015-02-12 2016-08-18 富士通株式会社 Information management apparatus, information management method, and information management program
CN105740234A (en) * 2016-01-29 2016-07-06 昆明理工大学 MST algorithm based Vietnamese dependency tree library construction method
CN106202395B (en) * 2016-07-11 2019-12-31 上海智臻智能网络科技股份有限公司 Text clustering method and device
CN106843849B (en) * 2016-12-28 2020-04-14 南京大学 An Automatic Synthesis Method for Code Models of Document-Based Library Functions
CN108021559B (en) * 2018-02-05 2022-05-03 威盛电子股份有限公司 Natural language understanding system and semantic analysis method
CN109684644A (en) * 2018-12-27 2019-04-26 南京大学 A Context-based Dependency Syntax Tree Construction Method
CN111767709A (en) * 2019-03-27 2020-10-13 武汉慧人信息科技有限公司 Logic method for carrying out error correction and syntactic analysis on English text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226327A (en) * 1996-06-28 1999-08-18 微软公司 Method and system for computing semantic logical forms from syntax trees

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226327A (en) * 1996-06-28 1999-08-18 微软公司 Method and system for computing semantic logical forms from syntax trees

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
魏蓉.限定领域的基本陈述句句法分析.《中国优秀硕士学位论文全文数据库(电子期刊)》.2008,(第8期),F084-154. *

Also Published As

Publication number Publication date
CN101814065A (en) 2010-08-25

Similar Documents

Publication Publication Date Title
CN101814065B (en) Syntactic analysis device and syntactic analysis method
Zhou et al. Deep learning for aspect-level sentiment classification: Survey, vision, and challenges
Buys et al. Robust incremental neural semantic graph parsing
Täckström et al. Efficient inference and structured learning for semantic role labeling
Muller et al. Constrained decoding for text-level discourse parsing
CN110888943B (en) Method and system for assisted generation of court judge document based on micro-template
Shindo et al. Bayesian symbol-refined tree substitution grammars for syntactic parsing
CN110609983B (en) Structured decomposition method for policy file
Ackermann et al. Data-driven annotation of textual process descriptions based on formal meaning representations
CN117313850A (en) Information extraction and knowledge graph construction system and method
Bai et al. Enhanced natural language interface for web-based information retrieval
Lima et al. A logic-based relational learning approach to relation extraction: The OntoILPER system
Jahan et al. Generating sequence diagram from natural language requirements
Nan et al. HISyn: human learning-inspired natural language programming
Xiao et al. Information extraction from the web: System and techniques
Han et al. A novel part of speech tagging framework for nlp based business process management
D'Ulizia et al. A learning algorithm for multimodal grammar inference
Ouaddi et al. A sketch of an approach for discovering uml use-case diagrams from textual specifications of systems using a chatbot
Zhao et al. Natural language query for technical knowledge graph navigation
Zaenen et al. Language analysis and understanding
Ramamonjison et al. LaTeX2Solver: a hierarchical semantic parsing of LaTeX document into code for an assistive optimization modeling application
CN114490928A (en) Implementation method, system, computer equipment and storage medium of semantic search
Javed et al. An unsupervised incremental learning algorithm for domain-specific language development
Choi et al. Syntactic Factors Associated With Performance of Dependency Parsers Using Stack-Pointer Network and Graph Attention Networks Between English and Korean
Zhu et al. Knowledge Graph Path-Enhanced RAG for Intelligent Residency Q&A.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140730

Termination date: 20180223