CN101814065B

CN101814065B - Syntactic analysis device and syntactic analysis method

Info

Publication number: CN101814065B
Application number: CN200910118104.1A
Authority: CN
Inventors: 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-02-23
Filing date: 2009-02-23
Publication date: 2014-07-30
Anticipated expiration: 2029-02-23
Also published as: CN101814065A

Abstract

The invention discloses a syntax analysis device and a syntax analysis method. The syntax analysis device using regular expression rules according to the present invention includes a training tree library, a rule acquisition module, a rule application module, a syntax tree generation module and a rule set. The rule acquisition module learns the syntactic analysis rules from the marked training tree bank through the statistical learning method, and generates a rule set used when analyzing the input sentence. For the repeated part in the subsequent item of the production rule, the rule acquisition module uses a regular expression to represent it. The syntax analysis rules learned by the rule acquisition module may also include context information. The rule application module uses the syntactic analysis rule set learned by the rule acquisition module to analyze the input sentence and identify the grammatical components of the input sentence and the relationship between the components. The syntactic tree generation module generates a dependency syntactic relationship graph or a phrase structure syntactic analysis tree of the input sentence according to the analysis results output by the rule application module according to the needs of the user.

Description

Syntactic analysis device and syntactic analysis method

Technical field

The present invention relates to syntactic analysis technology, for identifying the relation between grammer composition and the composition of sentence from the natural language sentences of input.More particularly, the present invention relates to a kind of syntactic analysis device and syntactic analysis method that uses regular expression, the syntactic analysis rules of its using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree.

Background technology

Relation between grammer composition and the composition of identification natural language is to process difficult point and the vital task of natural language.Research about this respect discloses many sections of papers and Patents, for example, US Patent No. 5,386,556A discloses a kind of natural language analysis apparatus and method (Naturallanguage analyzing apparatus and method), and US Patent No. 5,930,746A discloses the apparatus and method (Parsing andtranslating natural language sentences automatically) of a kind of automatic parsing and translation natural language.

In to the processing procedure of natural language, while carrying out syntactic analysis, need to use syntactic analysis rules storehouse, the quality in syntactic analysis rules storehouse and ability are the most critical reasons that affects syntactic analysis result.But, in existing natural language analysis apparatus and method, because the ability to express of syntactic analysis rules is limited, therefore can not apply syntactic rule flexible and efficiently and describe the grammar property of natural language, correspondingly can not identify effectively and accurately the syntax composition of input sentence.

Summary of the invention

In view of the foregoing, the present invention proposes a kind of syntactic analysis device and syntactic analysis method that uses regular expression to describe syntactic analysis rules, in order to the natural language of input is carried out to the identification of relation between syntax composition and composition.According to syntactic analysis device of the present invention and syntactic analysis method, syntactic analysis rules that can using regular expression form and Parsing algorithm go analyze the grammer composition of input sentence and export parsing tree, thereby strengthen the ability of describing natural language rule.

According to an aspect of the present invention, a kind of syntactic analysis device is provided, comprise: rule acquisition module, be configured to training treebank from having marked study syntactic analysis rules to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent; Rule application module, the production rule set pair input sentence that is configured to the acquisition of service regeulations acquisition module is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generation module, be configured to, according to the relation between grammer composition and the composition of the input sentence of rule application module output, generate interdependent syntactic relation figure or the phrase structure type parsing tree of input sentence.

Syntactic analysis device according to an embodiment of the invention, rule acquisition module comprises: tree fragment decomposition unit, is configured to every syntax tree in training treebank to be decomposed into the tree fragment as production rule, to form tree fragment collection; Repeated fragment test section, be configured to detect tree fragment decomposition unit and decompose the node sequence that whether has repetition in the concentrated production rule of the tree fragment that obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form; And recurring rule merging portion, the identic production rule that is configured to repeated fragment test section to generate is merged into a production rule, to form production rule collection.

Preferably, rule acquisition module also comprises rules selection portion, and the production rule that is configured to according to selection strategy, repetition compatible rule merging portion be generated is selected, to generate the production rule collection of reduction.

According to one embodiment of present invention, tree fragment decomposition unit is decomposed the tree fragment obtaining and is expressed as <freq:xx>{<f ₁... <f _n>}P → Y ₁y ₂... Y _n, n is more than or equal to 1 positive integer, and wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f _i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y _ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f ₁>...<f _n>}P → Y ₁y ₂... Y _n" be illustrated in and occurred <f ₁>...<f _nin the situation of > attribute, phrase P can be by Y ₁y ₂... Y _nform.When recurring rule merging portion merges into a production rule at the identic production rule that repeated fragment test section is generated, the frequency of production rule is carried out to corresponding merging.In addition, attributive character <f in production rule _i> is optional.

Preferably, production rule comprises: single node repetition type rule, for a certain composition in production rule consequent repeats production rule more than secondary; Many nodes repetition type rule, for a certain fragment in production rule consequent repeats production rule more than secondary; Mix repetition type rule, both comprised in production rule consequent the production rule that single node repeating part also comprises many nodes repeating part; And without repetition type rule, for there is no the production rule of repeating part in production rule consequent.

Syntactic analysis device according to an embodiment of the invention, rule application module comprises: regular compiling portion, the production rule collection that is configured to that rule acquisition module is generated is compiled into the rule query table of syntactic analysis rules; Rule query portion, is configured to the rule query table that rule searching compiling portion compiles; And syntactic analysis portion, be configured to inquire about in rule query table by rule query portion the syntactic analysis rules that can be applied to input sentence, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.

Preferably, rule application module also comprises ambiguity resolution portion, the partial analysis candidate who is configured to generate from syntax analysis portion, selects optimum partial analysis result; And syntactic analysis portion to ambiguity resolution portion select partial analysis result carry out further syntactic analysis, to export the final analysis result meeting the demands.

According to one embodiment of present invention, rule compiling portion is that production rule concentrates the part that comprises regular expression to increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection; And syntactic analysis portion identifies the middle mark that generates according to the mode identical with identification P-marker, by all syntactic analysis rules that comprise middle create-rule that are applicable to current input sentence of rule query portion inquiry, to identify the relation between grammer composition and the composition of inputting sentence.

Preferably, ambiguity resolution portion determines by calculating the probability P (S, t) of parsing tree the optimum production rule using according to the following formula, to select optimum partial analysis result:

P (s, t) = \underset{r &Element; D (T)}{Π} p (r)

Wherein S is input sentence, t is parsing tree, r is syntactic analysis rules that use in syntactic analysis process, and D (T) makes a living into whole syntactic analysis rules that parsing tree t uses, and p (r) is the regular probability for syntactic analysis rules r.

According to a preferred embodiment of the present invention, the calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule; And for the production rule formula that contains regular expression

p (r) = \underset{i = 1 . . n}{Π} p (r_{i})

Computation rule probability, wherein r _ifor create-rule in the middle of transforming that uses when this contains regular expression regular, and p (r _i) be for this centre create-rule r _iregular probability, n is more than or equal to 1 positive integer, i is the arbitrary integer from 1 to n.

Syntactic analysis device according to an embodiment of the invention, syntax tree generation module comprises: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; And phrase structure generating unit, be configured to, according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output, generate phrase structure type parsing tree.

Syntactic analysis device according to another embodiment of the invention, syntax tree generation module also can comprise: central marker cleaning portion, is configured to remove the middle mark that generates using in syntactic analysis process; Obs network node mark portion, is configured to carry out obs network node mark according to the relation between grammer composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleaning portion output; And dependency structure generating unit, be configured to generate according to the obs network node of obs network node mark standard laid down by the ministries or commissions of the Central Government note the interdependent syntactic relation figure that inputs sentence.

According to another aspect of the present invention, a kind of syntactic analysis method is provided, comprise: Rule step,, wherein represent with regular expression for the repeating part in production rule consequent to generate the production rule collection that comprises regular expression form from the training treebank study syntactic analysis rules that marked; Rule application step, the production rule set pair input sentence that service regeulations obtaining step obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence; And syntax tree generates step, the relation between grammer composition and the composition of the input sentence of exporting according to rule application step, interdependent syntactic relation figure or the phrase structure type parsing tree of generation input sentence.

According to the syntactic analysis device of the bright proposition of we and syntactic analysis method, can use the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, thereby realize efficient correct syntactic analysis.

In addition, the present invention is also provided for realizing the computer program of above-mentioned character identifying method.

In addition, the present invention also provides at least computer program of computer-readable medium form, records the computer program code for realizing above-mentioned character identifying method on it.

Brief description of the drawings

Below with reference to the accompanying drawings illustrate embodiments of the invention, can understand more easily above and other objects, features and advantages of the present invention.In the accompanying drawings, identical or corresponding technical characterictic or parts will adopt identical or corresponding Reference numeral to represent.In accompanying drawing:

Fig. 1 illustrates according to the structural representation of the syntactic analysis device of the use regular expression rule of the embodiment of the present invention;

Fig. 2 illustrates according to the block diagram of the rule acquisition module shown in Fig. 1 of the embodiment of the present invention;

Fig. 3 illustrates according to the block diagram of the rule application module shown in Fig. 1 of the embodiment of the present invention;

Fig. 4 illustrates according to the block diagram of the syntax tree generation module shown in Fig. 1 of the embodiment of the present invention;

Fig. 5 is the schematic diagram for upper node and the next node are described;

Fig. 6 is the example syntax tree S1 using when illustrating that rule acquisition module is obtained syntactic rule;

Fig. 7 is another example syntax tree S2 using when illustrating that rule acquisition module is obtained syntactic rule;

Fig. 8 illustrates the final syntactic analysis result of by syntactic analysis module, input sentence being analyzed according to one embodiment of present invention rear output;

Fig. 9 illustrates according to one embodiment of present invention, by the central marker cleaning portion of syntax tree generation module, the final syntactic analysis result shown in Fig. 8 is carried out to the result after intermediate node removing;

Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention;

Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step shown in Figure 10 according to one embodiment of the invention;

Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step shown in Figure 10 according to one embodiment of the invention;

Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step according to the syntax tree shown in one embodiment of the invention Figure 10; And

Figure 14 illustrates for implementing the structure calcspar according to the messaging device of syntactic analysis method of the present invention.

Embodiment

Embodiments of the invention are described with reference to the accompanying drawings.It should be noted that for purposes of clarity, in accompanying drawing and explanation, omitted expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and processing.

Here, single node repetition type rule that given first is applied in the present invention, many nodes repetition type rule, mix repetition type rule and without the definition of repetition type rule and the definition of upper node and the next node, to better principle of the present invention is set forth.

Definition 1: single node repetition type rule, the rule definition that the consequent a certain composition of production rule repeats more than secondary is single node repetition type rule.

Definition 2: many nodes repetition type rule, the rule definition that the consequent a certain fragment of production rule repeats more than secondary is many nodes repetition type rule.

Definition 3: mix repetition type rule, both comprised rule definition that single node repeating part also comprises many nodes repeating part in production rule consequent for mixing repetition type rule.

Definition 4: without repetition type rule, the rule definition that there is no repeating part in production rule consequent is for without repetition type rule.

Provide some examples of various types of production rules below, further illustrating single node repetition type rule defined above, many nodes repetition type rule, mix repetition type rule and without repetition type rule.

(1) single node repetition type rule

For example, RL1:P → AAABC, wherein " AAA " part is single node repeating part, is P → A with regular expression Rule Expression method representation of the present invention ^*bC.

Again for example, RL2:P → ABBCCD, wherein includes two groups of single node repeating parts " BB " and " CC ", is P → AB with regular expression Rule Expression method representation of the present invention ^*c ^*d.

(2) many nodes repetition type rule

For example, RL3:P → ABABC, wherein " AB " part is many nodes repeating part, is expressed as P → [AB] with regular expression regular expression of the present invention ⁺c.

Again for example, RL4:P → ABCBCDEFDEFG, wherein " BCBC " and " DEFDEF " is many nodes repeating part, is expressed as P → A[BC with regular expression regular expression of the present invention] ⁺[DEF] ⁺g.

(3) mix repetition type rule

For example, RL5:P → AAABCDBCDE, comprises single node repetition type and many nodes repetition type, and wherein " AAA " part is single node repeating part, " BCDBCD " part is many nodes repeating part, is expressed as P → A with regular expression regular expression of the present invention ^*[BCD] ⁺e.

(4) without repetition type rule

For example, RL6:P → ABCDEABC, neither comprises single node repetition type, does not also comprise many nodes repetition type, is therefore without repetition type rule.

Then define upper node and the next node.

Definition 5: upper node, the node that comprises child node in tree construction is defined as upper node.

Definition 6: the next node has the node of father node to be defined as the next node in tree construction.

Upper node and the next node with respect to the different piece of paid close attention to tree and change, with the same node of one tree, both can do upper node, sometimes also can do the next node.For example, A2, B3, these three nodes of C1 in syntax tree as shown in Figure 5, wherein C1 is the next node of B3, A2, and A2 is the upper node of B3, C1, and B3 is the upper node of C1 and is the next node of A2.

Next with reference to accompanying drawing, particularly Fig. 1 to Fig. 4, describes according to the general work principle of the syntactic analysis device of the embodiment of the present invention.As shown in Figure 1, comprise training treebank 101, rule acquisition module 102, rule application module 103, syntax tree generation module 104 and rule set 105 according to the syntactic analysis device of the use regular expression rule of the embodiment of the present invention.

Rule acquisition module 102 is learnt syntactic analysis rules by the method for for example statistical learning from the training treebank 101 having marked, and is created on the rule set 105 using when input sentence is analyzed.For the repeating part in production rule consequent, rule acquisition module 102 is applied in above defined regular expression form and explains accordingly.Therefore the set that, rule set 105 is the production rule that comprises regular expression form.In addition, the syntactic analysis rules that rule acquisition module 102 is learnt can also comprise contextual information.

The syntactic analysis rules collection 105 that rule application module 103 service regeulations acquisition module 102 study obtain is analyzed input sentence, identifies the relation between composition and the composition of inputting sentence.

The analysis result that syntax tree generation module 104 is exported according to rule application module 103, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence according to user's demand.

Below three of syntactic analysis device of the present invention main modular rule acquisition module 102, rule application module 103 and syntax tree generation modules 104 are specifically described.

Fig. 2 illustrates according to the block diagram of the rule acquisition module 102 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 2, according to the rule acquisition module 102 of this embodiment comprise syntactic analysis treebank 201, tree fragment decomposition unit 202, resolution parameter input block 203, tree fragment collection 204, repeated fragment test section 205, without repetition type rule unit 206, single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209, recurring rule merging portion 211 and production rule collection 213.

Syntactic analysis treebank 201 is the syntactic analysis treebank for learning, i.e. training treebank 101 shown in Fig. 1 has wherein indicated the nest relation between grammer composition and the composition of the sentence for training.The present invention has carried out practical application respectively on the English PennTreebank of two treebanks and Chinese PennTreebank, but should be noted that syntactic analysis device proposed by the invention and syntactic analysis method and language independent, as long as any language has marked the nest relation between grammer composition and the composition of sentence, just can obtain syntactic analysis rules by technical scheme of the present invention, and subsequently input sentence be carried out to syntactic analysis.

The resolution parameter that tree fragment decomposition unit 202 is inputted according to resolution parameter input block 203, trains every syntax tree in treebank 101 to be decomposed into some less subtrees or tree fragment syntactic analysis treebank 201.The presentation format of tree fragment is as follows.

<freq:xx>{<f ₁>...<f _n>}P→Y ₁Y ₂...Y _n

Wherein, <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank.<f _i> is attributive character, is mainly used to describe contextual information, vocabulary or the semantic features while using this rule.The resolution parameter that attributive character is inputted according to resolution parameter input block 203 is definite, and rule can comprise attributive character, also can not comprise.

P represents the upper node of this tree fragment, is a P-marker.Y _ithe child node that represents P node is a P-marker or a vocabulary mark." { <f ₁>...<f _n>}P → Y ₁y ₂... Y _n" be illustrated in and occurred <f ₁>...<f _nin the situation of > attribute, phrase P can be by Y ₁y ₂... Y _nform.

From the presentation format of tree fragment, a tree fragment is the rule of a form of production.Every syntax tree in syntactic analysis treebank 201 can be decomposed into some tree fragments, and all decomposition result all deposit in tree fragment collection 204, forms the rule set of form of production.

Then, tree fragment collection (, production rule collection) input repeated fragment test section 205.Repeated fragment test section 205 detects the " Y in inputted tree fragment ₁y ₂... Y _n" whether part have the node sequence of repetition.The form repeating according to node, tree fragment collection is divided into as defined above single node repetition type rule unit 207, many nodes repetition type rule unit 208, mix repetition type rule unit 209 and without repetition type rule unit 206.The rule that comprises repetition node sequence, will represent with regular expression symbol " * " or "+".

After introducing regular expression, some identic rules will be transformed out.For example, " regular R ₁: P → ABBBC " and " regular R ₂: P → ABBC " be all expressed as P → AB while representing by regular expression form ^*c.

Therefore, strictly all rules is converted into after regular expression form, need to removes the identic rule repeating.Based on this, recurring rule merging portion 211 is a rule by identic compatible rule merging, regular frequency is carried out to corresponding merging simultaneously.Thus, can directly generate the production rule collection 213 for input sentence is carried out to syntactic analysis.

In addition, in order to improve the efficiency of syntactic analysis, rule acquisition module 102 can also comprise selection strategy unit 210 and rules selection portion 212 according to an embodiment of the invention.The selection strategy that rules selection portion 212 provides according to selection strategy unit 212, the production rule that repetition compatible rule merging portion 211 is generated is selected, thereby generates the efficient production rule collection 213 of reduction, to improve the efficiency of syntactic analysis.

The operating process of rule acquisition module 102 is described by instantiation below.Fig. 6 and Fig. 7 are two example syntax tree S1 and the S2 for illustrating that rule acquisition module 102 uses in the time obtaining syntactic rule.The process that rule acquisition module 102 is obtained syntactic analysis rules from syntax tree S1 and S2 is as follows.

First the resolution parameter that, tree fragment decomposition unit 202 provides according to resolution parameter input block 203 decomposes syntax tree S1 and S2.Suppose that in this example resolution parameter is that tree is decomposed into context-free phrase, by as shown in table 1 below the tree fragment collection forming after the syntax tree S1 shown in Fig. 6 and Fig. 7 and S2 decomposition.

Table 1 is set fragment collection

Then, the decomposition fragment collection shown in table 1 is entered to repeated fragment test section 205, detect whether there is repetition node at this.Fragment collection in this example in include repetition node.Repeating part is sent into respectively 206Zhi unit, unit 209 according to the type repeating after adopting regular expression form to represent.The result of fragment collection shown in table 1 after representing with regular expression is as shown in table 2 below above.

The rule set that table 2 regular expression represents

Fragment after representing with regular expression will be input to recurring rule merging portion 211 as syntactic analysis rules candidate, carry out the merging of recurring rule at this.The rule set that rule shown in table 2 forms after recurring rule merging portion 211 merges above comprises 9 rules altogether, wherein to rule " X → a ^*" carried out repeating merging.The Output rusults of recurring rule merging portion 211 is as shown in table 3 below.

Rule set after table 3 recurring rule merges

Afterwards, the Output rusults of recurring rule merging portion 211 enters rules selection portion 212, select according to selection strategy, final formation rule application module 103 carries out the needed syntactic analysis rules collection of syntactic analysis, and as production rule collection 213 input rule application modules 103.

Fig. 3 illustrates according to the block diagram of the rule application module 103 shown in Fig. 1 of the embodiment of the present invention.The syntactic analysis rules collection that rule acquisition module 102 shown in rule application module 103 application drawings 2 forms, i.e. production rule set pair input sentence carries out syntactic analysis, the relation between grammer composition and the composition of output input sentence.As shown in Figure 3, comprise production rule collection 302, regular compiling portion 303, rule query table 304, rule query portion 305, syntactic analysis portion 306, ambiguity resolution portion 308 etc. according to the rule application module 103 of this embodiment.

Production rule integrates the 302 production rule collection 213 that form as rule acquisition module 102, first enters regular compiling portion 303.Production rule collection 302 is compiled the rule query table 304 that formation can be used by rule query portion 305 by rule compiling portion 303.

Input sentence 301 is inputted after syntactic analysis portion 306, in rule query table 304, inquired about the syntactic analysis rules that can be applied to this input sentence 301 by rule query portion 305 by syntactic analysis portion 306, according to the grammer composition of syntactic analysis rules identification input sentence 301, and export analysis result.The process of syntactic analysis adopts CYK algorithm from the word node of input sentence 301, and the process of rule query expands to 2 nodes from 1 node, covers to end whole sentence.

Syntactic analysis rules may provide multiple partial analysis candidates 307, and optimum partial analysis result 309 is therefrom selected by ambiguity resolution portion 308.The partial analysis result 309 that ambiguity resolution portion 308 is selected enters syntactic analysis portion 306 and carries out further syntactic analysis, until the satisfied final analysis result 310 of syntactic analysis portion 306 output.

For the production syntactic analysis rules that contain regular expression form that generate in rule acquisition module 102, according to embodiments of the invention, in the middle of increasing by the part to containing regular expression in production syntactic analysis rules, generate result, solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process by rule compiling portion 303, rule query portion 305, syntactic analysis portion 306.Its concrete operating process is as follows.

Rule compiling portion 303 concentrates for production rule the part that comprises regular expression and generates mark and middle create-rule in the middle of increasing, and goes the part that in Substitution Rules, this regular expression represents with centre generation mark.

Specifically, for x ^*matrix section, generates label L EssT.LTssT.LTx> and middle create-rule <x> → xx and <x> → <x>x in the middle of increasing.For [x..y]+matrix section, in the middle of increasing, generate mark [x..y] and middle create-rule [x..y] → x.yx...y and [x..y] → [x..y] x...y.

For example, rule " R ₂: X → a ^*" in " a ^*" contain regular expression, be " a ^*" middle mark " <a> " and two the middle create-rules " <a> → aa " and " <a> → <a>a " of generating of increase.By the regular expression part in centre generation Marker exchange rule, by regular R ₂be converted into X → <a>.

The Output rusults of the recurring rule merging portion 211 shown in table 3, result in the middle of increasing after generation mark and middle create-rule is as shown in table 4 below, the central marker using and middle create-rule are as shown in table 5, wherein introduce 6 middle create-rules, use respectively R ₁₃～R ₁₇represent.

The Output rusults of table 4 recurring rule merging portion 211

Increase the result after central marker

When table 5 transforms the Output rusults of recurring rule merging portion 211

The central marker of using and middle create-rule

Central marker	Middle create-rule
		<a>	R ₁₃：<a>→aa
	R ₁₄：<a>→<a>a
		[ab]	R ₁₅：[ab]→abab
	R ₁₆：[ab]→[ab]ab
		<X>	R ₁₇：<X>→XX
	R ₁₇：<X>→<X>X

In regular compilation process, central marker is incorporated to the P-marker collection of rule set, and middle create-rule is incorporated to syntactic analysis rules collection, and regular compiling portion 303 organization of unity strictly all rules and marks, generate the rule query table 304 of being convenient to inquiry.

Syntactic analysis portion 306 is in analytic process, identify central marker according to the mode identical with identification phrase, adopt CYK algorithm to inquire about all syntactic analysis rules (create-rule in the middle of comprising) that are applicable to current input sentence 301 by rule query portion 305, generate syntactic analysis result.

Below by the rule analysis input sentence " aaababababcaa " using in table 4 and table 5, illustrate the operating process of rule application module 103.

Table 6 parsing sentence service regeulations example

Process	Present analysis state	Service regeulations
			Process 1	aaababababcaa	R ₁₃：<a>→aa R ₁₅：[ab]→abab R ₁₃：<a>→aa
Process 2	<a>[ab]abc<a>	R ₂：X→<a> R ₄：Z→[ab]
			Process 3	XZabcX	R ₅：M→abcX
Process 4	XZM	R ₇：S→XZM
			Process 5	S

Here should be noted that, above-mentioned example is only to have provided as an example a kind of mode that uses syntactic analysis rules, not optimum or unique.The object of this example is that explanation, how by introducing intermediate symbols and intermediate analysis rule, is used the syntactic analysis rules that contain regular expression in the process of syntactic analysis.

In the process using in rule, may there are some available analysis rules in same state, occurs analyzing ambiguity.The multiple create-rule candidate 307 who obtains inquires about in rule query portion 305 will be input to ambiguity resolution portion 308, select optimal rules to apply by ambiguity resolution portion 308, thereby generate partial analysis result 309.

According to one embodiment of present invention, ambiguity resolution portion 308 passes through the optimal rules of probability P (S, the t) choice for use that calculates parsing tree, and then exports optimum syntactic analysis result 309.Its basic computing formula is as follows:

P (s, t) = \underset{r &Element; D (T)}{Π} p (r)

In the time of computation rule probability, the phrase symbol existing in intermediate symbols and syntactic analysis is equal to, and intermediate analysis rule is equal to non-intermediate analysis rule.The method of existing various calculating syntactic analysis rules all can be used for calculating the syntactic analysis rules of obtaining in treatment in accordance with the present invention.Its regular Probabilistic estimation simplified summary is as follows:

1. non-regular expression rule: regular probability estimate is identical with existing method.

2. create-rule in the middle of: intermediate symbols is treated as phrase, adopted existing method to estimate regular probability.

3. contain the rule of regular expression:

p (r) = \underset{i = 1 . . n}{Π} p (r_{i}),

Wherein r _ifor transforming an intermediate rule of using when this contains regular expression regular, and p (r _i) be for this centre syntactic analysis rules r _iregular probability.

According in the rule application module 103 of this embodiment of the invention, by introducing intermediate symbols and middle create-rule, solve the problem that how to use regular expression rule.By intermediate symbols and middle create-rule and phrase symbol and general rule are put on an equal footing, solve the probability estimate problem of calculating regular expression rule.

By using the rule of table 6, the syntactic analysis result that input sentence " aaababababcaa " is carried out to the final output obtaining after syntactic analysis as shown in Figure 8.

The syntactic analysis result that rule application module 103 is exported, for example syntactic analysis result as shown in Figure 8 will be inputted syntax tree generation module 104, and the demand at this according to user generates interdependent syntactic relation figure or phrase structure type parsing tree.

Fig. 4 illustrates according to the block diagram of the syntax tree generation module 104 shown in Fig. 1 of the embodiment of the present invention.As shown in Figure 4, comprise central marker cleaning portion 402, obs network node mark portion 403, dependency structure generating unit 404 and phrase structure generating unit 405 according to the syntax tree generation module 104 of this embodiment of the invention.

Final analysis result 401 shown in Fig. 4 is the syntactic analysis result 310 that rule application module 103 is exported.First final analysis result 401 inputs central marker cleaning portion 402, removes the central marker using in syntactic analysis process.

After removing central marker, if need to generate phrase structure type parsing tree, generate phrase structure tree by phrase structure generating unit 405.If need to generate interdependent syntactic relation figure, carry out obs network node mark by removing central marker analysis result input obs network node mark portion 403 afterwards, generate dependence by dependency structure generating unit 404 afterwards.

According to one embodiment of present invention, described in the step recursive function that central marker cleaning portion 402 clears up intermediate node is described below.

Function CleanTags(n _i)

Begin

For each s _i，where s _i∈{sons of n _i}

cleanTags(s _i)

If // s _ifor intermediate symbols

If s _i is a semi-finished label

All sons of //si rise as the son of ni

move up all the sons of s _i as the sons of n _i

Endif

End for

End

The final analysis result 310 of the syntactic analysis module 103 shown in Fig. 8 is carried out result after intermediate node removing as shown in Figure 9 by central marker cleaning portion 402.

More than describe according to the structure of the syntactic analysis device of the embodiment of the present invention and principle of work thereof.Describe according to the applied syntactic analysis method of above-mentioned syntactic analysis device of the embodiment of the present invention below in conjunction with accompanying drawing 10～13.

Figure 10 illustrates the process flow diagram of syntactic analysis method according to an embodiment of the invention.As shown in figure 10, comprise that according to the syntactic analysis method of this embodiment Rule step S1001, rule application step S1003 and syntax tree generate step S1005.

First, in Rule step S1001, from the training treebank having marked, example training treebank 101 as shown in Figure 1, study syntactic analysis rules, to generate the production rule collection that comprises regular expression form, wherein represent with regular expression for the repeating part in production rule consequent.Then,, in rule application step S1003, the production rule set pair input sentence that service regeulations obtaining step S1001 obtains is analyzed, and identifies the relation between composition and the composition of inputting sentence.Finally, generate in step S1005 syntax tree, the relation between composition and the composition of the input sentence of exporting according to rule application step S1003, generates interdependent syntactic relation figure or the phrase structure type parsing tree of inputting sentence.

Figure 11 illustrates the detail flowchart of the disposal route of carrying out in the Rule step S1001 shown in Figure 10 according to one embodiment of the invention.As shown in figure 10, comprise tree fragment decomposition step S1101, repeated fragment detecting step S1103, recurring rule combining step S1105 and rules selection step S1107 according to the regulation obtaining method of this embodiment.

First, in tree fragment decomposition step S1101, every syntax tree in training treebank example training treebank 101 is as shown in Figure 1 decomposed into the tree fragment as production rule, to form tree fragment collection, example is set fragment collection 204 as shown in Figure 2.

Then, in repeated fragment detecting step S1103, detect the node sequence that whether has repetition in the concentrated production rule of tree fragment that tree fragment decomposition step S1101 obtains consequent, and the production rule of node sequence with repetition is expressed as to the production rule of regular expression form.Then,, in recurring rule combining step S1105, the identic production rule that repeated fragment detecting step S1103 is generated is merged into a production rule, to form production rule collection.

Preferably, in order to improve the efficiency of syntactic analysis, regulation obtaining method also comprises rules selection step S1107 according to an embodiment of the invention, production rule repetition compatible rule merging step S1105 being generated according to the selection strategy of prior setting is selected, to generate the production rule collection of reduction, than production rule collection 213 as shown in Figure 2.

Tree fragment decomposition step is decomposed the tree fragment obtaining and can be expressed as

<freq:xx>{<f ₁>...<f _n>}P → Y ₁y ₂... Y _n, n is more than or equal to 1 positive integer.Wherein <freq:xx> is frequency information, represents the occurrence number of this tree fragment in training treebank; <f _i> is attributive character, contextual information, vocabulary or semantic features while using this production rule for describing, and i is the arbitrary integer from 1 to n; P represents the upper node of this tree fragment, is a P-marker; Y _ithe child node that represents P node is a P-marker or a vocabulary mark; " { <f ₁>...<f _n>}P → Y ₁y ₂... Y _n" be illustrated in and occurred <f ₁>...<f _nin the situation of > attribute, phrase P can be by Y ₁y ₂... Y _nform.

It is worthy of note, when recurring rule combining step S1105 merges into a production rule at the identic production rule that repeated fragment detecting step S1103 is generated, also the frequency of production rule is carried out to corresponding merging.

Figure 12 illustrates the detail flowchart of the disposal route of carrying out in the rule application step S1003 shown in Figure 10 according to one embodiment of the invention.

As shown in figure 12, first at regular compile step S1201, the production rule collection that Rule step S1001 is generated is compiled into the rule query table of syntactic analysis rules, for example rule query table 304 as shown in Figure 3.Then, in syntactic analysis step S1203, in rule query table 304, inquire about the syntactic analysis rules that can be applied to input sentence by rule query step S1205, according to the relation between grammer composition and the composition of syntactic analysis rules identification input sentence.Here the rule query table 304 that, rule query step S1205 compiles for rule searching compile step S1201.

Next, in ambiguity resolution step S1207, from the partial analysis candidate of syntax analytical procedure S1203 and rule query step S1205 generation, select optimum partial analysis result.Then, judge at step S1209 whether the analysis result obtaining meets the demands.If do not met the demands, return to syntactic analysis step S1203, the partial analysis result that ambiguity resolution step S1207 is selected is carried out further syntactic analysis.

If judge that in step S1209 the analysis result obtaining meets the demands after ambiguity resolution, treatment scheme advances to step S1211, exports final syntactic analysis result.

According to a preferred embodiment of the present invention, in regular compile step S1201, concentrate for production rule the part that comprises regular expression and increase middle mark and the middle create-rule of generating, and generate with middle the part that in Marker exchange production rule, regular expression represents, middle generation mark is incorporated to the P-marker collection of syntactic analysis rules collection, and middle create-rule is incorporated to syntactic analysis rules collection.

In syntactic analysis step S1203, in the middle of identifying according to the mode identical with identifying P-marker, generate mark, inquire about all syntactic analysis rules of comprising middle create-rule that are applicable to current input sentences with the relation between grammer composition and the composition of identification input sentence by rule query step S1205.

In ambiguity resolution step S1207, determine by calculating the probability P (S, t) of parsing tree the optimum production rule using according to formula below, to select optimum partial analysis result:

P (s, t) = \underset{r &Element; D (T)}{Π} p (r)

The calculating of the middle regular probability that generates mark is identical with the calculating of the regular probability of the phrase symbol existing in syntactic analysis, and the calculating of the regular probability of middle create-rule is identical with the calculating of the regular probability of non-middle create-rule.

For the production rule formula that contains regular expression

p (r) = \underset{i = 1 . . n}{Π} p (r_{i})

Figure 13 illustrates the detail flowchart that generates the disposal route of carrying out in step S1005 according to the syntax tree shown in one embodiment of the invention Figure 10.

As shown in figure 13, first, in central marker cleanup step S1301, remove the middle mark that generates using in syntactic analysis process.Then, in step S1303, judgement is will generate interdependent syntactic relation figure or will generate phrase structure type parsing tree.

If judgement need to generate phrase structure type parsing tree in step S1303, treatment scheme advances to phrase structure and generates step S1305, relation between composition and the composition of the input sentence in the middle of the removing of exporting according to central marker cleanup step S1301 at this after generation mark, generate phrase structure type parsing tree, and by the phrase structure type parsing tree output of the input sentence generating.

If judgement needs to generate the interdependent syntactic relation figure of input sentence in step S1303, treatment scheme advances to obs network node annotation step S1307, carries out obs network node mark according to the relation between composition and the composition of the input sentence after generation mark in the middle of the removing of central marker cleanup step S1301 output.Then generate in step S1309 at dependency structure, generate the interdependent syntactic relation figure of input sentence according to the obs network node of obs network node annotation step S1307 mark, and by the interdependent syntactic relation figure of the input sentence generating.

Below describe by reference to the accompanying drawings the specific embodiment of syntactic analysis device of the present invention and syntactic analysis method in detail.Can find out from the above description, according to syntactic analysis device of the present invention and syntactic analysis method, by using the syntactic analysis rules of regular expression form, increase the descriptive power of syntactic analysis rules, overcome the regular expression existing in existing method dumb, the shortcoming that ability to express is not strong.Syntactic rule acquisition methods and Parsing algorithm that the present invention proposes, can form a parser of supporting regular expression rule, realizes efficient, correct syntactic analysis.

In addition, according to syntactic analysis device of the present invention and syntactic analysis method, introduce regular expression in production syntactic analysis rules time, provide a kind of method of describing repeating part in grammar construct, the multiplicity of repeated fragment can be unrestricted, same syntactic analysis rules, can analyze different length one class phrase.The syntactic analysis device proposing by the present invention and syntactic analysis method, the syntax can be described more neatly, the position relationship of each composition in the syntax both can have been described, the recursive characteristic of local composition in syntactic structure can be described again, therefore there is stronger versatility, more robust by the rule that the present invention obtains.

In addition, due in the training treebank of study, comprising more multicomponent phrase (is the more length language of node, be called length language herein) occur frequency lower, when describing by existing syntactic analysis rules, this part phrase conventionally will be left in the basket, in the time of syntactic analysis, can not analyze if there is length language like this, there will be the sparse problem of rule.And length language all includes repeatably part conventionally, use syntactic analysis device of the present invention and syntactic analysis method, by the part repeating in length language is merged, no matter the part repeating in phrase repeats how many times, can show by same syntactic analysis rules, the Sparse Problems of solution rule to a certain extent.

In addition, according to syntactic analysis device of the present invention and syntactic analysis method, in the time carrying out syntactic analysis, in the middle of increasing by the part to containing regular expression in rule, generate result, thereby solved the problem of the syntactic analysis rules that use comprises regular expression in syntactic analysis process.

Above set forth basic functional principle of the present invention using the syntax tree taking out from Chinese as instantiation, but the syntactic analysis device and the syntactic analysis method thereof that use the present invention to describe can be identified the grammer in other various language or semantic composition equally.In addition, the inventive method also can be used for analysis or the similar task of identifying certain class composition from incoming symbol sequence to genome sequence.Therefore be appreciated that all other Languages or notations of being applied to, the variation that does not exceed design main points of the present invention is all due among protection scope of the present invention.

Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skill in the art use their basic programming skill just can realize in the situation that having read explanation of the present invention.

Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.

In the situation that realizing embodiments of the invention by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, example general purpose personal computer 700 is as shown in figure 14 installed the program that forms this software, this computing machine, in the time that various program is installed, can be carried out various functions etc.

In Figure 14, CPU (central processing unit) (CPU) 701 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random access memory (RAM) 703.In RAM 703, also store as required data required in the time that CPU 701 carries out various processing etc.CPU 701, ROM 702 and RAM 703 are connected to each other via bus 704.Input/output interface 705 is also connected to bus 704.

Following parts are connected to input/output interface 705: importation 706, comprises keyboard, mouse etc.; Output 707, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 708, comprises hard disk etc.; With communications portion 709, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 709 via network such as the Internet executive communication processing.

As required, driver 710 is also connected to input/output interface 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 710 as required, is installed in storage area 708 computer program of therefrom reading as required.

In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 711 is installed the program that forms softwares.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 14, distributes separately the detachable media 711 so that program to be provided to user with device.The example of detachable media 711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 702, storage area 708 etc., wherein computer program stored, and be distributed to user together with comprising their device.

Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carry out the step of above-mentioned series of processes and can order naturally following the instructions carry out in chronological order, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.

Although described the present invention and advantage thereof in detail, be to be understood that in the case of not departing from the spirit and scope of the present invention that limited by appended claim and can carry out various changes, alternative and conversion.And, the application's term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the device that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or device.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the device that comprises described key element and also have other identical element.

Claims

1. A syntax analysis device, comprising:

The rule acquisition module is configured to learn the syntax analysis rules from the marked training tree bank to generate a production rule set that includes a regular expression form, wherein the repeated part in the latter item of the production rule is represented by a regular expression;

The rule application module is configured to use the production rule set obtained by the rule acquisition module to analyze the input sentence, and identify the grammatical components of the input sentence and the relationship between the components; and

The syntactic tree generation module is configured to generate a dependency syntactic relationship diagram or a phrase structure type syntactic analysis tree of the input sentence according to the grammatical components of the input sentence output by the rule application module and the relationship between the components;

The rule acquisition module includes:

a tree fragment decomposition part configured to decompose each syntax tree in the training tree bank into tree fragments as production rules to form a set of tree fragments;

The repeated segment detection part is configured to detect whether there is a repeated node sequence in the subsequent item of the production rule in the tree segment set decomposed by the tree segment decomposition part, and express the production rule with the repeated node sequence as a regular expression production rules of the form; and

The repeated rule merging unit is configured to combine the production rules of the same form generated by the repeated segment detection unit into one production rule to form a production rule set;

The rule acquisition module further includes a rule selection part, which is configured to select the production rules generated by the repeated rule merging part according to the selection strategy, so as to generate a reduced production rule set;

The tree fragment decomposed by the tree fragment decomposition part is expressed as <freq:xx>{<f ₁ >…<f _n >}P→Y ₁ Y ₂ …Y _n , n is a positive integer greater than or equal to 1, where

<freq:xx> is the frequency information, indicating the number of occurrences of the tree segment in the training tree bank;

<f _i > is an attribute feature, which is used to describe the context information, vocabulary or semantic features when using the production rule, i is any integer from 1 to n;

P represents the upper node of the tree segment, which is a phrase tag;

Y _i represents the child node of the P node, which is a phrase mark or a vocabulary mark;

“{<f ₁ >…<f _n >}P→Y ₁ Y ₂ …Y _n ” means that in the case of <f ₁ >…<f _n > attributes, the phrase P can be composed of Y ₁ Y ₂ …Y _n constitutes; and

Wherein, when the repeated rule merging unit merges the production rules of the same form generated by the repeated segment detection unit into one production rule, the frequencies of the production rules are correspondingly combined;

The attribute feature <f _i > is optional in production rules.

2. The syntax analysis device according to claim 1, wherein the production rules include:

Single-node repetitive rules are production rules that repeat more than two times for a certain component in the subsequent item of the production rule;

A multi-node repetitive rule is a production rule that repeats more than two times for a segment in the subsequent item of the production rule;

A mixed-repetition rule is a production rule whose successor contains both a single-node repeat part and a multi-node repeat part; and

A non-repetitive rule is a production rule in which there is no repeated part in the successor of the production rule.

3. The syntax analysis device according to claim 1, wherein the rule application module comprises:

The rule compilation part is configured to compile the production rule set generated by the rule acquisition module into a rule lookup table of syntax analysis rules;

The rule query unit is configured to query the rule query table compiled by the rule compilation unit; and

The syntactic analysis unit is configured to query the syntactic analysis rules applicable to the input sentence in the rule query table through the rule query unit, and identify the grammatical components of the input sentence and the relationship between the components according to the syntactic analysis rules.

4. The syntax analysis device according to claim 3, wherein the rule application module further comprises an ambiguity resolution unit configured to select an optimal local analysis result from the local analysis candidates generated by the syntax analysis unit; and

The syntactic analysis unit performs further syntactic analysis on the partial analysis results selected by the ambiguity resolution unit, so as to output the final analysis results that meet the requirements.

5. The syntax analysis device according to claim 4, wherein:

The rule compilation department adds intermediate generation tags and intermediate generation rules to the part of the production rule set that contains regular expressions, and replaces the part represented by regular expressions in the production rules with intermediate generation tags, and the intermediate generation tags are incorporated into the phrases of the syntax analysis rule set a tagging set, with intermediate generation rules incorporated into a parsing rule set; and

The syntactic analysis part identifies the intermediate generated tokens in the same way as identifying the phrase tokens, and queries all the syntactic analysis rules including the intermediate generated rules suitable for the current input sentence through the rule query part to identify the grammatical components of the input sentence and the relationship between the components.

6. The syntax analysis device according to claim 5, wherein the disambiguation part determines the optimal production rule used by calculating the probability P (S, t) of the syntax analysis tree according to the following formula, to select the optimal local analysis result:

P P ((s the s,, t t)) = = \underset{r r &Element; &Element; D D. ((T T))}{Π Π} p p ((r r))

Among them, S is the input sentence, t is the syntactic analysis tree, r is a syntactic analysis rule used in the syntactic analysis process, D(T) is all the syntactic analysis rules used to generate the syntactic analysis tree t, and p(r) is the Analyze rule probabilities for rule r.

7. The syntax analysis device according to claim 6, wherein:

the calculation of rule probabilities for intermediate generating tokens is the same as for phrase tokens present in the syntactic analysis, and the calculation of rule probabilities for intermediate generating rules is the same as for non-intermediate generating rules; and

For production rules that contain regular expressions use the formula Calculate the rule probability, where r _i is an intermediate generation rule used when converting the rule containing regular expressions, and p(r _i ) is the rule probability for the intermediate generation rule r _i , n is a regular number greater than or equal to 1 Integer, i is any integer from 1 to n.

8. The syntax analysis device according to any one of claims 5 to 7, wherein the syntax tree generation module comprises:

an intermediate token cleanup section configured to clean intermediate generated tokens used during parsing; and

The phrase structure generation unit is configured to generate a phrase structure type syntax analysis tree according to the grammatical components and the relationship between the components of the input sentence after the intermediate generated tokens are removed from the intermediate token cleaning unit.

9. The syntax analysis device according to any one of claims 5 to 7, wherein the syntax tree generation module comprises:

an intermediate token cleaning section configured to remove intermediate generated tokens used during parsing;

The core node labeling part is configured to carry out core node labeling according to the grammatical components and the relationship between the components of the input sentence after clearing the intermediate generated tags output by the intermediate tag cleaning part; and

The dependency structure generation part is configured to generate a dependency syntactic relationship graph of the input sentence according to the core nodes marked by the core node tagging part.

10. A method of syntactic analysis, comprising:

The rule acquisition step is to learn the syntactic analysis rules from the marked training tree bank to generate a production rule set containing regular expressions, wherein the repeated part in the latter item of the production rules is represented by regular expressions;

A rule applying step, using the production rule set obtained in the rule obtaining step to analyze the input sentence, and identify the grammatical components of the input sentence and the relationship between the components; and

Syntax tree generating step, according to the grammatical components of the input sentence output by the rule application step and the relationship between the components, generate a dependency syntactic relationship graph or a phrase structure type syntactic analysis tree of the input sentence;

The rules acquisition steps include:

A tree fragment decomposition step, decomposing each syntax tree in the training tree bank into tree fragments as production rules to form a tree fragment set;

Repeat the fragment detection step to detect whether there are repeated node sequences in the subsequent items of the production rules in the tree fragment set decomposed by the tree fragment decomposition step, and express the production rules with repeated node sequences in the form of regular expressions The production rules for ; and

The repeat rule merging step is to merge the production rules of the same form generated by the repeat fragment detection step into one production rule to form a production rule set;

The rule acquisition step also includes a rule selection step, selecting the production rules generated in the repeated rule merging step according to the selection strategy, so as to generate a reduced production rule set;

The tree fragments decomposed in the tree fragment decomposition step are expressed as <freq:xx>{<f ₁ >…<f _n >}P→Y ₁ Y ₂ …Y _n , n is a positive integer greater than or equal to 1, where

P represents the upper node of the tree segment, which is a phrase tag;

“{<f ₁ >…<f _n >}P→Y ₁ Y ₂ …Y _n ” means that in the case of <f ₁ >…<f _n > attributes, the phrase P can be composed of Y ₁ Y2…Y _n constitute; and

Wherein, when the repeated rule merging step merges the production rules of the same form generated by the repeated segment detection step into one production rule, the frequency of the production rule is merged accordingly;

The attribute feature <f _i > is optional in production rules.

11. The syntax analysis method according to claim 10, wherein the production rules include:

12. The syntax analysis method according to claim 10, wherein the rule application step comprises:

A rule compilation step, compiling the production rule set generated by the rule acquisition step into a rule lookup table of syntax analysis rules;

The rule query step is to query the rule query table compiled by the rule compilation step; and

In the syntactic analysis step, the syntactic analysis rules that can be applied to the input sentence are searched in the rule query table through the rule query step, and the grammatical components of the input sentence and the relationship between the components are identified according to the syntactic analysis rules.

13. The syntax analysis method according to claim 12, wherein the rule application step further comprises an ambiguity resolution step, selecting an optimal local analysis result from the local analysis candidates generated by the syntax analysis step; and

The syntactic analysis step performs further syntactic analysis on the local analysis results selected by the ambiguity resolution step, so as to output the final analysis results that meet the requirements.

14. The syntax analysis method according to claim 13, wherein:

The rule compilation step adds intermediate generation tags and intermediate generation rules to the part of the production rule set that contains regular expressions, and replaces the part represented by regular expressions in the production rules with intermediate generation tags, and the intermediate generation tags are incorporated into the phrases of the syntax analysis rule set a tagging set, with intermediate generation rules incorporated into a parsing rule set; and

The syntactic analysis step identifies the intermediate generated tokens in the same manner as identifying phrase tokens, and queries all syntactic analysis rules that are suitable for the current input sentence including the intermediate generated rules through the rule query step to identify the grammatical components of the input sentence and the relationship between the components.

15. The syntax analysis method according to claim 14, wherein the ambiguity resolution step determines the optimal production rule used by calculating the probability P (S, t) of the syntax analysis tree according to the following formula to select the optimal local analysis result:

P P ((s the s,, t t)) = = \underset{r r &Element; &Element; D D. ((T T))}{Π Π} p p ((r r))

16. The syntax analysis method according to claim 15, wherein:

17. The syntax analysis method according to any one of claims 14 to 16, wherein the syntax tree generation step comprises:

an intermediate token cleanup step that removes intermediate generated tokens used during parsing; and

The phrase structure generation step generates a phrase structure type syntax analysis tree according to the grammatical components of the input sentence after the intermediate mark cleaning step outputs the intermediate mark cleaning step and the grammatical components of the input sentence and the relationship between the components.

18. The syntax analysis method according to any one of claims 14 to 16, wherein the syntax tree generation step comprises:

an intermediate token cleaning step that removes intermediate generated tokens used during the parsing process;

The core node labeling step is to mark the core nodes according to the grammatical components of the input sentence after the intermediate mark is cleared and the relationship between the components is cleared according to the output of the intermediate mark cleaning step; and

The dependency structure generation step is to generate a dependency syntactic relationship graph of the input sentence according to the core nodes marked in the core node labeling step.