CN103020148A

CN103020148A - System and method for converting Chinese phrase structure tree banks into interdependent structure tree banks

Info

Publication number: CN103020148A
Application number: CN2012104798011A
Authority: CN
Inventors: 邱锡鹏; 赵建双
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2012-11-23
Filing date: 2012-11-23
Publication date: 2013-04-03

Abstract

The invention belongs to the technical field of natural language processing, in particular to a system and a method for converting Chinese phrase structure tree banks into interdependent structure tree banks. The method comprises the following steps: splitting complex tree structures; creating a more accurate core mapping table; splitting complex Chinese structures according to a regular method; creating a dependence relationship tagging standard; and confirming a dependence relationship type according to the regular method. The system comprises a splitter for splitting long sentences in a tree bank into short sentences, the core mapping table for obtaining the initial dependence head node of each phrase, a dependence regulator for confirming the final dependence head node of the phrase, and a dependence relationship regulator for confirming the final dependency relationship between phrases and forming final dependence tree banks. According to the invention, a Penn Chinese Tree Bank is converted into the interdependent structure tree banks, so as to be more accurate, standard, and reasonable.

Description

A kind of system and method that Chinese phrase structure treebank is converted into the dependency structure treebank

Technical field

The invention belongs to the natural language processing technique field, be specially a kind of system and method thereof that Chinese phrase structure treebank is converted to the dependency structure treebank.

Background technology

Along with the development of natural language processing, rule-based research method demonstrates its limitation gradually, and people more and more trend towards using the rule of obtaining natural language based on statistical method from real language material.Syntactic analysis is arranged in a core position of natural language processing, and the quality of its performance has important impact to other technologies.It also be take based on statistical method as main stream approach.So the language material data have been served as an important role in syntactic analysis.The height of the accuracy of language material and the size of scale are determining the quality of the performance of syntactic analysis from most basic aspect, do not have language material extensive, high accuracy, and good algorithm has also lost his effect again.Treebank more and more causes people's interest as a kind of corpus that sentence has been carried out deep layer syntax mark.

The researchist has also obtained considerable achievement having carried out a large amount of research-and-development activitys aspect the treebank research at present.The mark system difference that these treebanks adopt is huge, substantially is divided into two kinds according to describing method, and a kind of is phrase structure tree, and a kind of is dependency tree.At world wide, the extensive treebank of great majority is based on phrase structure.In the Chinese treebank, the treebank that marks based on phrase structure also occupies main status, and wherein that the most famous is the Chinese treebank Penn Chinese Treebank of the University of Pennsylvania.

In the grammer system, dependency grammar is succinct with its form, be easy to mark, be convenient to the advantage such as application, is subject to gradually researchist's attention.And limited undoubtedly the development of Chinese parsing based on the scarcity of the Chinese treebank of interdependent syntax.Because the mark treebank needs perfect mark system and the mark flow process of standard, guarantee the quality that marks, this is a job of wasting time and energy.Although research finds that phrase structure is different on the form of expression with dependency structure, they all are the descriptions to the sentence syntactic structure, therefore structurally have consistance.And the phrase structure treebank is sufficient now, and we can convert phrase structure to dependency structure according to the contact between them, obtains the interdependent treebank that we want, thereby has removed a large amount of artificial mark work from.

Many people have attempted the phrase structure treebank is converted into interdependent treebank both at home and abroad at present.Wherein the method for main flow is to utilize the core node mapping table to find the core node of every one deck, and other nodes of same layer all depend on this core node, and travels through whole structure tree with the mode of recurrence.Treebank crossover tool PENN2MALT is exactly the main flow crossover tool that utilizes this thought, and it provides the core node mapping table of Penn Treebank and Penn Chinese Treebank, and its executable file, all freely shares now.

PENN2MALT has reached good effect for the conversion of the English language material of Penn Treebank, but because the complicacy of Chinese, and the simplicity of the rule of PENN2MALT self, with the PennChineseTreebank Chinese language material of PENN2MALT conversion as a result effect be not fine, if train interdependent syntax with the language material after his conversion, can affect the final performance of interdependent syntax.So we according to the characteristics of Chinese, have defined a large amount of rules, developed the crossover tool of oneself with the method for rule, with the language material of the language material after this crossover tool conversion with respect to the PENN2MALT conversion, have higher accuracy and standardization.

Summary of the invention

The object of the invention is to propose a kind of rule-based Chinese treebank converting system and method, PennChineseTreeBank Chinese structure treebank is converted to the interdependent treebank of more reasonable more standard.

A kind of method that Chinese phrase structure treebank is converted into the dependency structure treebank that the present invention proposes, its concrete steps are as follows:

1) reads in PennChineseTreebank Chinese treebank, and by splitter, the long sentence in the treebank is split as short sentence.

2) determine final core mapping table, and utilize the core mapping table to obtain the initial dependence head node of each word.

3) determine the final dependence head node of each word by the dependent Rule device.

4) set up dependence type mark standard, by the dependence normalizer, determine the final dependence between word and the word, form final dependence treebank.

The present invention mainly comprises: split complicated tree construction; Set up more accurately core mapping table, and get rid of the situation that punctuate, modal particle, interjection are done core word; Utilize the special grammar structure in the regular method solution Chinese; Set up dependence type mark standard; Utilize the method for rule to determine the dependence type.The below introduces main contents of the present invention one by one.

One, splits complicated tree construction

In Penn Chinese Treebank treebank, there are many long sentence, and these long sentence are labeled in the structure tree, its structure complexity very, may there be a plurality of root nodes in such structure tree, and these root nodes Existence dependency relationship not each other, if so convert such long sentence to dependency tree, can greatly reduce the accuracy rate of interdependent treebank.And adopting splitter that these long sentences are cut into several short sentence among the present invention, each short sentence self forms an independently structure tree, thereby has reduced the complexity of structure tree.The structure tree that again these is regenerated converts dependent tree to, thereby obtains the dependence treebank of higher accuracy and standardization.Its specific rules is: according to the characteristics of tree construction, in the child nodes of root node, with its for comma or branch be made as the fractionation point, long sentence is split as short sentence, and the tree after splitting with original root node as present root node.

Two, set up more accurately core mapping table,

Although the source code of PENN2MALT crossover tool is not increased income, but its core mapping table comes forth, the present invention passes through great many of experiments, discovery is not very desirable with the language material of the core mapping table conversion that it is announced, so by the research to Penn Chinese Treebank treebank, set up the core mapping table of oneself, as shown in table 1.

Table 1

The core mapping table is used for determining which child node in the structure tree is the core node of father node.Each father node mark has a rule set in the table.His rule set comprises two aspects, and the one, scan direction, the 2nd, core phrase set of types.L representative in the table scans the child node sequence from left to right, and the r representative scans the child node sequence from right to left.

Can obtain by following algorithm the core node of each node according to the core mapping table.

1. decision node marks whether in table, if not in table then be left intact, otherwise turns to 2.

2. the direction of scanning that provides according to the first rule in the rule list, scan successively its child node, if core phrase set of types is empty, then take first node of scanning as core node, turn to 3, otherwise seek successively mark in the core phrase set of types in the mode that repeats to scan, if find then turn to 3, otherwise carry out successively Second Rule and three sigma rule by this way.

3. decision node is leaf node, if so, then finishes, otherwise to its each child, carries out step 1,2,3 in the mode of recurrence.

Three, the situation of core word done in eliminating punctuate, modal particle, interjection

Because the core mapping table is not all listed all situations, represent that with r or l rightmost or leftmost word is as core word at last, this causes many sentences to do core word with punctuate, modal particle, interjection, and these all do booster action in sentence, can not do core word, this patent is by when looking for core node with the core mapping table, got rid of the situation that core node done in punctuate, modal particle, interjection, makes the more accurate standard of language material, more reasonable of conversion.

The method of four, utilization rule solves the special grammar structure in the Chinese

Some special grammar structures that exist in the Chinese, if only solve with the core mapping table, the result is inaccurate, the present invention is directed to these special syntactic structures conducts in-depth research, found a large amount of rules, set up the dependent Rule device, obtained more accurate rational result by these rules.

1. " " word structure and " quilt " word structure

BA-sentence and " quilt " words and expressions are the special sentence formulas in the Chinese, and they also are subject to many researchists' attention in the linguistics field, and many researchists have done a large amount of research to them.According to Penn Chinese Treebank marking structure, if just find their dependence according to the rule of core mapping table as PENN2MALT, the dependence that obtains so will not meet grammer and the meaning of one's words of Chinese.So our own taxeme according to Chinese, oneself has defined about them and has relied on standard, and utilizes rule to realize.Rule wherein is: " " among the child of the node of closelying follow behind word or " quilt " byte point, if subject-predicate or SVO structure, then subject and predicate all depend on " " word or " quilt " byte point, and as their object.

2. " get " the word structure

" get " the word structure and occur in Chinese frequently, it appears in the verb phrase usually, immediately following the verb back, makes its auxiliary verb.Obviously the core word that " gets " byte point is the verb of his front, but the core word of the object of his back but is " getting " byte point.

3. parallel construction

In Chinese, often have the sentence of some parallel constructions, and the coordinate noun in these sentences is labeled in the same phrase structure in Penn Chinese Treebank.And whom these nouns arranged side by side do not have rely on whose relation, and not only only have a core word in a structure, and it also has some secondary core words, are not all right with a core mapping table only.The present invention has defined a standard, allows top noun as core word, and the conjunction between those connection coordinate nouns relies on the noun of conjunction back, if coordinate noun is to separate with pause mark, pause mark relies on the noun of its front.

4. special verb phrase structure

In Penn Chinese Treebank treebank, many mark standards are arranged, comprising the mark of distinguishing for some special verb structures.We just in time can utilize these to mark our dependence syntax of standard.The mark of the verb phrase structure that these are special comprises VCD, VRD, VSB, VCP, VPT, VNV.We have defined dependence standard as shown in table 2 by the research to these anomalous verb phrase structures:

Table 2

Five, set up dependence type mark standard

Because the simplicity of PENN2MALT rule, and the complicacy of Chinese self, the dependence type of PENN2MALT definition is not very accurate, and the present invention has defined the mark standard such as table 3 by the further investigation to the dependence treebank language material after Penn Chinese Treebank language material and the conversion:

Table 3

Six, utilize the method for rule to determine the dependence type

The present invention is by the dependence normalizer, seeks dependence between word and the word from two aspects:

1. from PennChineseTreebank Chinese treebank mark, find their dependence.

2. find their dependence from the characteristics of the characteristics of word self and its dependence word.

For first aspect, its specific rules is:

1) in the PennChineseTreebank Chinese treebank, vertex ticks is that the dependence with its core word of DVP, ADVP is decided to be the adverbial modifier; Vertex ticks is that the dependence with its core word of DNP, DP, ADJP is decided to be attribute.

2) in the PennChineseTreebank Chinese treebank, vertex ticks is decided to be subject for the dependence with its core word of-SUB; Vertex ticks is decided to be object for the dependence with its core word of-OBJ; Vertex ticks is decided to be the adverbial modifier for the dependence with its core word of-ADV; Vertex ticks is decided to be complement for the dependence with its core word of-EXT.

3) in the PennChineseTreebank Chinese treebank, vertex ticks is that the dependence with its non-core node of VRD, VCP, VPT is decided to be complement; Vertex ticks is that the dependence with its non-core node of VCD is decided to be side by side; Vertex ticks is that the dependence with its non-core node of VSB is decided to be interlock; Vertex ticks is that the dependence with its non-core node of VNV is decided to be the query interlock.

For second aspect, its specific rules is as shown in table 4:

Table 4

Because these rules are to have conflict, these rules need to be reserved priority, concrete priority is followed successively by from high to low: the dependence type that the tabulation in the second aspect is listed is root node, tense, the tone, sigh with feeling, punctuate, the word structure, the word structure, get the word structure, the rule of ground word structure, then be the rule 1 in the first aspect), 2), 3), that the dependence type of listing in the tabulation in the second aspect is side by side at last, related, guest Jie, quantity, subject, object, attribute, the adverbial modifier, the rule of complement, the strict sequencing according to priority can obtain accurately dependence.

The present invention also provides a kind of system that Chinese phrase structure treebank is converted into the dependency structure treebank, it is characterized in that, this system comprises:

Splitter is used for the long sentence of treebank is split as short sentence;

The core mapping table is for the initial dependence head node that obtains each word;

The dependent Rule device is for the final dependence head node of determining each word;

The dependence normalizer is used for the final dependence between definite word and the word, forms final dependence treebank.

A kind of rule-based Chinese treebank converting system provided by the invention and method convert PennChineseTreeBank Chinese structure treebank to interdependent treebank, have more accuracy, standardization and rationality.

Description of drawings

Fig. 1: system flowchart.

Fig. 2: phrase structure tree exemplary plot.

Fig. 3: only process the dependency structure tree exemplary plot that obtains with the core mapping table.

Fig. 4: the final dependency structure tree exemplary plot that obtains of processing.

Embodiment

Below in conjunction with drawings and Examples the present invention is described in further detail.

Fig. 1 is the process flow diagram of system of the present invention, and its concrete steps are:

A) read in PennChineseTreebank Chinese treebank, and by splitter, the long sentence in the treebank is split as short sentence;

B) determine final core mapping table, and utilize the core mapping table to obtain the initial dependence head node of each word;

C) determine the final dependence head node of each word by the dependent Rule device;

D) set up dependence type mark standard, by the dependence normalizer, determine the final dependence between word and the word, form final dependence treebank.

Embodiment 1

A phrase among the PennChineseTreebank " will be in the vegetable cell mycin suction body during mosquito feed.", its structural representation is as shown in Figure 2.Dependency structure tree schematic diagram after its final conversion as shown in Figure 4.Below we just analyze the concrete steps that this phrase structure tree converts dependency tree to.

At first, reading in this phrase structure tree with the form of tree construction, owing to not having comma or branch in the child node of its ceiling, is not long sentence so judge it, need not enter splitter.

Then, utilize the rule of core mapping table to come it is processed.Specifically come declarative procedure with a noun phrase in this phrase.As shown in Figure 2, " plant " in the phrase, " cell ", " mycin " three words have formed a noun phrase, their father node is NP-SBJ, so finding father node from mapping table is the rule set of NP, because first row is r in the rule set, so from right to left scanning, in this rule set, the node that will look for successively is respectively NP, NN, NT, NR, QP, IP, PN.Because their three part of speech all is NN, so the NP node is not found in scanning for the first time, enter for the second time scanning, first word is exactly the core node that will look for, so " mycin " this node is decided to be the core word of this phrase, then finish the scanning to this phrase, enter next phrase.Travel through by this way complete phrase structure tree.The dependency structure tree that finally obtains as shown in Figure 3.

Secondly, by the dependent Rule device, find this phrase meet " " characteristic of word structure, so utilize in the regular device " " rule treatments of word structure.Because " " child of the node of closelying follow behind the byte point is the SVO structure, so the subject in this structure and predicate are all relied on " " word, and do his object.Namely " mycin " and " suction " these two words have all been relied on " " this word, and their dependence has been decided to be " object ".

At last, by the dependence normalizer, determine their dependence.For example, self part of speech of " mosquito " in this phrase is noun, and its core word part of speech is verb, and it is on the left side of core word, is " subject " so define its dependence.Utilize like this rule in the dependence normalizer and the regular priority of reserving, finally can obtain final dependency structure tree as shown in Figure 4.

Owing to also the dependency structure treebank of Chinese is not set up evaluating standard, can not adopt the automatically method of evaluation and test to it.In order to verify our conversion effect, 400 samples that extracted randomly the final transformation result of system of the present invention carry out desk checking, and its net result is as shown in table 5:

Table 5

Above data declaration the present invention has obtained a good effect, and it has high accuracy and standard degree.

Claims

1. a method that Chinese phrase structure treebank is converted into the dependency structure treebank is characterized in that, concrete steps are as follows:

2. method according to claim 1, it is characterized in that: step a) described in splitter according to the characteristics of tree construction, in the child nodes of root node, will be for comma or branch be made as the fractionation point, long sentence is split as short sentence, and the tree after splitting with original root node as present root node.

3. method according to claim 1, it is characterized in that: the mapping table of core step b) is to copy the core mapping tableau format of announcing in the PENN2MALT crossover tool, according to the characteristics of PennChineseTreebank Chinese treebank and the characteristics of dependent tree, the more accurately core mapping table determined, it has got rid of the situation that core word done in punctuate, modal particle, interjection.

Method according to claim 1, it is characterized in that: the device of dependent Rule step c), it is according to the characteristics of Chinese grammar and the mark characteristics of PennChineseTreebank Chinese treebank, for only using step b) described in the unascertainable dependency structure of core mapping table, determine concrete rule, determine the final dependence head node of each word; Wherein said concrete rule is:

A) " " rule of word structure and " quilt " word structure: " " among the child of the node of closelying follow behind word or " quilt " byte point, if subject-predicate or SVO structure, then subject and predicate all depend on " " word or " quilt " byte point, and as their object;

B) " get " rule of word structure: " getting " byte point is take the verb of his front as core word, and the object of his back is take " getting " byte point as core word;

C) rule of parallel construction: allow top noun as core word, and the conjunction between those connection coordinate nouns relies on the noun of conjunction back, if coordinate noun is to separate with pause mark, pause mark relies on the noun of its front;

D) rule of special verb phrase: the mark of special verb phrase structure comprises VCD, VRD, VSB, VCP, VPT, VNV.By the research to these anomalous verb phrase structures, obtain following rule list:

。

4. method according to claim 1 is characterized in that, steps d) described in dependence type mark standard,

Shown in specifically seeing the following form:

。

5. method according to claim 1 is characterized in that: the normalizer of dependence steps d), seek the dependence between word and the word, from two aspects:

1) from PennChineseTreebank Chinese treebank mark, finds their dependence;

2 characteristics from the characteristics of word self and its dependence word find their dependence;

Wherein said first aspect, its specific rules is:

1. in the PennChineseTreebank Chinese treebank, vertex ticks is that the dependence with its core word of DVP, ADVP is decided to be the adverbial modifier; Vertex ticks is that the dependence with its core word of DNP, DP, ADJP is decided to be attribute;

2. in the PennChineseTreebank Chinese treebank, the vertex ticks suffix is respectively-SUB ,-OBJ ,-ADV ,-EXT, the dependence of its core word is decided to be respectively subject, object, the adverbial modifier, complement;

3. in the PennChineseTreebank Chinese treebank, vertex ticks is that the dependence with its non-core node of VRD, VCP, VPT is decided to be complement; Vertex ticks is that the dependence with its non-core node of VCD is decided to be side by side; Vertex ticks is that the dependence with its non-core node of VSB is decided to be interlock; Vertex ticks is that the dependence with its non-core node of VNV is decided to be the query interlock;

Described second aspect, specific rules is seen following rule list:

。

6. method according to claim 5, it is characterized in that: these rules of described first aspect and second aspect are to have conflict, these rules are reserved priority, concrete priority is followed successively by from high to low: the dependence type of listing in the described second aspect rule is root node, tense, the tone, sigh with feeling, punctuate, the word structure, the word structure, get the word structure, the rule of ground word structure, then be in the described first aspect rule 1., 2., 3., that the dependence type of listing in the described second aspect rule is side by side at last, related, guest Jie, quantity, subject, object, attribute, the adverbial modifier, the rule of complement, the strict sequencing according to priority can obtain accurately dependence.

7. a system that Chinese phrase structure treebank is converted into the dependency structure treebank is characterized in that, this system comprises:

Splitter is used for the long sentence of treebank is split as short sentence;