CN106598951A - Dependency structure treebank acquisition method and system - Google Patents

Dependency structure treebank acquisition method and system Download PDF

Info

Publication number
CN106598951A
CN106598951A CN201611208593.6A CN201611208593A CN106598951A CN 106598951 A CN106598951 A CN 106598951A CN 201611208593 A CN201611208593 A CN 201611208593A CN 106598951 A CN106598951 A CN 106598951A
Authority
CN
China
Prior art keywords
treebank
phrase
converted
speech
dependence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611208593.6A
Other languages
Chinese (zh)
Other versions
CN106598951B (en
Inventor
武英波
杜建平
吕坤河
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN201611208593.6A priority Critical patent/CN106598951B/en
Publication of CN106598951A publication Critical patent/CN106598951A/en
Application granted granted Critical
Publication of CN106598951B publication Critical patent/CN106598951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a dependency structure treebank acquisition method and system. The method comprises the steps of calling a first treebank and converting phrase structures in the first treebank into dependency structures by adopting a conversion tool of the first treebank; converting phrase structures of flat structures in the first treebank into dependency structures by utilizing a syntactic analyzer; and performing dependency relationship conversion on the dependency structures in the first treebank by utilizing a dependency relationship mapping model obtained by training to obtain a dependency structure treebank of a second treebank type. According to the method and the system, the treebank after the conversion can be combined with the original dependency structure treebank, so that the treebank scale is increased and the performance of the syntactic analyzer is improved.

Description

A kind of dependency structure treebank acquisition methods and system
Technical field
The present invention relates to treebank conversion, espespecially a kind of dependency structure treebank acquisition methods and system.
Background technology
Syntactic analysis is the very important research direction of natural language processing field.In the syntactic analysis method based on statistics In, different according to the language material for being used, can be divided into has the method and unsupervised approach of guidance.The method for having guidance needs thing First according to certain syntax gauge, better sentence is manually marked as training data, then by various probabilistic methods Or machine learning method, the knowledge required for syntactic analysis is obtained from training data.Unsupervised approach is then using without Jing The data for crossing mark are trained, and according to certain mechanism, therefrom learn grammar rule automatically.
The syntactic analysis for having guidance is present main stream approach, has reached higher standard in the language such as English at present True rate.In having the syntactic analysis of guidance, in advance the sentence collection for training of mark is called treebank.Current most of system Meter syntactic analysis model is all to have guidance learning mode come the parameter of training pattern using the treebank for having marked.Therefore, treebank Construction is a very important job, and its quality and scale are directly connected to the training effect of syntactic analysis.
Syntactic analysis first has to follow a certain grammer system, and according to the grammer of the grammer system expression shape of syntax tree is determined Formula.At present, more widely there are phrase structure grammar and dependency grammar used in syntactic analysis.For example:" Siemens will in this year Make great efforts the three gorges project construction for participating in China." its phrase structure analysis result such as Fig. 1 a, it is analogous to the knot of fractionation layer by layer set Structure.
The first order is that " S " refers to that " this year, Siemens will strive to participate in the three gorges project construction of China whole sentence.”.Second Level is divided into four parts, and the Part I " NP " of the second level refers to noun phrase, correspondence " this year ";The Part II of the second level " NP " refers to noun phrase, correspondence " Siemens ";The Part III " VP " of the second level refers to verb phrase, and correspondence " will strive to ginseng With the three gorges project construction of China ";The Part IV " PU " of the second level is index point symbol, correspondence ".”.The third level is divided into three Part, the Part I " ADVP " of the third level is finger-like language phrase, correspondence " general ";The Part II " ADVP " of the third level is finger-like Language phrase, correspondence " effort ";The Part III " VP " of the third level refers to verb phrase, and correspondence " participates in the Three Gorges of China If ".The fourth stage is divided into two parts, and the Part I " VV " of the fourth stage refers to verb, correspondence " participation ";Second of the fourth stage " NP " is divided to refer to noun phrase, correspondence " three gorges project construction of China ".Level V is divided into three parts, first of level V " DNP " is divided to specify language phrase, correspondence " China ";The Part II " NP " of level V is named language phrase, correspondence " Three Gorges work Journey ";The Part III " NP " of level V refers to noun phrase, correspondence " construction ".6th grade is divided into four parts, and the of the 6th grade A part of " NP " specifies language phrase, correspondence " China ";6th grade of Part II " DEG " is auxiliary word phrase, correspondence " ";The Six grades of Part III " NP " specifies language phrase, correspondence " Three Gorges ";6th grade of Part IV " NP " specifies language phrase, right Answer " engineering ".
Using dependency structure analysis " three gorges project construction of China ", as a result such as Fig. 1 b.Dependency structure is using band direction Camber line mark out relation between each word.The analytical structure of dependency structure is more directly perceived than the analytical structure of phrase structure.
" this year, Siemens will strive to participate in the three gorges project construction of China." core node " VG " correspondence " participation ", it is " modern Year ", " will " and " effort " be all " ADV " i.e. adverbial modifier's relation of " participation ", " Siemens " and " participation " is " SBV " relation i.e. subject-predicate Relation, " China " with " " be " ATT " relation i.e. attribute relation, " Three Gorges " and " engineering " are " ATT " relation i.e. attribute relations, " engineering " and " construction " is " ATT " relation i.e. attribute relation.“." after " EOS " i.e. empty node represent and terminate.
How the dependency structure shown in Fig. 1 b is converted to using the phrase structure analysis result shown in Fig. 1 a, is that this area needs The technical problem to be solved.
The development of English syntactic analysis has benefited from the foundation of Penn Treebank (Penn treebank), Penn Treebank's Scale is big, and mark quality is high, it has also become English syntactic analysis de facto standards, almost all of research work is all based on the tree Storehouse is carried out.Meanwhile, the work that Penn Treebank are converted to into dependency structure is also ripe.Chinese aspect is reviewed, treebank is built If work also has gap, both lack unified interdependent mark system, also lack large-scale interdependent treebank.Existing Chinese phrase Structure treebank is foremost Chinese treebank PCT (Penn Chinese Treebank), TCT (Tsing-Hua University of the University of Pennsylvania The Chinese treebank of university) etc..And the interdependent treebank of Chinese then compares less, famous has HIT-IR-CDT (Harbin Institute of Technology's Chinese Interdependent treebank), the SDN treebank of mark (Department of Electronics of Tsing-Hua University).HIT-IR-CDT is that Harbin Institute of Technology's information retrieval is ground Study carefully the interdependent treebank of Chinese of room mark.
The technology that Penn Treebank are converted to into dependency structure is very ripe.For corresponding to English dependency grammar, The work that Chinese (Chinese) phrase structure treebank is converted to dependency structure is also very immature.In existing Penn2Malt crossover tools The rule file of dependency structure is converted to there is provided Penn Chinese Treebank, can be by Penn Chinese Treebank is converted to dependency structure.The rule that the Chinese structure transformation rule file that crossover tool Penn2Malt is provided is included Various language phenomenons cannot be accurately described, has no ability to process in coordination, and Penn Chinese Treebank Flat structure.
It is existing that TCT is converted to into dependency structure, completely using the method for rule.So require to the grammer body in TCT It is very familiar, line discipline conversion, including specified core node, specified relationship type is then entered to a kind of stipulations form.It is this The way that TCT is converted to dependency structure is not had into good versatility, needs to put into relatively large manpower.And, its is interdependent System focuses primarily upon the description with the related various relation compositions of verb.
Above-mentioned work, is all that the treebank of phrase structure is converted to into certain interdependent treebank.Interdependent treebank after conversion System and any existing interdependent treebank are all inconsistent, the treebank being so unfavorable for after effectively utilizes conversion.Can only be after conversion Treebank as independent treebank, then use.
The scale and quality of treebank directly affect the performance of syntactic analysis, and treebank scale is bigger, and quality is better, train and Parser performance it is necessarily better.Therefore, how Chinese phrase structure treebank is converted to into dependency structure treebank, fully Treebank scale using Chinese phrase structure treebank and dependency structure treebank is big, and the measured advantage of matter, is those skilled in the art The technical problem of urgent need to resolve.
The content of the invention
The skimble-scamble problem of system of the interdependent treebank after in order to solve existing conversion, the present invention provides a kind of dependency structure Treebank acquisition methods and system, by phrase structure treebank dependency structure treebank is converted to, and the treebank after conversion can be very easily Merge with original dependency structure treebank, so as to increase treebank scale, and then effectively improve the performance of parser.
To solve the above problems, the present invention provides a kind of dependency structure treebank acquisition methods, comprises the following steps:
Call the first treebank;First treebank is Chinese phrase structure treebank;
The crossover tool and parser of the first treebank are respectively adopted, the phrase structure in first treebank is turned It is changed to dependency structure;Second treebank is the treebank of dependency structure;
Wherein, the phrase structure in first treebank is converted to by dependency structure bag using the crossover tool of the first treebank Include:The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to institute Rule obtained by stating after rule is modified, by the phrase structure dependency structure is converted to;And, rule-based method Concluded, the phrase structure of the parallel construction in first treebank is converted to into dependency structure;
Wherein, using parser, the phrase structure in first treebank is converted to into dependency structure includes:Utilize The parser, by the phrase structure of the flat structure in first treebank dependency structure is converted to;
The dependence mapping model obtained using training, to the dependency structure in first treebank dependence is carried out Conversion, obtains the dependency structure treebank of the second treebank type.
Optionally, it is described that the phrase structure in the first treebank is converted to into interdependent knot using what the crossover tool was provided The rule of structure, or rule resulting after being modified to the rule, by the phrase structure dependency structure is converted to, and is wrapped Include:According to the Head core node mapping tables for pre-building, grammar inference in the phrase structure treebank of first treebank is determined Core node;
Using the mapping table, and according to the rule in the mapping table, it is scanned for the core node, obtains The dependence of other child nodes and the core node;
Wherein, the Head core nodes mapping table be according to the crossover tool provided will be short in the first treebank Language structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
Optionally, the utilization parser, the phrase structure of the flat structure in first treebank is converted to Dependency structure, specifically includes:
Using the parser, the phrase structure to the flat structure in first treebank, seek in digraph Maximum spanning tree is looked for, the interdependent probability of different phrases in the phrase structure of the flat structure is determined;
The phrase structure of the flat structure in first treebank is converted to according to the interdependent probability of the different phrases Dependency structure.
Optionally, the parser is trained using the phrase in second treebank.
Optionally, the method is further included:The phrase structure for obtaining the flat structure is converted to turning for dependency structure Accuracy rate is changed, according to the accuracy rate, training is adjusted to the parser.
Optionally, using Internet resources, the probability of occurrence of the dependency structure after search and statistics conversion, according to institute State conversion accuracy described in determine the probability.
Optionally, the rule-based method is concluded, and the phrase of the parallel construction in first treebank is tied Structure is converted to dependency structure, specifically includes:
It is multiple fragments by the phrase structure cutting of the parallel construction;
The core node of each fragment is determined respectively, and, other nodes in each fragment in addition to core node are true It is set to the core node depended in the fragment;
By each core node of other fragments in addition to first fragment, it is defined as depending on first fragment Core node.
Optionally, the phrase structure cutting by the parallel construction is multiple fragments, is specifically included:
The cutting is carried out as cutting foundation using conjunction part of speech or pause mark.
Optionally, the phrase structure cutting by the parallel construction is multiple fragments, is specifically included:
Input method input condition is obtained, described cutting is carried out as cutting foundation with the input interruption in input method input condition Point.
Optionally, the phrase structure cutting by the parallel construction is multiple fragments, is specifically included:
When the different phrases in the phrase structure of the parallel construction have incidence relation, using the incidence relation as Cutting foundation carries out the cutting.
Optionally, the core node for determining each fragment includes:Using phrase structure place sentence as analysis Object, determines the occurrence number in the sentence context of each node of the fragment, and according to different nodes occurrence is gone out Several comparable situations, determines that occurrence number meets desired node as the core node.
Optionally, the foundation of the dependence mapping model includes:
Dependence marking model is trained using second treebank;
Dependence mark is carried out to first treebank using the dependence marking model;
Using original part of speech and syntactic information of first treebank, the result of the dependence mark is corrected, set up The dependence mapping model.
Optionally, the dependence marking model uses the second linear-logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence word word_f words, father's word feature,
I=1, correspondence word pos_f words, father node part of speech feature,
I=2, correspondence pos word_f part of speech feature,
I=3, correspondence pos pos_f distance father node part of speech feature,
λ0:The weights of word word_f features during correspondence i=0;
λ1:The weights of word pos_f features during correspondence i=1;
λ2:The weights of pos word_f features during correspondence i=2;
λ3:The weights of pos pos_f distance features during correspondence i=3.
Optionally, the dependence mapping model uses third linear logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence phrase itself phrase type features,
I=1, correspondence phrase_s generates itself phrase type feature,
I=2, correspondence father's phrase_f phrase type feature,
λ0:The weights of phrase features during correspondence i=0;
λ1:The weights of phrase_s features during correspondence i=1;
λ2:The weights of phrase_f features during correspondence i=2.
Optionally, the method is further included:
Part-of-speech tagging collection in first treebank is converted into the mark collection for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement.
Optionally, the Chinese Industrial Standards (CIS) part-of-speech tagging collection is 863 part-of-speech tagging collection.
Optionally, the part-of-speech tagging collection by first treebank be converted into meet Chinese Industrial Standards (CIS) part-of-speech tagging collection will The mark collection asked, including:
Part-of-speech tagging is carried out to the word of the first treebank using second treebank, and using the part of speech mapping for pre-building Model carries out part of speech division, corrects the part of speech of the mark.
Optionally, the part of speech mapping model uses the first linear logarithmic model:
Carry out part of speech conversion;
Wherein, i=0, correspondence pos itself part of speech feature,
I=1, correspondence pos_s pos child node parts of speech, itself part of speech feature,
I=2, correspondence pos pos_f itself part of speech feature, father node part of speech,
λ0:The weights of pos features during correspondence i=0;
λ1:The weights of pos_s pos features during correspondence i=1;
λ2:The weights of pos pos_f features during correspondence i=2.
Optionally, first treebank is Penn Chinese TreeBank Universities of Pennsylvania Chinese treebank, described Second treebank is the interdependent treebank of HIT-IR-CDT Harbin Institute of Technologys Chinese.
The present invention also provides a kind of dependency structure treebank and obtains system, including call unit and converting unit:
The call unit, for calling the first treebank;First treebank is Chinese phrase structure treebank;
The converting unit, for the crossover tool and parser of the first treebank to be respectively adopted, by described first Phrase structure in treebank is converted to dependency structure;Second treebank is the treebank of dependency structure;
Wherein, the phrase structure in first treebank is converted to by dependency structure bag using the crossover tool of the first treebank Include:The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to institute Rule obtained by stating after rule is modified, by the phrase structure dependency structure is converted to;And, rule-based method Concluded, the phrase structure of the parallel construction in first treebank is converted to into dependency structure;
Wherein, using parser, the phrase structure in first treebank is converted to into dependency structure includes:Utilize The parser, by the phrase structure of the flat structure in first treebank dependency structure is converted to;
The converting unit is additionally operable to using the dependence mapping model that obtains of training, to first treebank according to Depositing structure carries out dependence conversion, obtains the dependency structure treebank of the second treebank type.
Optionally, the converting unit specifically includes determination subelement and scanning subelement:
The determination subelement, for according to the Head core node mapping tables for pre-building, determining first treebank Phrase structure treebank in grammar inference core node;
The scanning subelement, for using the mapping table, and according to the rule in the mapping table, for the core Heart node is scanned, and obtains the dependence of other child nodes and the core node;
Wherein, the Head core nodes mapping table be according to the crossover tool provided will be short in the first treebank Language structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
Optionally, the converting unit is specifically for using the parser, to first treebank in it is flat The phrase structure of structure, finds maximum spanning tree in digraph, determines different phrases in the phrase structure of the flat structure Interdependent probability;The phrase structure of the flat structure in first treebank is changed according to the interdependent probability of the different phrases For dependency structure.
Optionally, the system further includes parser training unit, for short in using second treebank Language is trained to the parser.
Optionally, the system further includes adjustment unit, and the phrase structure for obtaining the flat structure is converted to The conversion accuracy of dependency structure, according to the accuracy rate, to the parser training is adjusted.
Optionally, the adjustment unit, it is described interdependent after search and statistics conversion specifically for utilizing Internet resources The probability of occurrence of structure, according to conversion accuracy described in the determine the probability.
Optionally, the converting unit specifically includes cutting subelement and interdependent determination subelement,
The cutting subelement, for by the phrase structure cutting of the parallel construction be multiple fragments;
The interdependent determination subelement, for determining the core node of each fragment respectively, and, will remove in each fragment Other nodes outside core node are defined as depending on the core node in the fragment;
The interdependent determination subelement, is additionally operable to each core node of other fragments in addition to first fragment, It is defined as depending on the core node of first fragment.
Optionally, the cutting subelement, for the phrase structure of the parallel construction to be made with conjunction part of speech or pause mark The cutting is carried out for cutting foundation.
Optionally, the cutting subelement, for obtaining input method input condition, with the input in input method input condition Be interrupted carries out the cutting for cutting foundation.
Optionally, the cutting subelement, for closing when the different phrases in the phrase structure of the parallel construction have During connection relation, the cutting is carried out as cutting foundation using the incidence relation.
Optionally, the interdependent determination subelement, for using phrase structure place sentence as analysis object, it is determined that The occurrence number in the sentence context of each node of the fragment, according to the comparison feelings of different node occurrence numbers Condition, determines that occurrence number meets desired node as the core node.
Optionally, according to the foundation of dependence mapping model, the system also includes training unit, mark unit and correction Unit:
The training unit, for training dependence marking model using second treebank;
The mark unit, for carrying out dependence mark to first treebank using the dependence marking model Note;
The correction unit, for using original part of speech and syntactic information of first treebank, correcting the interdependent pass The result of system's mark, sets up the dependence mapping model.
Optionally, the dependence marking model uses the second linear-logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence word word_f words, father's word feature,
I=1, correspondence word pos_f words, father node part of speech feature,
I=2, correspondence pos word_f part of speech feature,
I=3, correspondence pos pos_f distance father node part of speech feature,
λ0:The weights of word word_f features during correspondence i=0;
λ1:The weights of word pos_f features during correspondence i=1;
λ2:The weights of pos word_f features during correspondence i=2;
λ3:The weights of pos pos_f distance features during correspondence i=3.
Optionally, the dependence mapping model uses third linear logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence phrase itself phrase type features,
I=1, correspondence phrase_s generates itself phrase type feature,
I=2, correspondence father's phrase_f phrase type feature,
λ0:The weights of phrase features during correspondence i=0;
λ1:The weights of phrase_s features during correspondence i=1;
λ2:The weights of phrase_f features during correspondence i=2.
Optionally, the system further includes conversion unit:
The conversion unit, for the part-of-speech tagging collection in first treebank to be converted into Chinese Industrial Standards (CIS) part of speech mark is met The mark collection that note collection is required.
Optionally, the Chinese Industrial Standards (CIS) part-of-speech tagging collection is 863 part-of-speech tagging collection.
Optionally, the conversion unit to the word of the first treebank using second treebank specifically for carrying out part of speech mark Note, and part of speech division is carried out using the part of speech mapping model for pre-building, correct the part of speech of the mark.
Optionally, the part of speech mapping model uses the first linear logarithmic model:
Carry out part of speech conversion;
Wherein, i=0, correspondence pos itself part of speech feature,
I=1, correspondence pos_s pos child node parts of speech, itself part of speech feature,
I=2, correspondence pos pos_f itself part of speech feature, father node part of speech,
λ0:The weights of pos features during correspondence i=0;
λ1:The weights of pos_s pos features during correspondence i=1;
λ2:The weights of pos pos_f features during correspondence i=2.
Optionally, first treebank is Penn Chinese TreeBank Universities of Pennsylvania Chinese treebank, described Second treebank is the interdependent treebank of HIT-IR-CDT Harbin Institute of Technologys Chinese.
Compared with above-mentioned prior art, dependency structure treebank acquisition methods described in the embodiment of the present invention are included the first treebank The step of dependency structure treebank of the second treebank type being converted to such as Chinese phrase structure treebank.It is interdependent described in the embodiment of the present invention Chinese phrase structure treebank is converted to dependency structure treebank by structure treebank acquisition methods, and so, the treebank after conversion can be very Convenient and original dependency structure treebank is merged, and so as to increase treebank scale, and then effectively improves parser Performance.
Meanwhile, dependency structure treebank acquisition methods described in the embodiment of the present invention are included using parser to the first treebank In the phrase structure of flat structure the step of be converted to dependency structure, solve the phrase of the flat structures such as noun compounded phrase Structure is converted to the difficult problem of dependency structure.
Description of the drawings
Fig. 1 a are prior art phrase structure analysis result figures;
Fig. 1 b are prior art dependency structure analysis result figures;
Fig. 2 is dependency structure treebank acquisition methods first embodiment flow chart of the present invention;
Fig. 3 is the Establishing process figure of dependence mapping model of the present invention;
Fig. 4 a are flat phrase structure schematic diagrames of the present invention;
Fig. 4 b are the schematic diagrames that flat phrase structure described in Fig. 4 a is converted to dependency structure;
Fig. 5 is the flow chart that the phrase structure of parallel construction of the present invention is converted to dependency structure method;
Fig. 6 is the schematic diagram that the phrase structure of parallel construction of the present invention is converted to dependency structure;
Fig. 7 is dependency structure treebank acquisition methods second embodiment flow chart of the present invention;
Fig. 8 is dependence schematic diagram of the present invention;
Fig. 9 is that dependency structure treebank of the present invention obtains system first embodiment structure chart;
Figure 10 is that dependency structure treebank of the present invention obtains system second embodiment structure chart.
Specific embodiment
The present invention provides a kind of dependency structure treebank acquisition methods, and the first treebank such as Chinese phrase structure treebank is converted to The dependency structure treebank of the second treebank type, the dependency structure treebank after conversion can very easily with original dependency structure tree Storehouse merges, and so as to increase treebank scale, and then effectively improves the performance of parser.
It is dependency structure treebank acquisition methods first embodiment flow chart of the present invention referring to Fig. 2 and Fig. 3, Fig. 2;Fig. 3 It is the Establishing process figure of dependence mapping model of the present invention.
Dependency structure treebank acquisition methods described in first embodiment of the invention, as shown in Fig. 2 comprising the following steps:
S201, call the first treebank.
First treebank can be Chinese phrase structure treebank, for example, Penn Chinese Treebank, TCT etc..
S202, crossover tool and parser that the first treebank is respectively adopted, by the phrase in first treebank Structure is converted to dependency structure.
Second treebank can be the treebank of dependency structure, for example, HIT-IR-CDT, SDN etc..
In embodiments of the present invention, first treebank can be Penn Chinese Treebank, second treebank Can be HIT-IR-CDT.
Wherein, the phrase structure in first treebank is converted to by dependency structure bag using the crossover tool of the first treebank Include:The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to institute Rule obtained by stating after rule is modified, by the phrase structure dependency structure is converted to;And, rule-based method Concluded, the phrase structure of the parallel construction in first treebank is converted to into dependency structure.
Next the crossover tool using the first treebank will be converted to the phrase structure in first treebank interdependent The concrete operations of structure launch to introduce.Specifically, it is described using the crossover tool provided by the phrase in the first treebank Structure is converted to the rule of dependency structure, or rule resulting after being modified to the rule, and the phrase structure is turned Dependency structure is changed to, including:
According to the Head core node mapping tables for pre-building, grammer in the phrase structure treebank of first treebank is determined The core node of derivation;
Using the mapping table, and according to the rule in the mapping table, it is scanned for the core node, obtains The dependence of other child nodes and the core node;
Wherein, the Head core nodes mapping table be according to the crossover tool provided will be short in the first treebank Language structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
For follow-up convenient introduction, next launch in the content introduced with the first treebank as PennChinese Treebank, the second treebank is to be introduced as a example by HIT-IR-CDT.
By observing all of grammar inference in Penn Chinese Treebank, Penn2Malt is provided Rule file is corrected, and forms Head mapping tables, and then the structure such as side by side is processed, most Penn Chinese at last Treebank phrase structures are converted to the dependency structure for meeting HIT-IR-CDT systems.
The phrase structure of Penn Chinese Treebank is converted to into dependency structure using Head mapping tables.
Table 1:Head mapping tables
Head mapping tables are used to determine the core node in a grammar inference.Determine child node sequence using Head mapping tables Which is (Head) core node of father node in row.Each phrase type corresponds to a rule set in upper table.Penn Chinese Treebank treebank phrase structure applications these rules are changed.Each rule comprising two aspect, direction and Core phrase type.Direction is r or l.R represents scanning child node sequence from right to left, and l represents from left to right scanning child node sequence Row.
For example, there is the grammar inference of a phrase structure in Penn Chinese Treebank treebanks:NP==> ADJP DNP NN NN."==>" direction is represented, "==>" left side NP be father node, ADJP DNP NN NN be child node Sequence.
NN is numbered to distinguish, by NP==>ADJP DNP NN NN are labeled as NP==>ADJPDNP NN(1) NN(2).Referring to table 1Head mapping tables, determine that the corresponding rule sets of NP are:
Rule 1 is primarily looked at, the direction of rule 1 is r.
Pre- core node sequence is scanned from right to left, finds first pre- core node NP without in child node sequence " ADJP DNP NN (1) NN (2) " occurs.Pre- core node sequence is rescaned from right to left in continuation, finds second pre- core Node NN is occurred in child node sequence " ADJP DNP NN (1) NN (2) ", due to being to scan from right to left, therefore is found first NN (2), it is determined that NN (2) is core node, is exited.Determine that other child nodes " ADJP DNP NN (1) " all depend on core section Point NN (2).
Last rule is default rule.If rule above does not all meet, using default rule.Now such as Really last rule is r, then rightmost child node is used as core node.If now last rule is l, most left The child node on side is used as core node.
Penn Chinese Treebank treebank phrase structures thus can be determined according to table 1Head mapping tables Dependence.
Conversion is carried out using above-mentioned rule be applied to common phrase structure, but in the required phrase structure changed In may there is the phrase structure of flat structure, for the phrase structure of flat structure, possibly cannot be real using above-mentioned rule Now the phrase structure of flat structure is converted to into dependency structure.
In embodiments of the present invention, for the phrase structure of flat structure can utilize parser, by described first The phrase structure of the flat structure in treebank is converted to dependency structure.
Next the phrase structure in first treebank will be converted to into the tool of dependency structure to adopting parser Gymnastics is made to launch to introduce.Specifically, it is possible to use the parser, to the phrase of the flat structure in first treebank Structure, finds maximum spanning tree in digraph, determines the interdependent probability of different phrases in the phrase structure of the flat structure; The phrase structure of the flat structure in first treebank is converted to by dependency structure according to the interdependent probability of the different phrases.
Interdependent probability can reflect the dependence of different phrases, and interdependent probability can be specific numerical value, different short The interdependent probability of language is higher, illustrates that the dependence of different phrases is better.Can be by predetermined threshold value, above or equal to default Phrase structure corresponding to the interdependent probability of threshold value is converted to dependency structure.
By taking two different phrases in the phrase structure of flat structure as an example, if the interdependent probability of the two phrases is more than Or equal to predetermined threshold value, illustrate that the two phrases have preferable dependence, i.e., the dependency structure tool between the two phrases There is higher reference value, being carried out can be used as in the dependency structure treebank of the second treebank type after dependence conversion Dependency structure, therefore interdependent knot can be converted to above or equal to the phrase structure corresponding to the interdependent probability of predetermined threshold value Structure;If the interdependent probability of the two phrases be less than predetermined threshold value, illustrate that the dependence of the two phrases is weaker, i.e., the two Dependency structure between phrase does not have higher reference value, therefore without the need for setting up dependency structure between the two phrases, The phrase structure corresponding to the interdependent probability of predetermined threshold value need not be will be less than and be converted to dependency structure.Above-mentioned transfer process is main Carried out using parser, if can be trained to parser by the phrase in the second treebank, then logical Interdependent knot when crossing parser the phrase structure of the flat structure in the first treebank being converted to into dependency structure, then after changing Structure can more press close to the dependency structure of the second treebank type, so, in embodiments of the present invention, it is possible to use in the second treebank Phrase is trained to the parser.
The phrase structure of the flat structure in the first treebank is converted to by parser for the accuracy rate of dependency structure May be unable to reach very, that is, the dependency structure being converted to is possible and not all correct, in order to further improve syntax point The conversion accuracy of parser, can by obtain the phrase structure of the flat structure be converted to dependency structure conversion it is accurate Rate, according to the accuracy rate, to the parser training is adjusted.
Conversion accuracy can be used to indicate that the correct probability of the dependency structure being converted to, the calculating tool of conversion accuracy Body can utilize Internet resources, the probability of occurrence of the dependency structure after search and statistics conversion, according to the probability Determine the conversion accuracy.
The higher explanation conversion accuracy of probability of occurrence of dependency structure is higher, can select appearance by default value , higher than the dependency structure of default value, the accuracy rate of the dependency structure selected according to the default value can reach will for probability Ask, that is, the dependency structure selected has higher reference value.Therefore can utilize corresponding to the dependency structure selected Phrase is adjusted training to the parser.By the adjusting training to parser, sentence can be further improved The phrase structure of flat structure is converted to method analyzer the conversion accuracy of dependency structure, improves the performance of parser.
The content of introduction developed below with the first treebank as Penn Chinese Treebank, the second treebank is HIT- As a example by IR-CDT.For follow-up convenient introduction, the phrase structure of flat structure can be referred to as flat phrase structure.
Referring to Fig. 4 a and Fig. 4 b, Fig. 4 a are flat phrase structure schematic diagram of the present invention;Fig. 4 b are flat described in Fig. 4 a Phrase structure is converted to the schematic diagram of dependency structure.
It is flat that the phrase structure of Penn Chinese Treebank belongs to comparison, is mainly reflected in noun compounded phrase.
For example:The phrase of Penn Chinese Treebank, " medical procurement service centre of medical institutions ", its structure is shown It is intended to as shown in fig. 4 a.Father node is:NP (noun phrase), child node is 6 NN (noun).6 NN be respectively " medical treatment ", " mechanism ", " medicine ", " buying ", " service " and " " center ".
Dependency analysis are carried out to phrase structure as shown in fig. 4 a using the parser in HIT-IR-LTP, it is obtained Inter-dependency relation.As a result referring to Fig. 4 b.
First, first order dependence is determined:" medical treatment " and " mechanism ", " medicine " and " buying " and " service " and " " center " Three dependences.Above-mentioned dependence is represented with the camber line with arrow or with direction.I.e. " medical treatment " by band arrow or Camber line with direction points to " mechanism ";" medicine " points to " buying " by the camber line with arrow or with direction;" service " passes through Camber line with arrow or with direction points to " " center ".
Then, it is determined that second level dependence, " mechanism " and " medicine " and " buying " and " service " two dependences.With Camber line with arrow or with direction represents above-mentioned dependence.I.e. " mechanism " is pointed to by the camber line with arrow or with direction " medicine ";" buying " points to " service " by the camber line with arrow or with direction.
Thus determine dependency structure as shown in Figure 4 b.
Structure for being expressed using rule carries out special process, mainly for parallel construction.
The phrase structure quantity of such parallel construction is very big.According to the second treebank system, such case needs specially treated. We are concluded using rule-based method, then specially treated.
In embodiments of the present invention, for the phrase structure of parallel construction can be returned using rule-based method Receive, be converted into dependency structure.Next rule-based method will be concluded, will be arranged side by side in first treebank The phrase structure of structure is converted to the concrete operations of dependency structure to launch to introduce, as shown in figure 5, concrete operations are as follows:
S501:It is multiple fragments by the phrase structure cutting of the parallel construction.
When the phrase structure of parallel construction is converted to into dependency structure, it is necessary first to determine the phrase knot of the parallel construction The core node of structure, core node is used as the key for carrying out dependency structure conversion, therefore, to assure that the accuracy of core node.With one As a example by section word, the core node of this section of word is determined, if the length of this section of word is longer, it is determined that the difficulty of core node Can be larger, and the core word that may cause to determine and undesirable, in order to improve the accuracy for determining core node, Before carrying out the determination of core node, it is multiple fragments that first the phrase structure of parallel construction can be carried out into cutting, is with fragment Unit, the core node determined from each fragment can be more accurate.
The embodiment of the present invention for the slit mode that the phrase structure cutting of parallel construction is multiple fragments is not construed as limiting, Can be according to the cutting, or acquisition input method input condition is carried out, with defeated using conjunction part of speech or pause mark as cutting Enter the interruption of the input in method input condition carries out the cutting for cutting foundation, or is when the phrase of the parallel construction is tied When different phrases in structure have incidence relation, the cutting is carried out as cutting foundation using the incidence relation.It is wherein different The incidence relation of phrase can be that different phrases belong to synonym or antonym.
S502:The core node of each fragment is determined respectively, and, by other sections in each fragment in addition to core node Point is defined as depending on the core node in the fragment.
By taking a fragment as an example, the mode for determining the core node of the fragment can be with phrase structure place sentence As analysis object, the occurrence number in the sentence context of each node of the fragment is determined, according to different sections The comparable situation of point occurrence number, determines that occurrence number meets desired node as the core node.
Can will appear from number of times highest node as core node, or can will appear from the higher node of number of times As core node, or can will appear from node of the number of times higher than setting numerical value as core node.
S503:By each core node of other fragments in addition to first fragment, it is defined as depending on described first The core node of individual fragment.
The dependency structure between other nodes and core node in a fragment is can determine that in S502, for each Dependency structure between fragment, can be using the core node in first fragment as the core in the phrase structure of the parallel construction Heart node, the core node in other fragments sets up dependency structure with the core node.
For example shown in Fig. 6, " special zone such as developed country and Shenzhen " in phrase structure, " developed country " and " spy such as Shenzhen Area " constitutes coordination, i.e. " special zone such as developed country and Shenzhen " and belongs to the phrase structure of parallel construction, can according to said method So that the phrase structure of the parallel construction is carried out into cutting, can be with cutting as " developed country " and " special zone such as Shenzhen " the two pieces Section, the core node of first fragment " developed country " is " country ", and other nodes " prosperity " depend on core in the first fragment Node " country ", i.e. " prosperity " point to " country " by the camber line with arrow or with direction, second fragment " spy such as Shenzhen The core node in area " is " Shenzhen ", and other nodes in the second fragment " and ", " etc. " and " special zone " depend on respectively core node " Shenzhen ", i.e., " and " " Shenzhen " is pointed to by the camber line with arrow or with direction, " etc. " by with arrow or with direction Camber line points to " Shenzhen ", and " special zone " points to " Shenzhen " by the camber line with arrow or with direction.Between the two fragments, the Core node " Shenzhen " in two fragments depends on the core node " country " in first fragment, i.e. " Shenzhen " by band arrow Or the camber line with direction points to " country ".
Determine core node by way of cutting fragment, the accuracy rate of the core node determined can be improved, so as to So that the dependency structure after conversion is more accurate.
S203, using the dependence mapping model that obtains of training, the dependency structure in first treebank is carried out according to Relation conversion is deposited, the dependency structure treebank of the second treebank type is obtained.
Referring to Fig. 3, the foundation of the dependence mapping model is comprised the following steps:
S301, using second treebank train dependence marking model.
The work of dependence annotator is to note dependence for each interdependent arc label.Each there are two sections at arc two ends Point:Own node and father node.Wherein own node depends on father node, father node domination
Own node, father node is core word.As above in figure:" medical treatment-> mechanisms " constitutes an arc, wherein " medical treatment " is Own node, " mechanism " is father node.
This is a mark problem, using linear-logarithmic model.Using following 4 features:
Feature Explanation Feature Explanation
word word_f Word, father's word word pos_f Word, father's part of speech
pos word_f Part of speech, father's word pos pos_f distance Part of speech, father's part of speech, distance
Probability is trained using Maximum-likelihood estimation, model form is obtained as follows:
F0_ is this _ understanding _ ATT 1
F1_ is this _ n_ATT 0.8
F2_r_ understanding _ ATT 0.142857
f3_r_n_1_ATT 0.997324
S302, dependence mark is carried out to first treebank using the dependence marking model.
With the first treebank as Penn Chinese Treebank, the second treebank is as a example by HIT-IR-CDT, using interdependent pass It is that marking model carries out dependence mark to Penn Chinese Treebank
The power of wherein four feature word word_f, word pos_f, pos word_f, pos pos_f distance Value takes respectively 0.4,0.2,0.2,0.2.
Tested using HIT-IR-CDT testing materials, the accuracy rate of dependence marking model is 89.7%.
In order to using original correct part of speech, syntactic information in Penn Chinese Treebank, trained one according to Relationship map model is deposited, dependence annotation results are corrected.
When phrase structure turns dependency structure, three information are recorded, the phrase type of child node generates phrase class Type, and the phrase type of father node.
With reference to Fig. 8, the figure is dependence schematic diagram of the present invention.Fig. 8 represents the interdependent of " medical treatment " and " mechanism " Relation record is " NN-NP-NN ", and " medical treatment " points to " mechanism ", on camber line " NN-NP-NN " is marked by the camber line with arrow.
S303, the result marked using original part of speech and syntactic information of first treebank, the correction dependence, Set up the dependence mapping model.
During training dependence mapping model, using these three features referring to table 2.
The training dependence mapping modular character table of table 2
Feature Explanation Feature Explanation
phrase Itself phrase type phrase_s Generate itself phrase type
phrase_f Father's phrase type
Probability is trained using Maximum-likelihood estimation, model form is obtained as follows:
f0_NN_ATT 0.734
f1_NP_ATT 0.543
f2_NN_ATT 0.933
Dependence conversion is carried out using dependence mapping model
The weights of wherein i=0, phrase feature are 0.35;
The weights of i=1, phrase_s feature are 0.3;
The weights of i=2, phrase_f feature are 0.35.
It is as a result as follows after carrying out dependence mapping:
Word Shanghai Pudong Exploitation With Legal system Build It is synchronous
Numbering 1 2 3 4 5 6 7
Dependency structure (father node numbering) 2 3 7 6 6 3 0
Syntactic relation annotator result ATT ATT SBV LAD ATT ATT HED
Syntactic relation mapping model result ATT ATT SBV LAD ATT COO HED
Referring to Fig. 3, the foundation of the dependence mapping model is comprised the following steps:
S301, using second treebank train dependence marking model.
S302, dependence mark is carried out to first treebank using the dependence marking model.
S303, the result marked using original part of speech and syntactic information of first treebank, the correction dependence, Set up the dependence mapping model.
The dependence marking model uses the second linear-logarithmic model
Carry out the dependence mark;
Wherein, i=0, correspondence word word_f words, father's word feature;
I=1, correspondence word pos_f words, father node part of speech feature;
I=2, correspondence pos word_f part of speech feature;
I=3, correspondence pos pos_f distance father node part of speech feature;
λ0:The weights of word word_f features during correspondence i=0;
λ1:The weights of word pos_f features during correspondence i=1;
λ2:The weights of pos word_f features during correspondence i=2;
λ3:The weights of pos pos_f distance features during correspondence i=2.
The dependence mapping model uses third linear logarithmic model
Carry out the dependence mark;
Wherein, i=0, correspondence phrase itself phrase type features;
I=1, correspondence phrase_s generates itself phrase type feature;
I=2, correspondence father's phrase_f phrase type feature;
λ0:The weights of phrase features during correspondence i=0;
λ1:The weights of phrase_s features during correspondence i=1;
λ2:The weights of phrase_f features during correspondence i=2.
Dependency structure treebank acquisition methods described in the embodiment of the present invention are included the such as Chinese phrase structure conversion of the first treebank For the second treebank type dependency structure treebank the step of.Dependency structure treebank acquisition methods described in the embodiment of the present invention are by Chinese Phrase structure treebank is converted to dependency structure treebank, so, the dependency structure treebank after conversion can very easily with it is original Dependency structure treebank is merged, and so as to increase treebank scale, and then effectively improves the performance of parser.
Meanwhile, dependency structure treebank acquisition methods described in the embodiment of the present invention are included using parser to the first treebank In the phrase structure of flat structure the step of be converted to dependency structure, solve the phrase of the flat structures such as noun compounded phrase Structure is converted to the difficult problem of dependency structure.
Referring to Fig. 7, the figure is dependency structure treebank acquisition methods second embodiment flow chart of the present invention.
Dependency structure treebank acquisition methods second embodiment of the present invention is with respect to the difference of first embodiment, The step of converting to part-of-speech tagging collection is further increased in two embodiments.
Dependency structure treebank acquisition methods, comprise the following steps described in second embodiment of the invention:
S701, call the first treebank.
S702, crossover tool and parser that the first treebank is respectively adopted, by the phrase in first treebank Structure is converted to dependency structure.
S702 is similar with the processing procedure of S202, will not be described here.
S703, the part-of-speech tagging collection in first treebank is converted into the mark for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement Note collection.
The Chinese Industrial Standards (CIS) part-of-speech tagging collection can be 863 part-of-speech tagging collection.
Syntactic structure information is not only included in one treebank, part-of-speech information can also be included.The word that each treebank is adopted Property mark collection be also not quite similar.Therefore the step of converting to part-of-speech tagging collection can be increased.863 part-of-speech tagging collection are China One of standard part-of-speech tagging collection, embodiment of the present invention methods described is by the first treebank such as Penn Chinese Treebank words Property mark collection be converted into the mark collection such as 863 part-of-speech tagging collection for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement, so can unite The mark of part of speech in one treebank, improves the accuracy of conversion.
Next the concrete operations of part-of-speech tagging collection conversion process will be launched to introduce.Specifically, it is possible to use described Two treebanks carry out part-of-speech tagging to the word of the first treebank, and carry out part of speech division using the part of speech mapping model for pre-building, Correct the part of speech of the mark.
With the first treebank as Penn Chinese Treebank, the second treebank is as a example by HIT-IR-CDT, using HIT- IR-CDT carries out part-of-speech tagging to the word of Penn Chinese Treebank, and using the part of speech mapping model for pre-building The part of speech division is carried out, the part of speech of the mark is corrected.
The part of speech mapping model uses the first linear logarithmic model:
Carry out the part of speech conversion;
Wherein, i=0, correspondence pos itself part of speech feature;
I=1, correspondence pos_s pos child node parts of speech, itself part of speech feature;
I=2, correspondence pos pos_f itself part of speech feature, father node part of speech;
λ0:The weights of pos features during correspondence i=0;
λ1:The weights of pos_s pos features during correspondence i=1;
λ2:The weights of pos pos_f features during correspondence i=2.
HIT-IR-LTP is the language technology platform of Harbin Institute of Technology's Research into information retrieval room exploitation, wherein comprising each Plant and include many natural language processing modules such as participle, syntactic analysis etc., the also for example interdependent treebank HIT-IR- of some language material resources CDT.HIT-IR-LTP is now freely shared to academia.
The precision of the part-of-speech tagging module in HIT-IR-LTP reaches 90%.Using HIT-IR-LTP part-of-speech tagging devices pair Penn Chinese Treebank carry out part-of-speech tagging.
Although the precision comparison of HIT-IR-LTP part-of-speech tagging modules is high, inevitable or meeting is wrong.In order to Using original correct part of speech, syntactic information in Penn Chinese Treebank, we trained a part of speech mapping mould Annotation results are corrected by type.
Part of speech mapping model uses linear-logarithmic model, using three features:
Parameter Estimation adopts Maximum-likelihood estimation, and the model probability for training is in the following example.
F0_NN_n=0.746038, represents that NN is mapped as the probability of n;
F0_NN_v=0.1699158, represents that NN is mapped as the probability of v;
F1_VC_NN_n=0.801055, expression child node is VC, and NN is mapped as the probability of n;
F1_VC_NN_v=0.121002, expression child node is VC, and NN is mapped as the probability of v;
F2_NN_NN_n=0.776695, expression father node is NN, and NN is mapped as the probability of n;
F2_NN_NN_v=0.180412, expression father node is NN, and NN is mapped as the probability of v.
Part of speech conversion is carried out using the formula of following part of speech mapping model:
λ0=0.4, λ0The weights of pos features during correspondence i=0;
λ1=0.3, λ1The weights of pos_s pos features during correspondence i=1;
λ2=0.3, λ2The weights of pos pos_f features during correspondence i=2.
Part of speech mapping model corrects the table of comparisons of marking error for example shown in following table
As seen from the above, using original Penn Chinese Treebank treebank information, some can effectively be corrected Part-of-speech tagging mistake.
It should be noted that S702 and S703 do not have the restriction on sequencing.
S704, using the dependence mapping model that obtains of training, the dependency structure in first treebank is carried out according to Relation conversion is deposited, the dependency structure treebank of the second treebank type is obtained.
During training dependence mapping model, using three features in table.
Feature Explanation Feature Explanation
phrase Itself phrase type phrase_s Generate itself phrase type
phrase_f Father's phrase type
Probability is trained using Maximum-likelihood estimation, obtains training dependence mapping model form, using dependence Mapping model carries out dependence conversion.
The formula of dependence mapping model is as follows:
Wherein three features phrase, phrase_s, the weights of phrase_f take respectively 0.35,0.3,0.35.
It is as a result as follows after carrying out dependence mapping:
Word Shanghai Pudong Exploitation With Legal system Build It is synchronous
Numbering 1 2 3 4 5 6 7
Dependency structure (father node numbering) 2 3 7 6 6 3 0
Syntactic relation annotator result ATT ATT SBV LAD ATT ATT HED
Syntactic relation mapping model result ATT ATT SBV LAD ATT COO HED
The present invention provides a kind of dependency structure treebank acquisition methods, including by the first treebank such as Chinese phrase structure treebank turn The dependency structure treebank of the second treebank type is changed to, the part-of-speech tagging collection in the first treebank is converted into and is met Chinese Industrial Standards (CIS) part of speech The step of mark that mark collection is required collects, contain the conversion of syntactic structure and the conversion of part-of-speech tagging collection so that after conversion Dependency structure treebank is more accurate, and the dependency structure treebank after conversion very easily can merge with original dependency structure treebank, So as to increase treebank scale, and then effectively improve the performance of parser.
Referring to Fig. 9, the figure is that dependency structure treebank of the present invention obtains system first embodiment structure chart.
Dependency structure treebank described in first embodiment of the invention obtains system, including call unit 11 and converting unit 12.
The call unit 11, for calling the first treebank;First treebank is Chinese phrase structure treebank.
The converting unit 12, for the crossover tool and parser of the first treebank to be respectively adopted, by described Phrase structure in one treebank is converted to dependency structure;Second treebank is the treebank of dependency structure.
Wherein, the phrase structure in first treebank is converted to by dependency structure bag using the crossover tool of the first treebank Include:The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to institute Rule obtained by stating after rule is modified, by the phrase structure dependency structure is converted to;And, rule-based method Concluded, the phrase structure of the parallel construction in first treebank is converted to into dependency structure.
Wherein, using parser, the phrase structure in first treebank is converted to into dependency structure includes:Utilize The parser, by the phrase structure of the flat structure in first treebank dependency structure is converted to.
The converting unit 12 is additionally operable to the dependence mapping model obtained using training, in first treebank Dependency structure carries out dependence conversion, obtains the dependency structure treebank of the second treebank type.
The converting unit 12 is connected with the call unit 11.
Optionally, the converting unit specifically includes determination subelement and scanning subelement:
The determination subelement, for according to the Head core node mapping tables for pre-building, determining first treebank Phrase structure treebank in grammar inference core node.
The scanning subelement, for using the mapping table, and according to the rule in the mapping table, for the core Heart node is scanned, and obtains the dependence of other child nodes and the core node.
Wherein, the Head core nodes mapping table be according to the crossover tool provided will be short in the first treebank Language structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
Optionally, the converting unit is specifically for using the parser, to first treebank in it is flat The phrase structure of structure, finds maximum spanning tree in digraph, determines different phrases in the phrase structure of the flat structure Interdependent probability;The phrase structure of the flat structure in first treebank is changed according to the interdependent probability of the different phrases For dependency structure.
Optionally, the system further includes parser training unit, for short in using second treebank Language is trained to the parser.
Optionally, the system further includes adjustment unit, and the phrase structure for obtaining the flat structure is converted to The conversion accuracy of dependency structure, according to the accuracy rate, to the parser training is adjusted.
Optionally, the adjustment unit, it is described interdependent after search and statistics conversion specifically for utilizing Internet resources The probability of occurrence of structure, according to conversion accuracy described in the determine the probability.
Optionally, the converting unit specifically includes cutting subelement and interdependent determination subelement,
The cutting subelement, for by the phrase structure cutting of the parallel construction be multiple fragments;
The interdependent determination subelement, for determining the core node of each fragment respectively, and, will remove in each fragment Other nodes outside core node are defined as depending on the core node in the fragment;
The interdependent determination subelement, is additionally operable to each core node of other fragments in addition to first fragment, It is defined as depending on the core node of first fragment.
Optionally, the cutting subelement, for the phrase structure of the parallel construction to be made with conjunction part of speech or pause mark The cutting is carried out for cutting foundation.
Optionally, the cutting subelement, for obtaining input method input condition, with the input in input method input condition Be interrupted carries out the cutting for cutting foundation.
Optionally, the cutting subelement, for closing when the different phrases in the phrase structure of the parallel construction have During connection relation, the cutting is carried out as cutting foundation using the incidence relation.
Optionally, the interdependent determination subelement, for using phrase structure place sentence as analysis object, it is determined that The occurrence number in the sentence context of each node of the fragment, according to the comparison feelings of different node occurrence numbers Condition, determines that occurrence number meets desired node as the core node.
Optionally, according to the foundation of dependence mapping model, the system also includes training unit, mark unit and correction Unit:
The training unit, for training dependence marking model using second treebank.
The mark unit, for carrying out dependence mark to first treebank using the dependence marking model Note.
The correction unit, for using original part of speech and syntactic information of first treebank, correcting the interdependent pass The result of system's mark, sets up the dependence mapping model.
Optionally, the dependence marking model uses the second linear-logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence word word_f words, father's word feature,
I=1, correspondence word pos_f words, father node part of speech feature,
I=2, correspondence pos word_f part of speech feature,
I=3, correspondence pos pos_f distance father node part of speech feature,
λ0:The weights of word word_f features during correspondence i=0;
λ1:The weights of word pos_f features during correspondence i=1;
λ2:The weights of pos word_f features during correspondence i=2;
λ3:The weights of pos pos_f distance features during correspondence i=3.
Dependence marking model may refer to dependence mark mould in dependency structure treebank acquisition methods described previously The description of type.
Optionally, the dependence mapping model uses third linear logarithmic model
Carry out dependence mark;
Wherein, i=0, correspondence phrase itself phrase type features,
I=1, correspondence phrase_s generates itself phrase type feature,
I=2, correspondence father's phrase_f phrase type feature,
λ0:The weights of phrase features during correspondence i=0;
λ1:The weights of phrase_s features during correspondence i=1;
λ2:The weights of phrase_f features during correspondence i=2.
Dependence mapping model may refer to dependence mapping mould in dependency structure treebank acquisition methods described previously The description of type.
Optionally, first treebank is Penn Chinese TreeBank Universities of Pennsylvania Chinese treebank, described Second treebank is the interdependent treebank of HIT-IR-CDT Harbin Institute of Technologys Chinese.
Dependency structure treebank described in the embodiment of the present invention obtains system and includes the call unit 11 for being used for calling the first treebank, Dependency structure is converted to by the phrase structure in the first treebank, and interdependent pass is carried out to the dependency structure in first treebank System's conversion, obtains the converting unit 12 of the dependency structure treebank of the second treebank type.Dependency structure tree described in the embodiment of the present invention Storehouse obtains system and Chinese phrase structure treebank can be converted to into dependency structure treebank, so, the dependency structure treebank after conversion Very easily can merge with original dependency structure treebank, so as to increase treebank scale, and then effectively improve syntax The performance of analyzer.
Meanwhile, dependency structure treebank described in the embodiment of the present invention obtains system and can utilize sentence comprising the converting unit 12 Method analyzer is converted to dependency structure to the phrase structure of the flat structure in the first treebank, solves noun compounded phrase etc. flat The phrase structure of flat structure is converted to the difficult problem of dependency structure.
Referring to Figure 10, the figure is that dependency structure treebank of the present invention obtains system second embodiment structure chart.
Dependency structure treebank of the present invention obtains system second embodiment and increased conversion unit with respect to first embodiment 13。
Dependency structure treebank of the present invention obtains the conversion list that system further includes to be connected with the converting unit 12 Unit 13, for the part-of-speech tagging collection in first treebank to be converted into the mark for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement Collection.
Optionally, the Chinese Industrial Standards (CIS) part-of-speech tagging collection is 863 part-of-speech tagging collection.
Optionally, the conversion unit to the word of the first treebank using second treebank specifically for carrying out part of speech mark Note, and part of speech division is carried out using the part of speech mapping model for pre-building, correct the part of speech of the mark.
Optionally, the part of speech mapping model uses the first linear logarithmic model:
Carry out part of speech conversion;
Wherein, i=0, correspondence pos itself part of speech feature,
I=1, correspondence pos_s pos child node parts of speech, itself part of speech feature,
I=2, correspondence pos pos_f itself part of speech feature, father node part of speech,
λ0:The weights of pos features during correspondence i=0;
λ1:The weights of pos_s pos features during correspondence i=1;
λ2:The weights of pos pos_f features during correspondence i=2.
Part of speech mapping model may refer to the description of part of speech mapping model in dependency structure treebank acquisition methods described previously.
Dependency structure treebank described in the embodiment of the present invention obtains system and includes the call unit 11 for being used for calling the first treebank, Phrase structure in first treebank is converted to into dependency structure, and dependence is carried out to the dependency structure in first treebank Conversion, obtains the converting unit 12 of the dependency structure treebank of the second treebank type, and the part-of-speech tagging collection in the first treebank is turned Turn to the conversion unit 13 of the mark collection for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement such that it is able to realize the conversion of syntactic structure With the conversion of part-of-speech tagging collection so that the dependency structure treebank after conversion is more accurate.Dependency structure treebank after conversion can be very Convenient and original dependency structure treebank is merged, and so as to increase treebank scale, and then effectively improves parser Performance.
Meanwhile, dependency structure treebank described in the embodiment of the present invention obtains system and can utilize sentence comprising the converting unit 12 Method analyzer is converted to dependency structure to the phrase structure of the flat structure in the first treebank, solves noun compounded phrase etc. flat The phrase structure of flat structure is converted to the difficult problem of dependency structure.
The preferred embodiment of the present invention is the foregoing is only, limiting the scope of the present invention is not constituted.It is any Any modification, equivalent and improvement for being made within the spirit and principles in the present invention etc., should be included in the power of the present invention Within the scope of profit is claimed.

Claims (24)

1. a kind of dependency structure treebank acquisition methods, it is characterised in that the method includes:
Call the first treebank;First treebank is Chinese phrase structure treebank;
The crossover tool and parser of the first treebank are respectively adopted, the phrase structure in first treebank is converted to Dependency structure;Second treebank is the treebank of dependency structure;
Wherein, the phrase structure in first treebank is converted to into dependency structure using the crossover tool of the first treebank includes: The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to the rule Rule obtained by after being then modified, by the phrase structure dependency structure is converted to;And, rule-based method is carried out Conclude, the phrase structure of the parallel construction in first treebank is converted to into dependency structure;
Wherein, using parser, the phrase structure in first treebank is converted to into dependency structure includes:Using syntax Analyzer, by the phrase structure of the flat structure in first treebank dependency structure is converted to;
The dependence mapping model obtained using training, dependence is carried out to the dependency structure in first treebank and is turned Change, obtain the dependency structure treebank of the second treebank type.
2. method according to claim 1, it is characterised in that it is described using the crossover tool provided by the first tree Phrase structure in storehouse is converted to the rule of dependency structure, or rule resulting after being modified to the rule, will be described Phrase structure is converted to dependency structure, including:According to the Head core node mapping tables for pre-building, first treebank is determined Phrase structure treebank in grammar inference core node;
Using the mapping table, and according to the rule in the mapping table, it is scanned for the core node, obtains other The dependence of child node and the core node;
Wherein, the Head core nodes mapping table is to tie the phrase in the first treebank according to what the crossover tool was provided Structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
3. method according to claim 1, it is characterised in that the utilization parser, by first treebank The phrase structure of flat structure be converted to dependency structure, specifically include:
Using the parser, the phrase structure to the flat structure in first treebank, find most in digraph Big spanning tree, determines the interdependent probability of different phrases in the phrase structure of the flat structure;
The phrase structure of the flat structure in first treebank is converted to according to the interdependent probability of the different phrases interdependent Structure.
4. the method according to claim 1,2 or 3, it is characterised in that using the phrase in second treebank to described Parser is trained.
5. the method according to claim 1,2 or 3, it is characterised in that the method is further included:Obtain the flat knot The phrase structure of structure is converted to the conversion accuracy of dependency structure, according to the accuracy rate, the parser is adjusted Training is practiced.
6. method according to claim 5, it is characterised in that utilize Internet resources, search and counts the institute after conversion The probability of occurrence of dependency structure is stated, according to conversion accuracy described in the determine the probability.
7. method according to claim 1, it is characterised in that the rule-based method is concluded, by described The phrase structure of the parallel construction in one treebank is converted to dependency structure, specifically includes:
It is multiple fragments by the phrase structure cutting of the parallel construction;
The core node of each fragment is determined respectively, and, other nodes in each fragment in addition to core node are defined as Depend on the core node in the fragment;
By each core node of other fragments in addition to first fragment, it is defined as depending on the core of first fragment Heart node.
8. method according to claim 7, it is characterised in that described is many by the phrase structure cutting of the parallel construction Individual fragment, specifically includes:
The cutting is carried out as cutting foundation using conjunction part of speech or pause mark.
9. method according to claim 7, it is characterised in that described is many by the phrase structure cutting of the parallel construction Individual fragment, specifically includes:
Input method input condition is obtained, the cutting is carried out as cutting foundation with the input interruption in input method input condition.
10. method according to claim 7, it is characterised in that described to be by the phrase structure cutting of the parallel construction Multiple fragments, specifically include:
When the different phrases in the phrase structure of the parallel construction have incidence relation, using the incidence relation as cutting Foundation carries out the cutting.
11. methods according to claim 7, it is characterised in that the core node of each fragment of determination includes:With institute Phrase structure place sentence is stated as analysis object, the going out in the sentence context of each node of the fragment is determined Occurrence number, according to the comparable situation of different node occurrence numbers, determines that occurrence number meets desired node as the core Node.
12. methods according to claim 1, it is characterised in that the foundation of the dependence mapping model includes:
Dependence marking model is trained using second treebank;
Dependence mark is carried out to first treebank using the dependence marking model;
Using original part of speech and syntactic information of first treebank, the result of the dependence mark is corrected, set up described Dependence mapping model.
13. method according to claim 12, it is characterised in that the dependence marking model is linear right using second Exponential model
Wherein, i=0, correspondence word word_f words, father's word feature,
I=1, correspondence word pos_f words, father node part of speech feature,
I=2, correspondence pos word_f part of speech feature,
I=3, correspondence pos pos_f distance father node part of speech feature,
λ0:The weights of word word_f features during correspondence i=0;
λ1:The weights of word pos_f features during correspondence i=1;
λ2:The weights of pos word_f features during correspondence i=2;
λ3:The weights of pos pos_f distance features during correspondence i=3.
14. methods according to claim 12 or 13, it is characterised in that the dependence mapping model uses the 3rd line Property logarithmic model
Wherein, i=0, correspondence phrase itself phrase type features,
I=1, correspondence phrase_s generates itself phrase type feature,
I=2, correspondence father's phrase_f phrase type feature,
λ0:The weights of phrase features during correspondence i=0;
λ1:The weights of phrase_s features during correspondence i=1;
λ2:The weights of phrase_f features during correspondence i=2.
15. methods according to claim 1 to 14 any one, it is characterised in that the method is further included:
Part-of-speech tagging collection in first treebank is converted into the mark collection for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement.
16. methods according to claim 15, it is characterised in that the Chinese Industrial Standards (CIS) part-of-speech tagging collection is 863 part of speech marks Note collection.
17. methods according to claim 15 or 16, it is characterised in that the part-of-speech tagging by first treebank Collection is converted into the mark collection for meeting Chinese Industrial Standards (CIS) part-of-speech tagging collection requirement, including:
Part-of-speech tagging is carried out to the word of the first treebank using second treebank, and using the part of speech mapping model for pre-building Part of speech division is carried out, the part of speech of the mark is corrected.
18. methods according to claim 17, it is characterised in that the part of speech mapping model uses the first linear logarithmic mode Type:
Wherein, i=0, correspondence pos itself part of speech feature,
I=1, correspondence pos_s pos child node parts of speech, itself part of speech feature,
I=2, correspondence pos pos_f itself part of speech feature, father node part of speech,
λ0:The weights of pos features during correspondence i=0;
λ1:The weights of pos_s pos features during correspondence i=1;
λ2:The weights of pos pos_f features during correspondence i=2.
19. methods according to claim 1 to 18 any one, it is characterised in that first treebank is Penn Chinese TreeBank Universities of Pennsylvania Chinese treebank, second treebank is that HIT-IR-CDT Harbin Institute of Technologys Chinese is interdependent Treebank.
A kind of 20. dependency structure treebanks obtain system, it is characterised in that the system includes call unit and converting unit:
The call unit, for calling the first treebank;First treebank is Chinese phrase structure treebank;
The converting unit, for the crossover tool and parser of the first treebank to be respectively adopted, by first treebank In phrase structure be converted to dependency structure;Second treebank is the treebank of dependency structure;
Wherein, the phrase structure in first treebank is converted to into dependency structure using the crossover tool of the first treebank includes: The rule that the phrase structure in the first treebank is converted to dependency structure provided using the crossover tool, or to the rule Rule obtained by after being then modified, by the phrase structure dependency structure is converted to;And, rule-based method is carried out Conclude, the phrase structure of the parallel construction in first treebank is converted to into dependency structure;
Wherein, using parser, the phrase structure in first treebank is converted to into dependency structure includes:Using described Parser, by the phrase structure of the flat structure in first treebank dependency structure is converted to;
The converting unit is additionally operable to the dependence mapping model obtained using training, to the interdependent knot in first treebank Structure carries out dependence conversion, obtains the dependency structure treebank of the second treebank type.
21. systems according to claim 20, it is characterised in that the converting unit specifically includes determination subelement and sweeps Retouch subelement:
The determination subelement, for according to the Head core node mapping tables for pre-building, determining the short of first treebank The core node of grammar inference in language structure treebank;
The scanning subelement, for using the mapping table, and according to the rule in the mapping table, for the core section Point is scanned, and obtains the dependence of other child nodes and the core node;
Wherein, the Head core nodes mapping table is to tie the phrase in the first treebank according to what the crossover tool was provided Structure is converted to the rule of dependency structure, or is modified what rear resulting rule was formed to the rule.
22. systems according to claim 20, it is characterised in that the converting unit specifically include cutting subelement and according to Deposit determination subelement:
The cutting subelement, for by the phrase structure cutting of the parallel construction be multiple fragments;
The interdependent determination subelement, for determining the core node of each fragment respectively, and, core will be removed in each fragment Other nodes outside node are defined as depending on the core node in the fragment;
The interdependent determination subelement, is additionally operable to each core node of other fragments in addition to first fragment, it is determined that To depend on the core node of first fragment.
23. systems according to claim 20, it is characterised in that according to the foundation of dependence mapping model, the system Also include training unit, mark unit and correct unit:
The training unit, for training dependence marking model using second treebank;
The mark unit, for carrying out dependence mark to first treebank using the dependence marking model;
The correction unit, for using original part of speech and syntactic information of first treebank, correcting the dependence mark The result of note, sets up the dependence mapping model.
24. systems according to claim 20 to 23 any one, it is characterised in that the system further includes that conversion is single Unit:
The conversion unit, for the part-of-speech tagging collection in first treebank to be converted into Chinese Industrial Standards (CIS) part-of-speech tagging collection is met The mark collection of requirement.
CN201611208593.6A 2016-12-23 2016-12-23 A kind of dependency structure treebank acquisition methods and system Active CN106598951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611208593.6A CN106598951B (en) 2016-12-23 2016-12-23 A kind of dependency structure treebank acquisition methods and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611208593.6A CN106598951B (en) 2016-12-23 2016-12-23 A kind of dependency structure treebank acquisition methods and system

Publications (2)

Publication Number Publication Date
CN106598951A true CN106598951A (en) 2017-04-26
CN106598951B CN106598951B (en) 2019-08-16

Family

ID=58601481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611208593.6A Active CN106598951B (en) 2016-12-23 2016-12-23 A kind of dependency structure treebank acquisition methods and system

Country Status (1)

Country Link
CN (1) CN106598951B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391488A (en) * 2017-07-28 2017-11-24 昆明理工大学 A kind of interdependent syntactic analysis method of Chinese of minimum spanning tree statistics fusion
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning
CN108628829A (en) * 2018-04-23 2018-10-09 苏州大学 Automatic treebank method for transformation based on tree-like Recognition with Recurrent Neural Network and system
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment
US11769007B2 (en) 2021-05-27 2023-09-26 International Business Machines Corporation Treebank synthesis for training production parsers

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201819A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 Method and system for transferring tree bank
CN101382844A (en) * 2008-10-24 2009-03-11 上海埃帕信息科技有限公司 Method for inputting spacing participle

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201819A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 Method and system for transferring tree bank
CN101382844A (en) * 2008-10-24 2009-03-11 上海埃帕信息科技有限公司 Method for inputting spacing participle

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周惠巍等: "短语结构到依存结构树库转换研究", 《大连理工大学学报》 *
李正华: "依存句法分析统计模型及树库转化研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391488A (en) * 2017-07-28 2017-11-24 昆明理工大学 A kind of interdependent syntactic analysis method of Chinese of minimum spanning tree statistics fusion
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning
CN108628829A (en) * 2018-04-23 2018-10-09 苏州大学 Automatic treebank method for transformation based on tree-like Recognition with Recurrent Neural Network and system
CN108628829B (en) * 2018-04-23 2022-03-15 苏州大学 Automatic tree bank transformation method and system based on tree-shaped cyclic neural network
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment
US11769007B2 (en) 2021-05-27 2023-09-26 International Business Machines Corporation Treebank synthesis for training production parsers

Also Published As

Publication number Publication date
CN106598951B (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN106598951A (en) Dependency structure treebank acquisition method and system
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN105045778B (en) A kind of Chinese homonym mistake auto-collation
US7088949B2 (en) Automated essay scoring
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN109918640B (en) Chinese text proofreading method based on knowledge graph
CN109062892A (en) A kind of Chinese sentence similarity calculating method based on Word2Vec
CN101866337A (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN101201819B (en) Method and system for transferring tree bank
CN104102630B (en) A kind of method for normalizing for Chinese and English mixing text in Chinese social networks
CN104881402A (en) Method and device for analyzing semantic orientation of Chinese network topic comment text
CN102929870A (en) Method for establishing word segmentation model, word segmentation method and devices using methods
CN103699529A (en) Method and device for fusing machine translation systems by aid of word sense disambiguation
CN106844348B (en) Method for analyzing functional components of Chinese sentences
CN105022806B (en) The method and system of the internet web page construction movement page based on translation template
CN107133223A (en) A kind of machine translation optimization method for exploring more reference translation information automatically
CN111340661A (en) Automatic application problem solving method based on graph neural network
CN105868187B (en) The construction method of more translation Parallel Corpus
CN102646091A (en) Dependence relationship labeling method, device and system
CN107391495A (en) A kind of sentence alignment schemes of bilingual parallel corporas
CN104050255A (en) Joint graph model-based error correction method and system
CN105243053B (en) Extract the method and device of document critical sentence
CN102929864B (en) A kind of tone-character conversion method and device
CN105677639A (en) English word sense disambiguation method based on phrase structure syntax tree
CN104008301A (en) Automatic construction method for hierarchical structure of domain concepts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant