CN101901213A - Instance-based dynamic generalization coreference resolution method - Google Patents

Instance-based dynamic generalization coreference resolution method Download PDF

Info

Publication number
CN101901213A
CN101901213A CN2010102397366A CN201010239736A CN101901213A CN 101901213 A CN101901213 A CN 101901213A CN 2010102397366 A CN2010102397366 A CN 2010102397366A CN 201010239736 A CN201010239736 A CN 201010239736A CN 101901213 A CN101901213 A CN 101901213A
Authority
CN
China
Prior art keywords
extensive
point
training
positive
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102397366A
Other languages
Chinese (zh)
Inventor
秦兵
刘挺
郎君
黎耀炳
张牧宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN2010102397366A priority Critical patent/CN101901213A/en
Publication of CN101901213A publication Critical patent/CN101901213A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an instance-based dynamic generalization coreference resolution method, and relates to the field of text information extraction. The dynamic generalization coreference resolution method comprises a training instance library establishment stage and an in-discourse entity resolution stage, and the coreference resolution is finished by instance establishment, instance library establishment, index creation, dynamic generalization, instance retrieval and coreference chain combination. The method eliminates the long tail effect in a coreference statistical model, fully achieves the effect of a low-frequency training sample, makes full use of the precious training sample, and makes the dynamic generalization mechanism of the instances self-adaptively convert the classification of test instances into the selection and utilization of the best generalization point in a training instance library and finally find the optimally matched training instance.

Description

A kind of coreference resolution method based on the example dynamic generalization
Technical field
The present invention relates to field of text information extraction, be specifically related to a kind of dynamic generalization coreference resolution method based on example.
Background technology
In recent years, along with the explosive growth of internet information, the fresh information that occur every day has substantially exceeded human processing power.In numerous areas such as natural language processing, information retrieval, the same things in the real world often has different titles and description.They are correctly corresponded to concrete things, be very important with deeply understanding for the subsequent treatment of data.In natural language processing, the noun, pronoun and the common noun phrase that point to same entity are cleared up, can make the description of follow-up entity relationship more perfect, be other natural language processing field, lay the foundation as mechanical translation, information extraction, automatic abstract and information retrieval etc.So-called coreference resolution is exactly to divide according to the equivalence class that self content of each statement in one piece of document and place context carry out all statements.For example, between big countries such as state, the U.S., Japan in the article of trade, introductory song may be write " People's Republic of China (PRC) " under discussion, and " China ", " big China " etc. may be said in the back, also can mention " this country ", " she " etc.The difference that these statements all are " People's Republic of China (PRC) " this entities embodies.Embody though people can have no the difference of same entity in the differentiation article of difficulty,, be still unusual difficulty computing machine.Say in some sense, refer in natural language, play the effect of hyperlink altogether.On the one hand, it makes the author can embody certain style when writing article and realizes the continuity of chapter.But then, the common finger phenomenon in the language has increased more fuzzy composition in natural language understanding, and for the processing of other field natural language, as mechanical translation, information extraction etc. bring difficulty.Coreference resolution Study on Technology target is exactly to find each equivalent description of same entity in the chapter, for follow-up natural language processing lays the foundation.
Coreference resolution research faces many difficulties, not only needs the knowledge of linguistics aspect, and for example the vocabulary of shallow-layer, syntactic knowledge also need the comparatively semanteme and the chapter knowledge of macroscopic view, and abundant background knowledge can be finished.Full automatic coreference resolution is the important and difficult task of computing machine to natural language understanding.Many decades has been carried out in specializing in of this respect abroad, but just ground zero is soon at home.Along with deepening continuously of coreference resolution research, arrived bottleneck stage now.The most key problem is exactly the rare of relevant language material, makes traditional all can only cover most training sample based on linguistic rules with based on statistical method, fails to utilize fully for the low sample of some frequencies.
Based on the coreference resolution method of linguistic rules, mainly comprise Hobbs algorithm, center theory and some methods based on the center theory.Rule-based method all is that forefathers sum up the subjectivity disposal route that draws behind a large amount of language phenomenons on relevant language material.The summary of this regularity inevitably can be omitted the common finger phenomenon of some a large amount of less appearance, especially analyzes the regular rule that obtains and be difficult to be applied in actual a large amount of processing actual conditions on language material on a small scale.In fact, rule-based method all shows on actual performance poorly, and has finally caused the research method development based on statistics.
Statistical learning method is applied in the coreference resolution problem and is risen in nineteen ninety-five.Along with since McCarthy and Lehnert (1995) is considered as binary classification with the coreference resolution problem first and adopts decision tree (Decision Trees) C4.5 algorithm, coreference resolution begins to have obtained significant progress under the framework of binary classification.Typical case's machine learning based on statistics commonly used has decision tree, maximum entropy, Support Vector Machine, this sorting technique based on statistical learning all is to want to add up on corpus earlier training, after obtaining the learning model that to unify the problem of describing, again this model is applied on the problem that needs classification.Though this method can obtain certain achievement, but there is certain problem for coreference resolution.Sorting algorithm in the continuous optimizing process, all is the optimization direction of selecting to cover most examples when selecting in the process of training at every turn, does not do consideration for the example of failing to cover.The model that final in this case study obtains can only cover most cases, will have wrong possibility of dividing for the lower example of some frequencies.This situation is particularly outstanding for the original just less relatively common finger of training example quantity.In fact, the wrong lower example One's name is legion of frequency that divides of this possibility.
Summary of the invention
In order to address the above problem, the invention discloses a kind of coreference resolution method based on the example dynamic generalization, not only solved long tail effect in the common finger statistical model, also give full play to the effect of low frequency training sample, make that original just very precious training sample is fully played, and make the dynamic generalization mechanism of example adaptive classification problem change the selected and utilization of training best extensive point in the case library into, finally find the training example of optimum matching test case.
The technical scheme that the present invention solves the problems of the technologies described above is: a kind of dynamic generalization coreference resolution method based on example is characterized in that described dynamic generalization coreference resolution method makes up in stage and the chapter entity stage of clearing up by the training case library and forms; The described training case library structure stage comprises:
A, corpus is carried out the natural language pre-service of bottom, extract candidate's noun phrase that may have co-reference each other;
B, utilize the noun phrase that extracts among the noun phrase that refers to altogether in the mark language material on the chain and the A, structure just/the anti-example of training;
C, extract each just/the feature value of anti-example, generate " the extensive point " that belongs to this example according to the feature value;
The training case library of D, structure band " extensive point ", and to training case library to set up inverted index.
The entity stage of clearing up comprises in the described chapter:
E, the pending plain text of reception, and carry out the natural language pre-service of various bottoms, extract candidate's noun phrase that may have co-reference each other;
F, utilize the noun phrase structure that extracts among the E may have candidate's example of co-reference, extract the feature value of this candidate's example;
The feature value of G, extraction candidate example generates " the extensive point " that belongs to this example according to the feature value;
H, according to the dynamic generalization algorithm, " the extensive point " that utilizes candidate's example is extensive to the example repeated screening in the training case library, the positive example proportion is as the positive example degree of confidence of this test case in the residue training example;
I, provide the binary classification result and syntheticly finally refer to chain altogether according to the positive example degree of confidence of each candidate's example, coreference resolution is finished.
Wherein, the natural language preprocessing process comprises described in steps A and the E step:
Punctuate according to the punctuation mark in the document, is independent sentence one by one with the correct cutting of document;
Participle is cut into several independent speech with the character string in the document;
Part-of-speech tagging is for the speech mark that word segmentation result obtains is gone up the part of speech label;
Noun phrase identification is according to the part-of-speech tagging result with there is the related keyword of qualitative description and indicative description to identify noun phrase in the document;
Named entity recognition according to participle and part-of-speech tagging result, identifies the named entity of being paid close attention in the current field;
Syntactic analysis according to participle and part-of-speech tagging result, makes up the phrase structure syntax tree of each sentence.
The training of structure described in the step B of the present invention example comprises:
A. with referring on the chain altogether, exist two adjacent noun phrase i, j formation positive examples of co-reference right<i, j 〉;
B. refer to altogether on the chain, exist other noun phrases k between two adjacent noun phrase i, the j of co-reference (i<k<j), with noun phrase j constitute counter-example right<k, j 〉.
Among the present invention, the detailed process that the feature value of training/test case is generated " extensive point " is:
Instance properties at current examination, extract the feature value of example, corresponding one " the extensive point " of each feature value, extensive some formalization representation be " [a/b/ab]. the feature name. the feature value ", the described object of first's representation feature wherein, a represents first lang, and b represents anaphor, and ab represents the association of both first langs and anaphor.
Making up the training case library described in the step D of the present invention with the detailed process of setting up inverted index is:
A. to each just/the anti-example of training generates all " extensive points ";
B. train all information that each line item has been stored the training example in the case library, comprise the class label "+" or "-" that train example, this example all " extensive points " are made up of extensive vertex type, feature title, feature value three partial informations;
C. based on the training case library after setting up, extensive as key word, the list of locations of all training examples in case library that has this extensive point set up the inverted index of training case library thus as index entry.
The above-mentioned extensive vertex type of mentioning is divided into following three kinds:
A. enumeration type, the possible value of feature is a discrete value;
B. determine infinite type, this type mainly is at the feature that infinite kind of return results may be arranged, and when in the dynamic generalization process, carrying out extensive some coupling, no longer needing these feature values further cutting to use, this type refers to that mainly rreturn value is the feature of character string forms;
C. with countless changes type, this type mainly is at the feature that infinite kind of return results may be arranged, and when carrying out extensive some coupling in the dynamic generalization process, need use the further cutting of these feature values, this type refers to that mainly rreturn value is tree-shaped graph structure.
The detailed process of dynamic generalization algorithm described in the step H is among the present invention:
I, described extensive dot generation process, for example to be classified generates extensive point, all extensive points form extensive point set G;
Ii, the whole examples of training case library are as example set S to be screened;
Iii, choose standard according to extensive point, choose an extensive some g from extensive point set G, the subclass G ' that makes all examples of having this extensive point among the S constitute satisfies selection standard, and this extensive some g is referred to as best extensive point;
Iv, from G, delete g, make all examples that have extensive some g among the former S of S={ };
If all examples all belong to same classification (promptly be positive example or be counter-example) among the v S, perhaps G is empty, finally remains among the example subclass S positive example proportion and treats the positive example degree of confidence of classified instance, termination of iterations as this; Otherwise, return the iii step.
Below for choose the standard of extensive point for step I ii:
The absolute value maximization of a. positive and negative routine proportion difference has among the example subclass G ' of this extensive point, the maximization of the absolute value of positive example proportion and counter-example proportion difference;
B. cover the maximization of example quantity, have among the example subclass G ' of this extensive point, the maximization of example quantity;
C. positive example quantity maximization has among the example subclass G ' of this extensive point, the maximization of positive example quantity;
D. it is minimized to cover example quantity, has under the situation of example subclass G ' non-NULL of this extensive point, and example quantity minimizes;
E. positive example proportion maximization has among the example subclass G ' of this extensive point, the maximization of positive example proportion;
Select extensive point also to have following priority ranking among the step I ii:
At first, when extensive some subclass non-NULL of with countless changes type, choose standard, preferentially concentrate and choose best extensive point from the extensive idea of with countless changes type according to extensive point;
Secondly, when extensive some subclass non-NULL of definite infinite type, choose standard, preferentially choose best extensive point from determining that the extensive idea of infinite type is concentrated according to extensive point;
At last, choose standard, choose best extensive point from using the extensive idea of enumeration type to concentrate according to extensive point.
Wherein, during calculated examples subclass G ', extensive some coupling is taked Different Strategies because of type difference among the step I ii: to the extensive point of enumeration type with determine extensive some g of infinite type, example have a g and if only if one of them extensive point of this example and g identical; To the extensive point of with countless changes type, delete node in the graph structure one by one with relaxed constraints, the minor structure after pruning occurs as subgraph in a training example at least, then with this minor structure the training example is screened.
Providing the binary classification outcome procedure according to the positive example degree of confidence of each candidate's example among the step I of the present invention is: if the dynamic generalization algorithm surpasses 0.5 for the positive example degree of confidence that candidate's example provides, then this candidate's example is judged as positive example, has co-reference between promptly corresponding two noun phrases.
A kind of dynamic generalization coreference resolution method of the present invention based on example, it is for the with countless changes type extensive point of feature value for the phrase structure syntax tree of first lang of connection and anaphor, and concrete trim mode is:
A. the shortest path that connects first lang and anaphor in the phrase structure syntax tree is called " critical path ";
B. from the bottom node of phrase structure syntax tree, except the node on " critical path ", delete every node layer successively, the minor structure after pruning occurs as subgraph in a training example at least.
The present invention has the following advantages: guaranteeing to give full play to the effect of the training sample of those low frequencies as much as possible under the correct situation about covering of most of training samples, make that original just very precious training sample is fully played.Improved classic method in the deficiency of handling the low frequency instance aspect, the low frequency example that can't effectively handle in classic method is cleared up and has been obtained better effect on the problem.Because the singularity of coreference resolution problem, make the processing of low frequency example seem and be even more important that this method has more advantage in this respect.And method difference in the past, the classification problem of test case can be changed into the selected and utilization of best extensive point in the training case library based on the dynamic generalization mechanism of example, when handling different examples and unlike classic method, use unified model and parameter, but adaptively find best extensive point, finally find the training example of optimum matching after successively extensive, can select corresponding extensive point automatically to different examples, avoid using unified model, therefore have stronger adaptability.To sum up, to carry out the coreference resolution effect more effective for this method.
Description of drawings
Fig. 1 is the process flow diagram of a kind of dynamic generalization coreference resolution method general frame based on example of the present invention;
Fig. 2 is the synoptic diagram that generates extensive point according to training/test with example;
Fig. 3 is the first embodiment process flow diagram of dynamic generalization and searching algorithm among the present invention;
Fig. 4 is to the extensive point of the with countless changes type embodiment synoptic diagram of relaxed constraints successively in dynamic generalization and the searching algorithm;
Embodiment
In conjunction with the accompanying drawings 1~4 and embodiment, the present invention is further illustrated:
(shown in Figure 1) a kind of dynamic generalization coreference resolution method based on example, described dynamic generalization coreference resolution method was made up of the entity stage of clearing up in training case library structure stage and the chapter; The described training case library structure stage comprises:
A, corpus is carried out the natural language pre-service of bottom, extract candidate's noun phrase that may have co-reference each other;
B, utilize the noun phrase that extracts among the noun phrase that refers to altogether in the mark language material on the chain and the A, structure just/the anti-example of training;
C, extract each just/the feature value of anti-example, generate " the extensive point " that belongs to this example according to the feature value;
The training case library of D, structure band " extensive point ", and to training case library to set up inverted index.
The entity stage of clearing up comprises in the described chapter:
E, the pending plain text of reception, and carry out the natural language pre-service of various bottoms, extract candidate's noun phrase that may have co-reference each other;
F, utilize the noun phrase structure that extracts among the E may have candidate's example of co-reference, extract the feature value of this candidate's example;
The feature value of G, extraction candidate example generates " the extensive point " that belongs to this example according to the feature value;
H, according to dynamic generalization algorithm screening case library.So-called dynamic generalization method is promptly carried out Dynamic Selection one by one to the extensive point that is had in the test case, and according to the extensive method of selecting that from the training case library, filters out the example subclass that helps making last classification judgement.Utilize " the extensive point " of candidate's example extensive to the example repeated screening in the training case library, the positive example proportion is as the positive example degree of confidence of this test case in the residue training example;
I, provide the binary classification result and syntheticly finally refer to chain altogether according to the positive example degree of confidence of each candidate's example, coreference resolution is finished.
In the present embodiment, the natural language preprocessing process comprises described in steps A and the E step:
Punctuate:, the correct cutting of document is independent sentence one by one according to the punctuation mark in the document;
Participle: the character string in the document is cut into several independent speech;
Part-of-speech tagging: for the speech mark that word segmentation result obtains is gone up the part of speech label;
Noun phrase identification: according to the part-of-speech tagging result with there is the related keyword of qualitative description and indicative description to identify noun phrase in the document;
Named entity recognition:, identify the named entity of being paid close attention in the current field according to participle and part-of-speech tagging result;
Syntactic analysis:, make up the phrase structure syntax tree of each sentence according to participle and part-of-speech tagging result.
In the present embodiment, the training of structure described in step B example comprises:
A. with referring on the chain altogether, exist two adjacent noun phrase i, j formation positive examples of co-reference right<i, j 〉;
B. refer to altogether on the chain, exist other noun phrases k between two adjacent noun phrase i, the j of co-reference (i<k<j), with noun phrase j constitute counter-example right<k, j 〉.
Making up the training case library described in the present embodiment step D with the detailed process of setting up inverted index is:
A. to each just/the anti-example of training generates all " extensive points ";
B. train all information that each line item has been stored the training example in the case library, comprise the class label "+" or "-" that train example, this example all " extensive points " are made up of extensive vertex type, feature title, feature value three partial informations;
C. based on the training case library after setting up, extensive as key word, the list of locations of all training examples in case library that has this extensive point set up the inverted index of training case library thus as index entry.
Wherein, the example that (Fig. 2 is as showing) training and testing process is used, it is right that specifically may there be the noun phrase of co-reference in finger.Wherein, positive example promptly this noun phrase between have co-reference, otherwise the counter-example of being referred to as.This noun phrase is in the example, and the forward person in position is referred to as " candidate elder generation lang " in the original text, and (Mention a) to be referred to as to explain a; The position is referred to as " anaphor " by the latter, is referred to as to explain b (Mention b).
Specifically, extensive point is to be exactly certain feature value of example, corresponding one " the extensive point " of each feature value, and the Unified Formization of extensive point is expressed as " [a/b/ab] .[feature title] .[feature value] ".The described object in " [a/b/ab] " expression back " [feature title] .[feature value] " wherein, a, b represent candidate elder generation lang, anaphor respectively, ab then represents the combination with candidate's elder generation's lang and anaphor; And " [feature title] ", " [feature value] " are represented its feature name and value respectively.Extensive type according to the feature value can be divided into following three types:
Enumeration type, these features have limited several possibility values, mainly refer to sentence structure, the lexical feature that some are commonly used, as explain the noun phrase type of a and b, semantic consistency, gender consistency, single plural consistance feature etc., as " the a.Mention type .NAM " among Fig. 2, " b. grammatical markers .Subject ", " the consistent .T of ab. sex ", " a b. sentence is apart from .1 ";
Determine infinite type, this feature may be returned infinite kind of result, and these features do not need further cutting to use in the dynamic generalization process, mainly refers to character string features such as head, as " the a.Head. monkey " among Fig. 2, " b.Head. it ";
With countless changes type, this feature may be returned infinite type return results, and in the dynamic generalization process, can use the further cutting of this part, mainly refer to cover microcontext syntax tree structure, for example " ab.Tree. (Sentences (ROOT (IP (NP (NR monkey))) (VP (VV eats) (NP (NR banana))) (PU among Fig. 2 between statement a and the b.)) (ROOT (PP (P because) (IP (NP (PN it)) (VP (VV starves) (AS))) (PU.))))”。
The detailed process of dynamic generalization algorithm described in the step H is in the present embodiment:
I, described extensive dot generation process, for example to be classified generates extensive point, all extensive points form extensive point set G;
Ii, the whole examples of training case library are as example set S to be screened;
Iii, choose standard according to extensive point, choose an extensive some g from extensive point set G, the subclass G ' that makes all examples of having this extensive point among the S constitute satisfies selection standard, and this extensive some g is referred to as best extensive point;
Iv, from G, delete g, make all examples that have extensive some g among the former S of S={ };
If all examples all belong to same classification (promptly be positive example or be counter-example) among the v S, perhaps G is empty, finally remains among the example subclass S positive example proportion and treats the positive example degree of confidence of classified instance, termination of iterations as this; Otherwise, return the iii step.
Below for choose the standard of extensive point for step I ii:
The absolute value maximization of a. positive and negative routine proportion difference has among the example subclass G ' of this extensive point, the maximization of the absolute value of positive example proportion and counter-example proportion difference;
B. cover the maximization of example quantity, have among the example subclass G ' of this extensive point, the maximization of example quantity;
C. positive example quantity maximization has among the example subclass G ' of this extensive point, the maximization of positive example quantity;
D. it is minimized to cover example quantity, has under the situation of example subclass G ' non-NULL of this extensive point, and example quantity minimizes;
E. positive example proportion maximization has among the example subclass G ' of this extensive point, the maximization of positive example proportion;
Select extensive point also to have following priority ranking among the step I ii:
At first, when extensive some subclass non-NULL of with countless changes type, choose standard, preferentially concentrate and choose best extensive point from the extensive idea of with countless changes type according to extensive point;
Secondly, when extensive some subclass non-NULL of definite infinite type, choose standard, preferentially choose best extensive point from determining that the extensive idea of infinite type is concentrated according to extensive point;
At last, choose standard, choose best extensive point from using the extensive idea of enumeration type to concentrate according to extensive point.
Wherein, during calculated examples subclass G ', extensive some coupling is taked Different Strategies because of type difference among the step I ii: to the extensive point of enumeration type with determine extensive some g of infinite type, example have a g and if only if one of them extensive point of this example and g identical; To the extensive point of with countless changes type, delete node in the graph structure one by one with relaxed constraints, the minor structure after pruning occurs as subgraph in a training example at least, then with this minor structure the training example is screened.
Providing the binary classification outcome procedure according to the positive example degree of confidence of each candidate's example among the step I of the present invention is: if the dynamic generalization algorithm surpasses 0.5 for the positive example degree of confidence that candidate's example provides, then this candidate's example is judged as positive example, has co-reference between promptly corresponding two noun phrases.
A kind of dynamic generalization coreference resolution method of the present invention based on example, it is for the with countless changes type extensive point of feature value for the phrase structure syntax tree of first lang of connection and anaphor, and concrete trim mode is:
A. the shortest path that connects first lang and anaphor in the phrase structure syntax tree is called " critical path ";
B. from the bottom node of phrase structure syntax tree, except the node on " critical path ", delete every node layer successively, the minor structure after pruning occurs as subgraph in a training example at least.
(Fig. 3) process flow diagram of the embodiment of dynamic generalization and searching algorithm among the present invention, the extensive point that uses among the described embodiment has been taken all factors into consideration enumeration type, has been determined infinite type and with countless changes type, comprised step:
Step 3-1 initialization:
All extensive points for the treatment of classified instance constitute untapped extensive some subclass G, and the extensive some subclass G ' that has used is empty set, and screens example subclass E ' with complete training case library as initial waiting.
Step 3-2 selects best extensive point:
Choose standard according to extensive point, from G, choose an extensive some g*, make in case library to be screened, all that have are trained being distributed in of example subclass E* to satisfy extensive point and are chosen standard, and this extensive some g* is referred to as best extensive point.
Wherein extensive point is chosen standard multiple choices, in this example maturation comprise following 5 kinds of selections:
1) absolute value of positive and negative routine proportion difference maximization has in the example subclass of this extensive point, the absolute value maximization of positive example proportion and counter-example proportion difference;
2) cover the maximization of example quantity, have example quantity maximization in the example subclass of this extensive point;
3) positive example quantity maximization has positive example quantity maximization in the example subclass of this extensive point;
4) it is minimized to cover example quantity, has under the situation of example subclass non-NULL of this extensive point, and example quantity minimizes;
5) positive example proportion maximization has in the example subclass of this extensive point, the maximization of positive example proportion.
More than each extensive point choose standard and emphasize particularly on different fields a little, according to concrete language material characteristic and requirement, will get different selection standards in the implementation process.
In addition, to dissimilar extensive points, can adopt the different strategies of preferentially choosing.When extensive some subclass non-NULL of with countless changes type, choose standard according to extensive point, preferentially concentrate and choose best extensive point from the extensive idea of with countless changes type; Secondly, when extensive some subclass non-NULL of definite infinite type, choose standard, preferentially choose best extensive point from determining that the extensive idea of infinite type is concentrated according to extensive point; At last, choose standard, choose best extensive point from using the extensive idea of enumeration type to concentrate according to extensive point.
Step 3-3 screens example set:
From G, delete g*, and g* is joined among the G ';
Utilize extensive some g* of this best, from current wait to screen the training example subclass have all examples of g* in the screening, and make E '={ having all examples of extensive some g* among the former E ' }.
It is emphasized that training example subclass is screened, relate to the matching way of extensive point.In implementation process, enumeration type and the coupling of determining the extensive point of infinite type are adopted complete matching way; And, delete node in the graph structure gradually with relaxed constraints to the extensive point of with countless changes type, the minor structure after pruning occurs as subgraph in a training example at least, then with this minor structure the training example is screened.
Step 3-4 stopping criterion for iteration is judged:
If the example among the E ' all belongs to same classification (be positive example or be counter-example), perhaps | G|=|G ' |, the positive example proportion is treated the positive example degree of confidence output of classified instance, termination of iterations as this among the E '; Otherwise untapped extensive some subclass G and training example subclass E ' conduct input to be screened continue step 3-2.
In embodiments of the present invention, the dynamic generalization algorithm is in order to determine to treat the optimum classification of classified instance, and its algorithm is described below:
Input: training case library E, the extensive some set G of test case to be retrieved
Output: the positive example degree of confidence p of test case, and example subclass E ', wherein p ∈ [0,1]
#01:G ' ← Φ, E ' ← E, p ← I (E ') // positive example degree of confidence I calculated according to E '
#02:while(|G’|<|G|and?E’≠Φ)
#03:(g*, E*) ← Best_Generalize_Point (E ', G-G '); // find best extensive some g* and selected example set E*
#04: G’←G’∪{g*}
#05: if(E*=Φ)
#06: continue
#07: end?if
#08: E’=E*
#09:if (E ' in entirely be to be counter-example or G-G '=Φ) entirely among the positive example or E '
#10:p ← I (E ') // positive example degree of confidence I calculated according to E '
#11: break
#12: end?if
#13:end?while
#14:return?p,E’
Optimum extensive point determines that algorithm is to use which extensive point more suitable for determining every the wheel in the extensive process of screening, describes below its algorithm:
Input: case library E, the extensive some set G of test case to be retrieved
Output: best extensive some g*, the effective example set E ' that screens among the E
#01:g*←null
#02:(N, C, S) ← Divide (G) //G in element be divided into: N enumeration type, C character string type, S structural type
#03:if(|S|>0)
Make screening back example subclass satisfy the extensive point of " choice criteria " among #04:g* ← S
#05:else?if(|C|>0)
Make screening back example subclass satisfy the extensive point of " choice criteria " among #06:g* ← C
#05:else?if(|N|>0)
Make screening back example subclass satisfy the extensive point of " choice criteria " among #06:g* ← N
#07:end?if
All examples consistent among #08:E ' ← E with the g* item
#09:return?g*,E’
To the extensive point of the with countless changes type synoptic diagram of the embodiment of relaxed constraints successively, the extensive point of the with countless changes type in the described embodiment refers to the Simple-Expansion structure, comprises step in (shown in Figure 4) dynamic generalization and the searching algorithm:
The Simple-Expansion structure specifically refers to " cover the shortest path on the minimum subtree of first lang and anaphor, and on the shortest path direct child's node of all nodes ".To one in sentence " [the man] in the room saw[him] ", cover the candidate and refer to example<" the man ", " him " altogether〉minimum subtree shown in Fig. 4 (a), wherein the subclass of dotted line tree is the Simple-Expansion structure.
As Fig. 4 (b) is the Simple-Expansion structure that connects candidate's elder generation's lang (NN-CANDI) and anaphor (PRP-ANA), among Fig. 4 (c) 1. shown in the closed curve cover part represent to comprise the shortest path of candidate's elder generation's lang and anaphor, 1.+2. the cover part of the Zu Chenging sentence structure tree structure behind one deck of representing to descend, by that analogy 1.+2.+3. expression result after two layers that descends, and 1.+2.+3. just be equivalent to complete Simple-Expansion structure.Because the constraint of complete structure is too strict, cause subclass after the extensive screening easily for empty.Therefore, when utilizing this " structural type " extensive point to carry out extensive screening, take from complete structure (1.+2.+3.) deletion successively 3., the 2. mode of minor structure, the constraint condition of progressive relaxation dynamic generalization.
Except deleting each node layer successively, further also the tree of the subclass behind the deletion of node is further pruned.As subtree among Fig. 4 (d) is the result after the deletion 3. among Fig. 4 (c), and has characteristics between two nodes that dotted line comprises among Fig. 4 (d): " label of father node is consistent with the label of child node, and child node has only a subsequent node ".In this case, two nodes are crimped to a node, shown in Fig. 4 (e).
The concrete use-pattern of " structural type " extensive point is as follows:
(1) T=" simple-Expansion structure ", the degree of depth of T is n, i=n;
(2) E=" its extensive some ab.Tree. (...) comprises all training examples of minor structure T in the training example set ";
(3) if E be sky and i greater than 0, deletion i node layer from T (shortest path 1. in node except) is also pruned as Fig. 4 (d-e), i=i-1 returns step (2); Otherwise, return E as screening the example subclass through this extensive point.
Above embodiment is only for the usefulness that the present invention is described; but not limitation of the present invention; those skilled in the art; under situation without departing from the spirit or scope of the invention; the technical scheme of making various equivalents or variation all belongs to protection category of the present invention, is limited by every claim.

Claims (10)

1. dynamic generalization coreference resolution method based on example is characterized in that described dynamic generalization coreference resolution method makes up in stage and the chapter entity stage of clearing up by the training case library and forms;
The described training case library structure stage comprises:
A, corpus is carried out the natural language pre-service of bottom, extract candidate's noun phrase that may have co-reference each other;
B, utilize the noun phrase that extracts among the noun phrase that refers to altogether in the mark language material on the chain and the A, structure just/anti-training is real;
C, extract each just/the feature value of anti-example, generate " the extensive point " that belongs to this example according to the feature value;
The training case library of D, structure band " extensive point ", and to training case library to set up inverted index;
The entity stage of clearing up comprises in the described chapter:
E, the pending plain text of reception, and carry out the natural language pre-service of various bottoms, extract candidate's noun phrase that may have co-reference each other;
F, utilize the noun phrase structure that extracts among the E may have candidate's example of co-reference, extract the feature value of this candidate's example;
The feature value of G, extraction candidate example generates " the extensive point " that belongs to this example according to the feature value;
H, according to the dynamic generalization algorithm, " the extensive point " that utilizes candidate's example is extensive to the example repeated screening in the training case library, the positive example proportion is as the positive example degree of confidence of this test case in the residue training example;
I, provide the binary classification result and syntheticly finally refer to chain altogether according to the positive example degree of confidence of each candidate's example, coreference resolution is finished.
2. a kind of its feature of dynamic generalization coreference resolution side based on example according to claim 1 is that also the natural language preprocessing process comprises described in steps A and the E step: punctuate; Participle; Part-of-speech tagging; Noun phrase identification; Named entity recognition and syntactic analysis.
3. a kind of dynamic generalization coreference resolution method based on example according to claim 1, its feature are that also the training of structure described in step B example comprises:
A. with referring on the chain altogether, exist two adjacent noun phrase i, j formation positive examples of co-reference right<i, j 〉;
B. refer to altogether on the chain, exist other noun phrases k between two adjacent noun phrase i, the j of co-reference (i<k<j), with noun phrase j constitute counter-example right<k, j 〉.
4. a kind of dynamic generalization coreference resolution method based on example according to claim 1 is characterized in that the detailed process that makes up the training case library described in the step D and set up inverted index is:
A. to each just/the anti-example of training generates all " extensive points ";
B. train all information that each line item has been stored the training example in the case library, comprise the class label "+" or "-" that train example, this example all " extensive points " are made up of extensive vertex type, feature title, feature value three partial informations;
C. based on the training case library after setting up, extensive as key word, the list of locations of all training examples in case library that has this extensive point set up the inverted index of training case library thus as index entry.
5. according to claim 1,2,3 or 4 described a kind of dynamic generalization coreference resolution methods, it is characterized in that feature value type is divided in the extensive point: enumeration type, determine infinite type or with countless changes type based on example; The detailed process of dynamic generalization algorithm is described in the described step H:
I, described extensive dot generation process, for example to be classified generates extensive point, all extensive points form extensive point set G;
Ii, the whole examples of training case library are as example set S to be screened;
Iii, choose standard according to extensive point, choose an extensive some g from extensive point set G, the subclass G ' that makes all examples of having this extensive point among the S constitute satisfies selection standard, and this extensive some g is referred to as best extensive point;
Iv, from G, delete g, make all examples that have extensive some g among the former S of S={ };
If all examples all belong to same classification (promptly be positive example or be counter-example) among the v S, perhaps G is empty, finally remains among the example subclass S positive example proportion and treats the positive example degree of confidence of classified instance, termination of iterations as this; Otherwise, return the iii step.
6. a kind of dynamic generalization coreference resolution method based on example according to claim 5, its feature also are to choose below the design extensive some standard and select for use for step I ii:
The absolute value maximization of a. positive and negative routine proportion difference has among the example subclass G ' of this extensive point, the maximization of the absolute value of positive example proportion and counter-example proportion difference;
B. cover the maximization of example quantity, have among the example subclass G ' of this extensive point, the maximization of example quantity;
C. positive example quantity maximization has among the example subclass G ' of this extensive point, the maximization of positive example quantity;
D. it is minimized to cover example quantity, has under the situation of example subclass G ' non-NULL of this extensive point, and example quantity minimizes;
E. positive example proportion maximization has among the example subclass G ' of this extensive point, the maximization of positive example proportion.
7. a kind of dynamic generalization coreference resolution method based on example according to claim 6 is characterized in that selecting among the step I ii extensive point also to have following priority ranking:
At first, when extensive some subclass non-NULL of with countless changes type, choose standard, preferentially concentrate and choose best extensive point from the extensive idea of with countless changes type according to extensive point;
Secondly, when extensive some subclass non-NULL of definite infinite type, choose standard, preferentially choose best extensive point from determining that the extensive idea of infinite type is concentrated according to extensive point;
At last, choose standard, choose best extensive point from using the extensive idea of enumeration type to concentrate according to extensive point.
8. according to claim 6 or 7 described a kind of dynamic generalization coreference resolution methods based on example, when it is characterized in that among the step I ii calculated examples subclass G ', extensive some coupling is taked Different Strategies because of type difference:
A. to the extensive point of enumeration type with determine extensive some g of infinite type, example have a g and if only if one of them extensive point of this example and g identical;
B. to the extensive point of with countless changes type, delete node in the graph structure one by one with relaxed constraints, the minor structure after pruning occurs as subgraph in a training example at least, then with this minor structure the training example is screened.
9. a kind of dynamic generalization coreference resolution method according to claim 8 based on example, it is characterized in that among the step I that positive example degree of confidence according to each candidate's example provides the binary classification outcome procedure and is: if the dynamic generalization algorithm surpasses 0.5 for the positive example degree of confidence that candidate's example provides, then this candidate's example is judged as positive example, has co-reference between promptly corresponding two noun phrases.
10. a kind of dynamic generalization coreference resolution method based on example according to claim 9 is characterized in that for the with countless changes type extensive point of feature value for the phrase structure syntax tree of first lang of connection and anaphor, concrete trim mode is:
A. the shortest path that connects first lang and anaphor in the phrase structure syntax tree is called " critical path ";
B. from the bottom node of phrase structure syntax tree, except the node on " critical path ", delete every node layer successively, the minor structure after pruning occurs as subgraph in a training example at least.
CN2010102397366A 2010-07-29 2010-07-29 Instance-based dynamic generalization coreference resolution method Pending CN101901213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102397366A CN101901213A (en) 2010-07-29 2010-07-29 Instance-based dynamic generalization coreference resolution method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102397366A CN101901213A (en) 2010-07-29 2010-07-29 Instance-based dynamic generalization coreference resolution method

Publications (1)

Publication Number Publication Date
CN101901213A true CN101901213A (en) 2010-12-01

Family

ID=43226756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102397366A Pending CN101901213A (en) 2010-07-29 2010-07-29 Instance-based dynamic generalization coreference resolution method

Country Status (1)

Country Link
CN (1) CN101901213A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081659A (en) * 2011-01-14 2011-06-01 南开大学 Pretreatment method for compressing inverted index
CN103150405A (en) * 2013-03-29 2013-06-12 苏州大学 Classification model modeling method, Chinese cross-textual reference resolution method and system
CN103838559A (en) * 2012-11-23 2014-06-04 富士通株式会社 Method and device for combining tools
CN104142914A (en) * 2013-05-10 2014-11-12 富士通株式会社 Device and method for function module combination with feedback control, data processing method and data processing equipment
CN105260457A (en) * 2015-10-14 2016-01-20 南京大学 Coreference resolution-oriented multi-semantic web entity contrast table automatic generation method
CN106445911A (en) * 2016-03-18 2017-02-22 苏州大学 Anaphora resolution method and system based on microscopic topic structure
CN106776550A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of analysis method of english composition textual coherence quality
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN110362682A (en) * 2019-06-21 2019-10-22 厦门美域中央信息科技有限公司 A kind of entity coreference resolution method based on statistical machine learning algorithm
CN112001190A (en) * 2020-07-20 2020-11-27 北京百度网讯科技有限公司 Training method, device and equipment of natural language processing model and storage medium

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081659A (en) * 2011-01-14 2011-06-01 南开大学 Pretreatment method for compressing inverted index
CN103838559A (en) * 2012-11-23 2014-06-04 富士通株式会社 Method and device for combining tools
CN103150405A (en) * 2013-03-29 2013-06-12 苏州大学 Classification model modeling method, Chinese cross-textual reference resolution method and system
CN103150405B (en) * 2013-03-29 2014-12-10 苏州大学 Classification model modeling method, Chinese cross-textual reference resolution method and system
CN104142914A (en) * 2013-05-10 2014-11-12 富士通株式会社 Device and method for function module combination with feedback control, data processing method and data processing equipment
CN105260457B (en) * 2015-10-14 2018-07-13 南京大学 A kind of multi-semantic meaning network entity contrast table automatic generation method towards coreference resolution
CN105260457A (en) * 2015-10-14 2016-01-20 南京大学 Coreference resolution-oriented multi-semantic web entity contrast table automatic generation method
CN106445911B (en) * 2016-03-18 2022-02-22 苏州大学 Reference resolution method and system based on micro topic structure
CN106445911A (en) * 2016-03-18 2017-02-22 苏州大学 Anaphora resolution method and system based on microscopic topic structure
CN106776550B (en) * 2016-12-06 2019-12-13 桂林电子科技大学 method for analyzing consistency quality of English literary texts
CN106776550A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of analysis method of english composition textual coherence quality
CN108280064B (en) * 2018-02-28 2020-09-11 北京理工大学 Combined processing method for word segmentation, part of speech tagging, entity recognition and syntactic analysis
CN108280064A (en) * 2018-02-28 2018-07-13 北京理工大学 Participle, part-of-speech tagging, Entity recognition and the combination treatment method of syntactic analysis
CN110362682A (en) * 2019-06-21 2019-10-22 厦门美域中央信息科技有限公司 A kind of entity coreference resolution method based on statistical machine learning algorithm
CN112001190A (en) * 2020-07-20 2020-11-27 北京百度网讯科技有限公司 Training method, device and equipment of natural language processing model and storage medium
US20220019736A1 (en) * 2020-07-20 2022-01-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for training natural language processing model, device and storage medium
KR20220011082A (en) * 2020-07-20 2022-01-27 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Natural language processing model training method, device, electric equipment and storage medium
KR102549972B1 (en) * 2020-07-20 2023-06-29 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Natural language processing model training method, device, electric equipment and storage medium
CN112001190B (en) * 2020-07-20 2024-09-20 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for natural language processing model

Similar Documents

Publication Publication Date Title
CN101901213A (en) Instance-based dynamic generalization coreference resolution method
CN108763333B (en) Social media-based event map construction method
CN109189942B (en) Construction method and device of patent data knowledge graph
CN106503192B (en) Name entity recognition method and device based on artificial intelligence
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN101630314B (en) Semantic query expansion method based on domain knowledge
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN110321432A (en) Textual event information extracting method, electronic device and non-volatile memory medium
CN104679867B (en) Address method of knowledge processing and device based on figure
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN112818661B (en) Patent technology keyword unsupervised extraction method
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN112926337B (en) End-to-end aspect level emotion analysis method combined with reconstructed syntax information
CN116244445B (en) Aviation text data labeling method and labeling system thereof
CN112036178A (en) Distribution network entity related semantic search method
CN104391837A (en) Intelligent grammatical analysis method based on case semantics
CN114997288A (en) Design resource association method
CN114996467A (en) Knowledge graph entity attribute alignment algorithm based on semantic similarity
CN115935995A (en) Knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method
CN103020311B (en) A kind of processing method of user search word and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20101201