CN106445917B

CN106445917B - A kind of Chinese entity abstracting method of pattern-based bootstrapping

Info

Publication number: CN106445917B
Application number: CN201610848425.7A
Authority: CN
Inventors: 姜晓夏; 葛唯益; 杨岩; 贺成龙; 宗士强; 徐琳; 王羽
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2019-02-19
Anticipated expiration: 2036-09-23
Also published as: CN106445917A

Abstract

It is iterative to learn more entities and mode out from corpus from a small amount of kind of fructification, entity internal schema, solid exterior mode the invention discloses a kind of Chinese entity abstracting method of pattern-based bootstrapping.The present invention is a kind of method for counting and combining with mode, advantage is without a large amount of artificial mark corpus of dependence or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the observations to specific area entity type mode, entity internal schema and feature are used to carry out fraction assessment to candidate pattern and the entity that can not accurately mark, and then the levels of precision of Lifting scheme and entity scoring, it is suitable for the extraction of specific area entity and construction of knowledge base.

Description

A kind of Chinese entity abstracting method of pattern-based bootstrapping

Technical field

The present invention relates to Chinese natural language processing techniques, take out more particularly to a kind of Chinese entity of pattern-based bootstrapping Take method.

Background technique

Name Entity recognition (also known as entity extraction) is a background task of natural language processing, is widely used in letter It ceases in the application such as extraction, question and answer, machine translation, the 6th MUC meeting held in 1996 is put forward for the first time.Initially, mesh Be identify the name such as name, place name, institution term entity in corpus, with the extension of application field, entity class Definition and extension bring very big challenge.The main technique methods of name Entity recognition are divided into: pattern-based method is based on The method that method, the two of statistics combine.Statistics-Based Method is widely studied in academia, unrelated commonly used in field Entity extracts；Pattern-based method is the mainstream of industrial circle application, but usually requires a large amount of artificial constructed rules, and leading Portability between domain is poor；Bootstrapping entity extraction is a kind of from the entity manually marked on a small quantity, never in mark text repeatedly For the method that formula learns more entities and rule, it only needs a small amount of artificial participation, and has and migrate between preferable field Ability.The core that bootstrapping entity extracts is the scoring of mode and entity, and in specific field, the entity for belonging to a type is logical Often meet certain constraints, and certain mode is deferred in inside.However, the Chinese entity abstracting method of bootstrapping in the prior art can not benefit It is scored with entity internal schema, and do not fully consider Chinese to extracted feature when entity scores can not be marked The characteristic of participle.

Summary of the invention

Goal of the invention: the object of the present invention is to provide one kind, and the prior art can be overcome in the utilization of entity internal schema and reality The insufficient Chinese entity abstracting method of pattern-based bootstrapping present on body characteristics selection.

Technical solution: the Chinese entity abstracting method of pattern-based bootstrapping of the present invention, for every kind of entity type Carry out Entity recognition and rule base building, comprising the following steps:

S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification；B. positive kind of fructification and reversed The respective internal constraint of kind fructification, internal schema and confidence level；C. positive kind of fructification and reversed kind of fructification are respective External constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur；D. original not mark text；? In the above four classes input information, a, d can not be sky, and b, c can be sky；

S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basis Corpus；Final entity library is added in positive kind of fructification；

S3: according to the positive entity in final entity library, being labeled in basic corpus, and real to the forward direction being marked Body extracts its contextual information, forms external schema to be selected, and external schema library to be selected is added；

S4: it scores external schema library to be selected: external schema to be selected marks original text again, according to final Entity library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, It scores each of external schema library to be selected external schema to be selected, and sorts from high to low according to score, outside to be selected Final external schema library is added in K external schemas to be selected before selecting in portion's pattern base；

S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to reality to be selected Each of body library entity to be selected scores, and sorts from high to low according to score, before being selected from entity library to be selected K to Select entity that final entity library is added；

S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected；

S7: it scores each of internal schema library to be selected internal schema to be selected, and is arranged from high to low according to score Sequence, final internal schema library is added in K internal schemas to be selected before selecting from internal schema library to be selected；

S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns Return step S3；

S9: the final entity library, final external schema library and final internal schema library of generation are exported.

Further, in the step S1, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: forward direction Kind of fructification and the respective length range of reversed kind of fructification, whether only comprising Chinese character, whether allow to occur additional character, whether Allow letter and number and known solid centre word occur.

Further, in the step S1, positive kind of fructification and the reversed kind of respective internal schema of fructification are positive kind The mode that fructification and reversed kind of fructification are respectively deferred to is carried out extensive with the entity type on basis.

Further, in the step S3, the method that forms external schema library to be selected are as follows: to the word of positive kind of fructification itself The entity type of property and entity type and certain window interior element is counted, and forms external schema to be selected；For window Each interior element uses feature tag of the entity type as the element, otherwise by vocabulary if having entity type Meaning is as feature tag.

Further, it in the step S4, is scored according to the following steps external schema to be selected to carry out:

S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more Multiple entity then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring, Process terminates；Otherwise, continue step S4.2；

S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge that the entity is Positive entity, the entity are scored at 1；If the entity that the external schema to be selected extracts is present in reversed entity library, sentence The entity break as reversed entity, which is scored at 0；If the entity type for the entity that the external schema to be selected extracts can not Judgement, then carry out step S4.3；

S4.3: the entity e for that can not determine entity type carrys out the score score of computational entity e as follows (e):

S4.31: it calculates internal schema matching degree innerPat (e)；

Existing internal schema is applied to entity e, if sporocarp e meets internal schema, then by the fiducial probability of mode Score as innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other spies Sign, jumps directly to step S4.4；If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than 1；If sporocarp e does not meet any internal schema, then innerPat (e)=0；

S4.32: it calculates semantic distance sem (e)；

Computational entity e and reversed entity in just stereotropic distance in existing entity library and entity e and existing entity library Distance: as just stereotropic apart from larger and be higher than threshold value in sporocarp e and existing entity library, then sem (e)=1, otherwise, Sem (e)=0；If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center are extracted The word2vec distance of set of words: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0；

S4.33: it calculates editing distance editDist (e): computational entity e and just stereotropic editing distance and entity e With anti-stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with it is all anti-stereotropic Editing distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0；

S4.34: it is calculated as Word probability phraseProb (e): being set up respectively for entity e solidified inside degree and adjacent word comentropy Threshold value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e, then phraseProb (e)=1 simultaneously, Otherwise, phraseProb (e)=0；Wherein, solidified inside degree is calculated by formula (1):

In formula (1), TS (t) is the set for constituting all possible division token of entity e, each of TS (t) member Being called usually as S (t), P (t) is the probability that t-th of token in S (t) occurs hereof, and NumTokens is institute in basic corpus There is the quantity of token；Freq (e) is the number that entity e occurs in basic corpus；

S4.35: calculating field particularity measures tfidf (e)；

Firstly, calculating original field particularity measurement TFIDF_e, it is calculated by following formula:

In formula (2), tf_eFor the frequency that entity e occurs in basic corpus, N is in the unrelated magnanimity news corpus in field The quantity of document, df_eFor the number of the document comprising entity e；

Then, original field particularity is measured into TFIDF_eIt normalizes between 0~1, obtains field particularity measurement tfidf(e)；

S4.36: internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist are taken (e), the score at the average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), as entity e score(e)；

S4.4: the score of external schema to be selected is calculated according to formula (3):

In formula (3), P_rFor the set for positive kind of the fructification that external schema to be selected extracts, N_rFor external schema to be selected pumping The set of reversed kind of fructification is taken out, | | for the number of element in set, U_rFor can not determine entity type entity collection It closes, score (e) is the score that can not determine the entity e of entity type.

Further, in the step S5, the rule to score each entity to be selected is as follows:

E. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected；

F. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected；

G. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected；

H. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field Different property measures tfidf (e) this five characteristic values；Then all mode scores for extracting entity to be selected are added up, normalizing Change between 0~1, using the numerical value after normalization as Section 6 characteristic value；Finally to this six characteristic value weighted averages, obtain The final score of entity to be selected.

Further, in the step S6, the rule for extracting internal schema to the entity in final entity library is as follows: such as fruit Internal portion includes continuous alphabetic string, number, Chinese numbers, date, place name, name and centre word, then extracts extensive inside Mode.

Further, in the step S7, the formula that scores internal schema NP to be selected are as follows:

PN in formula (4)_rFor the final just stereotropic set for meeting internal schema NP to be selected, NN_rTo meet internal mode to be selected The anti-stereotropic set of formula NP, | | indicate the number of element in set, score (e) is to comment internal schema NP to be selected The score got.

The utility model has the advantages that compared with prior art, the present invention have it is following the utility model has the advantages that

1) without a large amount of artificial mark corpus or manual compiling rule, a small amount of kind of fructification and rule only need to be provided, i.e., It can be automatically performed more multiple entity and rule base building process, and portability of the system between field is strong, having can preferably move Shifting property；

2) internal schema and constraint participation mode scoring for making entity, extract substance feature from various dimensions, can be obviously improved The effect of Entity recognition.

Detailed description of the invention

Fig. 1 is the flow diagram of specific embodiment of the invention method.

Specific embodiment

With reference to the accompanying drawings and detailed description, technical solution of the present invention is further introduced.

The invention discloses a kind of Chinese entity abstracting methods of pattern-based bootstrapping, every kind of entity type are carried out real Body identification and rule base building, comprising the following steps:

The present invention is a kind of mode for counting and combining with mode, and advantage is without a large amount of artificial mark corpus of dependence Or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the sights to specific area entity type mode It examines, entity internal schema and feature is used to carry out fraction assessment, Jin Erti to candidate pattern and the entity that can not accurately mark The levels of precision of rising mould formula and entity scoring is suitable for the extraction of specific area entity and construction of knowledge base.

The flow chart of present embodiment is as shown in Figure 1:

In step S1, for the entity of " aircraft " class, user gives kind of fructification: destroying -20.

User gives physical constraints and is shown in Table 1:

The physical constraints that 1 user of table gives

Bound term	Binding occurrence
		Length	{2,10}
NumAllowed	true
		Alphabetallowed	true
SpecialSymbolAllowed	true
		Headwords	Aircraft, fighter plane, machine, patrol plane, fuel charger

User gives internal schema and is shown in Table 2:

The internal schema that 2 user of table gives

User gives external schema and is shown in Table 3:

The external schema that 3 user of table gives

In step S2, urtext is segmented, the pretreatments such as part-of-speech tagging, Entity recognition using open source tool, Scheme is as follows: participle and part-of-speech tagging use Ansj tool, the Chinese classification device that Entity recognition uses Stanford NER to carry To identify GPE, PERSON, ORGANIZATION, LOCATION, and Chinese is write with Stanford Tokensregex tool Date (DATE), time (TIME), quantity (NUMBER) recognition rule.Finally, Entity recognition can provide GPE, PERSON, The mark of seven seed type of LOCATION, ORGANIZATION, DATE, TIME, NUMBER.

In step S3, it is labeled first with existing positive entity to by pretreated original language material, and extract External schema in contextual window.Such as " destroying -20 fighter plane code name prestige dragons, the F-22 fighter plane code name bird of prey.Destroy -20 by China's research and development, F-22 is by American R & D, and iPhone is by American R & D ", it is matched in the text with kind of a fructification " destroying -20 ", in window In the case that mouth is 2~3, following external schema can be extracted:

1. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]

2. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}] [{ word :/prestige dragon/}]

3. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }]

4. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]

It in step S4, scores each candidate pattern, in mode for 1, is applied in original language material, can take out Take out F-22.F-22 is evaluated: checking whether F-22 meets internal constraint.It is trained in advance by magnanimity without military corpus is marked Word2vec model.F-22 is inputted into word2vec, calculates and destroy the distance between -20, such as distance is higher than certain threshold value (such as 0.6), then it is assumed that the two semantic similarity, sem (e)=1；F-22 is matched with internal schema, discovery F-22 meets internal mode Formula 3, confidence level 0.8, then innerPat (e)=0.8；Editing distance is calculated, two can be calculated after extensive to number progress Person's editing distance is 33%, editDist (e)=1；It is calculated as Word probability, it is assumed that solidified inside degree and face word comentropy and be unsatisfactory for Threshold requirement then obtains phraseProb (e)=0 (calculating process is complex herein, no longer specifically shows).It is led based on magnanimity The ngram that the unrelated news corpus in domain calculates, calculating field particularity measurement, it is assumed that the normalized result of tfidf (e) is 0.8, The then entity final score 0.74.

According to the following formula, the final score 3.84 of acquisition model.

According to above step, score calculating is carried out to each candidate external schema, mode 2 is due to that can not identify More entities and be dropped.When score is identical, more complicated rule is selected.Top2 mode is selected to be added after sequence final Rule base, it is assumed that final choice mode 1 and mode 4.

It in step S5, is extracted with external schema 1 and external schema 4, forms entity library to be selected { F-22, apple hand Machine }, it scores two entities, " F-22 " appraisal result is better than " iPhone ", selects top1 that final entity library, mesh is added There is { destroying -20, F-22 } in preceding final entity library.

In step S6, mode is extracted to newly added entity library, however since F-22 has met one of inside Mode can not regenerate new internal schema.Therefore, step S7 is skipped, directly progress step S8.

In step S8, return step S3 is that kind of a fructification is again labeled original language material with { destroying -20, F-22 }, raw At external schema library, it re-execute the steps S4~step S7.

In step S9, due to not new schema creation, then iteration terminates, and exports final entity library, final external schema Library and final internal schema library.

Final entity library: { destroying -20, F-22 }

Final external schema library:

(? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]

(? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]

(? $ term [] { 2,3 } [word:$ PLANETYPE]) [word :/| in /] [{ ner:DATE }] [word :/ Landing | take off /]

Final internal schema library:

([word :/destroy | Soviet Union | Ilyushin | beauty | Boeing | rice lattice | rice | Air Passenger/}]) ([{ word: "-" }] { 0,1 }) ([{ner:NUMBER}]))

(([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+)(([{word:"-"}]) ([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+))+[word:$PLANETYPE]*)。

Claims

1. a kind of Chinese entity abstracting method of pattern-based bootstrapping, it is characterised in that: carry out entity for every kind of entity type Identification and rule base building, comprising the following steps:

S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification；B. positive kind of fructification and reversed seed The respective internal constraint of entity, internal schema and confidence level；C. positive kind of fructification and the reversed kind of respective outside of fructification Constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur；D. original not mark text；Above Four classes input in information, and a, d can not be sky, and b, c can be sky；

S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basic corpus； Final entity library is added in positive kind of fructification；

S3: according to the positive entity in final entity library, being labeled in basic corpus, and takes out to the positive entity being marked Its contextual information is taken, external schema to be selected is formed, external schema library to be selected is added；

S4: it scores external schema library to be selected: external schema to be selected marking original text again, according to final entity Library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, treats It selects each of external schema library external schema to be selected to score, and sorts from high to low according to score, from external mould to be selected Final external schema library is added in K external schemas to be selected before selecting in formula library；

S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to entity library to be selected Each of entity to be selected score, and sort from high to low according to score, K realities to be selected before being selected from entity library to be selected Final entity library is added in body；

S7: scoring to each of internal schema library to be selected internal schema to be selected, and sort from high to low according to score, from Final internal schema library is added in K internal schemas to be selected before selecting in internal schema library to be selected；

S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns to step Rapid S3；

2. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1 In, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: that positive kind of fructification and reversed kind of fructification are each From length range, whether only comprising Chinese character, whether allow to occur additional character, whether allow to occur letter and number and The solid centre word known.

3. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1 In, positive kind of fructification and the reversed kind of respective internal schema of fructification are that positive seed entity and reversed kind of fructification are respectively abided by From mode, with basis entity type carry out it is extensive.

4. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S3 In, the method that forms external schema library to be selected are as follows: part of speech and entity type and certain window to positive kind of fructification itself The entity type of interior element is counted, and forms external schema to be selected；For each element in window, if had real Body type then uses feature tag of the entity type as the element, otherwise using vocabulary meaning as feature tag.

5. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S4 In, it is scored according to the following steps external schema to be selected to carry out:

S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more realities Body then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring, process Terminate；Otherwise, continue step S4.2；

S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge the entity for forward direction Entity, the entity are scored at 1；If the entity that the external schema to be selected extracts is present in reversed entity library, judgement should Entity is reversed entity, which is scored at 0；If the entity type for the entity that the external schema to be selected extracts can not be sentenced It is disconnected, then carry out step S4.3；

S4.3: the entity e for that can not determine entity type carrys out the score score (e) of computational entity e as follows:

S4.31: it calculates internal schema matching degree innerPat (e)；

By existing internal schema be applied to entity e, if sporocarp e meets internal schema, then using the fiducial probability of mode as The score of innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other features, directly It connects and skips to step S4.4；If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than 1；Such as Sporocarp e does not meet any internal schema, then innerPat (e)=0；

S4.32: it calculates semantic distance sem (e)；

In computational entity e and existing entity library in just stereotropic distance and entity e and existing entity library it is anti-it is stereotropic away from From: as being higher than threshold value with a distance from just stereotropic in sporocarp e and existing entity library, then sem (e)=1, otherwise, sem (e)=0； If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center set of words are extracted Word2vec distance: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0；

S4.33: calculate editing distance editDist (e): computational entity e is with just stereotropic editing distance and entity e and instead Stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with all anti-stereotropic editors Distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0；

S4.34: it is calculated as Word probability phraseProb (e): setting up threshold respectively for entity e solidified inside degree and adjacent word comentropy Value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e simultaneously, then phraseProb (e)=1, no Then, phraseProb (e)=0；Wherein, solidified inside degree is calculated by formula (1):

In formula (1), TS (t) is the set for constituting all possible division token of entity e, and each of TS (t) member is called usually It is the probability that t-th of token in S (t) occurs hereof for S (t), P (t), NumTokens is to own in basic corpus The quantity of token；Freq (e) is the number that entity e occurs in basic corpus；

S4.35: calculating field particularity measures tfidf (e)；

In formula (2), tf_eFor the frequency that entity e occurs in basic corpus, N is document in the unrelated magnanimity news corpus in field Quantity, df_eFor the number of the document comprising entity e；

Then, original field particularity is measured into TFIDF_eIt normalizes between 0~1, obtains field particularity measurement tfidf (e)；

S4.36: take internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist (e), at The average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), the score score (e) as entity e；

In formula (3), P_rFor the set for positive kind of the fructification that external schema to be selected extracts, N_rIt is extracted for external schema to be selected The set of reversed kind of fructification, | | for the number of element in set, U_rFor can not determine entity type entity set, Score (e) is the score that can not determine the entity e of entity type.

6. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S5 In, the rule to score each entity to be selected is as follows:

A. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected；

B. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected；

C. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected；

D. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field Different property measures tfidf (e) this five characteristic values；Then all mode scores for extracting entity to be selected are added up, normalizing Change between 0~1, using the numerical value after normalization as Section 6 characteristic value；Finally to this six characteristic value weighted averages, obtain The final score of entity to be selected.

7. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S6 In, the rule for extracting internal schema to the entity in final entity library is as follows: as included continuous alphabetic string, number inside sporocarp Word, Chinese numbers, date, place name, name and centre word then extract extensive internal schema.

8. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S7 In, the formula that scores internal schema NP to be selected are as follows:

PN in formula (4)_rFor the final just stereotropic set for meeting internal schema NP to be selected, NN_rTo meet internal schema NP to be selected Anti- stereotropic set, | | indicate set in element number, score (e) is the entity e that can not determine entity type Score.