CN106445917B - A kind of Chinese entity abstracting method of pattern-based bootstrapping - Google Patents

A kind of Chinese entity abstracting method of pattern-based bootstrapping Download PDF

Info

Publication number
CN106445917B
CN106445917B CN201610848425.7A CN201610848425A CN106445917B CN 106445917 B CN106445917 B CN 106445917B CN 201610848425 A CN201610848425 A CN 201610848425A CN 106445917 B CN106445917 B CN 106445917B
Authority
CN
China
Prior art keywords
entity
library
schema
fructification
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610848425.7A
Other languages
Chinese (zh)
Other versions
CN106445917A (en
Inventor
姜晓夏
葛唯益
杨岩
贺成龙
宗士强
徐琳
王羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN201610848425.7A priority Critical patent/CN106445917B/en
Publication of CN106445917A publication Critical patent/CN106445917A/en
Application granted granted Critical
Publication of CN106445917B publication Critical patent/CN106445917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

It is iterative to learn more entities and mode out from corpus from a small amount of kind of fructification, entity internal schema, solid exterior mode the invention discloses a kind of Chinese entity abstracting method of pattern-based bootstrapping.The present invention is a kind of method for counting and combining with mode, advantage is without a large amount of artificial mark corpus of dependence or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the observations to specific area entity type mode, entity internal schema and feature are used to carry out fraction assessment to candidate pattern and the entity that can not accurately mark, and then the levels of precision of Lifting scheme and entity scoring, it is suitable for the extraction of specific area entity and construction of knowledge base.

Description

A kind of Chinese entity abstracting method of pattern-based bootstrapping
Technical field
The present invention relates to Chinese natural language processing techniques, take out more particularly to a kind of Chinese entity of pattern-based bootstrapping Take method.
Background technique
Name Entity recognition (also known as entity extraction) is a background task of natural language processing, is widely used in letter It ceases in the application such as extraction, question and answer, machine translation, the 6th MUC meeting held in 1996 is put forward for the first time.Initially, mesh Be identify the name such as name, place name, institution term entity in corpus, with the extension of application field, entity class Definition and extension bring very big challenge.The main technique methods of name Entity recognition are divided into: pattern-based method is based on The method that method, the two of statistics combine.Statistics-Based Method is widely studied in academia, unrelated commonly used in field Entity extracts;Pattern-based method is the mainstream of industrial circle application, but usually requires a large amount of artificial constructed rules, and leading Portability between domain is poor;Bootstrapping entity extraction is a kind of from the entity manually marked on a small quantity, never in mark text repeatedly For the method that formula learns more entities and rule, it only needs a small amount of artificial participation, and has and migrate between preferable field Ability.The core that bootstrapping entity extracts is the scoring of mode and entity, and in specific field, the entity for belonging to a type is logical Often meet certain constraints, and certain mode is deferred in inside.However, the Chinese entity abstracting method of bootstrapping in the prior art can not benefit It is scored with entity internal schema, and do not fully consider Chinese to extracted feature when entity scores can not be marked The characteristic of participle.
Summary of the invention
Goal of the invention: the object of the present invention is to provide one kind, and the prior art can be overcome in the utilization of entity internal schema and reality The insufficient Chinese entity abstracting method of pattern-based bootstrapping present on body characteristics selection.
Technical solution: the Chinese entity abstracting method of pattern-based bootstrapping of the present invention, for every kind of entity type Carry out Entity recognition and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed The respective internal constraint of kind fructification, internal schema and confidence level;C. positive kind of fructification and reversed kind of fructification are respective External constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;? In the above four classes input information, a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basis Corpus;Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and real to the forward direction being marked Body extracts its contextual information, forms external schema to be selected, and external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marks original text again, according to final Entity library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, It scores each of external schema library to be selected external schema to be selected, and sorts from high to low according to score, outside to be selected Final external schema library is added in K external schemas to be selected before selecting in portion's pattern base;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to reality to be selected Each of body library entity to be selected scores, and sorts from high to low according to score, before being selected from entity library to be selected K to Select entity that final entity library is added;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: it scores each of internal schema library to be selected internal schema to be selected, and is arranged from high to low according to score Sequence, final internal schema library is added in K internal schemas to be selected before selecting from internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns Return step S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
Further, in the step S1, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: forward direction Kind of fructification and the respective length range of reversed kind of fructification, whether only comprising Chinese character, whether allow to occur additional character, whether Allow letter and number and known solid centre word occur.
Further, in the step S1, positive kind of fructification and the reversed kind of respective internal schema of fructification are positive kind The mode that fructification and reversed kind of fructification are respectively deferred to is carried out extensive with the entity type on basis.
Further, in the step S3, the method that forms external schema library to be selected are as follows: to the word of positive kind of fructification itself The entity type of property and entity type and certain window interior element is counted, and forms external schema to be selected;For window Each interior element uses feature tag of the entity type as the element, otherwise by vocabulary if having entity type Meaning is as feature tag.
Further, it in the step S4, is scored according to the following steps external schema to be selected to carry out:
S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more Multiple entity then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring, Process terminates;Otherwise, continue step S4.2;
S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge that the entity is Positive entity, the entity are scored at 1;If the entity that the external schema to be selected extracts is present in reversed entity library, sentence The entity break as reversed entity, which is scored at 0;If the entity type for the entity that the external schema to be selected extracts can not Judgement, then carry out step S4.3;
S4.3: the entity e for that can not determine entity type carrys out the score score of computational entity e as follows (e):
S4.31: it calculates internal schema matching degree innerPat (e);
Existing internal schema is applied to entity e, if sporocarp e meets internal schema, then by the fiducial probability of mode Score as innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other spies Sign, jumps directly to step S4.4;If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than 1;If sporocarp e does not meet any internal schema, then innerPat (e)=0;
S4.32: it calculates semantic distance sem (e);
Computational entity e and reversed entity in just stereotropic distance in existing entity library and entity e and existing entity library Distance: as just stereotropic apart from larger and be higher than threshold value in sporocarp e and existing entity library, then sem (e)=1, otherwise, Sem (e)=0;If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center are extracted The word2vec distance of set of words: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0;
S4.33: it calculates editing distance editDist (e): computational entity e and just stereotropic editing distance and entity e With anti-stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with it is all anti-stereotropic Editing distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0;
S4.34: it is calculated as Word probability phraseProb (e): being set up respectively for entity e solidified inside degree and adjacent word comentropy Threshold value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e, then phraseProb (e)=1 simultaneously, Otherwise, phraseProb (e)=0;Wherein, solidified inside degree is calculated by formula (1):
In formula (1), TS (t) is the set for constituting all possible division token of entity e, each of TS (t) member Being called usually as S (t), P (t) is the probability that t-th of token in S (t) occurs hereof, and NumTokens is institute in basic corpus There is the quantity of token;Freq (e) is the number that entity e occurs in basic corpus;
S4.35: calculating field particularity measures tfidf (e);
Firstly, calculating original field particularity measurement TFIDFe, it is calculated by following formula:
In formula (2), tfeFor the frequency that entity e occurs in basic corpus, N is in the unrelated magnanimity news corpus in field The quantity of document, dfeFor the number of the document comprising entity e;
Then, original field particularity is measured into TFIDFeIt normalizes between 0~1, obtains field particularity measurement tfidf(e);
S4.36: internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist are taken (e), the score at the average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), as entity e score(e);
S4.4: the score of external schema to be selected is calculated according to formula (3):
In formula (3), PrFor the set for positive kind of the fructification that external schema to be selected extracts, NrFor external schema to be selected pumping The set of reversed kind of fructification is taken out, | | for the number of element in set, UrFor can not determine entity type entity collection It closes, score (e) is the score that can not determine the entity e of entity type.
Further, in the step S5, the rule to score each entity to be selected is as follows:
E. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected;
F. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected;
G. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected;
H. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field Different property measures tfidf (e) this five characteristic values;Then all mode scores for extracting entity to be selected are added up, normalizing Change between 0~1, using the numerical value after normalization as Section 6 characteristic value;Finally to this six characteristic value weighted averages, obtain The final score of entity to be selected.
Further, in the step S6, the rule for extracting internal schema to the entity in final entity library is as follows: such as fruit Internal portion includes continuous alphabetic string, number, Chinese numbers, date, place name, name and centre word, then extracts extensive inside Mode.
Further, in the step S7, the formula that scores internal schema NP to be selected are as follows:
PN in formula (4)rFor the final just stereotropic set for meeting internal schema NP to be selected, NNrTo meet internal mode to be selected The anti-stereotropic set of formula NP, | | indicate the number of element in set, score (e) is to comment internal schema NP to be selected The score got.
The utility model has the advantages that compared with prior art, the present invention have it is following the utility model has the advantages that
1) without a large amount of artificial mark corpus or manual compiling rule, a small amount of kind of fructification and rule only need to be provided, i.e., It can be automatically performed more multiple entity and rule base building process, and portability of the system between field is strong, having can preferably move Shifting property;
2) internal schema and constraint participation mode scoring for making entity, extract substance feature from various dimensions, can be obviously improved The effect of Entity recognition.
Detailed description of the invention
Fig. 1 is the flow diagram of specific embodiment of the invention method.
Specific embodiment
With reference to the accompanying drawings and detailed description, technical solution of the present invention is further introduced.
The invention discloses a kind of Chinese entity abstracting methods of pattern-based bootstrapping, every kind of entity type are carried out real Body identification and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed The respective internal constraint of kind fructification, internal schema and confidence level;C. positive kind of fructification and reversed kind of fructification are respective External constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;? In the above four classes input information, a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basis Corpus;Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and real to the forward direction being marked Body extracts its contextual information, forms external schema to be selected, and external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marks original text again, according to final Entity library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, It scores each of external schema library to be selected external schema to be selected, and sorts from high to low according to score, outside to be selected Final external schema library is added in K external schemas to be selected before selecting in portion's pattern base;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to reality to be selected Each of body library entity to be selected scores, and sorts from high to low according to score, before being selected from entity library to be selected K to Select entity that final entity library is added;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: it scores each of internal schema library to be selected internal schema to be selected, and is arranged from high to low according to score Sequence, final internal schema library is added in K internal schemas to be selected before selecting from internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns Return step S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
The present invention is a kind of mode for counting and combining with mode, and advantage is without a large amount of artificial mark corpus of dependence Or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the sights to specific area entity type mode It examines, entity internal schema and feature is used to carry out fraction assessment, Jin Erti to candidate pattern and the entity that can not accurately mark The levels of precision of rising mould formula and entity scoring is suitable for the extraction of specific area entity and construction of knowledge base.
The flow chart of present embodiment is as shown in Figure 1:
In step S1, for the entity of " aircraft " class, user gives kind of fructification: destroying -20.
User gives physical constraints and is shown in Table 1:
The physical constraints that 1 user of table gives
Bound term Binding occurrence
Length {2,10}
NumAllowed true
Alphabetallowed true
SpecialSymbolAllowed true
Headwords Aircraft, fighter plane, machine, patrol plane, fuel charger
User gives internal schema and is shown in Table 2:
The internal schema that 2 user of table gives
User gives external schema and is shown in Table 3:
The external schema that 3 user of table gives
In step S2, urtext is segmented, the pretreatments such as part-of-speech tagging, Entity recognition using open source tool, Scheme is as follows: participle and part-of-speech tagging use Ansj tool, the Chinese classification device that Entity recognition uses Stanford NER to carry To identify GPE, PERSON, ORGANIZATION, LOCATION, and Chinese is write with Stanford Tokensregex tool Date (DATE), time (TIME), quantity (NUMBER) recognition rule.Finally, Entity recognition can provide GPE, PERSON, The mark of seven seed type of LOCATION, ORGANIZATION, DATE, TIME, NUMBER.
In step S3, it is labeled first with existing positive entity to by pretreated original language material, and extract External schema in contextual window.Such as " destroying -20 fighter plane code name prestige dragons, the F-22 fighter plane code name bird of prey.Destroy -20 by China's research and development, F-22 is by American R & D, and iPhone is by American R & D ", it is matched in the text with kind of a fructification " destroying -20 ", in window In the case that mouth is 2~3, following external schema can be extracted:
1. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]
2. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}] [{ word :/prestige dragon/}]
3. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }]
4. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]
It in step S4, scores each candidate pattern, in mode for 1, is applied in original language material, can take out Take out F-22.F-22 is evaluated: checking whether F-22 meets internal constraint.It is trained in advance by magnanimity without military corpus is marked Word2vec model.F-22 is inputted into word2vec, calculates and destroy the distance between -20, such as distance is higher than certain threshold value (such as 0.6), then it is assumed that the two semantic similarity, sem (e)=1;F-22 is matched with internal schema, discovery F-22 meets internal mode Formula 3, confidence level 0.8, then innerPat (e)=0.8;Editing distance is calculated, two can be calculated after extensive to number progress Person's editing distance is 33%, editDist (e)=1;It is calculated as Word probability, it is assumed that solidified inside degree and face word comentropy and be unsatisfactory for Threshold requirement then obtains phraseProb (e)=0 (calculating process is complex herein, no longer specifically shows).It is led based on magnanimity The ngram that the unrelated news corpus in domain calculates, calculating field particularity measurement, it is assumed that the normalized result of tfidf (e) is 0.8, The then entity final score 0.74.
According to the following formula, the final score 3.84 of acquisition model.
According to above step, score calculating is carried out to each candidate external schema, mode 2 is due to that can not identify More entities and be dropped.When score is identical, more complicated rule is selected.Top2 mode is selected to be added after sequence final Rule base, it is assumed that final choice mode 1 and mode 4.
It in step S5, is extracted with external schema 1 and external schema 4, forms entity library to be selected { F-22, apple hand Machine }, it scores two entities, " F-22 " appraisal result is better than " iPhone ", selects top1 that final entity library, mesh is added There is { destroying -20, F-22 } in preceding final entity library.
In step S6, mode is extracted to newly added entity library, however since F-22 has met one of inside Mode can not regenerate new internal schema.Therefore, step S7 is skipped, directly progress step S8.
In step S8, return step S3 is that kind of a fructification is again labeled original language material with { destroying -20, F-22 }, raw At external schema library, it re-execute the steps S4~step S7.
In step S9, due to not new schema creation, then iteration terminates, and exports final entity library, final external schema Library and final internal schema library.
Final entity library: { destroying -20, F-22 }
Final external schema library:
(? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]
(? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]
(? $ term [] { 2,3 } [word:$ PLANETYPE]) [word :/| in /] [{ ner:DATE }] [word :/ Landing | take off /]
Final internal schema library:
$ PLANETYPE="/opportunity of combat | aircraft | helicopter | trainer aircraft | patrol plane | fuel charger | aerial surveying plane | patrol plane | religion Practice machine | bomber | reconnaissance plane | research aircraft | fighter plane | jet plane/"
([word :/destroy | Soviet Union | Ilyushin | beauty | Boeing | rice lattice | rice | Air Passenger/}]) ([{ word: "-" }] { 0,1 }) ([{ner:NUMBER}]))
(([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+)(([{word:"-"}]) ([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+))+[word:$PLANETYPE]*)。

Claims (8)

1. a kind of Chinese entity abstracting method of pattern-based bootstrapping, it is characterised in that: carry out entity for every kind of entity type Identification and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed seed The respective internal constraint of entity, internal schema and confidence level;C. positive kind of fructification and the reversed kind of respective outside of fructification Constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;Above Four classes input in information, and a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basic corpus; Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and takes out to the positive entity being marked Its contextual information is taken, external schema to be selected is formed, external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marking original text again, according to final entity Library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, treats It selects each of external schema library external schema to be selected to score, and sorts from high to low according to score, from external mould to be selected Final external schema library is added in K external schemas to be selected before selecting in formula library;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to entity library to be selected Each of entity to be selected score, and sort from high to low according to score, K realities to be selected before being selected from entity library to be selected Final entity library is added in body;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: scoring to each of internal schema library to be selected internal schema to be selected, and sort from high to low according to score, from Final internal schema library is added in K internal schemas to be selected before selecting in internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns to step Rapid S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
2. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1 In, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: that positive kind of fructification and reversed kind of fructification are each From length range, whether only comprising Chinese character, whether allow to occur additional character, whether allow to occur letter and number and The solid centre word known.
3. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1 In, positive kind of fructification and the reversed kind of respective internal schema of fructification are that positive seed entity and reversed kind of fructification are respectively abided by From mode, with basis entity type carry out it is extensive.
4. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S3 In, the method that forms external schema library to be selected are as follows: part of speech and entity type and certain window to positive kind of fructification itself The entity type of interior element is counted, and forms external schema to be selected;For each element in window, if had real Body type then uses feature tag of the entity type as the element, otherwise using vocabulary meaning as feature tag.
5. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S4 In, it is scored according to the following steps external schema to be selected to carry out:
S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more realities Body then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring, process Terminate;Otherwise, continue step S4.2;
S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge the entity for forward direction Entity, the entity are scored at 1;If the entity that the external schema to be selected extracts is present in reversed entity library, judgement should Entity is reversed entity, which is scored at 0;If the entity type for the entity that the external schema to be selected extracts can not be sentenced It is disconnected, then carry out step S4.3;
S4.3: the entity e for that can not determine entity type carrys out the score score (e) of computational entity e as follows:
S4.31: it calculates internal schema matching degree innerPat (e);
By existing internal schema be applied to entity e, if sporocarp e meets internal schema, then using the fiducial probability of mode as The score of innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other features, directly It connects and skips to step S4.4;If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than 1;Such as Sporocarp e does not meet any internal schema, then innerPat (e)=0;
S4.32: it calculates semantic distance sem (e);
In computational entity e and existing entity library in just stereotropic distance and entity e and existing entity library it is anti-it is stereotropic away from From: as being higher than threshold value with a distance from just stereotropic in sporocarp e and existing entity library, then sem (e)=1, otherwise, sem (e)=0; If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center set of words are extracted Word2vec distance: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0;
S4.33: calculate editing distance editDist (e): computational entity e is with just stereotropic editing distance and entity e and instead Stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with all anti-stereotropic editors Distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0;
S4.34: it is calculated as Word probability phraseProb (e): setting up threshold respectively for entity e solidified inside degree and adjacent word comentropy Value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e simultaneously, then phraseProb (e)=1, no Then, phraseProb (e)=0;Wherein, solidified inside degree is calculated by formula (1):
In formula (1), TS (t) is the set for constituting all possible division token of entity e, and each of TS (t) member is called usually It is the probability that t-th of token in S (t) occurs hereof for S (t), P (t), NumTokens is to own in basic corpus The quantity of token;Freq (e) is the number that entity e occurs in basic corpus;
S4.35: calculating field particularity measures tfidf (e);
Firstly, calculating original field particularity measurement TFIDFe, it is calculated by following formula:
In formula (2), tfeFor the frequency that entity e occurs in basic corpus, N is document in the unrelated magnanimity news corpus in field Quantity, dfeFor the number of the document comprising entity e;
Then, original field particularity is measured into TFIDFeIt normalizes between 0~1, obtains field particularity measurement tfidf (e);
S4.36: take internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist (e), at The average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), the score score (e) as entity e;
S4.4: the score of external schema to be selected is calculated according to formula (3):
In formula (3), PrFor the set for positive kind of the fructification that external schema to be selected extracts, NrIt is extracted for external schema to be selected The set of reversed kind of fructification, | | for the number of element in set, UrFor can not determine entity type entity set, Score (e) is the score that can not determine the entity e of entity type.
6. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S5 In, the rule to score each entity to be selected is as follows:
A. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected;
B. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected;
C. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected;
D. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field Different property measures tfidf (e) this five characteristic values;Then all mode scores for extracting entity to be selected are added up, normalizing Change between 0~1, using the numerical value after normalization as Section 6 characteristic value;Finally to this six characteristic value weighted averages, obtain The final score of entity to be selected.
7. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S6 In, the rule for extracting internal schema to the entity in final entity library is as follows: as included continuous alphabetic string, number inside sporocarp Word, Chinese numbers, date, place name, name and centre word then extract extensive internal schema.
8. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S7 In, the formula that scores internal schema NP to be selected are as follows:
PN in formula (4)rFor the final just stereotropic set for meeting internal schema NP to be selected, NNrTo meet internal schema NP to be selected Anti- stereotropic set, | | indicate set in element number, score (e) is the entity e that can not determine entity type Score.
CN201610848425.7A 2016-09-23 2016-09-23 A kind of Chinese entity abstracting method of pattern-based bootstrapping Active CN106445917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610848425.7A CN106445917B (en) 2016-09-23 2016-09-23 A kind of Chinese entity abstracting method of pattern-based bootstrapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610848425.7A CN106445917B (en) 2016-09-23 2016-09-23 A kind of Chinese entity abstracting method of pattern-based bootstrapping

Publications (2)

Publication Number Publication Date
CN106445917A CN106445917A (en) 2017-02-22
CN106445917B true CN106445917B (en) 2019-02-19

Family

ID=58167285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610848425.7A Active CN106445917B (en) 2016-09-23 2016-09-23 A kind of Chinese entity abstracting method of pattern-based bootstrapping

Country Status (1)

Country Link
CN (1) CN106445917B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238363B2 (en) * 2017-04-27 2022-02-01 Accenture Global Solutions Limited Entity classification based on machine learning techniques
CN108154198B (en) * 2018-01-25 2021-07-13 北京百度网讯科技有限公司 Knowledge base entity normalization method, system, terminal and computer readable storage medium
CN111400458A (en) * 2018-12-27 2020-07-10 上海智臻智能网络科技股份有限公司 Automatic generalization method and device
CN110245354A (en) * 2019-06-20 2019-09-17 贵州电网有限责任公司 The method of entity is extracted in a kind of schedule information
CN111178045A (en) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium
CN111259134B (en) * 2020-01-19 2023-08-08 出门问问信息科技有限公司 Entity identification method, equipment and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN103186556A (en) * 2011-12-28 2013-07-03 北京百度网讯科技有限公司 Method for obtaining and searching structural semantic knowledge and corresponding device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8271479B2 (en) * 2009-11-23 2012-09-18 International Business Machines Corporation Analyzing XML data
US9111211B2 (en) * 2011-12-20 2015-08-18 Bitly, Inc. Systems and methods for relevance scoring of a digital resource

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN103186556A (en) * 2011-12-28 2013-07-03 北京百度网讯科技有限公司 Method for obtaining and searching structural semantic knowledge and corresponding device

Also Published As

Publication number Publication date
CN106445917A (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN106445917B (en) A kind of Chinese entity abstracting method of pattern-based bootstrapping
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN105786991B (en) In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN109408642A (en) A kind of domain entities relation on attributes abstracting method based on distance supervision
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
CN107590133A (en) The method and system that position vacant based on semanteme matches with job seeker resume
CN106257455B (en) A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template
CN108509425A (en) Chinese new word discovery method based on novelty
CN111680488B (en) Cross-language entity alignment method based on knowledge graph multi-view information
CN105608070B (en) A kind of character relation abstracting method towards headline
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN108268539A (en) Video matching system based on text analyzing
CN106445921B (en) Utilize the Chinese text terminology extraction method of quadratic mutual information
CN103049501A (en) Chinese domain term recognition method based on mutual information and conditional random field model
CN103984943A (en) Scene text identification method based on Bayesian probability frame
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN106446018B (en) Query information processing method and device based on artificial intelligence
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN104298663B (en) Method and device for translation consistency and statistical machine translation method and system
CN102054029A (en) Figure information disambiguation treatment method based on social network and name context
Gast et al. The areal factor in lexical typology
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
CN107807910A (en) A kind of part-of-speech tagging method based on HMM
CN105068990B (en) A kind of English long sentence dividing method of more strategies of Machine oriented translation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant