CN106445917B - A kind of Chinese entity abstracting method of pattern-based bootstrapping - Google Patents
A kind of Chinese entity abstracting method of pattern-based bootstrapping Download PDFInfo
- Publication number
- CN106445917B CN106445917B CN201610848425.7A CN201610848425A CN106445917B CN 106445917 B CN106445917 B CN 106445917B CN 201610848425 A CN201610848425 A CN 201610848425A CN 106445917 B CN106445917 B CN 106445917B
- Authority
- CN
- China
- Prior art keywords
- entity
- library
- schema
- fructification
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
It is iterative to learn more entities and mode out from corpus from a small amount of kind of fructification, entity internal schema, solid exterior mode the invention discloses a kind of Chinese entity abstracting method of pattern-based bootstrapping.The present invention is a kind of method for counting and combining with mode, advantage is without a large amount of artificial mark corpus of dependence or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the observations to specific area entity type mode, entity internal schema and feature are used to carry out fraction assessment to candidate pattern and the entity that can not accurately mark, and then the levels of precision of Lifting scheme and entity scoring, it is suitable for the extraction of specific area entity and construction of knowledge base.
Description
Technical field
The present invention relates to Chinese natural language processing techniques, take out more particularly to a kind of Chinese entity of pattern-based bootstrapping
Take method.
Background technique
Name Entity recognition (also known as entity extraction) is a background task of natural language processing, is widely used in letter
It ceases in the application such as extraction, question and answer, machine translation, the 6th MUC meeting held in 1996 is put forward for the first time.Initially, mesh
Be identify the name such as name, place name, institution term entity in corpus, with the extension of application field, entity class
Definition and extension bring very big challenge.The main technique methods of name Entity recognition are divided into: pattern-based method is based on
The method that method, the two of statistics combine.Statistics-Based Method is widely studied in academia, unrelated commonly used in field
Entity extracts;Pattern-based method is the mainstream of industrial circle application, but usually requires a large amount of artificial constructed rules, and leading
Portability between domain is poor;Bootstrapping entity extraction is a kind of from the entity manually marked on a small quantity, never in mark text repeatedly
For the method that formula learns more entities and rule, it only needs a small amount of artificial participation, and has and migrate between preferable field
Ability.The core that bootstrapping entity extracts is the scoring of mode and entity, and in specific field, the entity for belonging to a type is logical
Often meet certain constraints, and certain mode is deferred in inside.However, the Chinese entity abstracting method of bootstrapping in the prior art can not benefit
It is scored with entity internal schema, and do not fully consider Chinese to extracted feature when entity scores can not be marked
The characteristic of participle.
Summary of the invention
Goal of the invention: the object of the present invention is to provide one kind, and the prior art can be overcome in the utilization of entity internal schema and reality
The insufficient Chinese entity abstracting method of pattern-based bootstrapping present on body characteristics selection.
Technical solution: the Chinese entity abstracting method of pattern-based bootstrapping of the present invention, for every kind of entity type
Carry out Entity recognition and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed
The respective internal constraint of kind fructification, internal schema and confidence level;C. positive kind of fructification and reversed kind of fructification are respective
External constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;?
In the above four classes input information, a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basis
Corpus;Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and real to the forward direction being marked
Body extracts its contextual information, forms external schema to be selected, and external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marks original text again, according to final
Entity library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts,
It scores each of external schema library to be selected external schema to be selected, and sorts from high to low according to score, outside to be selected
Final external schema library is added in K external schemas to be selected before selecting in portion's pattern base;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to reality to be selected
Each of body library entity to be selected scores, and sorts from high to low according to score, before being selected from entity library to be selected K to
Select entity that final entity library is added;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: it scores each of internal schema library to be selected internal schema to be selected, and is arranged from high to low according to score
Sequence, final internal schema library is added in K internal schemas to be selected before selecting from internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns
Return step S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
Further, in the step S1, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: forward direction
Kind of fructification and the respective length range of reversed kind of fructification, whether only comprising Chinese character, whether allow to occur additional character, whether
Allow letter and number and known solid centre word occur.
Further, in the step S1, positive kind of fructification and the reversed kind of respective internal schema of fructification are positive kind
The mode that fructification and reversed kind of fructification are respectively deferred to is carried out extensive with the entity type on basis.
Further, in the step S3, the method that forms external schema library to be selected are as follows: to the word of positive kind of fructification itself
The entity type of property and entity type and certain window interior element is counted, and forms external schema to be selected;For window
Each interior element uses feature tag of the entity type as the element, otherwise by vocabulary if having entity type
Meaning is as feature tag.
Further, it in the step S4, is scored according to the following steps external schema to be selected to carry out:
S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more
Multiple entity then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring,
Process terminates;Otherwise, continue step S4.2;
S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge that the entity is
Positive entity, the entity are scored at 1;If the entity that the external schema to be selected extracts is present in reversed entity library, sentence
The entity break as reversed entity, which is scored at 0;If the entity type for the entity that the external schema to be selected extracts can not
Judgement, then carry out step S4.3;
S4.3: the entity e for that can not determine entity type carrys out the score score of computational entity e as follows
(e):
S4.31: it calculates internal schema matching degree innerPat (e);
Existing internal schema is applied to entity e, if sporocarp e meets internal schema, then by the fiducial probability of mode
Score as innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other spies
Sign, jumps directly to step S4.4;If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than
1;If sporocarp e does not meet any internal schema, then innerPat (e)=0;
S4.32: it calculates semantic distance sem (e);
Computational entity e and reversed entity in just stereotropic distance in existing entity library and entity e and existing entity library
Distance: as just stereotropic apart from larger and be higher than threshold value in sporocarp e and existing entity library, then sem (e)=1, otherwise,
Sem (e)=0;If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center are extracted
The word2vec distance of set of words: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0;
S4.33: it calculates editing distance editDist (e): computational entity e and just stereotropic editing distance and entity e
With anti-stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with it is all anti-stereotropic
Editing distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0;
S4.34: it is calculated as Word probability phraseProb (e): being set up respectively for entity e solidified inside degree and adjacent word comentropy
Threshold value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e, then phraseProb (e)=1 simultaneously,
Otherwise, phraseProb (e)=0;Wherein, solidified inside degree is calculated by formula (1):
In formula (1), TS (t) is the set for constituting all possible division token of entity e, each of TS (t) member
Being called usually as S (t), P (t) is the probability that t-th of token in S (t) occurs hereof, and NumTokens is institute in basic corpus
There is the quantity of token;Freq (e) is the number that entity e occurs in basic corpus;
S4.35: calculating field particularity measures tfidf (e);
Firstly, calculating original field particularity measurement TFIDFe, it is calculated by following formula:
In formula (2), tfeFor the frequency that entity e occurs in basic corpus, N is in the unrelated magnanimity news corpus in field
The quantity of document, dfeFor the number of the document comprising entity e;
Then, original field particularity is measured into TFIDFeIt normalizes between 0~1, obtains field particularity measurement
tfidf(e);
S4.36: internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist are taken
(e), the score at the average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), as entity e
score(e);
S4.4: the score of external schema to be selected is calculated according to formula (3):
In formula (3), PrFor the set for positive kind of the fructification that external schema to be selected extracts, NrFor external schema to be selected pumping
The set of reversed kind of fructification is taken out, | | for the number of element in set, UrFor can not determine entity type entity collection
It closes, score (e) is the score that can not determine the entity e of entity type.
Further, in the step S5, the rule to score each entity to be selected is as follows:
E. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected;
F. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected;
G. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected;
H. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first
It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field
Different property measures tfidf (e) this five characteristic values;Then all mode scores for extracting entity to be selected are added up, normalizing
Change between 0~1, using the numerical value after normalization as Section 6 characteristic value;Finally to this six characteristic value weighted averages, obtain
The final score of entity to be selected.
Further, in the step S6, the rule for extracting internal schema to the entity in final entity library is as follows: such as fruit
Internal portion includes continuous alphabetic string, number, Chinese numbers, date, place name, name and centre word, then extracts extensive inside
Mode.
Further, in the step S7, the formula that scores internal schema NP to be selected are as follows:
PN in formula (4)rFor the final just stereotropic set for meeting internal schema NP to be selected, NNrTo meet internal mode to be selected
The anti-stereotropic set of formula NP, | | indicate the number of element in set, score (e) is to comment internal schema NP to be selected
The score got.
The utility model has the advantages that compared with prior art, the present invention have it is following the utility model has the advantages that
1) without a large amount of artificial mark corpus or manual compiling rule, a small amount of kind of fructification and rule only need to be provided, i.e.,
It can be automatically performed more multiple entity and rule base building process, and portability of the system between field is strong, having can preferably move
Shifting property;
2) internal schema and constraint participation mode scoring for making entity, extract substance feature from various dimensions, can be obviously improved
The effect of Entity recognition.
Detailed description of the invention
Fig. 1 is the flow diagram of specific embodiment of the invention method.
Specific embodiment
With reference to the accompanying drawings and detailed description, technical solution of the present invention is further introduced.
The invention discloses a kind of Chinese entity abstracting methods of pattern-based bootstrapping, every kind of entity type are carried out real
Body identification and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed
The respective internal constraint of kind fructification, internal schema and confidence level;C. positive kind of fructification and reversed kind of fructification are respective
External constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;?
In the above four classes input information, a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basis
Corpus;Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and real to the forward direction being marked
Body extracts its contextual information, forms external schema to be selected, and external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marks original text again, according to final
Entity library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts,
It scores each of external schema library to be selected external schema to be selected, and sorts from high to low according to score, outside to be selected
Final external schema library is added in K external schemas to be selected before selecting in portion's pattern base;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to reality to be selected
Each of body library entity to be selected scores, and sorts from high to low according to score, before being selected from entity library to be selected K to
Select entity that final entity library is added;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: it scores each of internal schema library to be selected internal schema to be selected, and is arranged from high to low according to score
Sequence, final internal schema library is added in K internal schemas to be selected before selecting from internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns
Return step S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
The present invention is a kind of mode for counting and combining with mode, and advantage is without a large amount of artificial mark corpus of dependence
Or domain pattern library, compared with the method for existing mode bootstrapping, the present invention is based on the sights to specific area entity type mode
It examines, entity internal schema and feature is used to carry out fraction assessment, Jin Erti to candidate pattern and the entity that can not accurately mark
The levels of precision of rising mould formula and entity scoring is suitable for the extraction of specific area entity and construction of knowledge base.
The flow chart of present embodiment is as shown in Figure 1:
In step S1, for the entity of " aircraft " class, user gives kind of fructification: destroying -20.
User gives physical constraints and is shown in Table 1:
The physical constraints that 1 user of table gives
Bound term | Binding occurrence |
Length | {2,10} |
NumAllowed | true |
Alphabetallowed | true |
SpecialSymbolAllowed | true |
Headwords | Aircraft, fighter plane, machine, patrol plane, fuel charger |
User gives internal schema and is shown in Table 2:
The internal schema that 2 user of table gives
User gives external schema and is shown in Table 3:
The external schema that 3 user of table gives
In step S2, urtext is segmented, the pretreatments such as part-of-speech tagging, Entity recognition using open source tool,
Scheme is as follows: participle and part-of-speech tagging use Ansj tool, the Chinese classification device that Entity recognition uses Stanford NER to carry
To identify GPE, PERSON, ORGANIZATION, LOCATION, and Chinese is write with Stanford Tokensregex tool
Date (DATE), time (TIME), quantity (NUMBER) recognition rule.Finally, Entity recognition can provide GPE, PERSON,
The mark of seven seed type of LOCATION, ORGANIZATION, DATE, TIME, NUMBER.
In step S3, it is labeled first with existing positive entity to by pretreated original language material, and extract
External schema in contextual window.Such as " destroying -20 fighter plane code name prestige dragons, the F-22 fighter plane code name bird of prey.Destroy -20 by
China's research and development, F-22 is by American R & D, and iPhone is by American R & D ", it is matched in the text with kind of a fructification " destroying -20 ", in window
In the case that mouth is 2~3, following external schema can be extracted:
1. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]
2. (? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}] [{ word :/prestige dragon/}]
3. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }]
4. (? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]
It in step S4, scores each candidate pattern, in mode for 1, is applied in original language material, can take out
Take out F-22.F-22 is evaluated: checking whether F-22 meets internal constraint.It is trained in advance by magnanimity without military corpus is marked
Word2vec model.F-22 is inputted into word2vec, calculates and destroy the distance between -20, such as distance is higher than certain threshold value (such as
0.6), then it is assumed that the two semantic similarity, sem (e)=1;F-22 is matched with internal schema, discovery F-22 meets internal mode
Formula 3, confidence level 0.8, then innerPat (e)=0.8;Editing distance is calculated, two can be calculated after extensive to number progress
Person's editing distance is 33%, editDist (e)=1;It is calculated as Word probability, it is assumed that solidified inside degree and face word comentropy and be unsatisfactory for
Threshold requirement then obtains phraseProb (e)=0 (calculating process is complex herein, no longer specifically shows).It is led based on magnanimity
The ngram that the unrelated news corpus in domain calculates, calculating field particularity measurement, it is assumed that the normalized result of tfidf (e) is 0.8,
The then entity final score 0.74.
According to the following formula, the final score 3.84 of acquisition model.
According to above step, score calculating is carried out to each candidate external schema, mode 2 is due to that can not identify
More entities and be dropped.When score is identical, more complicated rule is selected.Top2 mode is selected to be added after sequence final
Rule base, it is assumed that final choice mode 1 and mode 4.
It in step S5, is extracted with external schema 1 and external schema 4, forms entity library to be selected { F-22, apple hand
Machine }, it scores two entities, " F-22 " appraisal result is better than " iPhone ", selects top1 that final entity library, mesh is added
There is { destroying -20, F-22 } in preceding final entity library.
In step S6, mode is extracted to newly added entity library, however since F-22 has met one of inside
Mode can not regenerate new internal schema.Therefore, step S7 is skipped, directly progress step S8.
In step S8, return step S3 is that kind of a fructification is again labeled original language material with { destroying -20, F-22 }, raw
At external schema library, it re-execute the steps S4~step S7.
In step S9, due to not new schema creation, then iteration terminates, and exports final entity library, final external schema
Library and final internal schema library.
Final entity library: { destroying -20, F-22 }
Final external schema library:
(? $ term [] { 1,3 }) [{ word :/fighter plane/}] [{ word :/code name/}]
(? $ term [] { 1,3 }) [word :/by /] [{ ner:/GPE/ }] [{ word :/research and development/}]
(? $ term [] { 2,3 } [word:$ PLANETYPE]) [word :/| in /] [{ ner:DATE }] [word :/
Landing | take off /]
Final internal schema library:
$ PLANETYPE="/opportunity of combat | aircraft | helicopter | trainer aircraft | patrol plane | fuel charger | aerial surveying plane | patrol plane | religion
Practice machine | bomber | reconnaissance plane | research aircraft | fighter plane | jet plane/"
([word :/destroy | Soviet Union | Ilyushin | beauty | Boeing | rice lattice | rice | Air Passenger/}]) ([{ word: "-" }] { 0,1 })
([{ner:NUMBER}]))
(([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+)(([{word:"-"}])
([({word:/\d+/}&{ner:NUMBER})|{word:/[a-zA-Z]+/}]+))+[word:$PLANETYPE]*)。
Claims (8)
1. a kind of Chinese entity abstracting method of pattern-based bootstrapping, it is characterised in that: carry out entity for every kind of entity type
Identification and rule base building, comprising the following steps:
S1: user is given below input: a. forward direction kind of fructification and reversed kind of fructification;B. positive kind of fructification and reversed seed
The respective internal constraint of entity, internal schema and confidence level;C. positive kind of fructification and the reversed kind of respective outside of fructification
Constraint, i.e., the contextual information that positive kind of fructification and reversed kind of fructification respectively occur;D. original not mark text;Above
Four classes input in information, and a, d can not be sky, and b, c can be sky;
S2: the unrelated participle in field, part-of-speech tagging, syntax parsing and Entity recognition are carried out to urtext, generate basic corpus;
Final entity library is added in positive kind of fructification;
S3: according to the positive entity in final entity library, being labeled in basic corpus, and takes out to the positive entity being marked
Its contextual information is taken, external schema to be selected is formed, external schema library to be selected is added;
S4: it scores external schema library to be selected: external schema to be selected marking original text again, according to final entity
Library counts positive entity, reversed entity and the entity that can not determine entity type that each external schema to be selected extracts, treats
It selects each of external schema library external schema to be selected to score, and sorts from high to low according to score, from external mould to be selected
Final external schema library is added in K external schemas to be selected before selecting in formula library;
S5: entity extraction is carried out to original text with newly-generated final external schema library, entity library to be selected is generated, to entity library to be selected
Each of entity to be selected score, and sort from high to low according to score, K realities to be selected before being selected from entity library to be selected
Final entity library is added in body;
S6: internal schema is extracted to the K generated in S5 entities to be selected, forms internal schema library to be selected;
S7: scoring to each of internal schema library to be selected internal schema to be selected, and sort from high to low according to score, from
Final internal schema library is added in K internal schemas to be selected before selecting in internal schema library to be selected;
S8: if the number of iterations has arrived at the upper limit, or not new entity is found, then iteration terminates, and otherwise returns to step
Rapid S3;
S9: the final entity library, final external schema library and final internal schema library of generation are exported.
2. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1
In, positive kind of fructification and the reversed kind of respective internal constraint of fructification include: that positive kind of fructification and reversed kind of fructification are each
From length range, whether only comprising Chinese character, whether allow to occur additional character, whether allow to occur letter and number and
The solid centre word known.
3. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S1
In, positive kind of fructification and the reversed kind of respective internal schema of fructification are that positive seed entity and reversed kind of fructification are respectively abided by
From mode, with basis entity type carry out it is extensive.
4. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S3
In, the method that forms external schema library to be selected are as follows: part of speech and entity type and certain window to positive kind of fructification itself
The entity type of interior element is counted, and forms external schema to be selected;For each element in window, if had real
Body type then uses feature tag of the entity type as the element, otherwise using vocabulary meaning as feature tag.
5. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S4
In, it is scored according to the following steps external schema to be selected to carry out:
S4.1: it is extracted in basic corpus with external schema to be selected: if the external schema to be selected can not obtain more realities
Body then deletes the external schema to be selected from external schema library to be selected, and the external schema to be selected is no longer participate in scoring, process
Terminate;Otherwise, continue step S4.2;
S4.2: if the entity that the external schema to be selected extracts is present in positive entity library, judge the entity for forward direction
Entity, the entity are scored at 1;If the entity that the external schema to be selected extracts is present in reversed entity library, judgement should
Entity is reversed entity, which is scored at 0;If the entity type for the entity that the external schema to be selected extracts can not be sentenced
It is disconnected, then carry out step S4.3;
S4.3: the entity e for that can not determine entity type carrys out the score score (e) of computational entity e as follows:
S4.31: it calculates internal schema matching degree innerPat (e);
By existing internal schema be applied to entity e, if sporocarp e meets internal schema, then using the fiducial probability of mode as
The score of innerPat (e): if mode fiducial probability is 1, entity e final score is 1, no longer calculates other features, directly
It connects and skips to step S4.4;If sporocarp e meets multinomial internal schema, then fiducial probability is added up, no more than 1;Such as
Sporocarp e does not meet any internal schema, then innerPat (e)=0;
S4.32: it calculates semantic distance sem (e);
In computational entity e and existing entity library in just stereotropic distance and entity e and existing entity library it is anti-it is stereotropic away from
From: as being higher than threshold value with a distance from just stereotropic in sporocarp e and existing entity library, then sem (e)=1, otherwise, sem (e)=0;
If semantic distance can not calculate, then the centre word of entity e, the centre word of computational entity e and existing center set of words are extracted
Word2vec distance: if being higher than threshold value, sem (e)=1, otherwise, sem (e)=0;
S4.33: calculate editing distance editDist (e): computational entity e is with just stereotropic editing distance and entity e and instead
Stereotropic editing distance: as sporocarp e and some just stereotropic distance are less than threshold value, and with all anti-stereotropic editors
Distance is all larger than threshold value, then editDist (e)=1, otherwise, editDist (e)=0;
S4.34: it is calculated as Word probability phraseProb (e): setting up threshold respectively for entity e solidified inside degree and adjacent word comentropy
Value meets the threshold value of solidified inside degree and the threshold value of adjacent word comentropy such as sporocarp e simultaneously, then phraseProb (e)=1, no
Then, phraseProb (e)=0;Wherein, solidified inside degree is calculated by formula (1):
In formula (1), TS (t) is the set for constituting all possible division token of entity e, and each of TS (t) member is called usually
It is the probability that t-th of token in S (t) occurs hereof for S (t), P (t), NumTokens is to own in basic corpus
The quantity of token;Freq (e) is the number that entity e occurs in basic corpus;
S4.35: calculating field particularity measures tfidf (e);
Firstly, calculating original field particularity measurement TFIDFe, it is calculated by following formula:
In formula (2), tfeFor the frequency that entity e occurs in basic corpus, N is document in the unrelated magnanimity news corpus in field
Quantity, dfeFor the number of the document comprising entity e;
Then, original field particularity is measured into TFIDFeIt normalizes between 0~1, obtains field particularity measurement tfidf
(e);
S4.36: take internal schema matching degree innerPat (e), semantic distance sem (e), editing distance editDist (e), at
The average value of Word probability phraseProb (e) and field particularity measurement tfidf (e), the score score (e) as entity e;
S4.4: the score of external schema to be selected is calculated according to formula (3):
In formula (3), PrFor the set for positive kind of the fructification that external schema to be selected extracts, NrIt is extracted for external schema to be selected
The set of reversed kind of fructification, | | for the number of element in set, UrFor can not determine entity type entity set,
Score (e) is the score that can not determine the entity e of entity type.
6. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S5
In, the rule to score each entity to be selected is as follows:
A. if entity to be selected is unsatisfactory for internal constraints, entity to be selected is deleted from entity library to be selected;
B. if entity to be selected belongs to common word or stop-word, entity to be selected is deleted from entity library to be selected;
C. if entity to be selected meets the internal schema that confidence level is 1, final entity library is added in entity to be selected;
D. if entity to be selected is not belonging to three cases above, the internal schema matching degree of entity to be selected is calculated first
It is innerPat (e), semantic distance sem (e), editing distance editDist (e), special at Word probability phraseProb (e) and field
Different property measures tfidf (e) this five characteristic values;Then all mode scores for extracting entity to be selected are added up, normalizing
Change between 0~1, using the numerical value after normalization as Section 6 characteristic value;Finally to this six characteristic value weighted averages, obtain
The final score of entity to be selected.
7. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S6
In, the rule for extracting internal schema to the entity in final entity library is as follows: as included continuous alphabetic string, number inside sporocarp
Word, Chinese numbers, date, place name, name and centre word then extract extensive internal schema.
8. the Chinese entity abstracting method of pattern-based bootstrapping according to claim 1, it is characterised in that: the step S7
In, the formula that scores internal schema NP to be selected are as follows:
PN in formula (4)rFor the final just stereotropic set for meeting internal schema NP to be selected, NNrTo meet internal schema NP to be selected
Anti- stereotropic set, | | indicate set in element number, score (e) is the entity e that can not determine entity type
Score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848425.7A CN106445917B (en) | 2016-09-23 | 2016-09-23 | A kind of Chinese entity abstracting method of pattern-based bootstrapping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848425.7A CN106445917B (en) | 2016-09-23 | 2016-09-23 | A kind of Chinese entity abstracting method of pattern-based bootstrapping |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106445917A CN106445917A (en) | 2017-02-22 |
CN106445917B true CN106445917B (en) | 2019-02-19 |
Family
ID=58167285
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610848425.7A Active CN106445917B (en) | 2016-09-23 | 2016-09-23 | A kind of Chinese entity abstracting method of pattern-based bootstrapping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106445917B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11238363B2 (en) * | 2017-04-27 | 2022-02-01 | Accenture Global Solutions Limited | Entity classification based on machine learning techniques |
CN108154198B (en) * | 2018-01-25 | 2021-07-13 | 北京百度网讯科技有限公司 | Knowledge base entity normalization method, system, terminal and computer readable storage medium |
CN111400458A (en) * | 2018-12-27 | 2020-07-10 | 上海智臻智能网络科技股份有限公司 | Automatic generalization method and device |
CN110245354A (en) * | 2019-06-20 | 2019-09-17 | 贵州电网有限责任公司 | The method of entity is extracted in a kind of schedule information |
CN111178045A (en) * | 2019-10-14 | 2020-05-19 | 深圳软通动力信息技术有限公司 | Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium |
CN111259134B (en) * | 2020-01-19 | 2023-08-08 | 出门问问信息科技有限公司 | Entity identification method, equipment and computer readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN103186556A (en) * | 2011-12-28 | 2013-07-03 | 北京百度网讯科技有限公司 | Method for obtaining and searching structural semantic knowledge and corresponding device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8271479B2 (en) * | 2009-11-23 | 2012-09-18 | International Business Machines Corporation | Analyzing XML data |
US9111211B2 (en) * | 2011-12-20 | 2015-08-18 | Bitly, Inc. | Systems and methods for relevance scoring of a digital resource |
-
2016
- 2016-09-23 CN CN201610848425.7A patent/CN106445917B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101196904A (en) * | 2007-11-09 | 2008-06-11 | 清华大学 | News keyword abstraction method based on word frequency and multi-component grammar |
CN103186556A (en) * | 2011-12-28 | 2013-07-03 | 北京百度网讯科技有限公司 | Method for obtaining and searching structural semantic knowledge and corresponding device |
Also Published As
Publication number | Publication date |
---|---|
CN106445917A (en) | 2017-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106445917B (en) | A kind of Chinese entity abstracting method of pattern-based bootstrapping | |
CN111143479B (en) | Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN109408642A (en) | A kind of domain entities relation on attributes abstracting method based on distance supervision | |
CN110334213B (en) | Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism | |
CN107590133A (en) | The method and system that position vacant based on semanteme matches with job seeker resume | |
CN106257455B (en) | A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template | |
CN108509425A (en) | Chinese new word discovery method based on novelty | |
CN111680488B (en) | Cross-language entity alignment method based on knowledge graph multi-view information | |
CN105608070B (en) | A kind of character relation abstracting method towards headline | |
CN107895000B (en) | Cross-domain semantic information retrieval method based on convolutional neural network | |
CN108268539A (en) | Video matching system based on text analyzing | |
CN106445921B (en) | Utilize the Chinese text terminology extraction method of quadratic mutual information | |
CN103049501A (en) | Chinese domain term recognition method based on mutual information and conditional random field model | |
CN103984943A (en) | Scene text identification method based on Bayesian probability frame | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN106446018B (en) | Query information processing method and device based on artificial intelligence | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN104298663B (en) | Method and device for translation consistency and statistical machine translation method and system | |
CN102054029A (en) | Figure information disambiguation treatment method based on social network and name context | |
Gast et al. | The areal factor in lexical typology | |
CN106569993A (en) | Method and device for mining hypernym-hyponym relation between domain-specific terms | |
CN107807910A (en) | A kind of part-of-speech tagging method based on HMM | |
CN105068990B (en) | A kind of English long sentence dividing method of more strategies of Machine oriented translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |