CN109635297A - A kind of entity disambiguation method, device, computer installation and computer storage medium - Google Patents
A kind of entity disambiguation method, device, computer installation and computer storage medium Download PDFInfo
- Publication number
- CN109635297A CN109635297A CN201811508089.7A CN201811508089A CN109635297A CN 109635297 A CN109635297 A CN 109635297A CN 201811508089 A CN201811508089 A CN 201811508089A CN 109635297 A CN109635297 A CN 109635297A
- Authority
- CN
- China
- Prior art keywords
- entity
- gene
- disambiguated
- occurrence
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention is suitable for Internet technical field, discloses a kind of entity disambiguation method, device, computer installation and computer storage medium, which comprises construct the gene of entity to be disambiguated;Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;The gene matching degree for calculating the candidate entity and the entity to be disambiguated determines that the gene of the candidate entity and the entity to be disambiguated matches in the case where the gene matching degree is more than preset threshold.Entity disambiguation method provided by the invention, it can be improved the effect of entity disambiguation, during entity disambiguates, gradually improve link entity and knowledge base, help to improve target analysis in mass text, construction of knowledge base and in terms of data-handling efficiency.
Description
Technical field
The invention belongs to Internet technical field more particularly to a kind of entity disambiguation method, device, computer installation and meters
Calculation machine storage medium.
Background technique
There are entity name ambiguity problems during natural language processing, for example, some name in text may refer to
For entities multiple in this life circle.The reason of leading to entity name ambiguity problem be natural language statement freedom, diversity,
Ambiguousness.Currently, natural language processing (Natural Language Processing, NLP) research is absorbed in always machine and is turned over
It translates, information retrieval, text snippet, question and answer, information extraction, theme models and the tasks such as emotion is excavated.Tradition is based on grammer point
The natural language processing technique development of analysis is more slow, and breakthrough achievement is less.With the innovation of the technologies such as deep learning, manually
Intelligence has obtained extensive concern in the field NLP.Since there are synonymous, near synonym, the feelings such as polysemy in natural language description
Condition increases the difficulty of natural language analysis, thus relatively crucial problem is that entity disambiguates in natural language processing, and entity disappears
The main purpose of discrimination is that there are the entity names of ambiguity in identification sentence, and provide to each ambiguity entity name and meet it
The semanteme of context.Common entity disambiguation method requires pre-existing informative knowledge base, the property on large-scale data source
It is poor to show, and the precision that entity disambiguates on internet data source is low.
Summary of the invention
The embodiment of the present invention provides a kind of entity disambiguation method, device, computer installation and computer storage medium, it is intended to
It solving method in the prior art and requires pre-existing informative knowledge base, performance is poor on large-scale data source,
The low problem of the precision that entity disambiguates on internet data source.
The invention is realized in this way a kind of entity disambiguation method, including following procedure:
Construct the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute gene, described total
Real pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the category of the entity to be disambiguated
Property;
Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the reality to be disambiguated
The semantic feature of body includes name the Formal Similarity, abbreviation information and reference feature;
The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than default in the gene matching degree
In the case where threshold value, determine that the gene of the candidate entity and the entity to be disambiguated matches.
Further, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, including following mistake
Journey:
Obtain the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated;
Obtain the candidate entity and the entity attributes gene matching degree to be disambiguated;
According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and reality to be disambiguated are calculated
The gene matching degree of body, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree,
scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Further, the co-occurrence entity gene matching degree for obtaining the candidate entity and the entity to be disambiguated, packet
Include following procedure:
Determine the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from the pre-stored text
The co-occurrence entity word of the candidate entity is determined in shelves;
Obtain the genetic entity word set of the candidate entity;
According to the overlap split-phase of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity
For the word frequency of pre-stored entity word total amount, the co-occurrence of the co-occurrence entity word of the entity to be disambiguated and the candidate entity
The overlapping part of entity word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence of the entity to be disambiguated
The inverse document frequency of the overlapping part of entity word and the co-occurrence entity word of the candidate entity, calculates the co-occurrence entity base
Because of matching degree.
Further, the acquisition candidate entity and the entity attributes gene matching degree to be disambiguated, including with
Lower process:
Determine the entity attributes name to be disambiguated and the candidate entity attributes name;
According to the overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name, and overlap
The weighted value of attribute calculates the attribution gene matching degree.
Further, after the gene of the determination candidate entity and the entity to be disambiguated matches, the reality
Body disambiguation method further include:
The gene of the candidate entity and the entity to be disambiguated is merged;
According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining and Entities Matching to be disambiguated
Target entity, according to the knowledge of the target entity to entity corresponding in knowledge base carry out knowledge fusion.
The present invention also provides a kind of entity disambiguators, comprising:
Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity category
Property gene, the co-occurrence entity word gene includes co-occurrence entity word and co-occurrence degree, the entity attribute gene include it is described to
Disambiguate entity attributes;
Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein
The semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;
Matching module, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the gene
In the case that matching degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.
Further, the matching module includes:
First acquisition submodule is matched for obtaining the candidate entity with the co-occurrence entity gene of the entity to be disambiguated
Degree and attribution gene matching degree;
First computational submodule, for calculating institute according to the co-occurrence entity gene matching degree and attribution gene matching degree
State the gene matching degree of candidate entity Yu entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree,
scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Further, first acquisition submodule includes:
First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and
The co-occurrence entity word of the candidate entity is determined from the pre-stored document;
First acquisition unit, for obtaining the genetic entity word set of the candidate entity;
First computing unit, for real according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence of the candidate entity
Word frequency of the overlapping part of pronouns, general term for nouns, numerals and measure words relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and institute
The overlapping part of the co-occurrence entity word of candidate entity is stated relative to the word frequency of the genetic entity word set of the candidate entity and described
The inverse document frequency of the overlapping part of the co-occurrence entity word of entity to be disambiguated and the co-occurrence entity word of the candidate entity, meter
Calculate the co-occurrence entity gene matching degree.
Further, first acquisition submodule includes:
Second determination unit, for determining the entity attributes name to be disambiguated and the candidate entity attributes name;
Second computing unit, for according between the entity attributes name to be disambiguated and the candidate entity attributes name
Overlapping attribute and overlapping attribute weighted value, calculate the attribution gene matching degree.
Further, the entity disambiguator, further includes:
First Fusion Module, for merging the gene of the candidate entity and the entity to be disambiguated;
Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine and
The target entity of the Entities Matching to be disambiguated knows entity corresponding in knowledge base according to the knowledge of the target entity
Know fusion.
The present invention also provides a kind of computer installation, the computer installation includes processor, and the processor is for holding
It is realized when computer program such as the step of above-mentioned entity disambiguation method in line storage.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey
It realizes when sequence is executed by processor such as the step of above-mentioned entity disambiguation method.
Entity disambiguation method provided by the invention, by constructing the gene of entity to be disambiguated, calculate the candidate entity with
The gene matching degree of the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, to extensive
The disambiguation accuracy that data source carries out during entity disambiguation is relatively high, improves the essence that entity disambiguates on internet data source
Degree during entity disambiguates, gradually improves link entity and knowledge base to improve the effect that entity disambiguates on the whole,
Help to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..
Detailed description of the invention
Fig. 1 is the implementation flow chart of entity disambiguation method provided in an embodiment of the present invention;
The reality of Fig. 2 gene matching degree provided in an embodiment of the present invention for calculating the candidate entity and the entity to be disambiguated
Existing flow chart;
Fig. 3 is the implementation flow chart that the present invention implements the acquisition co-occurrence entity gene matching degree provided;
Fig. 4 is the implementation flow chart that the present invention implements the acquisition attribution gene matching degree provided;
Fig. 5 is after the gene of the determination candidate entity and the entity to be disambiguated that the present invention implements offer matches
Entity disambiguation method implementation flow chart;
Fig. 6 is a kind of structural schematic diagram of entity disambiguator provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of matching module provided in an embodiment of the present invention;
Fig. 8 is a structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention;
Fig. 9 is another structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention;
Figure 10 is the structural schematic diagram of another entity disambiguator provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Fig. 1 show the flow chart of entity disambiguation method provided in an embodiment of the present invention.The entity disambiguation method, including
Following procedure:
Step S101, the gene of entity to be disambiguated is constructed.
In the present embodiment, the gene includes: co-occurrence entity word gene and entity attribute gene, the co-occurrence entity word
Gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated.The base
Because determining the characteristic information of entity.For each entity word to be disambiguated, in statistical documents the co-occurrence frequency of co-occurrence entity word and
Co-occurrence entity gene of the co-occurrence entity word as entity, using in the attribute list of document about the attribute of entity word to be disambiguated as
Entity attribute gene.In the present embodiment, entity word that can also be faint to co-occurrence degree according to relationship type attribute carries out gene
Enhancing, finally obtains co-occurrence entity gene.It is understood that relationship type attribute indicates the incidence relation between two entities,
For example, relationship type attribute can be conjugal relation, Peer Relationships, father and daughter's relationship etc..It in the present embodiment, can be by document
Information extraction is carried out, finds out the association attributes of entity in document, and attribute is snapped in attribute slot predetermined with specification
The expression of attribute-name, using this strong information of entity attributes as entity attribute gene.
In the present embodiment, the process that entity to be disambiguated is obtained from text specifically includes that web page body extracts, and text is clear
It washes, languages identification, text participle, the processes such as entity attribute extraction.
Specifically, in the web page body extraction process, mainly according under the title of css-class and<div>block
Content of text finds body matter expression part, removes the auxiliary element in web page contents, such as logs in, comment, shares function
Property button, not care about one's appearance banner towing etc., to obtain webpage body content.
Specifically, it in the text cleaning process, can be removed such as by some tools<div>,<p>,<br>deng
Html source code obtains natural language text, then carries out full half-angle conversion, the removing of invisible character, emoticon removing, complicated and simple conversion
Etc. processes, wash some useless, meaningless characters, export natural language text.
Specifically, in the languages identification process, encoding block belonging to character, such as Chinese articles can be both checked merely
In, most characters are in the encoding block and ASCII block of Chinese;It is regarded as a classification problem simultaneously so long, used
Machine learning method solves.It is according to text languages that text distribution participle part is right by system after the languages for judging text
The participle tool answered.
Specifically, being identified according to languages as a result, respectively using specific during the text segments
Chinese, English, Spanish participle tool, output comprising word, part of speech and name entity word word segmentation result, and will
It is converted to unified expression.
Specifically, in the entity attribute extraction process, system first passes through NLP tool and clears up pronominal reference as far as possible, replaces
Pronoun is changed to by reference physical name, is then based on rule and attribute extraction tool Extracting Information from text.It is rule-based
Information extraction includes the building of decimation rule and carries out information extraction two parts using rule.Attribute extraction tool first can will be every
A sentence is cut into a series of clause, then shortens each clause to the maximum extent, obtains shorter sentence fragment, then
These sentence fragments can be divided into triple, and triple includes entity, attribute and attribute value, and triple can be used as the reality of document
Body attribute list, finally by the entity attribute table of system output document.
After Text Pretreatment and NLP analytic process, the entity attribute table of obtained document can be used as entity disambiguation
It directly inputs.Text Pretreatment, NLP analytic process are indispensable in entity disambiguation, but particular technique used can adopt
With other NLP processing methods for removing above-mentioned technology.Configuring these steps can help to optimize subsequent processing speed, disappear for entity
Discrimination process provides the input text of high quality, reduces to the input requirements of system, is expanding manageable data area
While also strengthen scalability.
Step S102, candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated.
In the present embodiment, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and refers to
For feature.Different modes is taken to determine candidate's entity, the number of the candidate entity determined from entity library for different language
One or more are likely to be, in the case where candidate entity has multiple, candidate entity set can be formed.It is understood that name
Word the Formal Similarity can between substantive noun form similarity, for example, the people in name Zhang San and B article in A article
There is name the Formal Similarity, English name " jack " and " jackie ", " trump " and " donald trump " have between name Zhang San
There is similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, the abbreviation information in Hunan Province is
Hunan, media access control address can be referred to as MAC Address.It is special that reference feature can be understood as the corresponding reference of pronoun in document
Sign.
Step S103, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, matches in the gene
In the case that degree is more than preset threshold, determine that the gene of the candidate entity and the entity to be disambiguated matches.
In the present embodiment, in the case where there is candidate entity set, the candidate entity in candidate entity set can be scanned,
Calculate the gene matching degree of candidate entity and entity to be disambiguated, determined when gene matching degree is more than preset threshold candidate entity with to
Disambiguating entity can match.The preset threshold is bigger, illustrates that matching precision is higher, and the preset threshold is smaller, illustrates matching precision
Lower, which can be set according to actual needs.The case where candidate entity can be matched with entity to be disambiguated
Under, the gene of entity to be disambiguated is merged with the gene of candidate entity, entity is accumulated.Then, to candidate entity
Other the candidate entities concentrated are scanned, until there is no the candidate entities that can be matched in candidate entity set.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated
The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on
The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves
The effect that entity disambiguates.
Referring to fig. 2, the gene of the calculating candidate entity and the entity to be disambiguated in the step S103
With degree, including following procedure:
Step S1031 obtains the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated.
Step S1032 obtains the candidate entity and the entity attributes gene matching degree to be disambiguated.
Step S1033 calculates the candidate entity according to the co-occurrence entity gene matching degree and attribution gene matching degree
With the gene matching degree of entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree,
scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Referring to Fig. 3, above-mentioned steps S1031 includes following procedure:
Step S10311, determines the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from described
The co-occurrence entity word of the candidate entity is determined in pre-stored document;
Step S10312 obtains the genetic entity word set of the candidate entity;
Step S10313, according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity
Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and the candidate
The overlapping part of the co-occurrence entity word of entity is relative to the word frequency of the genetic entity word set of the candidate entity and described wait disambiguate
The inverse document frequency of the overlapping part of the co-occurrence entity word of entity and the co-occurrence entity word of the candidate entity, described in calculating
Co-occurrence entity gene matching degree.
In the present embodiment, the genetic entity word set is entity set of words relevant to entity, e.g., to Zhang San's reality
Body, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.
It further illustrates, uses for reference TF-IDF thought, TF-IDF (term frequency-inverse document
It frequency) is a kind of common weighting technique for information retrieval and data mining.TF means word frequency (Term
Frequency), IDF means inverse document frequency (Inverse Document Frequency).Using all entities as
Document D, the entity word of the co-occurrence entity gene of entity treat the entity word for disambiguating the co-occurrence entity gene of entity as word T
Tf and the tf of entity word of co-occurrence entity gene of candidate entity be normalized.TF is normalized to genetic entity word number
Amount is sensitive, and the co-occurrence entity mrna length of candidate entity is generally on the high side, thus the weight for the co-occurrence entity word that gene is overlapped into
Row normalizes again after strengthening, and reinforces weak signal.The calculating of idf is then based on caching mechanism progress, improves calculating speed
Degree.Show that entity word gene matching degree specifically can finally by the tf and idf of the overlapping part of co-occurrence entity word gene
To calculate the co-occurrence entity gene matching degree score according to the following formulaw(m, e):
Wherein, calculate public about entity to be disambiguated and candidate entity m, e, the normalization tf value of overlapping genetic entity word w
Formula are as follows:
Wherein, the gene of the weight of the genetic entity word w of weit (w, e) presentation-entity e, weit (e, w) presentation-entity w is real
The weight of pronouns, general term for nouns, numerals and measure words e, W (m) are the genetic entity word set of entity m;It should be noted that in the above-mentioned calculating co-occurrence entity gene
Matching degree scorewIt in the formula of (m, e), does not distinguish and treats disambiguation entity and candidate entity, m, e are accordingly to be regarded as entity, have equity
Property, tfnorm(w;M*, e) indicate that word w is the normalized value of total amount, tf about m after the overlapping reinforcing of m, enorm(w;E*, m) meter
Calculation mode is similar to tfnorm(w;M*, e) calculation, this will not be repeated here.
The calculation formula of the inverse document frequency idf (w) of co-occurrence entity word in co-occurrence entity gene are as follows:
Wherein, E is entity library, and W (e) is the genetic entity word set of entity e, and Z is a complementary constant, under normal circumstances
Z is smaller.
Referring to fig. 4, above-mentioned steps S1032 includes following procedure:
Step S10321 determines the entity attributes name to be disambiguated and the candidate entity attributes name;
Step S10322, according to overlapping between the entity attributes name to be disambiguated and the candidate entity attributes name
The weighted value of attribute and overlapping attribute calculates the attribution gene matching degree.
In the present embodiment, when gene constructed, attribute-name is aligned, therefore when attribution gene matching primitives
It only needs to match the corresponding attribute value of attribute-name.Entity attribute gene matching degree is the weighted sum of overlapping attribute, is being spent
It does not require attribute value character string completely the same when measuring attribute value, but calculates the two phase with the similarity algorithm based on editing distance
Like degree.When constructing the weighted sum model of attributes match, fuzzy matching optimization can be carried out to the value of part specific properties, reinforce discriminating
The strong attribute weight of other property, such as identity card, spouse.It specifically, can computation attribute gene matching degree according to the following formula
scorep(m, e):
Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's
All properties value, weitp(pn) weight of attribute-name pn, I are indicatedpv(vi,vj) ∈ { 0,1 }, it is an indicator function, indicates
Attribute value viAnd vjIt is whether identical, Ipv(vi,vj) calculation formula is as follows:
Wherein, simpv(vi,vj) indicate that character string normalizes similarity, θpIt is a higher threshold value, θpIt is biased to 1, Ipfix
(vi,vj)、Isfix(vi,vj) respectively indicate vi、vjIn value whether be another value prefix and suffix.
Referring to Fig. 5, after above-mentioned steps 103, the method also includes:
Step 104, the gene of the candidate entity and the entity to be disambiguated is merged.
Step 105, it according to the gene matching degree of the candidate entity and the entity to be disambiguated, determines with described wait disambiguate
The target entity of Entities Matching carries out knowledge fusion to entity corresponding in knowledge base according to the knowledge of the target entity.
In the present embodiment, candidate entity and entity to be disambiguated can be in matched situation, by the base of entity to be disambiguated
Because being merged with the gene of candidate entity, entity is accumulated.Entity word, the power of co-occurrence entity gene can be merged respectively
Weight and entity attribute gene, while window is increased to the setting of co-occurrence entity gene, the reserved increasing of important vocabulary occurred for the later period
Long spacing.In the present embodiment, with the process disambiguated to the entity word in text, entity mobility models can gradually be improved
Library.
In the present embodiment mode, knowledge base can be updated according to the result that matching disambiguates, be roughly divided into nothing
It with entity, several situations such as is matched to an entity and is matched to multiple entities, knowledge base is carried out respectively for three kinds of situations
Attribute fusion.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated
The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, right
The disambiguation accuracy that large-scale data source carries out during entity disambiguation is relatively high, improves the entity on internet data source and disambiguates
Precision during entity disambiguates, gradually improve link entity and knowledge to improve the effect that entity disambiguates on the whole
Library helps to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..
Fig. 6 shows a kind of structural schematic diagram of entity disambiguator 600 provided in an embodiment of the present invention, for the ease of saying
It is bright, it illustrates only and implements relevant part in the present invention.The entity disambiguator 600, comprising:
Module 601 is constructed, for constructing the gene of entity to be disambiguated.
In the present embodiment, the gene includes: co-occurrence entity word gene and entity attribute gene, the co-occurrence entity word
Gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated.The base
Because determining the characteristic information of entity.For each entity word to be disambiguated, in statistical documents the co-occurrence frequency of co-occurrence entity word and
Co-occurrence entity gene of the co-occurrence entity word as entity, using in the attribute list of document about the attribute of entity word to be disambiguated as
Entity attribute gene.In the present embodiment, entity word that can also be faint to co-occurrence degree according to relationship type attribute carries out gene
Enhancing, finally obtains co-occurrence entity gene.It is understood that relationship type attribute indicates the incidence relation between two entities,
For example, relationship type attribute can be conjugal relation, Peer Relationships, father and daughter's relationship etc..It in the present embodiment, can be by document
Information extraction is carried out, finds out the association attributes of entity in document, and attribute is snapped in attribute slot predetermined with specification
The expression of attribute-name, using this strong information of entity attributes as entity attribute gene.
In the present embodiment, the process that entity to be disambiguated is obtained from text specifically includes that web page body extracts, and text is clear
It washes, languages identification, text participle, the processes such as entity attribute extraction.
Specifically, in the web page body extraction process, mainly according under the title of css-class and<div>block
Content of text finds body matter expression part, removes the auxiliary element in web page contents, such as logs in, comment, shares function
Property button, not care about one's appearance banner towing etc., to obtain webpage body content.
Specifically, it in the text cleaning process, can be removed such as by some tools<div>,<p>,<br>deng
Html source code obtains natural language text, then carries out full half-angle conversion, the removing of invisible character, emoticon removing, complicated and simple conversion
Etc. processes, wash some useless, meaningless characters, export natural language text.
Specifically, in the languages identification process, encoding block belonging to character, such as Chinese articles can be both checked merely
In, most characters are in the encoding block and ASCII block of Chinese;It is regarded as a classification problem simultaneously so long, used
Machine learning method solves.It is according to text languages that text distribution participle part is right by system after the languages for judging text
The participle tool answered.
Specifically, being identified according to languages as a result, respectively using specific during the text segments
Chinese, English, Spanish participle tool, output comprising word, part of speech and name entity word word segmentation result, and will
It is converted to unified expression.
Specifically, in the entity attribute extraction process, system first passes through NLP tool and clears up pronominal reference as far as possible, replaces
Pronoun is changed to by reference physical name, is then based on rule and attribute extraction tool Extracting Information from text.It is rule-based
Information extraction includes the building of decimation rule and carries out information extraction two parts using rule.Attribute extraction tool first can will be every
A sentence is cut into a series of clause, then shortens each clause to the maximum extent, obtains shorter sentence fragment, then
These sentence fragments can be divided into triple, and triple includes entity, attribute and attribute value, and triple can be used as the reality of document
Body attribute list, finally by the entity attribute table of system output document.
After Text Pretreatment and NLP analytic process, the entity attribute table of obtained document can be used as entity disambiguation
It directly inputs.Text Pretreatment, NLP analytic process are indispensable in entity disambiguation, but particular technique used can adopt
With other NLP processing methods for removing above-mentioned technology.Configuring these steps can help to optimize subsequent processing speed, disappear for entity
Discrimination process provides the input text of high quality, reduces to the input requirements of system, is expanding manageable data area
While also strengthen scalability.
Screening module 602, for determining candidate's entity from entity library according to the semantic feature of the entity to be disambiguated,
In, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature.
In the present embodiment, it takes different modes to determine candidate's entity from entity library for different language, determines
The number of candidate entity is likely to be one or more, in the case where candidate entity has multiple, can form candidate entity set.
It is understood that name the Formal Similarity can between substantive noun form similarity, for example, name in A article
There is name the Formal Similarity between name Zhang San in three and B article, English name " jack " and " jackie ", " trump " with
" donald trump " has similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, Hunan
The abbreviation information of province is Hunan, and media access control address can be referred to as MAC Address.Reference feature can be understood as generation in document
The corresponding reference feature of word.
Matching module 603, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the base
In the case where being more than preset threshold because of matching degree, determine that the gene of the candidate entity and the entity to be disambiguated matches.
In the present embodiment, in the case where there is candidate entity set, the candidate entity in candidate entity set can be scanned,
Calculate the gene matching degree of candidate entity and entity to be disambiguated, determined when gene matching degree is more than preset threshold candidate entity with to
Disambiguating entity can match.The preset threshold is bigger, illustrates that matching precision is higher, and the preset threshold is smaller, illustrates matching precision
Lower, which can be set according to actual needs.The case where candidate entity can be matched with entity to be disambiguated
Under, the gene of entity to be disambiguated is merged with the gene of candidate entity, entity is accumulated.Then, to candidate entity
Other the candidate entities concentrated are scanned, until there is no the candidate entities that can be matched in candidate entity set.
Entity disambiguator provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated
The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on
The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves
The effect that entity disambiguates.
Referring to Fig. 7, the matching module 603 includes:
First acquisition submodule 6031, for obtaining the co-occurrence entity gene of the candidate entity and the entity to be disambiguated
Matching degree and attribution gene matching degree.
First computational submodule 6032, for according to the co-occurrence entity gene matching degree and attribution gene matching degree, meter
Calculate the gene matching degree of the candidate entity and entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e);
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree,
scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Referring to Fig. 8, above-mentioned first acquisition submodule 6031 includes:
First determination unit 60311, for determining the co-occurrence entity of the entity to be disambiguated from pre-stored document
Word, and determine from the pre-stored document co-occurrence entity word of the candidate entity;
First acquisition unit 60312, for obtaining the genetic entity word set of the candidate entity;
First computing unit 60313, co-occurrence entity word and the candidate entity for the entity to be disambiguated according to
Word frequency of the overlapping part of co-occurrence entity word relative to pre-stored entity word total amount, the co-occurrence entity of the entity to be disambiguated
Word frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity relative to the genetic entity word set of the candidate entity,
And the inverse text frequency of the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity
Index calculates the co-occurrence entity gene matching degree.
In the present embodiment, in the present embodiment, the genetic entity word set is entity word set relevant to entity
It closes, e.g., to Zhang San's entity, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.
It further illustrates, uses for reference TF-IDF thought, TF-IDF (term frequency-inverse document
It frequency) is a kind of common weighting technique for information retrieval and data mining.TF means word frequency (Term
Frequency), IDF means inverse document frequency (Inverse Document Frequency).Using all entities as
Document D, the entity word of the co-occurrence entity gene of entity treat the entity word for disambiguating the co-occurrence entity gene of entity as word T
Tf and the tf of entity word of co-occurrence entity gene of candidate entity be normalized.TF is normalized to genetic entity word number
Amount is sensitive, and the co-occurrence entity mrna length of candidate entity is generally on the high side, thus the weight for the co-occurrence entity word that gene is overlapped into
Row normalizes again after strengthening, and reinforces weak signal.The calculating of idf is then based on caching mechanism progress, improves calculating speed
Degree.Show that entity word gene matching degree specifically can finally by the tf and idf of the overlapping part of co-occurrence entity word gene
To calculate the co-occurrence entity gene matching degree score according to the following formulaw(m, e):
Wherein, calculate public about entity to be disambiguated and candidate entity m, e, the normalization tf value of overlapping genetic entity word w
Formula are as follows:
Wherein, the gene of the weight of the genetic entity word w of weit (w, e) presentation-entity e, weit (e, w) presentation-entity w is real
The weight of pronouns, general term for nouns, numerals and measure words e, W (m) are the genetic entity word set of entity m;It should be noted that in the above-mentioned calculating co-occurrence entity gene
Matching degree scorewIt in the formula of (m, e), does not distinguish and treats disambiguation entity and candidate entity, m, e are accordingly to be regarded as entity, have equity
Property, tfnorm(w;M*, e) indicate that word w is the normalized value of total amount, tf about m after the overlapping reinforcing of m, enorm(w;E*, m) meter
Calculation mode is similar to tfnorm(w;M*, e) calculation, this will not be repeated here.
The calculation formula of the inverse document frequency idf (w) of co-occurrence entity word in co-occurrence entity gene are as follows:
Wherein, E is entity library, and W (e) is the genetic entity word set of entity e, and Z is a complementary constant, under normal circumstances
Z is smaller.
Referring to Fig. 9, above-mentioned first acquisition submodule 6031 includes:
Second determination unit 60314, for determining the entity attributes name to be disambiguated and the candidate entity attributes
Name;
Second computing unit 60315, for according to the entity attributes name to be disambiguated and the candidate entity attributes
The weighted value of overlapping attribute and overlapping attribute between name, calculates the attribution gene matching degree.
In the present embodiment, when gene constructed, attribute-name is aligned, therefore when attribution gene matching primitives
It only needs to match the corresponding attribute value of attribute-name.Entity attribute gene matching degree is the weighted sum of overlapping attribute, is being spent
It does not require attribute value character string completely the same when measuring attribute value, but calculates the two phase with the similarity algorithm based on editing distance
Like degree.When constructing the weighted sum model of attributes match, fuzzy matching optimization can be carried out to the value of part specific properties, reinforce discriminating
The strong attribute weight of other property, such as identity card, spouse.It specifically, can computation attribute gene matching degree according to the following formula
scorep(m, e):
Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's
All properties value, weitp(pn) weight of attribute-name pn, I are indicatedpv(vi, vj) ∈ { 0,1 } is an indicator function, is indicated
Attribute value viAnd vjIt is whether identical, Ipv(vi,vj) calculation formula is as follows:
Wherein, simpv(vi,vj) indicate that character string normalizes similarity, θpIt is a higher threshold value, θpIt is biased to 1, Ipfix
(vi,vj)、Isfix(vi,vj) respectively indicate vi、vjIn value whether be another value prefix and suffix.
Referring to Figure 10, the entity disambiguator 600 further include:
First Fusion Module 604, for merging the gene of the candidate entity and the entity to be disambiguated.
Second Fusion Module 605 is determined for the gene matching degree according to the candidate entity and the entity to be disambiguated
With the target entity of the Entities Matching to be disambiguated, entity corresponding in knowledge base is carried out according to the knowledge of the target entity
Knowledge fusion.
In the present embodiment, candidate entity and entity to be disambiguated can be in matched situation, by the base of entity to be disambiguated
Because being merged with the gene of candidate entity, entity is accumulated.Entity word, the power of co-occurrence entity gene can be merged respectively
Weight and entity attribute gene, while window is increased to the setting of co-occurrence entity gene, the reserved increasing of important vocabulary occurred for the later period
Long spacing.In the present embodiment, with the process disambiguated to the entity word in text, entity mobility models can gradually be improved
Library.
In the present embodiment mode, knowledge base can be updated according to the result that matching disambiguates, be roughly divided into nothing
It with entity, several situations such as is matched to an entity and is matched to multiple entities, knowledge base is carried out respectively for three kinds of situations
Attribute fusion.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated
The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on
The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves
The effect that entity disambiguates gradually is improved link entity and knowledge base, is helped to improve in mass text during entity disambiguates
In target analysis, construction of knowledge base and question answering system etc. data-handling efficiency.
The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing
The step of entity disambiguation method that above-mentioned each embodiment of the method provides is realized in memory when computer program.
Illustratively, computer program can be divided into one or more modules, one or more module is stored
In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function
Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example
Such as, computer program can be divided into the step of entity disambiguation method that above-mentioned each embodiment of the method provides.
It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating
The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions
Part, such as may include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital SignalProcessor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection
Various pieces.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes
Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization
The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program
It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function
Deng;Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition,
Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting
Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent
Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real
All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program
At the computer program can be stored in a computer readable storage medium, which is being executed by processor
When, it can be achieved that the step of above-mentioned each entity disambiguation method embodiment.Wherein, the computer program includes computer program generation
Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms
Deng.The computer-readable medium may include: any entity or device, record that can carry the computer program code
Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with
Machine access memory (RAM, Random Access Memory), electric carrier signal, electric signal and software distribution medium etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (12)
1. a kind of entity disambiguation method, which is characterized in that the entity disambiguation method includes:
The gene of entity to be disambiguated is constructed, the gene includes: co-occurrence entity word gene and entity attribute gene, and the co-occurrence is real
Pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated;
Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the entity to be disambiguated
Semantic feature includes name the Formal Similarity, abbreviation information and reference feature;
The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than preset threshold in the gene matching degree
In the case where, determine that the gene of the candidate entity and the entity to be disambiguated matches.
2. entity disambiguation method according to claim 1, which is characterized in that it is described calculate the candidate entity and it is described to
Disambiguate the gene matching degree of entity, including following procedure:
Obtain the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated;
Obtain the candidate entity and the entity attributes gene matching degree to be disambiguated;
According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and entity to be disambiguated are calculated
Gene matching degree, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m,
It e) is the attribution gene matching degree, α, β are weight.
3. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to
Disambiguate the co-occurrence entity gene matching degree of entity, including following procedure:
The co-occurrence entity word of the entity to be disambiguated is determined from pre-stored document, and from the pre-stored document
Determine the co-occurrence entity word of the candidate entity;
Obtain the genetic entity word set of the candidate entity;
According to the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to
The word frequency of pre-stored entity word total amount, the co-occurrence entity of the co-occurrence entity word of the entity to be disambiguated and the candidate entity
The overlapping part of word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence entity of the entity to be disambiguated
The inverse document frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity, calculates the co-occurrence entity gene
With degree.
4. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to
Disambiguate entity attributes gene matching degree, including following procedure:
Determine the entity attributes name to be disambiguated and the candidate entity attributes name;
According to the overlapping attribute and overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name
Weighted value, calculate the attribution gene matching degree.
5. entity disambiguation method according to claim 1, which is characterized in that the determination candidate entity and it is described to
After the gene of disambiguation entity matches, the entity disambiguation method further include:
The gene of the candidate entity and the entity to be disambiguated is merged;
According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining mesh with the Entities Matching to be disambiguated
Entity is marked, knowledge fusion is carried out to entity corresponding in knowledge base according to the knowledge of the target entity.
6. a kind of entity disambiguator characterized by comprising
Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute base
Cause, the co-occurrence entity word gene include co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes described wait disambiguate
Entity attributes;
Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein described
The semantic feature of entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;
Matching module is matched for calculating the gene matching degree of the candidate entity and the entity to be disambiguated in the gene
In the case that degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.
7. entity disambiguator according to claim 6, which is characterized in that the matching module includes:
First acquisition submodule, for obtain the candidate entity and the entity to be disambiguated co-occurrence entity gene matching degree and
Attribution gene matching degree;
First computational submodule, for calculating the time according to the co-occurrence entity gene matching degree and attribution gene matching degree
Select the gene matching degree of entity Yu entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m,
It e) is the attribution gene matching degree, α, β are weight.
8. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:
First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from institute
State the co-occurrence entity word that the candidate entity is determined in pre-stored document;
First acquisition unit, for obtaining the genetic entity word set of the candidate entity;
First computing unit, for according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity
Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word and the time of the entity to be disambiguated
Select the overlapping part of the co-occurrence entity word of entity relative to the word frequency of the genetic entity word set of the candidate entity and described wait disappear
The inverse document frequency of the overlapping part of the co-occurrence entity word of discrimination entity and the co-occurrence entity word of the candidate entity, calculates institute
State co-occurrence entity gene matching degree.
9. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:
Second determination unit, for determining the entity attributes name to be disambiguated and the candidate entity attributes name;
Second computing unit, for according to the friendship between the entity attributes name to be disambiguated and the candidate entity attributes name
The weighted value of folded attribute and overlapping attribute, calculates the attribution gene matching degree.
10. entity disambiguator according to claim 6, which is characterized in that further include:
First Fusion Module, for merging the gene of the candidate entity and the entity to be disambiguated;
Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine with it is described
The target entity of Entities Matching to be disambiguated carries out knowledge to entity corresponding in knowledge base according to the knowledge of the target entity and melts
It closes.
11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing
It is realized when computer program in memory as described in any one of claim 1-5 the step of entity disambiguation method.
12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program
It is realized when being executed by processor as described in any one of claim 1-5 the step of entity disambiguation method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811508089.7A CN109635297B (en) | 2018-12-11 | 2018-12-11 | Entity disambiguation method and device, computer device and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811508089.7A CN109635297B (en) | 2018-12-11 | 2018-12-11 | Entity disambiguation method and device, computer device and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109635297A true CN109635297A (en) | 2019-04-16 |
CN109635297B CN109635297B (en) | 2022-01-04 |
Family
ID=66072632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811508089.7A Active CN109635297B (en) | 2018-12-11 | 2018-12-11 | Entity disambiguation method and device, computer device and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109635297B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134965A (en) * | 2019-05-21 | 2019-08-16 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and computer readable storage medium for information processing |
CN110348012A (en) * | 2019-07-01 | 2019-10-18 | 北京明略软件系统有限公司 | Determine method, apparatus, storage medium and the electronic device of target character |
CN110427612A (en) * | 2019-07-02 | 2019-11-08 | 平安科技(深圳)有限公司 | Based on multilingual entity disambiguation method, device, equipment and storage medium |
CN110516252A (en) * | 2019-08-30 | 2019-11-29 | 京东方科技集团股份有限公司 | Data mask method, device, computer equipment and storage medium |
CN110827831A (en) * | 2019-11-15 | 2020-02-21 | 广州洪荒智能科技有限公司 | Voice information processing method, device, equipment and medium based on man-machine interaction |
CN111259653A (en) * | 2020-01-15 | 2020-06-09 | 重庆邮电大学 | Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation |
CN111401049A (en) * | 2020-03-12 | 2020-07-10 | 京东方科技集团股份有限公司 | Entity linking method and device |
CN111680498A (en) * | 2020-05-18 | 2020-09-18 | 国家基础地理信息中心 | Entity disambiguation method, device, storage medium and computer equipment |
CN113947087A (en) * | 2021-12-20 | 2022-01-18 | 太极计算机股份有限公司 | Label-based relation construction method and device, electronic equipment and storage medium |
CN115293158A (en) * | 2022-06-30 | 2022-11-04 | 撼地数智(重庆)科技有限公司 | Disambiguation method and device based on label assistance |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182420A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Ontology-based Chinese name disambiguation method |
CN106202382A (en) * | 2016-07-08 | 2016-12-07 | 南京缘长信息科技有限公司 | Link instance method and system |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN108170662A (en) * | 2016-12-07 | 2018-06-15 | 富士通株式会社 | The disambiguation method of breviaty word and disambiguation equipment |
CN108415902A (en) * | 2018-02-10 | 2018-08-17 | 合肥工业大学 | A kind of name entity link method based on search engine |
CN108959461A (en) * | 2018-06-15 | 2018-12-07 | 东南大学 | A kind of entity link method based on graph model |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
-
2018
- 2018-12-11 CN CN201811508089.7A patent/CN109635297B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182420A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Ontology-based Chinese name disambiguation method |
CN106202382A (en) * | 2016-07-08 | 2016-12-07 | 南京缘长信息科技有限公司 | Link instance method and system |
CN108170662A (en) * | 2016-12-07 | 2018-06-15 | 富士通株式会社 | The disambiguation method of breviaty word and disambiguation equipment |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN108415902A (en) * | 2018-02-10 | 2018-08-17 | 合肥工业大学 | A kind of name entity link method based on search engine |
CN108959461A (en) * | 2018-06-15 | 2018-12-07 | 东南大学 | A kind of entity link method based on graph model |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134965A (en) * | 2019-05-21 | 2019-08-16 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and computer readable storage medium for information processing |
CN110134965B (en) * | 2019-05-21 | 2023-08-18 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer readable storage medium for information processing |
CN110348012B (en) * | 2019-07-01 | 2022-12-09 | 北京明略软件系统有限公司 | Method, device, storage medium and electronic device for determining target character |
CN110348012A (en) * | 2019-07-01 | 2019-10-18 | 北京明略软件系统有限公司 | Determine method, apparatus, storage medium and the electronic device of target character |
CN110427612A (en) * | 2019-07-02 | 2019-11-08 | 平安科技(深圳)有限公司 | Based on multilingual entity disambiguation method, device, equipment and storage medium |
CN110516252A (en) * | 2019-08-30 | 2019-11-29 | 京东方科技集团股份有限公司 | Data mask method, device, computer equipment and storage medium |
CN110516252B (en) * | 2019-08-30 | 2022-12-09 | 京东方科技集团股份有限公司 | Data annotation method and device, computer equipment and storage medium |
CN110827831A (en) * | 2019-11-15 | 2020-02-21 | 广州洪荒智能科技有限公司 | Voice information processing method, device, equipment and medium based on man-machine interaction |
CN111259653B (en) * | 2020-01-15 | 2022-06-24 | 重庆邮电大学 | Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation |
CN111259653A (en) * | 2020-01-15 | 2020-06-09 | 重庆邮电大学 | Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation |
CN111401049A (en) * | 2020-03-12 | 2020-07-10 | 京东方科技集团股份有限公司 | Entity linking method and device |
US11914959B2 (en) | 2020-03-12 | 2024-02-27 | Boe Technology Group Co., Ltd. | Entity linking method and apparatus |
CN111680498A (en) * | 2020-05-18 | 2020-09-18 | 国家基础地理信息中心 | Entity disambiguation method, device, storage medium and computer equipment |
CN111680498B (en) * | 2020-05-18 | 2023-04-07 | 国家基础地理信息中心 | Entity disambiguation method, device, storage medium and computer equipment |
CN113947087A (en) * | 2021-12-20 | 2022-01-18 | 太极计算机股份有限公司 | Label-based relation construction method and device, electronic equipment and storage medium |
CN115293158A (en) * | 2022-06-30 | 2022-11-04 | 撼地数智(重庆)科技有限公司 | Disambiguation method and device based on label assistance |
CN115293158B (en) * | 2022-06-30 | 2024-02-02 | 撼地数智(重庆)科技有限公司 | Label-assisted disambiguation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109635297B (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635297A (en) | A kind of entity disambiguation method, device, computer installation and computer storage medium | |
Ramisch et al. | mwetoolkit: A framework for multiword expression identification. | |
Gupta et al. | A survey of common stemming techniques and existing stemmers for indian languages | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
WO2005064490A1 (en) | System for recognising and classifying named entities | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
WO2014022172A2 (en) | Information classification based on product recognition | |
CN106611041A (en) | New text similarity solution method | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
Weerasinghe et al. | Feature vector difference based neural network and logistic regression models for authorship verification | |
CN103678565A (en) | Domain self-adaption sentence alignment system based on self-guidance mode | |
Amarappa et al. | Named entity recognition and classification in kannada language | |
US20220365956A1 (en) | Method and apparatus for generating patent summary information, and electronic device and medium | |
Wong et al. | iSentenizer‐μ: Multilingual Sentence Boundary Detection Model | |
Venčkauskas et al. | Problems of authorship identification of the national language electronic discourse | |
CN101271448A (en) | Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus | |
Shafi et al. | UNLT: Urdu natural language toolkit | |
CN110874408B (en) | Model training method, text recognition device and computing equipment | |
Adebayo et al. | Normas at semeval-2016 task 1: Semsim: A multi-feature approach to semantic text similarity | |
Muhamad et al. | Proposal: A hybrid dictionary modelling approach for malay tweet normalization | |
Oudah et al. | Person name recognition using the hybrid approach | |
Sharma et al. | Lfwe: Linguistic feature based word embedding for hindi fake news detection | |
Nguyen et al. | L3i_lbpam at the finsim-2 task: Learning financial semantic similarities with siamese transformers | |
Baishya et al. | Present state and future scope of Assamese text processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Duan Lian Inventor after: Zhou Zhongcheng Inventor before: Duan Lian Inventor before: Zhou Zhongcheng Inventor before: Huang Jiuming Inventor before: Zhang Shengdong |
|
GR01 | Patent grant | ||
GR01 | Patent grant |