CN109635297A - A kind of entity disambiguation method, device, computer installation and computer storage medium - Google Patents

A kind of entity disambiguation method, device, computer installation and computer storage medium Download PDF

Info

Publication number
CN109635297A
CN109635297A CN201811508089.7A CN201811508089A CN109635297A CN 109635297 A CN109635297 A CN 109635297A CN 201811508089 A CN201811508089 A CN 201811508089A CN 109635297 A CN109635297 A CN 109635297A
Authority
CN
China
Prior art keywords
entity
gene
disambiguated
occurrence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811508089.7A
Other languages
Chinese (zh)
Other versions
CN109635297B (en
Inventor
段炼
周忠诚
黄九鸣
张圣栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xinghan Shuzhi Technology Co Ltd
Original Assignee
Hunan Xinghan Shuzhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xinghan Shuzhi Technology Co Ltd filed Critical Hunan Xinghan Shuzhi Technology Co Ltd
Priority to CN201811508089.7A priority Critical patent/CN109635297B/en
Publication of CN109635297A publication Critical patent/CN109635297A/en
Application granted granted Critical
Publication of CN109635297B publication Critical patent/CN109635297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention is suitable for Internet technical field, discloses a kind of entity disambiguation method, device, computer installation and computer storage medium, which comprises construct the gene of entity to be disambiguated;Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;The gene matching degree for calculating the candidate entity and the entity to be disambiguated determines that the gene of the candidate entity and the entity to be disambiguated matches in the case where the gene matching degree is more than preset threshold.Entity disambiguation method provided by the invention, it can be improved the effect of entity disambiguation, during entity disambiguates, gradually improve link entity and knowledge base, help to improve target analysis in mass text, construction of knowledge base and in terms of data-handling efficiency.

Description

A kind of entity disambiguation method, device, computer installation and computer storage medium
Technical field
The invention belongs to Internet technical field more particularly to a kind of entity disambiguation method, device, computer installation and meters Calculation machine storage medium.
Background technique
There are entity name ambiguity problems during natural language processing, for example, some name in text may refer to For entities multiple in this life circle.The reason of leading to entity name ambiguity problem be natural language statement freedom, diversity, Ambiguousness.Currently, natural language processing (Natural Language Processing, NLP) research is absorbed in always machine and is turned over It translates, information retrieval, text snippet, question and answer, information extraction, theme models and the tasks such as emotion is excavated.Tradition is based on grammer point The natural language processing technique development of analysis is more slow, and breakthrough achievement is less.With the innovation of the technologies such as deep learning, manually Intelligence has obtained extensive concern in the field NLP.Since there are synonymous, near synonym, the feelings such as polysemy in natural language description Condition increases the difficulty of natural language analysis, thus relatively crucial problem is that entity disambiguates in natural language processing, and entity disappears The main purpose of discrimination is that there are the entity names of ambiguity in identification sentence, and provide to each ambiguity entity name and meet it The semanteme of context.Common entity disambiguation method requires pre-existing informative knowledge base, the property on large-scale data source It is poor to show, and the precision that entity disambiguates on internet data source is low.
Summary of the invention
The embodiment of the present invention provides a kind of entity disambiguation method, device, computer installation and computer storage medium, it is intended to It solving method in the prior art and requires pre-existing informative knowledge base, performance is poor on large-scale data source, The low problem of the precision that entity disambiguates on internet data source.
The invention is realized in this way a kind of entity disambiguation method, including following procedure:
Construct the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute gene, described total Real pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the category of the entity to be disambiguated Property;
Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the reality to be disambiguated The semantic feature of body includes name the Formal Similarity, abbreviation information and reference feature;
The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than default in the gene matching degree In the case where threshold value, determine that the gene of the candidate entity and the entity to be disambiguated matches.
Further, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, including following mistake Journey:
Obtain the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated;
Obtain the candidate entity and the entity attributes gene matching degree to be disambiguated;
According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and reality to be disambiguated are calculated The gene matching degree of body, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Further, the co-occurrence entity gene matching degree for obtaining the candidate entity and the entity to be disambiguated, packet Include following procedure:
Determine the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from the pre-stored text The co-occurrence entity word of the candidate entity is determined in shelves;
Obtain the genetic entity word set of the candidate entity;
According to the overlap split-phase of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity For the word frequency of pre-stored entity word total amount, the co-occurrence of the co-occurrence entity word of the entity to be disambiguated and the candidate entity The overlapping part of entity word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence of the entity to be disambiguated The inverse document frequency of the overlapping part of entity word and the co-occurrence entity word of the candidate entity, calculates the co-occurrence entity base Because of matching degree.
Further, the acquisition candidate entity and the entity attributes gene matching degree to be disambiguated, including with Lower process:
Determine the entity attributes name to be disambiguated and the candidate entity attributes name;
According to the overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name, and overlap The weighted value of attribute calculates the attribution gene matching degree.
Further, after the gene of the determination candidate entity and the entity to be disambiguated matches, the reality Body disambiguation method further include:
The gene of the candidate entity and the entity to be disambiguated is merged;
According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining and Entities Matching to be disambiguated Target entity, according to the knowledge of the target entity to entity corresponding in knowledge base carry out knowledge fusion.
The present invention also provides a kind of entity disambiguators, comprising:
Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity category Property gene, the co-occurrence entity word gene includes co-occurrence entity word and co-occurrence degree, the entity attribute gene include it is described to Disambiguate entity attributes;
Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein The semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;
Matching module, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the gene In the case that matching degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.
Further, the matching module includes:
First acquisition submodule is matched for obtaining the candidate entity with the co-occurrence entity gene of the entity to be disambiguated Degree and attribution gene matching degree;
First computational submodule, for calculating institute according to the co-occurrence entity gene matching degree and attribution gene matching degree State the gene matching degree of candidate entity Yu entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Further, first acquisition submodule includes:
First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and The co-occurrence entity word of the candidate entity is determined from the pre-stored document;
First acquisition unit, for obtaining the genetic entity word set of the candidate entity;
First computing unit, for real according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence of the candidate entity Word frequency of the overlapping part of pronouns, general term for nouns, numerals and measure words relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and institute The overlapping part of the co-occurrence entity word of candidate entity is stated relative to the word frequency of the genetic entity word set of the candidate entity and described The inverse document frequency of the overlapping part of the co-occurrence entity word of entity to be disambiguated and the co-occurrence entity word of the candidate entity, meter Calculate the co-occurrence entity gene matching degree.
Further, first acquisition submodule includes:
Second determination unit, for determining the entity attributes name to be disambiguated and the candidate entity attributes name;
Second computing unit, for according between the entity attributes name to be disambiguated and the candidate entity attributes name Overlapping attribute and overlapping attribute weighted value, calculate the attribution gene matching degree.
Further, the entity disambiguator, further includes:
First Fusion Module, for merging the gene of the candidate entity and the entity to be disambiguated;
Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine and The target entity of the Entities Matching to be disambiguated knows entity corresponding in knowledge base according to the knowledge of the target entity Know fusion.
The present invention also provides a kind of computer installation, the computer installation includes processor, and the processor is for holding It is realized when computer program such as the step of above-mentioned entity disambiguation method in line storage.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey It realizes when sequence is executed by processor such as the step of above-mentioned entity disambiguation method.
Entity disambiguation method provided by the invention, by constructing the gene of entity to be disambiguated, calculate the candidate entity with The gene matching degree of the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, to extensive The disambiguation accuracy that data source carries out during entity disambiguation is relatively high, improves the essence that entity disambiguates on internet data source Degree during entity disambiguates, gradually improves link entity and knowledge base to improve the effect that entity disambiguates on the whole, Help to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..
Detailed description of the invention
Fig. 1 is the implementation flow chart of entity disambiguation method provided in an embodiment of the present invention;
The reality of Fig. 2 gene matching degree provided in an embodiment of the present invention for calculating the candidate entity and the entity to be disambiguated Existing flow chart;
Fig. 3 is the implementation flow chart that the present invention implements the acquisition co-occurrence entity gene matching degree provided;
Fig. 4 is the implementation flow chart that the present invention implements the acquisition attribution gene matching degree provided;
Fig. 5 is after the gene of the determination candidate entity and the entity to be disambiguated that the present invention implements offer matches Entity disambiguation method implementation flow chart;
Fig. 6 is a kind of structural schematic diagram of entity disambiguator provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of matching module provided in an embodiment of the present invention;
Fig. 8 is a structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention;
Fig. 9 is another structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention;
Figure 10 is the structural schematic diagram of another entity disambiguator provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Fig. 1 show the flow chart of entity disambiguation method provided in an embodiment of the present invention.The entity disambiguation method, including Following procedure:
Step S101, the gene of entity to be disambiguated is constructed.
In the present embodiment, the gene includes: co-occurrence entity word gene and entity attribute gene, the co-occurrence entity word Gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated.The base Because determining the characteristic information of entity.For each entity word to be disambiguated, in statistical documents the co-occurrence frequency of co-occurrence entity word and Co-occurrence entity gene of the co-occurrence entity word as entity, using in the attribute list of document about the attribute of entity word to be disambiguated as Entity attribute gene.In the present embodiment, entity word that can also be faint to co-occurrence degree according to relationship type attribute carries out gene Enhancing, finally obtains co-occurrence entity gene.It is understood that relationship type attribute indicates the incidence relation between two entities, For example, relationship type attribute can be conjugal relation, Peer Relationships, father and daughter's relationship etc..It in the present embodiment, can be by document Information extraction is carried out, finds out the association attributes of entity in document, and attribute is snapped in attribute slot predetermined with specification The expression of attribute-name, using this strong information of entity attributes as entity attribute gene.
In the present embodiment, the process that entity to be disambiguated is obtained from text specifically includes that web page body extracts, and text is clear It washes, languages identification, text participle, the processes such as entity attribute extraction.
Specifically, in the web page body extraction process, mainly according under the title of css-class and<div>block Content of text finds body matter expression part, removes the auxiliary element in web page contents, such as logs in, comment, shares function Property button, not care about one's appearance banner towing etc., to obtain webpage body content.
Specifically, it in the text cleaning process, can be removed such as by some tools<div>,<p>,<br>deng Html source code obtains natural language text, then carries out full half-angle conversion, the removing of invisible character, emoticon removing, complicated and simple conversion Etc. processes, wash some useless, meaningless characters, export natural language text.
Specifically, in the languages identification process, encoding block belonging to character, such as Chinese articles can be both checked merely In, most characters are in the encoding block and ASCII block of Chinese;It is regarded as a classification problem simultaneously so long, used Machine learning method solves.It is according to text languages that text distribution participle part is right by system after the languages for judging text The participle tool answered.
Specifically, being identified according to languages as a result, respectively using specific during the text segments Chinese, English, Spanish participle tool, output comprising word, part of speech and name entity word word segmentation result, and will It is converted to unified expression.
Specifically, in the entity attribute extraction process, system first passes through NLP tool and clears up pronominal reference as far as possible, replaces Pronoun is changed to by reference physical name, is then based on rule and attribute extraction tool Extracting Information from text.It is rule-based Information extraction includes the building of decimation rule and carries out information extraction two parts using rule.Attribute extraction tool first can will be every A sentence is cut into a series of clause, then shortens each clause to the maximum extent, obtains shorter sentence fragment, then These sentence fragments can be divided into triple, and triple includes entity, attribute and attribute value, and triple can be used as the reality of document Body attribute list, finally by the entity attribute table of system output document.
After Text Pretreatment and NLP analytic process, the entity attribute table of obtained document can be used as entity disambiguation It directly inputs.Text Pretreatment, NLP analytic process are indispensable in entity disambiguation, but particular technique used can adopt With other NLP processing methods for removing above-mentioned technology.Configuring these steps can help to optimize subsequent processing speed, disappear for entity Discrimination process provides the input text of high quality, reduces to the input requirements of system, is expanding manageable data area While also strengthen scalability.
Step S102, candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated.
In the present embodiment, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and refers to For feature.Different modes is taken to determine candidate's entity, the number of the candidate entity determined from entity library for different language One or more are likely to be, in the case where candidate entity has multiple, candidate entity set can be formed.It is understood that name Word the Formal Similarity can between substantive noun form similarity, for example, the people in name Zhang San and B article in A article There is name the Formal Similarity, English name " jack " and " jackie ", " trump " and " donald trump " have between name Zhang San There is similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, the abbreviation information in Hunan Province is Hunan, media access control address can be referred to as MAC Address.It is special that reference feature can be understood as the corresponding reference of pronoun in document Sign.
Step S103, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, matches in the gene In the case that degree is more than preset threshold, determine that the gene of the candidate entity and the entity to be disambiguated matches.
In the present embodiment, in the case where there is candidate entity set, the candidate entity in candidate entity set can be scanned, Calculate the gene matching degree of candidate entity and entity to be disambiguated, determined when gene matching degree is more than preset threshold candidate entity with to Disambiguating entity can match.The preset threshold is bigger, illustrates that matching precision is higher, and the preset threshold is smaller, illustrates matching precision Lower, which can be set according to actual needs.The case where candidate entity can be matched with entity to be disambiguated Under, the gene of entity to be disambiguated is merged with the gene of candidate entity, entity is accumulated.Then, to candidate entity Other the candidate entities concentrated are scanned, until there is no the candidate entities that can be matched in candidate entity set.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates.
Referring to fig. 2, the gene of the calculating candidate entity and the entity to be disambiguated in the step S103 With degree, including following procedure:
Step S1031 obtains the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated.
Step S1032 obtains the candidate entity and the entity attributes gene matching degree to be disambiguated.
Step S1033 calculates the candidate entity according to the co-occurrence entity gene matching degree and attribution gene matching degree With the gene matching degree of entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Referring to Fig. 3, above-mentioned steps S1031 includes following procedure:
Step S10311, determines the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from described The co-occurrence entity word of the candidate entity is determined in pre-stored document;
Step S10312 obtains the genetic entity word set of the candidate entity;
Step S10313, according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and the candidate The overlapping part of the co-occurrence entity word of entity is relative to the word frequency of the genetic entity word set of the candidate entity and described wait disambiguate The inverse document frequency of the overlapping part of the co-occurrence entity word of entity and the co-occurrence entity word of the candidate entity, described in calculating Co-occurrence entity gene matching degree.
In the present embodiment, the genetic entity word set is entity set of words relevant to entity, e.g., to Zhang San's reality Body, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.
It further illustrates, uses for reference TF-IDF thought, TF-IDF (term frequency-inverse document It frequency) is a kind of common weighting technique for information retrieval and data mining.TF means word frequency (Term Frequency), IDF means inverse document frequency (Inverse Document Frequency).Using all entities as Document D, the entity word of the co-occurrence entity gene of entity treat the entity word for disambiguating the co-occurrence entity gene of entity as word T Tf and the tf of entity word of co-occurrence entity gene of candidate entity be normalized.TF is normalized to genetic entity word number Amount is sensitive, and the co-occurrence entity mrna length of candidate entity is generally on the high side, thus the weight for the co-occurrence entity word that gene is overlapped into Row normalizes again after strengthening, and reinforces weak signal.The calculating of idf is then based on caching mechanism progress, improves calculating speed Degree.Show that entity word gene matching degree specifically can finally by the tf and idf of the overlapping part of co-occurrence entity word gene To calculate the co-occurrence entity gene matching degree score according to the following formulaw(m, e):
Wherein, calculate public about entity to be disambiguated and candidate entity m, e, the normalization tf value of overlapping genetic entity word w Formula are as follows:
Wherein, the gene of the weight of the genetic entity word w of weit (w, e) presentation-entity e, weit (e, w) presentation-entity w is real The weight of pronouns, general term for nouns, numerals and measure words e, W (m) are the genetic entity word set of entity m;It should be noted that in the above-mentioned calculating co-occurrence entity gene Matching degree scorewIt in the formula of (m, e), does not distinguish and treats disambiguation entity and candidate entity, m, e are accordingly to be regarded as entity, have equity Property, tfnorm(w;M*, e) indicate that word w is the normalized value of total amount, tf about m after the overlapping reinforcing of m, enorm(w;E*, m) meter Calculation mode is similar to tfnorm(w;M*, e) calculation, this will not be repeated here.
The calculation formula of the inverse document frequency idf (w) of co-occurrence entity word in co-occurrence entity gene are as follows:
Wherein, E is entity library, and W (e) is the genetic entity word set of entity e, and Z is a complementary constant, under normal circumstances Z is smaller.
Referring to fig. 4, above-mentioned steps S1032 includes following procedure:
Step S10321 determines the entity attributes name to be disambiguated and the candidate entity attributes name;
Step S10322, according to overlapping between the entity attributes name to be disambiguated and the candidate entity attributes name The weighted value of attribute and overlapping attribute calculates the attribution gene matching degree.
In the present embodiment, when gene constructed, attribute-name is aligned, therefore when attribution gene matching primitives It only needs to match the corresponding attribute value of attribute-name.Entity attribute gene matching degree is the weighted sum of overlapping attribute, is being spent It does not require attribute value character string completely the same when measuring attribute value, but calculates the two phase with the similarity algorithm based on editing distance Like degree.When constructing the weighted sum model of attributes match, fuzzy matching optimization can be carried out to the value of part specific properties, reinforce discriminating The strong attribute weight of other property, such as identity card, spouse.It specifically, can computation attribute gene matching degree according to the following formula scorep(m, e):
Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's All properties value, weitp(pn) weight of attribute-name pn, I are indicatedpv(vi,vj) ∈ { 0,1 }, it is an indicator function, indicates Attribute value viAnd vjIt is whether identical, Ipv(vi,vj) calculation formula is as follows:
Wherein, simpv(vi,vj) indicate that character string normalizes similarity, θpIt is a higher threshold value, θpIt is biased to 1, Ipfix (vi,vj)、Isfix(vi,vj) respectively indicate vi、vjIn value whether be another value prefix and suffix.
Referring to Fig. 5, after above-mentioned steps 103, the method also includes:
Step 104, the gene of the candidate entity and the entity to be disambiguated is merged.
Step 105, it according to the gene matching degree of the candidate entity and the entity to be disambiguated, determines with described wait disambiguate The target entity of Entities Matching carries out knowledge fusion to entity corresponding in knowledge base according to the knowledge of the target entity.
In the present embodiment, candidate entity and entity to be disambiguated can be in matched situation, by the base of entity to be disambiguated Because being merged with the gene of candidate entity, entity is accumulated.Entity word, the power of co-occurrence entity gene can be merged respectively Weight and entity attribute gene, while window is increased to the setting of co-occurrence entity gene, the reserved increasing of important vocabulary occurred for the later period Long spacing.In the present embodiment, with the process disambiguated to the entity word in text, entity mobility models can gradually be improved Library.
In the present embodiment mode, knowledge base can be updated according to the result that matching disambiguates, be roughly divided into nothing It with entity, several situations such as is matched to an entity and is matched to multiple entities, knowledge base is carried out respectively for three kinds of situations Attribute fusion.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, right The disambiguation accuracy that large-scale data source carries out during entity disambiguation is relatively high, improves the entity on internet data source and disambiguates Precision during entity disambiguates, gradually improve link entity and knowledge to improve the effect that entity disambiguates on the whole Library helps to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..
Fig. 6 shows a kind of structural schematic diagram of entity disambiguator 600 provided in an embodiment of the present invention, for the ease of saying It is bright, it illustrates only and implements relevant part in the present invention.The entity disambiguator 600, comprising:
Module 601 is constructed, for constructing the gene of entity to be disambiguated.
In the present embodiment, the gene includes: co-occurrence entity word gene and entity attribute gene, the co-occurrence entity word Gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated.The base Because determining the characteristic information of entity.For each entity word to be disambiguated, in statistical documents the co-occurrence frequency of co-occurrence entity word and Co-occurrence entity gene of the co-occurrence entity word as entity, using in the attribute list of document about the attribute of entity word to be disambiguated as Entity attribute gene.In the present embodiment, entity word that can also be faint to co-occurrence degree according to relationship type attribute carries out gene Enhancing, finally obtains co-occurrence entity gene.It is understood that relationship type attribute indicates the incidence relation between two entities, For example, relationship type attribute can be conjugal relation, Peer Relationships, father and daughter's relationship etc..It in the present embodiment, can be by document Information extraction is carried out, finds out the association attributes of entity in document, and attribute is snapped in attribute slot predetermined with specification The expression of attribute-name, using this strong information of entity attributes as entity attribute gene.
In the present embodiment, the process that entity to be disambiguated is obtained from text specifically includes that web page body extracts, and text is clear It washes, languages identification, text participle, the processes such as entity attribute extraction.
Specifically, in the web page body extraction process, mainly according under the title of css-class and<div>block Content of text finds body matter expression part, removes the auxiliary element in web page contents, such as logs in, comment, shares function Property button, not care about one's appearance banner towing etc., to obtain webpage body content.
Specifically, it in the text cleaning process, can be removed such as by some tools<div>,<p>,<br>deng Html source code obtains natural language text, then carries out full half-angle conversion, the removing of invisible character, emoticon removing, complicated and simple conversion Etc. processes, wash some useless, meaningless characters, export natural language text.
Specifically, in the languages identification process, encoding block belonging to character, such as Chinese articles can be both checked merely In, most characters are in the encoding block and ASCII block of Chinese;It is regarded as a classification problem simultaneously so long, used Machine learning method solves.It is according to text languages that text distribution participle part is right by system after the languages for judging text The participle tool answered.
Specifically, being identified according to languages as a result, respectively using specific during the text segments Chinese, English, Spanish participle tool, output comprising word, part of speech and name entity word word segmentation result, and will It is converted to unified expression.
Specifically, in the entity attribute extraction process, system first passes through NLP tool and clears up pronominal reference as far as possible, replaces Pronoun is changed to by reference physical name, is then based on rule and attribute extraction tool Extracting Information from text.It is rule-based Information extraction includes the building of decimation rule and carries out information extraction two parts using rule.Attribute extraction tool first can will be every A sentence is cut into a series of clause, then shortens each clause to the maximum extent, obtains shorter sentence fragment, then These sentence fragments can be divided into triple, and triple includes entity, attribute and attribute value, and triple can be used as the reality of document Body attribute list, finally by the entity attribute table of system output document.
After Text Pretreatment and NLP analytic process, the entity attribute table of obtained document can be used as entity disambiguation It directly inputs.Text Pretreatment, NLP analytic process are indispensable in entity disambiguation, but particular technique used can adopt With other NLP processing methods for removing above-mentioned technology.Configuring these steps can help to optimize subsequent processing speed, disappear for entity Discrimination process provides the input text of high quality, reduces to the input requirements of system, is expanding manageable data area While also strengthen scalability.
Screening module 602, for determining candidate's entity from entity library according to the semantic feature of the entity to be disambiguated, In, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature.
In the present embodiment, it takes different modes to determine candidate's entity from entity library for different language, determines The number of candidate entity is likely to be one or more, in the case where candidate entity has multiple, can form candidate entity set. It is understood that name the Formal Similarity can between substantive noun form similarity, for example, name in A article There is name the Formal Similarity between name Zhang San in three and B article, English name " jack " and " jackie ", " trump " with " donald trump " has similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, Hunan The abbreviation information of province is Hunan, and media access control address can be referred to as MAC Address.Reference feature can be understood as generation in document The corresponding reference feature of word.
Matching module 603, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the base In the case where being more than preset threshold because of matching degree, determine that the gene of the candidate entity and the entity to be disambiguated matches.
In the present embodiment, in the case where there is candidate entity set, the candidate entity in candidate entity set can be scanned, Calculate the gene matching degree of candidate entity and entity to be disambiguated, determined when gene matching degree is more than preset threshold candidate entity with to Disambiguating entity can match.The preset threshold is bigger, illustrates that matching precision is higher, and the preset threshold is smaller, illustrates matching precision Lower, which can be set according to actual needs.The case where candidate entity can be matched with entity to be disambiguated Under, the gene of entity to be disambiguated is merged with the gene of candidate entity, entity is accumulated.Then, to candidate entity Other the candidate entities concentrated are scanned, until there is no the candidate entities that can be matched in candidate entity set.
Entity disambiguator provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates.
Referring to Fig. 7, the matching module 603 includes:
First acquisition submodule 6031, for obtaining the co-occurrence entity gene of the candidate entity and the entity to be disambiguated Matching degree and attribution gene matching degree.
First computational submodule 6032, for according to the co-occurrence entity gene matching degree and attribution gene matching degree, meter Calculate the gene matching degree of the candidate entity and entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e);
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, e) is the attribution gene matching degree, and α, β are weight.
Referring to Fig. 8, above-mentioned first acquisition submodule 6031 includes:
First determination unit 60311, for determining the co-occurrence entity of the entity to be disambiguated from pre-stored document Word, and determine from the pre-stored document co-occurrence entity word of the candidate entity;
First acquisition unit 60312, for obtaining the genetic entity word set of the candidate entity;
First computing unit 60313, co-occurrence entity word and the candidate entity for the entity to be disambiguated according to Word frequency of the overlapping part of co-occurrence entity word relative to pre-stored entity word total amount, the co-occurrence entity of the entity to be disambiguated Word frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity relative to the genetic entity word set of the candidate entity, And the inverse text frequency of the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Index calculates the co-occurrence entity gene matching degree.
In the present embodiment, in the present embodiment, the genetic entity word set is entity word set relevant to entity It closes, e.g., to Zhang San's entity, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.
It further illustrates, uses for reference TF-IDF thought, TF-IDF (term frequency-inverse document It frequency) is a kind of common weighting technique for information retrieval and data mining.TF means word frequency (Term Frequency), IDF means inverse document frequency (Inverse Document Frequency).Using all entities as Document D, the entity word of the co-occurrence entity gene of entity treat the entity word for disambiguating the co-occurrence entity gene of entity as word T Tf and the tf of entity word of co-occurrence entity gene of candidate entity be normalized.TF is normalized to genetic entity word number Amount is sensitive, and the co-occurrence entity mrna length of candidate entity is generally on the high side, thus the weight for the co-occurrence entity word that gene is overlapped into Row normalizes again after strengthening, and reinforces weak signal.The calculating of idf is then based on caching mechanism progress, improves calculating speed Degree.Show that entity word gene matching degree specifically can finally by the tf and idf of the overlapping part of co-occurrence entity word gene To calculate the co-occurrence entity gene matching degree score according to the following formulaw(m, e):
Wherein, calculate public about entity to be disambiguated and candidate entity m, e, the normalization tf value of overlapping genetic entity word w Formula are as follows:
Wherein, the gene of the weight of the genetic entity word w of weit (w, e) presentation-entity e, weit (e, w) presentation-entity w is real The weight of pronouns, general term for nouns, numerals and measure words e, W (m) are the genetic entity word set of entity m;It should be noted that in the above-mentioned calculating co-occurrence entity gene Matching degree scorewIt in the formula of (m, e), does not distinguish and treats disambiguation entity and candidate entity, m, e are accordingly to be regarded as entity, have equity Property, tfnorm(w;M*, e) indicate that word w is the normalized value of total amount, tf about m after the overlapping reinforcing of m, enorm(w;E*, m) meter Calculation mode is similar to tfnorm(w;M*, e) calculation, this will not be repeated here.
The calculation formula of the inverse document frequency idf (w) of co-occurrence entity word in co-occurrence entity gene are as follows:
Wherein, E is entity library, and W (e) is the genetic entity word set of entity e, and Z is a complementary constant, under normal circumstances Z is smaller.
Referring to Fig. 9, above-mentioned first acquisition submodule 6031 includes:
Second determination unit 60314, for determining the entity attributes name to be disambiguated and the candidate entity attributes Name;
Second computing unit 60315, for according to the entity attributes name to be disambiguated and the candidate entity attributes The weighted value of overlapping attribute and overlapping attribute between name, calculates the attribution gene matching degree.
In the present embodiment, when gene constructed, attribute-name is aligned, therefore when attribution gene matching primitives It only needs to match the corresponding attribute value of attribute-name.Entity attribute gene matching degree is the weighted sum of overlapping attribute, is being spent It does not require attribute value character string completely the same when measuring attribute value, but calculates the two phase with the similarity algorithm based on editing distance Like degree.When constructing the weighted sum model of attributes match, fuzzy matching optimization can be carried out to the value of part specific properties, reinforce discriminating The strong attribute weight of other property, such as identity card, spouse.It specifically, can computation attribute gene matching degree according to the following formula scorep(m, e):
Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's All properties value, weitp(pn) weight of attribute-name pn, I are indicatedpv(vi, vj) ∈ { 0,1 } is an indicator function, is indicated Attribute value viAnd vjIt is whether identical, Ipv(vi,vj) calculation formula is as follows:
Wherein, simpv(vi,vj) indicate that character string normalizes similarity, θpIt is a higher threshold value, θpIt is biased to 1, Ipfix (vi,vj)、Isfix(vi,vj) respectively indicate vi、vjIn value whether be another value prefix and suffix.
Referring to Figure 10, the entity disambiguator 600 further include:
First Fusion Module 604, for merging the gene of the candidate entity and the entity to be disambiguated.
Second Fusion Module 605 is determined for the gene matching degree according to the candidate entity and the entity to be disambiguated With the target entity of the Entities Matching to be disambiguated, entity corresponding in knowledge base is carried out according to the knowledge of the target entity Knowledge fusion.
In the present embodiment, candidate entity and entity to be disambiguated can be in matched situation, by the base of entity to be disambiguated Because being merged with the gene of candidate entity, entity is accumulated.Entity word, the power of co-occurrence entity gene can be merged respectively Weight and entity attribute gene, while window is increased to the setting of co-occurrence entity gene, the reserved increasing of important vocabulary occurred for the later period Long spacing.In the present embodiment, with the process disambiguated to the entity word in text, entity mobility models can gradually be improved Library.
In the present embodiment mode, knowledge base can be updated according to the result that matching disambiguates, be roughly divided into nothing It with entity, several situations such as is matched to an entity and is matched to multiple entities, knowledge base is carried out respectively for three kinds of situations Attribute fusion.
Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates gradually is improved link entity and knowledge base, is helped to improve in mass text during entity disambiguates In target analysis, construction of knowledge base and question answering system etc. data-handling efficiency.
The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing The step of entity disambiguation method that above-mentioned each embodiment of the method provides is realized in memory when computer program.
Illustratively, computer program can be divided into one or more modules, one or more module is stored In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example Such as, computer program can be divided into the step of entity disambiguation method that above-mentioned each embodiment of the method provides.
It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions Part, such as may include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital SignalProcessor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection Various pieces.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function Deng;Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each entity disambiguation method embodiment.Wherein, the computer program includes computer program generation Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms Deng.The computer-readable medium may include: any entity or device, record that can carry the computer program code Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with Machine access memory (RAM, Random Access Memory), electric carrier signal, electric signal and software distribution medium etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (12)

1. a kind of entity disambiguation method, which is characterized in that the entity disambiguation method includes:
The gene of entity to be disambiguated is constructed, the gene includes: co-occurrence entity word gene and entity attribute gene, and the co-occurrence is real Pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated;
Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the entity to be disambiguated Semantic feature includes name the Formal Similarity, abbreviation information and reference feature;
The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than preset threshold in the gene matching degree In the case where, determine that the gene of the candidate entity and the entity to be disambiguated matches.
2. entity disambiguation method according to claim 1, which is characterized in that it is described calculate the candidate entity and it is described to Disambiguate the gene matching degree of entity, including following procedure:
Obtain the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated;
Obtain the candidate entity and the entity attributes gene matching degree to be disambiguated;
According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and entity to be disambiguated are calculated Gene matching degree, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, It e) is the attribution gene matching degree, α, β are weight.
3. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to Disambiguate the co-occurrence entity gene matching degree of entity, including following procedure:
The co-occurrence entity word of the entity to be disambiguated is determined from pre-stored document, and from the pre-stored document Determine the co-occurrence entity word of the candidate entity;
Obtain the genetic entity word set of the candidate entity;
According to the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to The word frequency of pre-stored entity word total amount, the co-occurrence entity of the co-occurrence entity word of the entity to be disambiguated and the candidate entity The overlapping part of word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence entity of the entity to be disambiguated The inverse document frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity, calculates the co-occurrence entity gene With degree.
4. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to Disambiguate entity attributes gene matching degree, including following procedure:
Determine the entity attributes name to be disambiguated and the candidate entity attributes name;
According to the overlapping attribute and overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name Weighted value, calculate the attribution gene matching degree.
5. entity disambiguation method according to claim 1, which is characterized in that the determination candidate entity and it is described to After the gene of disambiguation entity matches, the entity disambiguation method further include:
The gene of the candidate entity and the entity to be disambiguated is merged;
According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining mesh with the Entities Matching to be disambiguated Entity is marked, knowledge fusion is carried out to entity corresponding in knowledge base according to the knowledge of the target entity.
6. a kind of entity disambiguator characterized by comprising
Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute base Cause, the co-occurrence entity word gene include co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes described wait disambiguate Entity attributes;
Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein described The semantic feature of entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature;
Matching module is matched for calculating the gene matching degree of the candidate entity and the entity to be disambiguated in the gene In the case that degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.
7. entity disambiguator according to claim 6, which is characterized in that the matching module includes:
First acquisition submodule, for obtain the candidate entity and the entity to be disambiguated co-occurrence entity gene matching degree and Attribution gene matching degree;
First computational submodule, for calculating the time according to the co-occurrence entity gene matching degree and attribution gene matching degree Select the gene matching degree of entity Yu entity to be disambiguated, calculation formula are as follows:
scoreg(m, e)=α * scorew(m,e)+β*scorep(m,e)
Wherein, scoreg(m, e) is gene matching degree, scorew(m, e) is the co-occurrence entity gene matching degree, scorep(m, It e) is the attribution gene matching degree, α, β are weight.
8. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:
First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from institute State the co-occurrence entity word that the candidate entity is determined in pre-stored document;
First acquisition unit, for obtaining the genetic entity word set of the candidate entity;
First computing unit, for according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word and the time of the entity to be disambiguated Select the overlapping part of the co-occurrence entity word of entity relative to the word frequency of the genetic entity word set of the candidate entity and described wait disappear The inverse document frequency of the overlapping part of the co-occurrence entity word of discrimination entity and the co-occurrence entity word of the candidate entity, calculates institute State co-occurrence entity gene matching degree.
9. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:
Second determination unit, for determining the entity attributes name to be disambiguated and the candidate entity attributes name;
Second computing unit, for according to the friendship between the entity attributes name to be disambiguated and the candidate entity attributes name The weighted value of folded attribute and overlapping attribute, calculates the attribution gene matching degree.
10. entity disambiguator according to claim 6, which is characterized in that further include:
First Fusion Module, for merging the gene of the candidate entity and the entity to be disambiguated;
Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine with it is described The target entity of Entities Matching to be disambiguated carries out knowledge to entity corresponding in knowledge base according to the knowledge of the target entity and melts It closes.
11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing It is realized when computer program in memory as described in any one of claim 1-5 the step of entity disambiguation method.
12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program It is realized when being executed by processor as described in any one of claim 1-5 the step of entity disambiguation method.
CN201811508089.7A 2018-12-11 2018-12-11 Entity disambiguation method and device, computer device and computer storage medium Active CN109635297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811508089.7A CN109635297B (en) 2018-12-11 2018-12-11 Entity disambiguation method and device, computer device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811508089.7A CN109635297B (en) 2018-12-11 2018-12-11 Entity disambiguation method and device, computer device and computer storage medium

Publications (2)

Publication Number Publication Date
CN109635297A true CN109635297A (en) 2019-04-16
CN109635297B CN109635297B (en) 2022-01-04

Family

ID=66072632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811508089.7A Active CN109635297B (en) 2018-12-11 2018-12-11 Entity disambiguation method and device, computer device and computer storage medium

Country Status (1)

Country Link
CN (1) CN109635297B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110348012A (en) * 2019-07-01 2019-10-18 北京明略软件系统有限公司 Determine method, apparatus, storage medium and the electronic device of target character
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN110516252A (en) * 2019-08-30 2019-11-29 京东方科技集团股份有限公司 Data mask method, device, computer equipment and storage medium
CN110827831A (en) * 2019-11-15 2020-02-21 广州洪荒智能科技有限公司 Voice information processing method, device, equipment and medium based on man-machine interaction
CN111259653A (en) * 2020-01-15 2020-06-09 重庆邮电大学 Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
CN111401049A (en) * 2020-03-12 2020-07-10 京东方科技集团股份有限公司 Entity linking method and device
CN111680498A (en) * 2020-05-18 2020-09-18 国家基础地理信息中心 Entity disambiguation method, device, storage medium and computer equipment
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN115293158A (en) * 2022-06-30 2022-11-04 撼地数智(重庆)科技有限公司 Disambiguation method and device based on label assistance

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108170662A (en) * 2016-12-07 2018-06-15 富士通株式会社 The disambiguation method of breviaty word and disambiguation equipment
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN108170662A (en) * 2016-12-07 2018-06-15 富士通株式会社 The disambiguation method of breviaty word and disambiguation equipment
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110134965B (en) * 2019-05-21 2023-08-18 北京百度网讯科技有限公司 Method, apparatus, device and computer readable storage medium for information processing
CN110348012B (en) * 2019-07-01 2022-12-09 北京明略软件系统有限公司 Method, device, storage medium and electronic device for determining target character
CN110348012A (en) * 2019-07-01 2019-10-18 北京明略软件系统有限公司 Determine method, apparatus, storage medium and the electronic device of target character
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN110516252A (en) * 2019-08-30 2019-11-29 京东方科技集团股份有限公司 Data mask method, device, computer equipment and storage medium
CN110516252B (en) * 2019-08-30 2022-12-09 京东方科技集团股份有限公司 Data annotation method and device, computer equipment and storage medium
CN110827831A (en) * 2019-11-15 2020-02-21 广州洪荒智能科技有限公司 Voice information processing method, device, equipment and medium based on man-machine interaction
CN111259653B (en) * 2020-01-15 2022-06-24 重庆邮电大学 Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
CN111259653A (en) * 2020-01-15 2020-06-09 重庆邮电大学 Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
CN111401049A (en) * 2020-03-12 2020-07-10 京东方科技集团股份有限公司 Entity linking method and device
US11914959B2 (en) 2020-03-12 2024-02-27 Boe Technology Group Co., Ltd. Entity linking method and apparatus
CN111680498A (en) * 2020-05-18 2020-09-18 国家基础地理信息中心 Entity disambiguation method, device, storage medium and computer equipment
CN111680498B (en) * 2020-05-18 2023-04-07 国家基础地理信息中心 Entity disambiguation method, device, storage medium and computer equipment
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN115293158A (en) * 2022-06-30 2022-11-04 撼地数智(重庆)科技有限公司 Disambiguation method and device based on label assistance
CN115293158B (en) * 2022-06-30 2024-02-02 撼地数智(重庆)科技有限公司 Label-assisted disambiguation method and device

Also Published As

Publication number Publication date
CN109635297B (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
Ramisch et al. mwetoolkit: A framework for multiword expression identification.
Gupta et al. A survey of common stemming techniques and existing stemmers for indian languages
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
WO2005064490A1 (en) System for recognising and classifying named entities
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN106570180A (en) Artificial intelligence based voice searching method and device
WO2014022172A2 (en) Information classification based on product recognition
CN106611041A (en) New text similarity solution method
US11170169B2 (en) System and method for language-independent contextual embedding
Weerasinghe et al. Feature vector difference based neural network and logistic regression models for authorship verification
CN103678565A (en) Domain self-adaption sentence alignment system based on self-guidance mode
Amarappa et al. Named entity recognition and classification in kannada language
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
Wong et al. iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
CN101271448A (en) Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
Shafi et al. UNLT: Urdu natural language toolkit
CN110874408B (en) Model training method, text recognition device and computing equipment
Adebayo et al. Normas at semeval-2016 task 1: Semsim: A multi-feature approach to semantic text similarity
Muhamad et al. Proposal: A hybrid dictionary modelling approach for malay tweet normalization
Oudah et al. Person name recognition using the hybrid approach
Sharma et al. Lfwe: Linguistic feature based word embedding for hindi fake news detection
Nguyen et al. L3i_lbpam at the finsim-2 task: Learning financial semantic similarities with siamese transformers
Baishya et al. Present state and future scope of Assamese text processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Duan Lian

Inventor after: Zhou Zhongcheng

Inventor before: Duan Lian

Inventor before: Zhou Zhongcheng

Inventor before: Huang Jiuming

Inventor before: Zhang Shengdong

GR01 Patent grant
GR01 Patent grant