CN109635297A

CN109635297A - A kind of entity disambiguation method, device, computer installation and computer storage medium

Info

Publication number: CN109635297A
Application number: CN201811508089.7A
Authority: CN
Inventors: 段炼; 周忠诚; 黄九鸣; 张圣栋
Original assignee: Hunan Xinghan Shuzhi Technology Co Ltd
Current assignee: Hunan Xinghan Shuzhi Technology Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2019-04-16
Anticipated expiration: 2038-12-11
Also published as: CN109635297B

Abstract

The present invention is suitable for Internet technical field, discloses a kind of entity disambiguation method, device, computer installation and computer storage medium, which comprises construct the gene of entity to be disambiguated；Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature；The gene matching degree for calculating the candidate entity and the entity to be disambiguated determines that the gene of the candidate entity and the entity to be disambiguated matches in the case where the gene matching degree is more than preset threshold.Entity disambiguation method provided by the invention, it can be improved the effect of entity disambiguation, during entity disambiguates, gradually improve link entity and knowledge base, help to improve target analysis in mass text, construction of knowledge base and in terms of data-handling efficiency.

Description

A kind of entity disambiguation method, device, computer installation and computer storage medium

Technical field

The invention belongs to Internet technical field more particularly to a kind of entity disambiguation method, device, computer installation and meters Calculation machine storage medium.

Background technique

There are entity name ambiguity problems during natural language processing, for example, some name in text may refer to For entities multiple in this life circle.The reason of leading to entity name ambiguity problem be natural language statement freedom, diversity, Ambiguousness.Currently, natural language processing (Natural Language Processing, NLP) research is absorbed in always machine and is turned over It translates, information retrieval, text snippet, question and answer, information extraction, theme models and the tasks such as emotion is excavated.Tradition is based on grammer point The natural language processing technique development of analysis is more slow, and breakthrough achievement is less.With the innovation of the technologies such as deep learning, manually Intelligence has obtained extensive concern in the field NLP.Since there are synonymous, near synonym, the feelings such as polysemy in natural language description Condition increases the difficulty of natural language analysis, thus relatively crucial problem is that entity disambiguates in natural language processing, and entity disappears The main purpose of discrimination is that there are the entity names of ambiguity in identification sentence, and provide to each ambiguity entity name and meet it The semanteme of context.Common entity disambiguation method requires pre-existing informative knowledge base, the property on large-scale data source It is poor to show, and the precision that entity disambiguates on internet data source is low.

Summary of the invention

The embodiment of the present invention provides a kind of entity disambiguation method, device, computer installation and computer storage medium, it is intended to It solving method in the prior art and requires pre-existing informative knowledge base, performance is poor on large-scale data source, The low problem of the precision that entity disambiguates on internet data source.

The invention is realized in this way a kind of entity disambiguation method, including following procedure:

Construct the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute gene, described total Real pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the category of the entity to be disambiguated Property；

Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the reality to be disambiguated The semantic feature of body includes name the Formal Similarity, abbreviation information and reference feature；

The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than default in the gene matching degree In the case where threshold value, determine that the gene of the candidate entity and the entity to be disambiguated matches.

Further, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, including following mistake Journey:

Obtain the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated；

Obtain the candidate entity and the entity attributes gene matching degree to be disambiguated；

According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and reality to be disambiguated are calculated The gene matching degree of body, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)

Wherein, score_g(m, e) is gene matching degree, score_w(m, e) is the co-occurrence entity gene matching degree, score_p(m, e) is the attribution gene matching degree, and α, β are weight.

Further, the co-occurrence entity gene matching degree for obtaining the candidate entity and the entity to be disambiguated, packet Include following procedure:

Determine the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from the pre-stored text The co-occurrence entity word of the candidate entity is determined in shelves；

Obtain the genetic entity word set of the candidate entity；

According to the overlap split-phase of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity For the word frequency of pre-stored entity word total amount, the co-occurrence of the co-occurrence entity word of the entity to be disambiguated and the candidate entity The overlapping part of entity word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence of the entity to be disambiguated The inverse document frequency of the overlapping part of entity word and the co-occurrence entity word of the candidate entity, calculates the co-occurrence entity base Because of matching degree.

Further, the acquisition candidate entity and the entity attributes gene matching degree to be disambiguated, including with Lower process:

Determine the entity attributes name to be disambiguated and the candidate entity attributes name；

According to the overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name, and overlap The weighted value of attribute calculates the attribution gene matching degree.

Further, after the gene of the determination candidate entity and the entity to be disambiguated matches, the reality Body disambiguation method further include:

The gene of the candidate entity and the entity to be disambiguated is merged；

According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining and Entities Matching to be disambiguated Target entity, according to the knowledge of the target entity to entity corresponding in knowledge base carry out knowledge fusion.

The present invention also provides a kind of entity disambiguators, comprising:

Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity category Property gene, the co-occurrence entity word gene includes co-occurrence entity word and co-occurrence degree, the entity attribute gene include it is described to Disambiguate entity attributes；

Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein The semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature；

Matching module, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the gene In the case that matching degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.

Further, the matching module includes:

First acquisition submodule is matched for obtaining the candidate entity with the co-occurrence entity gene of the entity to be disambiguated Degree and attribution gene matching degree；

First computational submodule, for calculating institute according to the co-occurrence entity gene matching degree and attribution gene matching degree State the gene matching degree of candidate entity Yu entity to be disambiguated, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)

Further, first acquisition submodule includes:

First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and The co-occurrence entity word of the candidate entity is determined from the pre-stored document；

First acquisition unit, for obtaining the genetic entity word set of the candidate entity；

First computing unit, for real according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence of the candidate entity Word frequency of the overlapping part of pronouns, general term for nouns, numerals and measure words relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and institute The overlapping part of the co-occurrence entity word of candidate entity is stated relative to the word frequency of the genetic entity word set of the candidate entity and described The inverse document frequency of the overlapping part of the co-occurrence entity word of entity to be disambiguated and the co-occurrence entity word of the candidate entity, meter Calculate the co-occurrence entity gene matching degree.

Further, first acquisition submodule includes:

Second determination unit, for determining the entity attributes name to be disambiguated and the candidate entity attributes name；

Second computing unit, for according between the entity attributes name to be disambiguated and the candidate entity attributes name Overlapping attribute and overlapping attribute weighted value, calculate the attribution gene matching degree.

Further, the entity disambiguator, further includes:

First Fusion Module, for merging the gene of the candidate entity and the entity to be disambiguated；

Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine and The target entity of the Entities Matching to be disambiguated knows entity corresponding in knowledge base according to the knowledge of the target entity Know fusion.

The present invention also provides a kind of computer installation, the computer installation includes processor, and the processor is for holding It is realized when computer program such as the step of above-mentioned entity disambiguation method in line storage.

The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey It realizes when sequence is executed by processor such as the step of above-mentioned entity disambiguation method.

Entity disambiguation method provided by the invention, by constructing the gene of entity to be disambiguated, calculate the candidate entity with The gene matching degree of the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, to extensive The disambiguation accuracy that data source carries out during entity disambiguation is relatively high, improves the essence that entity disambiguates on internet data source Degree during entity disambiguates, gradually improves link entity and knowledge base to improve the effect that entity disambiguates on the whole, Help to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..

Detailed description of the invention

Fig. 1 is the implementation flow chart of entity disambiguation method provided in an embodiment of the present invention；

The reality of Fig. 2 gene matching degree provided in an embodiment of the present invention for calculating the candidate entity and the entity to be disambiguated Existing flow chart；

Fig. 3 is the implementation flow chart that the present invention implements the acquisition co-occurrence entity gene matching degree provided；

Fig. 4 is the implementation flow chart that the present invention implements the acquisition attribution gene matching degree provided；

Fig. 5 is after the gene of the determination candidate entity and the entity to be disambiguated that the present invention implements offer matches Entity disambiguation method implementation flow chart；

Fig. 6 is a kind of structural schematic diagram of entity disambiguator provided in an embodiment of the present invention；

Fig. 7 is the structural schematic diagram of matching module provided in an embodiment of the present invention；

Fig. 8 is a structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention；

Fig. 9 is another structural schematic diagram of the first acquisition submodule provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of another entity disambiguator provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 1 show the flow chart of entity disambiguation method provided in an embodiment of the present invention.The entity disambiguation method, including Following procedure:

Step S101, the gene of entity to be disambiguated is constructed.

In the present embodiment, the gene includes: co-occurrence entity word gene and entity attribute gene, the co-occurrence entity word Gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated.The base Because determining the characteristic information of entity.For each entity word to be disambiguated, in statistical documents the co-occurrence frequency of co-occurrence entity word and Co-occurrence entity gene of the co-occurrence entity word as entity, using in the attribute list of document about the attribute of entity word to be disambiguated as Entity attribute gene.In the present embodiment, entity word that can also be faint to co-occurrence degree according to relationship type attribute carries out gene Enhancing, finally obtains co-occurrence entity gene.It is understood that relationship type attribute indicates the incidence relation between two entities, For example, relationship type attribute can be conjugal relation, Peer Relationships, father and daughter's relationship etc..It in the present embodiment, can be by document Information extraction is carried out, finds out the association attributes of entity in document, and attribute is snapped in attribute slot predetermined with specification The expression of attribute-name, using this strong information of entity attributes as entity attribute gene.

In the present embodiment, the process that entity to be disambiguated is obtained from text specifically includes that web page body extracts, and text is clear It washes, languages identification, text participle, the processes such as entity attribute extraction.

Specifically, in the web page body extraction process, mainly according under the title of css-class and<div>block Content of text finds body matter expression part, removes the auxiliary element in web page contents, such as logs in, comment, shares function Property button, not care about one's appearance banner towing etc., to obtain webpage body content.

Specifically, it in the text cleaning process, can be removed such as by some tools<div>,<p>,<br>deng Html source code obtains natural language text, then carries out full half-angle conversion, the removing of invisible character, emoticon removing, complicated and simple conversion Etc. processes, wash some useless, meaningless characters, export natural language text.

Specifically, in the languages identification process, encoding block belonging to character, such as Chinese articles can be both checked merely In, most characters are in the encoding block and ASCII block of Chinese；It is regarded as a classification problem simultaneously so long, used Machine learning method solves.It is according to text languages that text distribution participle part is right by system after the languages for judging text The participle tool answered.

Specifically, being identified according to languages as a result, respectively using specific during the text segments Chinese, English, Spanish participle tool, output comprising word, part of speech and name entity word word segmentation result, and will It is converted to unified expression.

Specifically, in the entity attribute extraction process, system first passes through NLP tool and clears up pronominal reference as far as possible, replaces Pronoun is changed to by reference physical name, is then based on rule and attribute extraction tool Extracting Information from text.It is rule-based Information extraction includes the building of decimation rule and carries out information extraction two parts using rule.Attribute extraction tool first can will be every A sentence is cut into a series of clause, then shortens each clause to the maximum extent, obtains shorter sentence fragment, then These sentence fragments can be divided into triple, and triple includes entity, attribute and attribute value, and triple can be used as the reality of document Body attribute list, finally by the entity attribute table of system output document.

After Text Pretreatment and NLP analytic process, the entity attribute table of obtained document can be used as entity disambiguation It directly inputs.Text Pretreatment, NLP analytic process are indispensable in entity disambiguation, but particular technique used can adopt With other NLP processing methods for removing above-mentioned technology.Configuring these steps can help to optimize subsequent processing speed, disappear for entity Discrimination process provides the input text of high quality, reduces to the input requirements of system, is expanding manageable data area While also strengthen scalability.

Step S102, candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated.

In the present embodiment, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and refers to For feature.Different modes is taken to determine candidate's entity, the number of the candidate entity determined from entity library for different language One or more are likely to be, in the case where candidate entity has multiple, candidate entity set can be formed.It is understood that name Word the Formal Similarity can between substantive noun form similarity, for example, the people in name Zhang San and B article in A article There is name the Formal Similarity, English name " jack " and " jackie ", " trump " and " donald trump " have between name Zhang San There is similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, the abbreviation information in Hunan Province is Hunan, media access control address can be referred to as MAC Address.It is special that reference feature can be understood as the corresponding reference of pronoun in document Sign.

Step S103, the gene matching degree for calculating the candidate entity and the entity to be disambiguated, matches in the gene In the case that degree is more than preset threshold, determine that the gene of the candidate entity and the entity to be disambiguated matches.

In the present embodiment, in the case where there is candidate entity set, the candidate entity in candidate entity set can be scanned, Calculate the gene matching degree of candidate entity and entity to be disambiguated, determined when gene matching degree is more than preset threshold candidate entity with to Disambiguating entity can match.The preset threshold is bigger, illustrates that matching precision is higher, and the preset threshold is smaller, illustrates matching precision Lower, which can be set according to actual needs.The case where candidate entity can be matched with entity to be disambiguated Under, the gene of entity to be disambiguated is merged with the gene of candidate entity, entity is accumulated.Then, to candidate entity Other the candidate entities concentrated are scanned, until there is no the candidate entities that can be matched in candidate entity set.

Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates.

Referring to fig. 2, the gene of the calculating candidate entity and the entity to be disambiguated in the step S103 With degree, including following procedure:

Step S1031 obtains the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated.

Step S1032 obtains the candidate entity and the entity attributes gene matching degree to be disambiguated.

Step S1033 calculates the candidate entity according to the co-occurrence entity gene matching degree and attribution gene matching degree With the gene matching degree of entity to be disambiguated, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)

Referring to Fig. 3, above-mentioned steps S1031 includes following procedure:

Step S10311, determines the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from described The co-occurrence entity word of the candidate entity is determined in pre-stored document；

Step S10312 obtains the genetic entity word set of the candidate entity；

Step S10313, according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word of the entity to be disambiguated and the candidate The overlapping part of the co-occurrence entity word of entity is relative to the word frequency of the genetic entity word set of the candidate entity and described wait disambiguate The inverse document frequency of the overlapping part of the co-occurrence entity word of entity and the co-occurrence entity word of the candidate entity, described in calculating Co-occurrence entity gene matching degree.

In the present embodiment, the genetic entity word set is entity set of words relevant to entity, e.g., to Zhang San's reality Body, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.

It further illustrates, uses for reference TF-IDF thought, TF-IDF (term frequency-inverse document It frequency) is a kind of common weighting technique for information retrieval and data mining.TF means word frequency (Term Frequency), IDF means inverse document frequency (Inverse Document Frequency).Using all entities as Document D, the entity word of the co-occurrence entity gene of entity treat the entity word for disambiguating the co-occurrence entity gene of entity as word T Tf and the tf of entity word of co-occurrence entity gene of candidate entity be normalized.TF is normalized to genetic entity word number Amount is sensitive, and the co-occurrence entity mrna length of candidate entity is generally on the high side, thus the weight for the co-occurrence entity word that gene is overlapped into Row normalizes again after strengthening, and reinforces weak signal.The calculating of idf is then based on caching mechanism progress, improves calculating speed Degree.Show that entity word gene matching degree specifically can finally by the tf and idf of the overlapping part of co-occurrence entity word gene To calculate the co-occurrence entity gene matching degree score according to the following formula_w(m, e):

Wherein, calculate public about entity to be disambiguated and candidate entity m, e, the normalization tf value of overlapping genetic entity word w Formula are as follows:

Wherein, the gene of the weight of the genetic entity word w of weit (w, e) presentation-entity e, weit (e, w) presentation-entity w is real The weight of pronouns, general term for nouns, numerals and measure words e, W (m) are the genetic entity word set of entity m；It should be noted that in the above-mentioned calculating co-occurrence entity gene Matching degree score_wIt in the formula of (m, e), does not distinguish and treats disambiguation entity and candidate entity, m, e are accordingly to be regarded as entity, have equity Property, tf_norm(w；M*, e) indicate that word w is the normalized value of total amount, tf about m after the overlapping reinforcing of m, e_norm(w；E*, m) meter Calculation mode is similar to tf_norm(w；M*, e) calculation, this will not be repeated here.

The calculation formula of the inverse document frequency idf (w) of co-occurrence entity word in co-occurrence entity gene are as follows:

Wherein, E is entity library, and W (e) is the genetic entity word set of entity e, and Z is a complementary constant, under normal circumstances Z is smaller.

Referring to fig. 4, above-mentioned steps S1032 includes following procedure:

Step S10321 determines the entity attributes name to be disambiguated and the candidate entity attributes name；

Step S10322, according to overlapping between the entity attributes name to be disambiguated and the candidate entity attributes name The weighted value of attribute and overlapping attribute calculates the attribution gene matching degree.

In the present embodiment, when gene constructed, attribute-name is aligned, therefore when attribution gene matching primitives It only needs to match the corresponding attribute value of attribute-name.Entity attribute gene matching degree is the weighted sum of overlapping attribute, is being spent It does not require attribute value character string completely the same when measuring attribute value, but calculates the two phase with the similarity algorithm based on editing distance Like degree.When constructing the weighted sum model of attributes match, fuzzy matching optimization can be carried out to the value of part specific properties, reinforce discriminating The strong attribute weight of other property, such as identity card, spouse.It specifically, can computation attribute gene matching degree according to the following formula score_p(m, e):

Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's All properties value, weit_p(pn) weight of attribute-name pn, I are indicated_pv(v_i,v_j) ∈ { 0,1 }, it is an indicator function, indicates Attribute value v_iAnd v_jIt is whether identical, I_pv(v_i,v_j) calculation formula is as follows:

Wherein, sim_pv(v_i,v_j) indicate that character string normalizes similarity, θ_pIt is a higher threshold value, θ_pIt is biased to 1, I_pfix (v_i,v_j)、I_sfix(v_i,v_j) respectively indicate v_i、v_jIn value whether be another value prefix and suffix.

Referring to Fig. 5, after above-mentioned steps 103, the method also includes:

Step 104, the gene of the candidate entity and the entity to be disambiguated is merged.

Step 105, it according to the gene matching degree of the candidate entity and the entity to be disambiguated, determines with described wait disambiguate The target entity of Entities Matching carries out knowledge fusion to entity corresponding in knowledge base according to the knowledge of the target entity.

In the present embodiment, candidate entity and entity to be disambiguated can be in matched situation, by the base of entity to be disambiguated Because being merged with the gene of candidate entity, entity is accumulated.Entity word, the power of co-occurrence entity gene can be merged respectively Weight and entity attribute gene, while window is increased to the setting of co-occurrence entity gene, the reserved increasing of important vocabulary occurred for the later period Long spacing.In the present embodiment, with the process disambiguated to the entity word in text, entity mobility models can gradually be improved Library.

In the present embodiment mode, knowledge base can be updated according to the result that matching disambiguates, be roughly divided into nothing It with entity, several situations such as is matched to an entity and is matched to multiple entities, knowledge base is carried out respectively for three kinds of situations Attribute fusion.

Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, right The disambiguation accuracy that large-scale data source carries out during entity disambiguation is relatively high, improves the entity on internet data source and disambiguates Precision during entity disambiguates, gradually improve link entity and knowledge to improve the effect that entity disambiguates on the whole Library helps to improve the data-handling efficiency of target analysis, construction of knowledge base and question answering system in mass text etc..

Fig. 6 shows a kind of structural schematic diagram of entity disambiguator 600 provided in an embodiment of the present invention, for the ease of saying It is bright, it illustrates only and implements relevant part in the present invention.The entity disambiguator 600, comprising:

Module 601 is constructed, for constructing the gene of entity to be disambiguated.

Screening module 602, for determining candidate's entity from entity library according to the semantic feature of the entity to be disambiguated, In, the semantic feature of the entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature.

In the present embodiment, it takes different modes to determine candidate's entity from entity library for different language, determines The number of candidate entity is likely to be one or more, in the case where candidate entity has multiple, can form candidate entity set. It is understood that name the Formal Similarity can between substantive noun form similarity, for example, name in A article There is name the Formal Similarity between name Zhang San in three and B article, English name " jack " and " jackie ", " trump " with " donald trump " has similitude.Abbreviation information may include Chinese abbreviation information and English abbreviation information, for example, Hunan The abbreviation information of province is Hunan, and media access control address can be referred to as MAC Address.Reference feature can be understood as generation in document The corresponding reference feature of word.

Matching module 603, for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, in the base In the case where being more than preset threshold because of matching degree, determine that the gene of the candidate entity and the entity to be disambiguated matches.

Entity disambiguator provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates.

Referring to Fig. 7, the matching module 603 includes:

First acquisition submodule 6031, for obtaining the co-occurrence entity gene of the candidate entity and the entity to be disambiguated Matching degree and attribution gene matching degree.

First computational submodule 6032, for according to the co-occurrence entity gene matching degree and attribution gene matching degree, meter Calculate the gene matching degree of the candidate entity and entity to be disambiguated, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)；

Referring to Fig. 8, above-mentioned first acquisition submodule 6031 includes:

First determination unit 60311, for determining the co-occurrence entity of the entity to be disambiguated from pre-stored document Word, and determine from the pre-stored document co-occurrence entity word of the candidate entity；

First acquisition unit 60312, for obtaining the genetic entity word set of the candidate entity；

First computing unit 60313, co-occurrence entity word and the candidate entity for the entity to be disambiguated according to Word frequency of the overlapping part of co-occurrence entity word relative to pre-stored entity word total amount, the co-occurrence entity of the entity to be disambiguated Word frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity relative to the genetic entity word set of the candidate entity, And the inverse text frequency of the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Index calculates the co-occurrence entity gene matching degree.

In the present embodiment, in the present embodiment, the genetic entity word set is entity word set relevant to entity It closes, e.g., to Zhang San's entity, genetic entity word set may include the words such as Li Si, Wang Shi group, the village Wang Jia.

Referring to Fig. 9, above-mentioned first acquisition submodule 6031 includes:

Second determination unit 60314, for determining the entity attributes name to be disambiguated and the candidate entity attributes Name；

Second computing unit 60315, for according to the entity attributes name to be disambiguated and the candidate entity attributes The weighted value of overlapping attribute and overlapping attribute between name, calculates the attribution gene matching degree.

Wherein, the attribute-name of all properties of PN (x) presentation-entity x, pv (x, pn) presentation-entity x is about attribute-name pn's All properties value, weit_p(pn) weight of attribute-name pn, I are indicated_pv(vi, vj) ∈ { 0,1 } is an indicator function, is indicated Attribute value v_iAnd v_jIt is whether identical, I_pv(v_i,v_j) calculation formula is as follows:

Referring to Figure 10, the entity disambiguator 600 further include:

First Fusion Module 604, for merging the gene of the candidate entity and the entity to be disambiguated.

Second Fusion Module 605 is determined for the gene matching degree according to the candidate entity and the entity to be disambiguated With the target entity of the Entities Matching to be disambiguated, entity corresponding in knowledge base is carried out according to the knowledge of the target entity Knowledge fusion.

Entity disambiguation method provided in an embodiment of the present invention calculates the candidate by constructing the gene of entity to be disambiguated The gene matching degree of entity and the entity to be disambiguated disambiguates the entity word in text according to gene matching degree, is based on The performance that large-scale data source carries out entity disambiguation is relatively high, improves the precision that entity disambiguates on internet data source, improves The effect that entity disambiguates gradually is improved link entity and knowledge base, is helped to improve in mass text during entity disambiguates In target analysis, construction of knowledge base and question answering system etc. data-handling efficiency.

The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing The step of entity disambiguation method that above-mentioned each embodiment of the method provides is realized in memory when computer program.

Illustratively, computer program can be divided into one or more modules, one or more module is stored In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example Such as, computer program can be divided into the step of entity disambiguation method that above-mentioned each embodiment of the method provides.

It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions Part, such as may include input-output equipment, network access equipment, bus etc..

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital SignalProcessor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection Various pieces.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function Deng；Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.

If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each entity disambiguation method embodiment.Wherein, the computer program includes computer program generation Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms Deng.The computer-readable medium may include: any entity or device, record that can carry the computer program code Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with Machine access memory (RAM, Random Access Memory), electric carrier signal, electric signal and software distribution medium etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of entity disambiguation method, which is characterized in that the entity disambiguation method includes:

The gene of entity to be disambiguated is constructed, the gene includes: co-occurrence entity word gene and entity attribute gene, and the co-occurrence is real Pronouns, general term for nouns, numerals and measure words gene includes co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes the entity attributes to be disambiguated；

Candidate's entity is determined from entity library according to the semantic feature of the entity to be disambiguated, wherein the entity to be disambiguated Semantic feature includes name the Formal Similarity, abbreviation information and reference feature；

The gene matching degree for calculating the candidate entity and the entity to be disambiguated is more than preset threshold in the gene matching degree In the case where, determine that the gene of the candidate entity and the entity to be disambiguated matches.

2. entity disambiguation method according to claim 1, which is characterized in that it is described calculate the candidate entity and it is described to Disambiguate the gene matching degree of entity, including following procedure:

According to the co-occurrence entity gene matching degree and attribution gene matching degree, the candidate entity and entity to be disambiguated are calculated Gene matching degree, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)

Wherein, score_g(m, e) is gene matching degree, score_w(m, e) is the co-occurrence entity gene matching degree, score_p(m, It e) is the attribution gene matching degree, α, β are weight.

3. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to Disambiguate the co-occurrence entity gene matching degree of entity, including following procedure:

The co-occurrence entity word of the entity to be disambiguated is determined from pre-stored document, and from the pre-stored document Determine the co-occurrence entity word of the candidate entity；

Obtain the genetic entity word set of the candidate entity；

According to the overlapping part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to The word frequency of pre-stored entity word total amount, the co-occurrence entity of the co-occurrence entity word of the entity to be disambiguated and the candidate entity The overlapping part of word is relative to the word frequency of the genetic entity word set of the candidate entity and the co-occurrence entity of the entity to be disambiguated The inverse document frequency of the overlapping part of the co-occurrence entity word of word and the candidate entity, calculates the co-occurrence entity gene With degree.

4. entity disambiguation method according to claim 2, which is characterized in that it is described obtain the candidate entity and it is described to Disambiguate entity attributes gene matching degree, including following procedure:

According to the overlapping attribute and overlapping attribute between the entity attributes name to be disambiguated and the candidate entity attributes name Weighted value, calculate the attribution gene matching degree.

5. entity disambiguation method according to claim 1, which is characterized in that the determination candidate entity and it is described to After the gene of disambiguation entity matches, the entity disambiguation method further include:

According to the gene matching degree of the candidate entity and the entity to be disambiguated, the determining mesh with the Entities Matching to be disambiguated Entity is marked, knowledge fusion is carried out to entity corresponding in knowledge base according to the knowledge of the target entity.

6. a kind of entity disambiguator characterized by comprising

Module is constructed, for constructing the gene of entity to be disambiguated, the gene includes: co-occurrence entity word gene and entity attribute base Cause, the co-occurrence entity word gene include co-occurrence entity word and co-occurrence degree, and the entity attribute gene includes described wait disambiguate Entity attributes；

Screening module, for screening candidate entity from entity library according to the semantic feature of the entity to be disambiguated, wherein described The semantic feature of entity to be disambiguated includes name the Formal Similarity, abbreviation information and reference feature；

Matching module is matched for calculating the gene matching degree of the candidate entity and the entity to be disambiguated in the gene In the case that degree is more than preset threshold, determine the candidate entity with described to match to the gene of gas entity.

7. entity disambiguator according to claim 6, which is characterized in that the matching module includes:

First acquisition submodule, for obtain the candidate entity and the entity to be disambiguated co-occurrence entity gene matching degree and Attribution gene matching degree；

First computational submodule, for calculating the time according to the co-occurrence entity gene matching degree and attribution gene matching degree Select the gene matching degree of entity Yu entity to be disambiguated, calculation formula are as follows:

score_g(m, e)=α * score_w(m,e)+β*score_p(m,e)

8. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:

First determination unit, for determining the co-occurrence entity word of the entity to be disambiguated from pre-stored document, and from institute State the co-occurrence entity word that the candidate entity is determined in pre-stored document；

First computing unit, for according to the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity Word frequency of the overlapping part relative to pre-stored entity word total amount, the co-occurrence entity word and the time of the entity to be disambiguated Select the overlapping part of the co-occurrence entity word of entity relative to the word frequency of the genetic entity word set of the candidate entity and described wait disappear The inverse document frequency of the overlapping part of the co-occurrence entity word of discrimination entity and the co-occurrence entity word of the candidate entity, calculates institute State co-occurrence entity gene matching degree.

9. entity disambiguator according to claim 7, which is characterized in that first acquisition submodule includes:

Second computing unit, for according to the friendship between the entity attributes name to be disambiguated and the candidate entity attributes name The weighted value of folded attribute and overlapping attribute, calculates the attribution gene matching degree.

10. entity disambiguator according to claim 6, which is characterized in that further include:

Second Fusion Module, for the gene matching degree according to the candidate entity and the entity to be disambiguated, determine with it is described The target entity of Entities Matching to be disambiguated carries out knowledge to entity corresponding in knowledge base according to the knowledge of the target entity and melts It closes.

11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing It is realized when computer program in memory as described in any one of claim 1-5 the step of entity disambiguation method.

12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program It is realized when being executed by processor as described in any one of claim 1-5 the step of entity disambiguation method.