CN109635297B

CN109635297B - Entity disambiguation method and device, computer device and computer storage medium

Info

Publication number: CN109635297B
Application number: CN201811508089.7A
Authority: CN
Inventors: 段炼; 周忠诚
Original assignee: Hunan Xinghan Shuzhi Technology Co ltd
Current assignee: Hunan Xinghan Shuzhi Technology Co ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2022-01-04
Anticipated expiration: 2038-12-11
Also published as: CN109635297A

Abstract

The invention is applicable to the technical field of Internet and discloses an entity disambiguation method, an entity disambiguation device, a computer device and a computer storage medium, wherein the method comprises the following steps: constructing a gene of an entity to be disambiguated; determining candidate entities from an entity library according to the semantic features of the entities to be disambiguated, wherein the semantic features of the entities to be disambiguated comprise name form similarity, short information and reference characteristics; calculating the gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold value. The entity disambiguation method provided by the invention can improve the effect of entity disambiguation, gradually improves the link entity and the knowledge base in the entity disambiguation process, and is beneficial to improving the data processing efficiency in the aspects of target analysis, knowledge base construction, question-answering system and the like in massive texts.

Description

Entity disambiguation method and device, computer device and computer storage medium

Technical Field

The invention belongs to the technical field of internet, and particularly relates to an entity disambiguation method, an entity disambiguation device, a computer device and a computer storage medium.

Background

There are entity name ambiguity problems in natural language processing, for example, a person's name in a text may refer to multiple entities in the world. The reasons for the ambiguity problem of entity names are the freedom, diversity, ambiguity of natural language expressions. At present, Natural Language Processing (NLP) research is focused on machine translation, information retrieval, text summarization, question answering, information extraction, topic modeling, and emotion mining. The traditional natural language processing technology based on the grammar analysis is slow in development and has less breakthrough achievements. With the innovation of technologies such as deep learning, artificial intelligence has gained wide attention in the NLP field. Because the natural language description has the conditions of synonymy, near-synonymy, word ambiguity and the like, the difficulty of natural language analysis is increased, so the key problem in natural language processing is entity disambiguation, and the main purpose of entity disambiguation is to identify the entity names with ambiguity in a sentence and give semantics conforming to the context of each ambiguous entity name. Common entity disambiguation methods require a pre-existing knowledge base with rich information, have poor performance on large-scale data sources, and have low accuracy of entity disambiguation on internet data sources.

Disclosure of Invention

The embodiment of the invention provides an entity disambiguation method, an entity disambiguation device, a computer device and a computer storage medium, and aims to solve the problems that a knowledge base with rich information is required to be pre-stored, the performance of the method on a large-scale data source is poor, and the accuracy of entity disambiguation on an internet data source is low in the prior art.

The invention is realized in such a way that an entity disambiguation method comprises the following processes:

constructing a gene for an entity to be disambiguated, the gene comprising: the entity word gene and the entity attribute gene are shared, the entity word gene comprises shared entity words and shared degree, and the entity attribute gene comprises the attribute of the entity to be disambiguated;

determining candidate entities from an entity library according to the semantic features of the entities to be disambiguated, wherein the semantic features of the entities to be disambiguated comprise name form similarity, short information and reference characteristics;

calculating the gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold value.

Further, the calculating the gene matching degree of the candidate entity and the entity to be disambiguated comprises the following processes:

acquiring the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated;

obtaining the matching degree of the attribute genes of the candidate entity and the entity to be disambiguated;

calculating the gene matching degree of the candidate entity and the entity to be disambiguated according to the co-occurrence entity gene matching degree and the attribute gene matching degree, wherein the calculation formula is as follows:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)

wherein, score_g(m, e) is the degree of gene match, score_w(m, e) is the degree of gene match of the co-occurring entity, score_p(m, e) is the matching degree of the attribute genes, and alpha and beta are weights.

Further, the obtaining of the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated comprises the following processes:

determining co-occurrence entity words of the entity to be disambiguated from a pre-stored document, and determining co-occurrence entity words of the candidate entity from the pre-stored document;

acquiring a gene entity word set of the candidate entity;

and calculating the co-occurrence entity gene matching degree according to the word frequency of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to the total amount of the pre-stored entity words, the word frequency of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to the gene entity word set of the candidate entity, and the inverse text frequency index of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity.

Further, the obtaining of the matching degree of the candidate entity and the attribute gene of the entity to be disambiguated includes the following processes:

determining the attribute name of the entity to be disambiguated and the attribute name of the candidate entity;

and calculating the matching degree of the attribute genes according to the overlapping attribute between the attribute name of the entity to be disambiguated and the attribute name of the candidate entity and the weight value of the overlapping attribute.

Further, after determining that the candidate entity matches the gene of the entity to be disambiguated, the entity disambiguation method further comprises:

fusing the candidate entity and the gene of the entity to be disambiguated;

and determining a target entity matched with the entity to be disambiguated according to the gene matching degree of the candidate entity and the entity to be disambiguated, and performing knowledge fusion on the corresponding entity in a knowledge base according to the knowledge of the target entity.

The present invention also provides an entity disambiguation apparatus comprising:

a construction module for constructing genes of an entity to be disambiguated, the genes comprising: the entity word gene and the entity attribute gene are shared, the entity word gene comprises shared entity words and shared degree, and the entity attribute gene comprises the attribute of the entity to be disambiguated;

the screening module is used for screening candidate entities from an entity library according to the semantic features of the entities to be disambiguated, wherein the semantic features of the entities to be disambiguated comprise name form similarity, short information and reference characteristics;

and the matching module is used for calculating the gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold value.

Further, the matching module comprises:

the first obtaining submodule is used for obtaining the co-occurrence entity gene matching degree and the attribute gene matching degree of the candidate entity and the entity to be disambiguated;

the first calculation submodule is used for calculating the gene matching degree of the candidate entity and the entity to be disambiguated according to the co-occurrence entity gene matching degree and the attribute gene matching degree, and the calculation formula is as follows:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)

Further, the first obtaining sub-module includes:

a first determining unit, configured to determine co-occurrence entity words of the entity to be disambiguated from a pre-stored document, and determine co-occurrence entity words of the candidate entities from the pre-stored document;

a first obtaining unit, configured to obtain a gene entity word set of the candidate entity;

the first calculation unit is used for calculating the co-occurrence entity gene matching degree according to the word frequency of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to the total amount of the pre-stored entity words, the word frequency of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity relative to the gene entity word set of the candidate entity, and the inverse text frequency index of the overlapped part of the co-occurrence entity word of the entity to be disambiguated and the co-occurrence entity word of the candidate entity.

Further, the first obtaining sub-module includes:

a second determining unit, configured to determine the attribute name of the entity to be disambiguated and the attribute name of the candidate entity;

and the second calculation unit is used for calculating the matching degree of the attribute genes according to the overlapping attribute between the attribute name of the entity to be disambiguated and the attribute name of the candidate entity and the weight value of the overlapping attribute.

Further, the entity disambiguation apparatus further comprises:

a first fusion module for fusing the candidate entity and the gene of the entity to be disambiguated;

and the second fusion module is used for determining a target entity matched with the entity to be disambiguated according to the gene matching degree of the candidate entity and the entity to be disambiguated and carrying out knowledge fusion on the corresponding entity in a knowledge base according to the knowledge of the target entity.

The invention also provides a computer arrangement comprising a processor for implementing the steps of the entity disambiguation method as described above when executing a computer program in a memory.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the entity disambiguation method as described above.

The entity disambiguation method provided by the invention has the advantages that the gene matching degree of the candidate entity and the entity to be disambiguated is calculated by constructing the gene of the entity to be disambiguated, the entity words in the text are disambiguated according to the gene matching degree, the disambiguation accuracy in the entity disambiguation process of a large-scale data source is higher, and the precision of the entity disambiguation on the internet data source is improved, so that the entity disambiguation effect is integrally improved, the link entity and the knowledge base are gradually improved in the entity disambiguation process, and the data processing efficiency of target analysis, knowledge base construction, question-answer system and the like in massive texts is improved.

Drawings

FIG. 1 is a flow chart of an implementation of an entity disambiguation method provided by an embodiment of the present invention;

FIG. 2 is a flowchart of an implementation of calculating a gene matching degree between the candidate entity and the entity to be disambiguated according to the embodiment of the present invention;

FIG. 3 is a flow chart of the implementation of obtaining co-occurrence entity gene matching degree according to the present invention;

FIG. 4 is a flow chart of the implementation of obtaining the matching degree of the attribute genes according to the present invention;

FIG. 5 is a flow chart of an implementation of the entity disambiguation method provided in accordance with an embodiment of the present invention after determining that the candidate entity matches the gene of the entity to be disambiguated;

FIG. 6 is a schematic structural diagram of an entity disambiguation apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a matching module according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a first obtaining submodule provided in the embodiment of the present invention;

fig. 9 is another schematic structural diagram of the first obtaining sub-module according to the embodiment of the present invention;

fig. 10 is a schematic structural diagram of another entity disambiguation apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a flowchart illustrating an entity disambiguation method according to an embodiment of the present invention. The entity disambiguation method comprises the following processes:

step S101, constructing a gene of an entity to be disambiguated.

In this example, the genes include: the common reality word gene and the entity attribute gene, wherein the common reality word gene comprises a common reality word and a common occurrence degree, and the entity attribute gene comprises the attribute of the entity to be disambiguated. The gene is characteristic information of the determined entity. For each entity word to be disambiguated, counting the co-occurrence frequency of the co-occurrence entity words in the document and taking the co-occurrence entity words as co-occurrence entity genes of the entity, and taking the attribute of the entity word to be disambiguated in the attribute table of the document as the attribute genes of the entity. In this embodiment, the gene enhancement of the entity word with weak co-occurrence degree can be performed according to the relationship type attribute, and the co-occurrence entity gene can be obtained finally. It will be appreciated that a relational attribute represents an associative relationship between two entities, for example, a relational attribute may be a couple relationship, a colleague relationship, a parent relationship, and the like. In this embodiment, the document may be subjected to information extraction to find out the related attributes of the entities in the document, and the attributes are aligned to the predefined attribute slots to normalize the expression of the attribute names, and the strong information of the attributes of the entities is used as the entity attribute genes.

In this embodiment, the process of obtaining the entity to be disambiguated from the text mainly includes: extracting a webpage main body, cleaning a text, identifying languages, segmenting a text, extracting entity attributes and the like.

Specifically, in the webpage main body extraction process, a main body content expression part is mainly found according to the name of the css-class and the text content under the < div > block, and auxiliary elements in the webpage content, such as functional buttons for login, comment and share, banner advertisements and the like, are removed, so that the webpage main body content is obtained.

Specifically, in the text cleaning process, html source codes such as < div >, < p >, < br > and the like can be removed by means of some tools to obtain a natural language text, and then processes such as full-half-angle conversion, invisible character cleaning, emoticon cleaning, complex and simple conversion and the like are performed to clean some useless and meaningless characters and output the natural language text.

Specifically, in the language identification process, the coding blocks to which the characters belong can be simply checked, for example, in a Chinese article, most of the characters are in the coding blocks and ASCII blocks of Chinese; meanwhile, the method can be regarded as a classification problem and is solved by applying a machine learning method. And after the language of the text is judged, the system distributes the text to a word segmentation tool corresponding to the word segmentation part according to the language of the text.

Specifically, in the text segmentation process, the segmentation result including the word, the part of speech, and the named entity word may be output and converted into a unified representation by using a specific segmentation tool for chinese, english, and spanish, respectively, according to the result obtained by the language identification.

Specifically, in the entity attribute extraction process, the system firstly resolves pronoun references as much as possible through an NLP tool, replaces the pronoun to be the name of the referred entity, and then extracts information from the text based on a rule and an attribute extraction tool. The rule-based information extraction comprises two parts of extraction rule construction and information extraction by using the rule. The attribute extraction tool firstly divides each sentence into a series of clauses, then shortens each clause to the maximum extent to obtain shorter sentence fragments, then the sentence fragments are divided into triples, the triples comprise entities, attributes and attribute values, the triples can be used as an entity attribute table of the document, and finally the entity attribute table of the document is output by the system.

After text preprocessing and NLP analysis, the obtained entity attribute table of the document can be used as direct input of entity disambiguation. The text preprocessing, NLP analysis process cannot be missing in entity disambiguation, but the specific techniques used may employ other NLP processing methods than those described above. The configuration of the steps can help to optimize the subsequent processing speed, provide high-quality input texts for the entity disambiguation process, reduce the input requirements on the system, and enhance the expandability while expanding the range of data which can be processed.

And S102, determining candidate entities from an entity library according to the semantic features of the entities to be disambiguated.

In this embodiment, the semantic features of the entity to be disambiguated include name form similarity, short for information, and reference features. Candidate entities are determined from the entity library in different manners according to different languages, the number of the determined candidate entities may be 1 or more, and in the case of multiple candidate entities, a candidate entity set may be formed. It is understood that the name form similarity may be a form similarity between entity nouns, for example, there is name form similarity between three people in chapter a and three people in article B, and the english name "jack" has similarity to "jackie", "number" and "donald number". The abbreviation information may include chinese abbreviation information and english abbreviation information, for example, the abbreviation information in hunan province is xiang, and the MAC address may be referred to as MAC address. The reference characteristics can be understood as the corresponding reference characteristics of the pronouns in the documents.

Step S103, calculating the gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold value.

In this embodiment, in the presence of the candidate entity set, the candidate entities in the candidate entity set may be scanned, and the gene matching degree between the candidate entities and the entity to be disambiguated may be calculated, and when the gene matching degree exceeds a preset threshold, it is determined that the candidate entities and the entity to be disambiguated can be matched. The larger the preset threshold value is, the higher the matching precision is, the smaller the preset threshold value is, the lower the matching precision is, and the preset threshold value can be set according to actual requirements. And fusing the genes of the entity to be disambiguated and the candidate entity under the condition that the candidate entity and the entity to be disambiguated can be matched, and accumulating the entities. Then, other candidate entities in the candidate entity set are scanned until no more candidate entities in the candidate entity set can be matched.

According to the entity disambiguation method provided by the embodiment of the invention, the gene matching degree of the candidate entity and the entity to be disambiguated is calculated by constructing the gene of the entity to be disambiguated, the entity word in the text is disambiguated according to the gene matching degree, the performance of entity disambiguation based on a large-scale data source is higher, the precision of entity disambiguation on an internet data source is improved, and the effect of entity disambiguation is improved.

Referring to fig. 2, the calculating the gene matching degree of the candidate entity and the entity to be disambiguated in step S103 includes the following processes:

and step S1031, obtaining the co-occurrence entity gene matching degree of the candidate entity and the entity to be disambiguated.

Step S1032, obtaining the matching degree of the candidate entity and the attribute gene of the entity to be disambiguated.

Step S1033, calculating the gene matching degree of the candidate entity and the entity to be disambiguated according to the co-occurrence entity gene matching degree and the attribute gene matching degree, wherein the calculation formula is as follows:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)

Referring to fig. 3, the above step S1031 includes the following processes:

step S10311, determining co-occurrence entity words of the entity to be disambiguated from a pre-stored document, and determining co-occurrence entity words of the candidate entity from the pre-stored document;

step S10312, obtaining a gene entity word set of the candidate entity;

step S10313, calculating the co-occurrence entity gene matching degree according to the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the pre-stored entity word total amount, the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the gene entity word set of the candidate entity, and the inverse text frequency index of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity.

In this embodiment, the set of gene entity words is a set of entity words related to an entity, for example, for zhang san entity, the set of gene entity words may include words such as lie four, wang's group, wang jiacun, and the like.

It is further explained that, taking the idea of TF-IDF as a reference, TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency). And taking all entities as a document D, taking the entity words of the co-occurrence entity genes of the entities as words T, and carrying out normalization processing on tf of the entity words of the co-occurrence entity genes of the entities to be disambiguated and tf of the entity words of the co-occurrence entity genes of the candidate entities. The normalized TF is sensitive to the number of gene entity words, and the co-occurrence entity genes of the candidate entities are generally long, so that the normalization is carried out after the weight of the gene overlapped co-occurrence entity words is strengthened, and weak signals are strengthened. The idf is calculated based on a cache mechanism, so that the calculation speed is increased. And finally, obtaining an entity word gene matching degree through tf and idf of the overlapped part of the co-occurrence entity word genes, specifically, calculating the co-occurrence entity gene matching degree score according to the following formula_w(m,e)：

Wherein, regarding the entity to be disambiguated and the candidate entities m and e, the calculation formula of the normalized tf value of the overlapped gene entity word w is as follows:

wherein, the wait (w, e) represents the weight of the gene entity word w of the entity e, the wait (e, w) represents the weight of the gene entity word e of the entity w, and W (m) is the gene entity word set of the entity m; it should be noted that, in the above calculation, the co-occurrence entity gene matching degree score_w(m, e) in the formula, the disambiguating entity and the candidate entity are not distinguishedM, e are both treated as entities with peering, tf_norm(w; m, e) denotes the normalized value of the word w with respect to m as the total after the m, e are overlapped and strengthened, tf_norm(w; e, m) is calculated in a manner similar to tf_normThe calculation of (w; m, e) is not described herein.

The formula for calculating the inverse text frequency index idf (w) of the co-occurrence entity words in the co-occurrence entity genes is as follows:

wherein E is an entity library, W (E) is a gene entity word set of an entity E, and Z is an auxiliary constant, and Z is smaller in general.

Referring to fig. 4, the above step S1032 includes the following processes:

step S10321, determining the attribute name of the entity to be disambiguated and the attribute name of the candidate entity;

step S10322, calculating the matching degree of the attribute gene according to the overlapping attribute between the attribute name of the entity to be disambiguated and the attribute name of the candidate entity and the weight value of the overlapping attribute.

In this embodiment, the attribute names are already aligned in the gene construction, so that only the attribute values corresponding to the attribute names need to be matched in the attribute gene matching calculation. The entity attribute gene matching degree is the weighted sum of overlapping attributes, and when measuring the attribute value, the attribute value character strings are not required to be completely consistent, but the similarity of the attribute value character strings and the attribute value character strings is calculated by using a similarity algorithm based on the edit distance. When a weighting and model for attribute matching is constructed, fuzzy matching optimization is carried out on the values of part of special attributes, and attribute weights with strong discrimination, such as identity cards, spouses and the like, are enhanced. Specifically, the attribute gene matching degree score may be calculated according to the following formula_p(m,e)：

Wherein PN (x) represents entity xPv (x, pn) represents all the attribute values, wait, of the entity x with respect to the attribute name pn_p(pn) represents the weight of the attribute name pn, I_pv(v_i,v_j) E {0,1}, is an indicator function representing the attribute value v_iAnd v_jWhether they are the same, I_pv(v_i,v_j) The calculation formula is as follows:

wherein, sim_pv(v_i,v_j) Representing normalized similarity of strings, θ_pIs a higher threshold value, theta_pBiased towards 1, I_pfix(v_i,v_j)、I_sfix(v_i,v_j) Respectively represent v_i、v_jWhether one value is a prefix and suffix of another value.

Referring to fig. 5, after the step 103, the method further comprises:

and 104, fusing the candidate entity and the gene of the entity to be disambiguated.

And 105, determining a target entity matched with the entity to be disambiguated according to the gene matching degree of the candidate entity and the entity to be disambiguated, and performing knowledge fusion on a corresponding entity in a knowledge base according to the knowledge of the target entity.

In this embodiment, in the case that the candidate entity and the entity to be disambiguated can be matched, the gene of the entity to be disambiguated and the gene of the candidate entity are fused, and the entities are accumulated. The entity words, the weights and the entity attribute genes of the co-occurrence entity genes can be fused respectively, and meanwhile, the growth window is set for the co-occurrence entity genes, so that the growth space is reserved for important words appearing in the later period. In this embodiment, the entity knowledge base may be gradually refined along with the process of disambiguating the entity words in the text.

In this embodiment, the knowledge base may be updated according to the result of the matching disambiguation, which is roughly divided into several cases, such as no matching entity, matching to one entity, and matching to multiple entities, and the attribute fusion of the knowledge base is performed for the three cases.

According to the entity disambiguation method provided by the embodiment of the invention, the gene matching degree of the candidate entity and the entity to be disambiguated is calculated by constructing the gene of the entity to be disambiguated, the entity words in the text are disambiguated according to the gene matching degree, the disambiguation accuracy in the entity disambiguation process of a large-scale data source is higher, and the entity disambiguation precision on the internet data source is improved, so that the entity disambiguation effect is integrally improved, the link entity and the knowledge base are gradually improved in the entity disambiguation process, and the data processing efficiency of target analysis, knowledge base construction, question-answer system and the like in massive texts is improved.

Fig. 6 is a schematic structural diagram of an entity disambiguation apparatus 600 according to an embodiment of the present invention, and for convenience of description, only the relevant portions for implementing the present invention are shown. The entity disambiguation apparatus 600 includes:

a construction module 601 for constructing genes of the entity to be disambiguated.

A screening module 602, configured to determine a candidate entity from an entity library according to the semantic features of the entity to be disambiguated, where the semantic features of the entity to be disambiguated include name form similarity, abbreviation information, and reference characteristics.

In this embodiment, candidate entities are determined from the entity library in different manners for different languages, the number of the determined candidate entities may be 1 or more, and in the case of multiple candidate entities, a candidate entity set may be formed. It is understood that the name form similarity may be a form similarity between entity nouns, for example, there is name form similarity between three people in chapter a and three people in article B, and the english name "jack" has similarity to "jackie", "number" and "donald number". The abbreviation information may include chinese abbreviation information and english abbreviation information, for example, the abbreviation information in hunan province is xiang, and the MAC address may be referred to as MAC address. The reference characteristics can be understood as the corresponding reference characteristics of the pronouns in the documents.

A matching module 603, configured to calculate a gene matching degree between the candidate entity and the entity to be disambiguated, and determine that the candidate entity matches the gene of the entity to be disambiguated when the gene matching degree exceeds a preset threshold.

The entity disambiguation device provided by the embodiment of the invention calculates the gene matching degree of the candidate entity and the entity to be disambiguated by constructing the gene of the entity to be disambiguated, disambiguates the entity words in the text according to the gene matching degree, has higher performance of carrying out entity disambiguation based on a large-scale data source, improves the precision of entity disambiguation on an internet data source, and improves the effect of entity disambiguation.

Referring to fig. 7, the matching module 603 includes:

a first obtaining sub-module 6031, configured to obtain a co-occurrence entity gene matching degree and an attribute gene matching degree of the candidate entity and the entity to be disambiguated.

A first calculating sub-module 6032, configured to calculate, according to the co-occurrence entity gene matching degree and the attribute gene matching degree, the gene matching degree between the candidate entity and the entity to be disambiguated, where the calculation formula is:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)；

Referring to fig. 8, the first acquisition sub-module 6031 includes:

a first determining unit 60311, configured to determine co-occurrence entity words of the entity to be disambiguated from a pre-stored document, and determine co-occurrence entity words of the candidate entities from the pre-stored document;

a first obtaining unit 60312, configured to obtain a set of gene entity words of the candidate entity;

a first calculating unit 60313, configured to calculate the co-occurrence entity gene matching degree according to a word frequency of an overlapping portion of a co-occurrence entity word of the entity to be disambiguated and a co-occurrence entity word of the candidate entity relative to a pre-stored total amount of entity words, a word frequency of an overlapping portion of a co-occurrence entity word of the entity to be disambiguated and a co-occurrence entity word of the candidate entity relative to a gene entity word set of the candidate entity, and an inverse text frequency index of an overlapping portion of a co-occurrence entity word of the entity to be disambiguated and a co-occurrence entity word of the candidate entity.

In this embodiment, the set of gene entity words is a set of entity words related to an entity, for example, for Zhang three entities, the set of gene entity words may include words such as Liqu, Wang's group, Wang Jiacun, etc.

It is further explained that, taking the idea of TF-IDF as a reference, TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency). And taking all entities as a document D, taking the entity words of the co-occurrence entity genes of the entities as words T, and carrying out normalization processing on tf of the entity words of the co-occurrence entity genes of the entities to be disambiguated and tf of the entity words of the co-occurrence entity genes of the candidate entities. The normalized TF is sensitive to the number of gene entity words, and the co-occurrence entity genes of the candidate entities are generally long, so that the normalization is carried out after the weight of the gene overlapped co-occurrence entity words is strengthened, and weak signals are strengthened. The idf is calculated based on a cache mechanism, so that the calculation speed is increased. And finally, obtaining the entity word gene matching degree through tf and idf of the overlapped part of the co-occurrence entity word genes, specifically, calculating the co-occurrence entity gene matching degree according to the following formulaGrading score_w(m,e)：

wherein, the wait (w, e) represents the weight of the gene entity word w of the entity e, the wait (e, w) represents the weight of the gene entity word e of the entity w, and W (m) is the gene entity word set of the entity m; it should be noted that, in the above calculation, the co-occurrence entity gene matching degree score_wIn the formula of (m, e), the disambiguating entity and the candidate entity are not distinguished, m and e are both regarded as entities and have the equality, tf_norm(w; m, e) denotes the normalized value of the word w with respect to m as the total after the m, e are overlapped and strengthened, tf_norm(w; e, m) is calculated in a manner similar to tf_normThe calculation of (w; m, e) is not described herein.

Referring to fig. 9, the first acquisition sub-module 6031 includes:

a second determining unit 60314, configured to determine the attribute name of the entity to be disambiguated and the attribute name of the candidate entity;

a second calculating unit 60315, configured to calculate the attribute gene matching degree according to an overlapping attribute between the attribute name of the entity to be disambiguated and the attribute name of the candidate entity, and a weight value of the overlapping attribute.

Where pn (x) represents the attribute names of all attributes of entity x, and pv (x, pn) represents all attribute values of entity x with respect to attribute name pn, wait_p(pn) represents the weight of the attribute name pn, I_pv(vi, vj) e {0,1}, is an indicator function representing the value of the attribute v_iAnd v_jWhether they are the same, I_pv(v_i,v_j) The calculation formula is as follows:

Referring to fig. 10, the entity disambiguation apparatus 600 further comprises:

a first fusion module 604 for fusing the candidate entity and the gene of the entity to be disambiguated.

A second fusion module 605, configured to determine a target entity matched with the entity to be disambiguated according to the gene matching degree between the candidate entity and the entity to be disambiguated, and perform knowledge fusion on a corresponding entity in a knowledge base according to knowledge of the target entity.

According to the entity disambiguation method provided by the embodiment of the invention, the gene matching degree of the candidate entity and the entity to be disambiguated is calculated by constructing the gene of the entity to be disambiguated, the entity words in the text are disambiguated according to the gene matching degree, the performance of entity disambiguation based on a large-scale data source is higher, the precision of entity disambiguation on an internet data source is improved, the effect of entity disambiguation is improved, the link entity and a knowledge base are gradually perfected in the entity disambiguation process, and the data processing efficiency of target analysis, knowledge base construction, question and answer system and the like in massive texts is improved.

An embodiment of the present invention provides a computer apparatus, which includes a processor, and the processor is configured to implement the steps of the entity disambiguation method provided in the foregoing method embodiments when executing a computer program in a memory.

Illustratively, a computer program can be partitioned into one or more modules, which are stored in memory and executed by a processor to implement the present invention. One or more of the modules may be a sequence of computer program instruction segments for describing the execution of a computer program in a computer device that is capable of performing certain functions. For example, the computer program may be divided into the steps of the entity disambiguation method provided by the various method embodiments described above.

Those skilled in the art will appreciate that the above description of a computer apparatus is by way of example only and is not intended to be limiting of computer apparatus, and that the apparatus may include more or less components than those described, or some of the components may be combined, or different components may be included, such as input output devices, network access devices, buses, etc.

The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The modules/units integrated by the computer device may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the entity disambiguation method may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, electrical signals, software distribution medium, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An entity disambiguation method, the entity disambiguation method comprising:

calculating the gene matching degree of the candidate entity and the entity to be disambiguated according to the co-occurrence entity gene matching degree and the attribute gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold;

obtaining the co-occurrence entity gene matching degree, comprising:

acquiring a gene entity word set of the candidate entity;

and performing multiplication operation according to the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the total amount of the pre-stored entity words, the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the gene entity word set of the candidate entity, and the inverse text frequency index of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity, and calculating the co-occurrence entity gene matching degree.

2. The entity disambiguation method of claim 1, wherein the degree of gene matching between the candidate entity and the entity to be disambiguated is calculated according to the degree of gene matching between the co-occurring entity and the attribute gene, and the calculation formula is:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)

3. The entity disambiguation method according to claim 1, wherein obtaining the degree of matching between the candidate entity and the attribute gene of the entity to be disambiguated comprises the following processes:

4. The entity disambiguation method of claim 1, wherein after said determining that said candidate entity matches a gene of said entity to be disambiguated, said entity disambiguation method further comprises:

fusing the candidate entity and the gene of the entity to be disambiguated;

5. An entity disambiguation apparatus, comprising:

the matching module is used for calculating the gene matching degree of the candidate entity and the entity to be disambiguated according to the co-occurrence entity gene matching degree and the attribute gene matching degree of the candidate entity and the entity to be disambiguated, and determining that the candidate entity is matched with the gene of the entity to be disambiguated under the condition that the gene matching degree exceeds a preset threshold value;

obtaining the co-occurrence entity gene matching degree, comprising: determining co-occurrence entity words of the entity to be disambiguated from a pre-stored document, and determining co-occurrence entity words of the candidate entity from the pre-stored document; acquiring a gene entity word set of the candidate entity; and performing multiplication operation according to the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the total amount of the pre-stored entity words, the word frequency of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity relative to the gene entity word set of the candidate entity, and the inverse text frequency index of the overlapped part of the co-occurrence entity words of the entity to be disambiguated and the co-occurrence entity words of the candidate entity, and calculating the co-occurrence entity gene matching degree.

6. The entity disambiguation apparatus of claim 5, wherein said matching module comprises:

score_g(m,e)＝α*score_w(m,e)+β*score_p(m,e)

7. The entity disambiguation apparatus of claim 6, wherein said first obtaining sub-module comprises:

8. The entity disambiguation apparatus of claim 5, further comprising:

9. A computer arrangement, characterized in that the computer arrangement comprises a processor for implementing the steps of the entity disambiguation method as claimed in any one of claims 1-4 when executing a computer program in a memory.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when being executed by a processor realizes the steps of the entity disambiguation method as defined in any one of the claims 1-4.