CN107992480A

CN107992480A - A kind of method, apparatus for realizing entity disambiguation and storage medium, program product

Info

Publication number: CN107992480A
Application number: CN201711423446.5A
Authority: CN
Inventors: 蔡巍; 崔朝辉; 赵立军; 张霞
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-05-04
Anticipated expiration: 2037-12-25
Also published as: CN107992480B

Abstract

This application discloses a kind of method, apparatus for realizing entity disambiguation and storage medium, program product, entity disambiguation is used for realization, this method includes：Corresponding first co-word network of first instance and corresponding second co-word network of second instance are established, there are identical entity node for the first co-word network and the second co-word network；Calculate the similarity between the first co-word network and the second co-word network；When similarity is more than first threshold, then first instance and second instance are determined as same entity；When similarity is less than second threshold, then first instance and second instance are determined as different entities.

Description

A kind of method, apparatus for realizing entity disambiguation and storage medium, program product

Technical field

This application involves technical field of data processing, and in particular to a kind of method, apparatus for realizing entity disambiguation and storage Medium, program product.

Background technology

Entity disambiguation aims to solve the problem that the title ambiguity problem being widely present in text, in semantization search, question answering system, knows Know the fields such as storehouse is expanded, Heterogeneous Knowledge storehouse is merged to have a wide range of applications.Entity disambiguation has two level implications, first, identical entity Discrimination, specify being directed correctly to for entity, determine that its is semantic, such as apple may refer to Apple Inc., can also refer to one kind Fruit, at this time identical entity need duplication of name disambiguation；Second, the association alignment of different entities, such as Neusoft and NEUSOFT Group, and The Neu-Alpine occurred in history text, with Neusoft Group Co., Ltd is referred to, different entities need several poly- at this time Close.

In English field, entity disambiguation depends on the construction of semantic knowledge-base, but Chinese entity is different from English Word, expression is more flexible, and in Chinese field, the research of entity disambiguation is started late, and achievement is also less, has currently been also lacked Effect carries out the mode of entity disambiguation.

The content of the invention

In view of this, the application provides a kind of method, apparatus for realizing entity disambiguation and storage medium, program product, with The technical problem of entity disambiguation can not effectively be realized for Chinese field by solving the prior art.

To solve the above problems, technical solution provided by the embodiments of the present application is as follows：

A kind of method for realizing entity disambiguation, the described method includes：

Establish corresponding first co-word network of first instance and corresponding second co-word network of second instance, described first There are identical entity node with second co-word network for co-word network；

Calculate the similarity between first co-word network and second co-word network；

When the similarity is more than first threshold, the first instance and the second instance are determined as same reality Body；

Alternatively, when the similarity is less than second threshold, the first instance and the second instance are determined as not Same entity.

Optionally, it is described when the similarity is more than first threshold, the first instance and the second instance is true It is set to same entity, including：

, will if the first instance is different from the second instance title when the similarity is more than first threshold The first instance and the second instance same entity that to be determined as title different；

It is described when the similarity is less than second threshold, the first instance and the second instance are determined as difference Entity, including：

, will if the first instance is identical with the second instance title when the similarity is less than second threshold The first instance and the second instance different entities that to be determined as title identical.

Optionally, it is described to establish corresponding first co-word network of first instance and second instance corresponding second word net altogether Network includes：

Obtain corresponding first corpus of text of the first instance and corresponding second corpus of text of the second instance；

When first corpus of text and second corpus of text are unstructured data, according to the described first text This language material extracts the corresponding fisrt feature set of words of the first instance, and according to second corpus of text extraction described the The corresponding second feature set of words of two entities；

According to each in the relation between each Feature Words in the fisrt feature set of words, and the fisrt feature set of words Relation between Feature Words and the first instance, establishes corresponding first co-word network of the first instance, and according to institute The relation between each Feature Words and the second instance in second feature set of words is stated, establishes the second instance corresponding second Co-word network.

It is optionally, described that the corresponding fisrt feature set of words of the first instance is extracted according to first corpus of text, And included according to the corresponding second feature set of words of second corpus of text extraction second instance：

Extract the corresponding first co-occurrence set of words of the first instance from first corpus of text, and from described The corresponding second co-occurrence set of words of the second instance is extracted in two corpus of text；The first co-occurrence set of words is included in described In first corpus of text, in the co-occurrence word occurred in the first instance preset range, the second co-occurrence set of words bag Include in second corpus of text, in the co-occurrence word occurred in the second instance preset range；

Corresponding first keyword set of the first instance is extracted from first corpus of text and first category is special Set of words is levied, and corresponding second keyword set of the second instance and the second class are extracted from second corpus of text Other feature set of words；The first category feature set of words includes the category feature identical with the entity class of the first instance Word, the second category feature set of words include the Based on Class Feature Word Quadric identical with the entity class of the second instance；

The first co-occurrence set of words, first keyword set and the first category feature set of words are taken simultaneously Collection, obtains the corresponding fisrt feature set of words of the first instance；And by the second co-occurrence set of words, described second crucial Set of words and the second category feature set of words take union, obtain the corresponding second feature set of words of the second instance.

When first corpus of text and second corpus of text are semi-structured data, according to the described first text This language material obtain the first instance association attributes and with the associated entity node of the first instance, and according to described the Two corpus of text obtain the second instance association attributes and with the associated entity node of the second instance；

It is real that first is established according to the association attributes of the first instance and with the related entity node of the first instance Corresponding first co-word network of body, and according to the association attributes of the second instance and with the related reality of the second instance Body node establishes corresponding second co-word network of second instance.

Optionally, the similarity calculated between first co-word network and second co-word network includes：

First co-word network entity node identical with second co-word network is obtained as identical entity set Close；

The entity node for being not belonging to the identical entity sets is removed from first co-word network, obtains the 3rd common word Network；The entity node for being not belonging to the identical entity sets is removed from second co-word network, obtains the 4th common word net Network；

Obtain the corresponding first subgraph set of the 3rd co-word network, and the 4th co-word network corresponding second Subgraph set, first sub-collective drawing are combined into the first instance and any one or more entities in the 3rd co-word network Node form subgraph set, second sub-collective drawing be combined into the second instance with it is any one in the 4th co-word network The set for the subgraph that a or multiple entity nodes are formed；

According to the first subgraph set and the second subgraph set, the 3rd co-word network and the described 4th are counted The number of existing same sub-image between co-word network；

Phase between first co-word network and second co-word network is calculated according to the number of the same sub-image Like degree.

Optionally, it is described that first co-word network and the described second common word net are calculated according to the number of the same sub-image Similarity between network includes：

First number that the entity node in the identical entity sets forms subgraph with the first instance is obtained, and Entity node in the identical entity sets forms second number of subgraph with the second instance, calculates first number With the total number of the sum of second number as subgraph；

The ratio of the number of the same sub-image and the total number of the subgraph is calculated, using the ratio as the first knot Fruit；

Calculate the entity node number that first co-word network includes square with being wrapped in second co-word network The entity node number included square product, using the product as the second result；

The ratio for calculating first result and second result is common with described second as first co-word network Similarity between word network.

A kind of device for realizing entity disambiguation, described device include：

Unit is established, for establishing corresponding first co-word network of first instance and second instance corresponding second word altogether Network, there are identical entity node with second co-word network for first co-word network；

Computing unit, for calculating the similarity between first co-word network and second co-word network；

First determination unit, for when the similarity is more than first threshold, by the first instance and described second Entity is determined as same entity；

Second determination unit, for when the similarity is less than second threshold, by the first instance and described second Entity is determined as different entities.

Optionally, first determination unit is specifically used for：When the similarity is more than first threshold, if described One entity is different from the second instance title, and the first instance and the second instance are determined as different same of title Entity；

Second determination unit is specifically used for：When the similarity is less than second threshold, if the first instance Different entities identical with the second instance title, that the first instance and the second instance are determined as title is identical.

Optionally, the unit of establishing includes：

First obtains subelement, for obtaining corresponding first corpus of text of the first instance and the second instance Corresponding second corpus of text；

First extraction subelement, for being unstructured number when first corpus of text and second corpus of text According to when, the corresponding fisrt feature set of words of the first instance is extracted according to first corpus of text, and according to described the Two corpus of text extract the corresponding second feature set of words of the second instance；

First establishes subelement, for according to the relation between each Feature Words in the fisrt feature set of words, Yi Jisuo The relation between each Feature Words and the first instance in fisrt feature set of words is stated, establishes the first instance corresponding first Co-word network, and according to the relation in the second feature set of words between each Feature Words and the second instance, establish institute State corresponding second co-word network of second instance.

Optionally, the first extraction subelement includes：

First extraction module, for extracting corresponding first co-occurrence word of the first instance from first corpus of text Set, and the corresponding second co-occurrence set of words of the second instance is extracted from second corpus of text；Described first is common Existing set of words is included in first corpus of text, in the co-occurrence word occurred in the first instance preset range, institute State the second co-occurrence set of words to be included in second corpus of text, be total to what is occurred in the second instance preset range Existing word；

Second extraction module, for extracting corresponding first keyword of the first instance from first corpus of text Set and first category feature set of words, and extract the second instance corresponding second from second corpus of text and close Keyword set and second category feature set of words；The first category feature set of words includes the entity class with the first instance Not identical Based on Class Feature Word Quadric, the second category feature set of words include the class identical with the entity class of the second instance Other Feature Words；

Merging module, for the first co-occurrence set of words, first keyword set and the first category is special Word set conjunction union is levied, obtains the corresponding fisrt feature set of words of the first instance；And by the second co-occurrence set of words, Second keyword set and the second category feature set of words take union, obtain corresponding second spy of the second instance Levy set of words.

Optionally, the unit of establishing includes：

Second obtains subelement, for obtaining corresponding first corpus of text of the first instance and the second instance Corresponding second corpus of text；

3rd obtains subelement, for being semi-structured number when first corpus of text and second corpus of text According to when, according to first corpus of text obtain the first instance association attributes and with the associated entity of the first instance Node, and the association attributes of the second instance and associated with the second instance is obtained according to second corpus of text Entity node；

Second establishes subelement, for the association attributes according to the first instance and related with the first instance Entity node establishes corresponding first co-word network of first instance, and according to the association attributes of the second instance and with it is described The related entity node of second instance establishes corresponding second co-word network of second instance.

Optionally, the computing unit includes：

4th obtains subelement, the entity node identical with second co-word network for first co-word network As identical entity sets；

5th obtains subelement, and the reality of the identical entity sets is not belonging to for being removed from first co-word network Body node, obtains the 3rd co-word network；The entity for being not belonging to the identical entity sets is removed from second co-word network Node, obtains the 4th co-word network；

6th obtains subelement, for obtaining the corresponding first subgraph set of the 3rd co-word network, and described the The corresponding second subgraph set of four co-word networks, first sub-collective drawing are combined into the first instance and the 3rd co-word network In the set of subgraph that forms of any one or more entity nodes, second sub-collective drawing be combined into the second instance with it is described The set for the subgraph that any one or more entity nodes are formed in 4th co-word network；

Subelement is counted, for common according to the first subgraph set and the second subgraph set, statistics the described 3rd The number of existing same sub-image between word network and the 4th co-word network；

First computation subunit, for calculating first co-word network and described the according to the number of the same sub-image Similarity between two co-word networks.

Optionally, first computation subunit includes：

Acquisition module, subgraph is formed for obtaining the entity node in the identical entity sets with the first instance Entity node in first number, and the identical entity sets forms second number of subgraph, meter with the second instance Calculate the total number of the sum of first number and second number as subgraph；

First computing module, for calculating the ratio of the number of the same sub-image and the total number of the subgraph, by institute Ratio is stated as the first result；

Second computing module, the entity node number included for calculating first co-word network square with it is described The entity node number that second co-word network includes square product, using the product as the second result；

3rd computing module, for calculating the ratio of first result and second result as the described first common word Similarity between network and second co-word network.

A kind of computer-readable recording medium, is stored with instruction in the computer readable storage medium storing program for executing, works as described instruction When running on the terminal device so that the terminal device performs the above-mentioned method for realizing entity disambiguation.

A kind of computer program product, when the computer program product is run on the terminal device so that the terminal Equipment performs the above-mentioned method for realizing entity disambiguation.

It can be seen from the above that the embodiment of the present application has the advantages that：

The embodiment of the present application is by establishing the co-word network of entity, using similar between the co-word network for comparing two entities Degree, to determine that two entities refer to same entity or reference different entities, when the similarity between the co-word network of two entities is big In first threshold, then it is assumed that two entities refer to same entity, the two entities can be determined as to same entity at this time；When two realities Similarity between the co-word network of body is less than second threshold, then it is assumed that two entities refer to different entities, at this time can by this two A entity is determined as different entities, so as to effectively realize entity disambiguation.

Brief description of the drawings

Fig. 1 is a kind of flow chart for the method for realizing entity disambiguation provided by the embodiments of the present application；

Fig. 2 a are a kind of exemplary plot of co-word network provided by the embodiments of the present application；

Fig. 2 b are a kind of exemplary plot of co-word network provided by the embodiments of the present application；

Fig. 3 is a kind of flow for the method that co-word network is established based on unstructured data provided by the embodiments of the present application Figure；

Fig. 4 is a kind of flow for the method that co-word network is established based on semi-structured data provided by the embodiments of the present application Figure；

Fig. 5 is a kind of exemplary plot of the co-word network of foundation provided by the embodiments of the present application；

Fig. 6 is a kind of exemplary plot of the co-word network of foundation provided by the embodiments of the present application；

Fig. 7 is a kind of similarity based method for calculating the first co-word network and the second co-word network provided by the embodiments of the present application Flow chart；

Fig. 8 is a kind of structure diagram for the device for realizing entity disambiguation provided by the embodiments of the present application.

Embodiment

It is below in conjunction with the accompanying drawings and specific real to enable the above-mentioned purpose of the application, feature and advantage more obvious understandable Mode is applied to be described in further detail the embodiment of the present application.

Inventor it has been investigated that, traditional entity disambiguation mode mainly by semantic knowledge-base to English entity carry out Disambiguation, although the mode of structure semantic knowledge-base can also be used to carry out disambiguation to Chinese entity, since Chinese entity disappears Since achievement in research is less, large-scale semantic knowledge-base does not shape also discrimination, causes semantic knowledge-base to cover all Chinese entity, so that many Chinese entities can not carry out entity disambiguation by searching for the mode of semantic knowledge-base.

To solve the above-mentioned problems, the embodiment of the present application provides a kind of method for realizing entity disambiguation, and this method can lead to The co-word network for establishing entity is crossed, using the similarity between the co-word network for comparing two entities, to determine that two entities refer to Same entity also refers to different entities, when the similarity between the co-word network of two entities is more than first threshold, then it is assumed that two Entity refers to same entity, and at this time if two entities are not of the same name, representative needs to carry out several polymerizations to the two entities, by this Two entities are determined as the different same entity of title；When the similarity between the co-word network of two entities is less than second threshold, Then think that two entities refer to different entities, at this time if two entities are of the same name, representative needs to carry out duplication of name to the two entities to disappear Discrimination, the two entities is determined as to claim identical different entities, so as to effectively realize the purpose of entity disambiguation.

It should be noted that the invention relates to entity, refer to objective reality in real world and can With the things mutually distinguished.Entity can be specific people, thing, thing or abstract concept etc..

Co-word network can include entity node and side, and co-word network can be by entity and spy in the embodiment of the present application Levy the network for the description entity that the relation between relation and Feature Words between word is formed.Wherein, entity, each Feature Words Can be as an entity node in co-word network, the relation between relation, Feature Words between entity and Feature Words can be with Represented using the side between entity node.While the storage mode in co-word network can set ID, example for each node As side e1 is determined by node 1 and node 2, the ID of node 1 is ID1, and the ID of node 2 is ID2, and side e1=is determined by ID1 and ID2 (ID1, ID2), side e1 can be stored in line set E.Showing in figure, side can be represented by the line between two nodes, referring to In Fig. 2 a, Fig. 2 b, Fig. 5, Fig. 6.

Below in conjunction with the accompanying drawings, describe how the embodiment of the present application carries out entity disambiguation in detail.

Referring to Fig. 1, which is a kind of flow chart of method for realizing entity disambiguation provided by the embodiments of the present application.

S101, establish corresponding first co-word network of first instance and corresponding second co-word network of second instance, institute Stating the first co-word network, there are identical entity node with second co-word network.

When handling corpus of text, the entity in corpus of text is since entity name may there are entity ambiguity Situation, it is same entity to lead to not determine whether multiple entities for occurring in corpus of text refer to.In this case, Need to carry out entity disambiguation to the plurality of entity, can be with by taking the first instance and second instance that the plurality of entity includes as an example By establishing corresponding first co-word network of first instance and corresponding second co-word network of second instance, and it is common to analyze first Similarity between word network and the second co-word network determines whether first instance and second instance refer to same entity.Its In, the co-word network of foundation can as shown in Figure 5 and Figure 6, wherein, including first instance and second instance be scholar's " trip light It is flourish ", by establishing first co-word network and the second co-word network of " trip is glorious ", to determine the scholar of the same name of not commensurate " trip It is glorious " whether refer to same person.

It should be noted that the corpus of text can include unstructured data and semi-structured data, for difference The implementation that corpus of text establishes co-word network will subsequently describe in detail.

It should be noted that work as the first co-word network is not present identical entity node with the second co-word network, and it is not required to Analyzing the similarity between the first co-word network and the second co-word network can think that the first co-word network and second is total to Word network is dissimilar, you can to think first instance and what second instance referred to is not an entity.

As shown in Figure 2 a, the co-word network in left side can represent the first co-word network, the grey in first co-word network Entity node A can represent first instance, and white entity node represents the corresponding Feature Words of first instance；The co-word network on right side It can represent the second co-word network, the grey entity node B in second co-word network can represent second instance, white entity Node represents the corresponding Feature Words of second instance.It can be seen that from Fig. 2 a between the first co-word network and the second co-word network not There are common entity node, do not intersect, can represent first instance and second instance and uncorrelated, therefore, can be direct Distinguish first instance and second instance, it is believed that what first instance and second instance referred to is not an entity.

When the first co-word network and the second co-word network are there are identical entity node, the present embodiment institute can be continued to execute The method of offer, is further analyzed the first co-word network and the second co-word network, with by the first co-word network with And the second similarity between co-word network determines first instance and whether what second instance referred to is an entity.

As shown in Figure 2 b, for Fig. 2 a, there are two whites between the first co-word network and the second co-word network Entity node 5 and 6, entity node 5 and 6 had not only been present in the first co-word network, but also was present in the second co-word network, i.e., real Body node 5 and 6 can be as existing identical entity node between the first co-word network and the second co-word network.First common word net Existing identical entity node 5 and 6 between network and the second co-word network so that between the first co-word network and the second co-word network With certain similarity, it is believed that first instance has certain correlation with second instance, but can't determine Whether one entity and second instance reference are same entity, it is also necessary to further similarity are calculated, with by similar Whether what the size of degree determined that first instance and second instance refer to is an entity.

Similarity between S102, calculating first co-word network and second co-word network.

In the present embodiment, the similarity between the first co-word network and the second co-word network can refer to the first common word net The similarity of network and the second co-word network macrostructure, such as can be included according to the first co-word network and the second co-word network Same sub-image calculate the similarity of the first co-word network and the second co-word network, more can represent of same sub-image appearance First co-word network and the similarity of the macrostructure of the second co-word network are higher, and first instance refers to same reality with second instance The possibility of body is higher.

It should be noted that calculated for 102 similar between first co-word network and second co-word network The implementation of degree will subsequently describe in detail.

S103, when the similarity is more than first threshold, the first instance and the second instance are determined as together One entity.

S104, when the similarity is less than second threshold, the first instance and the second instance are determined as not Same entity.

Since similarity is higher, it can represent that the possibility of first instance and the same entity of second instance reference is higher, when The similarity calculated can consider first instance when being more than first threshold and what second instance actually referred to generation is same reality Body, when the similarity calculated is less than second threshold, it is believed that what first instance and second instance actually referred to generation is Different entities.

In some cases, when the similarity calculated is more than first threshold, if first instance and second instance It is the entity of different names, at this point it is possible to the same reality that the first instance and the second instance are determined as title is different Body, you can to think the first instance of different names and what second instance actually referred to generation is same entity, at this point it is possible to One entity and second instance carry out several polymerizations.

In some cases, when the similarity calculated is less than second threshold, if first instance and second instance It is the identical entity of title, at this point it is possible to which the first instance is determined as the identical different realities of title from the second instance Body, you can to think the first instance of same title and what second instance actually referred to generation is different entities, at this point it is possible to first Entity and second instance carry out duplication of name disambiguation.

It should be noted that S103 and S104 be on execution sequence and non-sequential execution, and in this city embodiment, Ke Yigen According to the size of similarity value, determine to perform above-mentioned S103 or S104.It should be noted that the first threshold and second threshold can To be determined according to experiment, it is generally the case that first threshold is more than second threshold.

Further, when the similarity between the co-word network of two entities is more than first threshold, if two entities are not of the same name, Then think that neither entity of the same name refers to same entity, at this time, then representing needs to carry out several polymerizations, several polymerizations refer to by this two A entity is determined as the different same entity of title；When the similarity between the co-word network of two entities is less than second threshold, such as Two entity of fruit is of the same name, then it is assumed that two entities of the same name refer to different entities, at this time, then represent needs and carry out duplication of name disambiguation, duplication of name disappears Discrimination refers to the two entities being determined as the identical different entities of title, so as to effectively realize entity disambiguation.

Below in conjunction with the accompanying drawings, by taking first instance and second instance as an example, the method for establishing co-word network is discussed in detail, due to The targeted corpus of text of the embodiment of the present application mainly includes unstructured data and semi-structured data, due to unstructured number According to the different characteristic with semi-structured data so that establish the common word net of entity based on unstructured data and semi-structured data The method of network is different, and the present embodiment based on unstructured data by respectively to establishing co-word network, and based on half structure The method that change data establish co-word network is introduced.

In the present embodiment, corpus of text can refer to the linguistic data of entity, the implication being related to include refer to and It is related.Refer to and refer to occur the entity in corpus of text, although correlation refers to do not occur the entity in text language material, Occur with the relevant Feature Words of the entity, what text language material was told about in other words is and the relevant content of the entity.Such as One is told about in the corpus of text of patent, although in full without the wording for " intellectual property " occur, text language material is with knowing It is related to know property right, because patent is one of species of intellectual property.In the embodiment of the present application, the language of corpus of text can To be Chinese or English, Japanese etc., the application is not specifically limited.

Referring to Fig. 3, Fig. 3 shows the flow chart for the method that co-word network is established based on unstructured data, this method bag Include：

S301, obtain corresponding first corpus of text of first instance and corresponding second text of the second instance Language material.

In this embodiment, the first corpus of text and the second corpus of text are unstructured data.Unstructured data can To be that data structure is irregular or imperfect, without predefined data model, it has not been convenient to database two dimension logical table come table Existing data.Unstructured data can include office documents, text etc..

S302, extract the corresponding fisrt feature set of words of the first instance, Yi Jigen according to first corpus of text The corresponding second feature set of words of the second instance is extracted according to second corpus of text.

Feature Words can refer to the word or word in corpus of text with independent implication, and Feature Words can wrap in the present embodiment Include the co-occurrence word occurred jointly with first instance or second instance, keyword and Based on Class Feature Word Quadric.

Wherein, co-occurrence word can refer in corpus of text, in the word occurred in entity preset range；Category feature Word can refer to word identical with the entity class of entity in corpus of text.

Therefore, the corresponding fisrt feature set of words of first instance is extracted in S302, and extraction second instance corresponding the A kind of implementation of two feature set of words can be：First, it is corresponding that the first instance is extracted from the first corpus of text First co-occurrence set of words, and the corresponding second co-occurrence set of words of the second instance is extracted from the second corpus of text.Then, Extract corresponding first keyword set of first instance and first category feature set of words from the first corpus of text, and from the Corresponding second keyword set of second instance and second category feature set of words are extracted in two corpus of text；Wherein, the first kind Other feature set of words includes the Based on Class Feature Word Quadric identical with the entity class of first instance, second category feature set of words include with The identical Based on Class Feature Word Quadric of the entity class of second instance.After again, by the first co-occurrence set of words, the first keyword set and first Class character term set takes union, obtains the corresponding fisrt feature set of words of first instance；And by the second co-occurrence set of words, Two keyword sets and second category feature set of words take union, obtain the corresponding second feature set of words of second instance.

It should be noted that can be to extract co-occurrence word using the mode of sliding window in the present embodiment, then entity is pre- If can be with sliding window, co-occurrence word of the word in sliding window as entity will be appeared in jointly with entity in scope. The size of sliding window can be determined by experiment, and under normal circumstances, the size of sliding window, which can take, makes sliding window In 20 co-occurrence words can occur, such as can be 10 co-occurrence words before entity, 10 co-occurrence words after entity.The co-occurrence word of extraction Can include it is multiple, collectively form co-occurrence set of words.

It is understood that the co-word network embodiment that entity is established using co-occurrence word is office of the entity in corpus of text Portion's relation, in order to embody whole relation of the entity in corpus of text, can further extract keyword in corpus of text Keyword set is formed, and extraction Based on Class Feature Word Quadric forms class character term set, with the pass to entity in corpus of text System is supplemented, so as to obtain the co-word network that can reflect entity whole relation in corpus of text.

Wherein, keyword can be the word for being used for embodying corpus of text main contents to be expressed in corpus of text.Class Other Feature Words can be presentation-entity classification in corpus of text, alternatively, the word identical with entity generic, for example, entity is " apple ", then Based on Class Feature Word Quadric can be the words such as " fruit ", " pears ".Keyword and Based on Class Feature Word Quadric can pass through corresponding algorithm Extracted, for example, can be extracted based on the algorithm of semantics recognition to keyword and Based on Class Feature Word Quadric.

S303, according to the relation between each Feature Words in the fisrt feature set of words, and the fisrt feature word set Relation in conjunction between each Feature Words and the first instance, establishes corresponding first co-word network of the first instance, and According to the relation in the second feature set of words between each Feature Words and the second instance, establish the second instance and correspond to The second co-word network.

Since entity node and side can be included in co-word network, when determining entity and the corresponding feature set of words of entity Afterwards, that is, the entity node that co-word network includes is determined.Of course for establishing co-word network, it is also necessary to determine entity node it Between side how to connect, this is just needed according to the relation between each Feature Words, and the relation between Feature Words and entity It is determined.Side between two entity nodes can be represented between the two entity nodes there are direct relation, therefore, can be with According to the relation between entity and each Feature Words, determine between entity and each Feature Words to whether there is in co-word network Side, likewise, can determine between two Feature Words to whether there is in co-word network according to the relation between two Feature Words Side.

For example, the co-word network in left side due to entity A and can be used as entity with the co-word network of presentation-entity A in Fig. 2 a There are direct relation between the Feature Words of node 1, then in the co-word network of entity A, there are side with entity node 1 for entity A.And Feature Words as entity node 4 are probably to be determined according to the Feature Words as entity node 2, i.e., entity node 4 with Entity node 2 is there are the indirect relation that direct relation, entity node 4 are then established with entity A by entity node 2, therefore, in reality In the co-word network of body A, side is not present there are side in entity node 4 and entity node 2 between entity node 4 and entity A.

Referring to Fig. 4, Fig. 4 shows the flow chart for the method that co-word network is established based on semi-structured data, this method bag Include：

S401, obtain corresponding first corpus of text of first instance and corresponding second text of the second instance Language material.

In the present embodiment, the first corpus of text and the second corpus of text are semi-structured data.With common plain text Compare, semi-structured data has necessarily structural, but the structure change of semi-structured data is very big.Semi-structured data It can include the abstract of a thesis, encyclopaedia entry etc..

Such as in the application scenarios of academic research, it is understood that there may be substantial amounts of scholar of the same name, existing scholar's disambiguation of the same name, What is relied on is scholar affiliate of the same name.But in some cases, also there is scholar of the same name in commensurate, and commensurate is not of the same name Scholar is it could also be possible that same people.For this reason, the mode that the co-word network of scholar of the same name can be established based on the abstract of a thesis is carried out together Name scholar's disambiguation.Wherein, the abstract of a thesis can be used as corpus of text, and scholar of the same name can be used as entity.

S402, the association attributes for obtaining according to first corpus of text first instance and closed with the first instance The entity node of connection, and according to the association attributes of second corpus of text acquisition second instance and with described second in fact The associated entity node of body.

The characteristics of due to semi-structured data, compared with unstructured data, word net altogether is established based on semi-structured data Network is relatively easy.When establishing co-word network based on semi-structured data, the entity can be extracted in corpus of text according to entity Association attributes and with other related entity nodes of entity.

First instance and second instance are included with entity, and first instance and second instance are for scholar of the same name " trip is glorious " Example, the association attributes of entity can be publication where the scholar of the same name paper delivered, the date to publish thesis, the paper delivered, Keyword corresponding to paper delivered etc., can include delivering one jointly with the scholar of the same name with the entity node of entity associated Bibliography of paper that other scholars, scholar of the same name of piece paper deliver etc..

S403, according to the association attributes of the first instance and with the related entity node of the first instance establish Corresponding first co-word network of one entity, and according to the association attributes of the second instance and relevant with the second instance Entity node establish corresponding second co-word network of second instance.

In the present embodiment, determine entity association attributes and with after the related entity node of entity, can be by reality The association attributes of body and with the related entity node of entity as the entity node in the corresponding co-word network of the entity, so Afterwards, it is also necessary to determine how the side between entity node connects, in the method and S303 that determine the side between entity node Method it is similar, also need determine each entity node between whether there is direct relation, exist when between two entity nodes During direct relation, then in co-word network, there are side between the two entity nodes.

Continue so that the first instance and second instance that entity includes is scholars of the same name " trip is glorious " as an example, real for first When body " trip is glorious " establishes co-word network, if the association attributes of first instance can be the paper A and hair that " trip is glorious " delivers Keyword corresponding to the paper A of table, can include delivering paper A jointly with " trip is glorious " with the entity node of entity associated Other authors, then " trip glorious " is delivered paper A, the keyword corresponding to the paper A delivered and common with " trip is glorious " Publish thesis A other scholars respectively can be used as an entity node.Therefore, the word altogether of first instance corresponding first is being established During network, there may be side between the paper A that first instance and author deliver, due to the keyword corresponding to the paper A that delivers With having direct relation between the paper A that delivers, then it was determined that when establishing corresponding first co-word network of first instance, hair There are side between keyword corresponding to the paper A of table and the paper A delivered, and corresponding to first instance and the paper A that delivers Keyword between side is not present, it is similar, delivered jointly between other authors of paper A and first instance not with the author There are side.

When the paper A delivered according to " trip is glorious " establishes the co-word network of " trip is glorious ", can obtain as shown in Figure 5 Co-word network, since the scholar " trip is glorious " as entity appears in 4 units, is delivered according to " trip is glorious " in not commensurate The abstract of a thesis can establish out 4 co-word networks shown in Fig. 5.

Further, under many circumstances, except semi-structured data or unstructured data can be used to establish word altogether Network, can also be combined with unstructured data based on semi-structured data and establish co-word network jointly, for example, being removed in paper Include the semi-structured data such as summary part, the Keywords section, also include the unstructured data of paper text.Therefore, After the co-word network shown in Fig. 5 is established out, the keyword of " trip is glorious " in the paper that not commensurate delivers, institute can be obtained Paper possessed the Keywords section in itself can be included by stating keyword, can also be included from the pass that paper body part extracts Keyword, to be supplemented on the basis of 4 co-word networks being obtained in Fig. 5, so as to obtain 4 co-word networks shown in Fig. 6.

Using the method for S102-S104, the similarity of 4 co-word networks to being obtained in Fig. 6 is calculated and analyzed, can It is actually same scholar with the scholar " trip is glorious " for determining to appear in 4 units.

In the present embodiment, different data types can be belonged to according to corpus of text, it is real that rational selection establishes first Corresponding first co-word network of body, and the mode of corresponding second co-word network of second instance, so as to accurately, easily establish Go out to reflect the co-word network of relation between entity node, so as to subsequently can be according to the first co-word network of foundation and second Whether what the similarity between co-word network determined that first instance and second instance refer to is same entity.

Further, for S102, one kind is present embodiments provided according in the first co-word network and the second co-word network Including same sub-image calculate the first co-word network and the second co-word network similarity implementation, show referring to Fig. 7, Fig. 7 A kind of flow chart for the similarity based method for calculating the first co-word network and the second co-word network is gone out, this method includes：

S701, obtain first co-word network entity node identical with second co-word network as identical reality Body set.

Using first instance as entity A, second instance is entity B, and corresponding first co-word network of first instance is G_A, second The corresponding co-word network of entity is G_BExemplified by, determine G_AAnd G_BThe identical entity node included, as identical entity sets N_c。

S702, remove from first co-word network and be not belonging to the entity node of the identical entity sets, obtains the Three co-word networks；The entity node for being not belonging to the identical entity sets is removed from second co-word network, obtains the 4th Co-word network.

Remove G_AIn be not belonging to the entity node of identical entity sets Nc and obtain the 3rd co-word network G_A', remove G_BIn do not belong to The 4th co-word network G is obtained in the entity node of identical entity sets Nc_B’。

S703, obtain the corresponding first subgraph set of the 3rd co-word network, and the 4th co-word network corresponds to The second subgraph set, first sub-collective drawing is combined into any one in the first instance and the 3rd co-word network or more The set for the subgraph that a entity node is formed, second sub-collective drawing are combined into the second instance and the 4th co-word network The set for the subgraph that any one or more entity nodes are formed.

S704, according to the first subgraph set and the second subgraph set, count the 3rd co-word network and institute State the number of existing same sub-image between the 4th co-word network.

Continue with foregoing the 3rd obtained co-word network G_A' and the 4th co-word network G_B' exemplified by, according to it is foregoing obtain Three co-word network G_A', the 3rd co-word network G can be obtained_A' all sub-collective drawing cooperations for including entity node A are the first subgraph Set, and according to the 4th co-word network G_B', the 4th co-word network G can be obtained_B' include all subgraphs of entity node B Set is used as the second subgraph set.Then, according to obtained the first subgraph set and the second subgraph set, it is common that the 3rd can be counted The number of existing same sub-image between word network and the 4th co-word network.

It should be noted that due to the entity node A that each subgraph in the first subgraph set includes, and the second subgraph The entity node B that each subgraph in set includes, is counting the 3rd co-word network G_A' and the 4th co-word network G_B' between deposit Same sub-image number when, when the structure of the first subgraph set and the second subgraph set neutron figure identical then think son Scheme it is identical, in order to only consider the structure of the first subgraph set and the second subgraph set neutron figure, and avoid because entity node A and Entity node B and cause the difference of subgraph, can be by G_B' the entity node B of each subgraph in corresponding second subgraph set Entity node A is replaced with, G is counted using replaced second subgraph set and the first subgraph set_B' and G_A' same sub-image Number.

S705, calculate between first co-word network and second co-word network according to the number of the same sub-image Similarity.

In the present embodiment, can be using the number of same sub-image as between the first co-word network and the second co-word network Similarity, the similarity that the number of same sub-image is more can be represented between the first co-word network and the second co-word network are higher. But the comparison of different similarities, and given threshold is unified to different similarities for convenience, with according to similarity and threshold Whether what the relation of value determined that first instance and second instance refer to is same entity, and the number of same sub-image can be returned One changes, using the result after normalization as the similarity between the first co-word network and the second co-word network.

Therefore, as a kind of example, first co-word network and described the are calculated according to the number of the same sub-image The implementation of similarity between two co-word networks can be：Obtain entity node in the identical entity sets with it is described First instance forms first number of subgraph, and entity node and second instance composition in the identical entity sets Second number of subgraph, calculates the total number of the sum of first number and second number as subgraph；Calculate the phase With subgraph number and the subgraph total number ratio, using the ratio as the first result；Calculate the described first common word Square entity node number included with second co-word network for the entity node number that network includes square Product, using the product as the second result；The ratio of first result and second result is calculated as described first Similarity between co-word network and second co-word network.

The implementation of the calculating similarity can for example be represented with following calculation formula：

Wherein, similarities of the MacStruSim between first co-word network and second co-word network, M_SIMFor The number of the same sub-image, M_NCFor the total number of the subgraph, N_AThe entity node included for first co-word network Number, N_BThe entity node number included for second co-word network.

The similarity MacStruSim being calculated according to above-mentioned formula is a number among 0 to 1, its numerical value is bigger, It can then represent that the similarity between the first co-word network and the second co-word network is higher, otherwise can represent the first co-word network And the second similarity between co-word network is lower.

The similarity that shown method calculates through this embodiment can represent the phase of macrostructure between co-word network Like degree, the similarity of macrostructure can reflect the similarity of overall topology between co-word network, so as to more accurate Really reflect the similarity degree between the corresponding co-word network of entity, more accurately to determine whether multiple entities refer to For same entity.

A kind of method for realizing entity disambiguation based on foregoing offer, the embodiment of the present application additionally provide one kind and realize entity The device of disambiguation, referring to Fig. 8, Fig. 8 shows a kind of structure diagram for the device for realizing entity disambiguation, and described device includes establishing Unit 801, computing unit 802, the first determination unit 803 and the second determination unit 804：

It is described to establish unit 801, it is corresponding for establishing corresponding first co-word network of first instance and second instance Second co-word network, there are identical entity node with second co-word network for first co-word network；

The computing unit 802, it is similar between first co-word network and second co-word network for calculating Degree；

First determination unit 803, for when the similarity is more than first threshold, by the first instance and institute State second instance and be determined as same entity；

Second determination unit 804, for when the similarity is less than second threshold, by the first instance and institute State second instance and be determined as different entities.

Optionally, the unit of establishing includes：

Optionally, the first extraction subelement includes：

Optionally, the unit of establishing includes：

Optionally, the computing unit includes：

Optionally, first computation subunit includes：

A kind of method and apparatus for realizing entity disambiguation based on foregoing offer, the embodiment of the present application additionally provide a kind of meter Calculation machine readable storage medium storing program for executing, is stored with instruction in the computer readable storage medium storing program for executing, when described instruction is transported on the terminal device During row so that the terminal device performs the method for realizing entity disambiguation any one of previous embodiment.

A kind of method, apparatus and storage medium for realizing entity disambiguation based on foregoing offer, the embodiment of the present application also carry A kind of computer program product is supplied, when the computer program product is run on the terminal device so that the terminal device Perform the method for realizing entity disambiguation any one of previous embodiment.

It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said Bright is all the difference with other embodiment, between each embodiment identical similar portion mutually referring to.For reality For applying system disclosed in example or device, since it is corresponded to the methods disclosed in the examples, so fairly simple, the phase of description Part is closed referring to method part illustration.

It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there are other identical element in process, method, article or equipment including the key element.

Can directly it be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the application. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments in the case where not departing from spirit herein or scope.Therefore, the application The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one The most wide scope caused.

Claims

A kind of 1. method for realizing entity disambiguation, it is characterised in that the described method includes：

Establish corresponding first co-word network of first instance and corresponding second co-word network of second instance, the described first common word There are identical entity node with second co-word network for network；

Calculate the similarity between first co-word network and second co-word network；

When the similarity is more than first threshold, the first instance and the second instance are determined as same entity；

Alternatively, when the similarity is less than second threshold, the first instance is determined as different realities from the second instance Body.
2. according to the method described in claim 1, it is characterized in that, described when the similarity is more than first threshold, by institute State first instance and be determined as same entity with the second instance, including：

When the similarity is more than first threshold, if the first instance is different from the second instance title, by described in First instance and the second instance same entity that to be determined as title different；

It is described when the similarity is less than second threshold, the first instance is determined as different realities from the second instance Body, including：

When the similarity is less than second threshold, if the first instance is identical with the second instance title, by described in First instance and the second instance different entities that to be determined as title identical.
3. according to the method described in claim 1, it is characterized in that, it is described establish corresponding first co-word network of first instance with And corresponding second co-word network of second instance includes：

Obtain corresponding first corpus of text of the first instance and corresponding second corpus of text of the second instance；

When first corpus of text and second corpus of text are unstructured data, according to the first text language Material extracts the corresponding fisrt feature set of words of the first instance, and real according to second corpus of text extraction described second The corresponding second feature set of words of body；

According to each feature in the relation between each Feature Words in the fisrt feature set of words, and the fisrt feature set of words Relation between word and the first instance, establishes corresponding first co-word network of the first instance, and according to described Relation in two feature set of words between each Feature Words and the second instance, establishes the word altogether of the second instance corresponding second Network.
4. according to the method described in claim 3, it is characterized in that, described extract described first according to first corpus of text The corresponding fisrt feature set of words of entity, and corresponding second spy of the second instance is extracted according to second corpus of text Sign set of words includes：

The corresponding first co-occurrence set of words of the first instance is extracted from first corpus of text, and it is literary from described second The corresponding second co-occurrence set of words of the second instance is extracted in this language material；The first co-occurrence set of words is included in described first In corpus of text, in the co-occurrence word occurred in the first instance preset range, the second co-occurrence set of words is included in In second corpus of text, in the co-occurrence word occurred in the second instance preset range；

Corresponding first keyword set of the first instance and first category Feature Words are extracted from first corpus of text Set, and corresponding second keyword set of the second instance and second category spy are extracted from second corpus of text Levy set of words；The first category feature set of words includes the Based on Class Feature Word Quadric identical with the entity class of the first instance, The second category feature set of words includes the Based on Class Feature Word Quadric identical with the entity class of the second instance；

The first co-occurrence set of words, first keyword set and the first category feature set of words are taken into union, obtained To the corresponding fisrt feature set of words of the first instance；And by the second co-occurrence set of words, second keyword set Close and the second category feature set of words takes union, obtain the corresponding second feature set of words of the second instance.
5. according to the method described in claim 1, it is characterized in that, it is described establish corresponding first co-word network of first instance with And corresponding second co-word network of second instance includes：

Obtain corresponding first corpus of text of the first instance and corresponding second corpus of text of the second instance；

When first corpus of text and second corpus of text are semi-structured data, according to the first text language Material obtain the first instance association attributes and with the associated entity node of the first instance, and according to the described second text This language material obtain the second instance association attributes and with the associated entity node of the second instance；

First instance pair is established according to the association attributes of the first instance and with the related entity node of the first instance The first co-word network answered, and according to the association attributes of the second instance and with the related entity section of the second instance Point establishes corresponding second co-word network of second instance.
6. according to the method described in claim 1, it is characterized in that, first co-word network and described second that calculates is total to Similarity between word network includes：

First co-word network entity node identical with second co-word network is obtained as identical entity sets；

The entity node for being not belonging to the identical entity sets is removed from first co-word network, obtains the 3rd common word net Network；The entity node for being not belonging to the identical entity sets is removed from second co-word network, obtains the 4th co-word network；

Obtain the corresponding first subgraph set of the 3rd co-word network, and corresponding second subgraph of the 4th co-word network Set, first sub-collective drawing are combined into the first instance and any one or more entity nodes in the 3rd co-word network The set of the subgraph of composition, second sub-collective drawing be combined into any one in the second instance and the 4th co-word network or The set for the subgraph that multiple entity nodes are formed；

According to the first subgraph set and the second subgraph set, the 3rd co-word network and the 4th common word are counted The number of existing same sub-image between network；

Similarity between first co-word network and second co-word network is calculated according to the number of the same sub-image.
7. according to the method described in claim 6, it is characterized in that, described calculate described the according to the number of the same sub-image Similarity between one co-word network and second co-word network includes：

First number that the entity node in the identical entity sets forms subgraph with the first instance is obtained, and it is described Entity node in identical entity sets forms second number of subgraph with the second instance, calculates first number and institute State total number of the sum of second number as subgraph；

The ratio of the number of the same sub-image and the total number of the subgraph is calculated, using the ratio as the first result；

Calculate square including with second co-word network for the entity node number that first co-word network includes Entity node number square product, using the product as the second result；

The ratio of first result and second result is calculated as first co-word network and the described second common word net Similarity between network.
8. a kind of device for realizing entity disambiguation, it is characterised in that described device includes：

Unit is established, for establishing corresponding first co-word network of first instance and second instance corresponding second word net altogether Network, there are identical entity node with second co-word network for first co-word network；

Computing unit, for calculating the similarity between first co-word network and second co-word network；

First determination unit, for when the similarity is more than first threshold, by the first instance and the second instance It is determined as same entity；

Second determination unit, for when the similarity is less than second threshold, by the first instance and the second instance It is determined as different entities.
A kind of 9. computer-readable recording medium, it is characterised in that instruction is stored with the computer readable storage medium storing program for executing, when When described instruction is run on the terminal device so that it is real that the terminal device perform claim requires 1-7 any one of them to realize The method of body disambiguation.
10. a kind of computer program product, it is characterised in that when the computer program product is run on the terminal device, make Obtain the method that the terminal device perform claim requires 1-7 any one of them to realize entity disambiguation.