CN109359172A

CN109359172A - A kind of entity alignment optimization method divided based on figure

Info

Publication number: CN109359172A
Application number: CN201810871604.1A
Authority: CN
Inventors: 陈珂; 寿黎但; 王凌阳; 陈刚; 江大伟; 伍赛; 胡天磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-08-02
Filing date: 2018-08-02
Publication date: 2019-02-19
Anticipated expiration: 2038-08-02
Also published as: CN109359172B

Abstract

The invention discloses a kind of entities divided based on figure to be aligned optimization method.Candidate entity pair is excavated from all entities using combined index, differentiate that candidate entity to whether acquisition equivalent entities pair are aligned, recycles the similarity relationship between entity to propose the optimization algorithm promotion equivalent entities divided based on figure to the accuracy of alignment by the method for measuring similarity of entity.The method of the present invention solves the entity alignment problem of large-scale internet data, can the accurate complete entity sets for excavating equivalent equivalence in initial data.

Description

Entity alignment optimization method based on graph partitioning

Technical Field

The invention relates to an entity processing method in the field of databases, in particular to an entity alignment optimization method based on graph division.

The method relates to an inverted index and local sensitive hashing method in the field of databases, a TF-IDF model and a Doc2Vec model in the field of machine learning, a community partitioning algorithm in the field of social networks and an entity alignment method in the field of semantic networks.

Background

At present, internet resources containing a large amount of information and knowledge, such as hundred-degree encyclopedias, interactive encyclopedias and the like, emerge on the internet. Data barriers naturally exist among different data sources, so that the data are difficult to be related and interacted. However, if only a single data source is used to describe the object in the real world, there are problems of low coverage of the object, incomplete information description, and the like. Entity alignment is a problem in studying how to dig out objects that point to the same object in the real world from different data sources.

The current traditional entity alignment method research has three problems: (1) when only two data sources are subjected to entity matching, if all entity pairs are directly traversed, the calculation complexity is in direct proportion to the square of the scale of the data sources, and the calculation cost is too high. (2) At present, most entity alignment methods are aligned under semantic frameworks such as RDFS or OWL and the like, entity information is represented in the form of a large number of triples, and semantic expression and relationship information are rich. However, in the internet, entity data is generally represented in the form of a single page or document, and the current entity alignment method has no generality. (3) Under the condition of facing multiple data sources, most of the existing entity alignment methods convert the entity alignment methods into the entity alignment problem of multiple groups of two data sources, and do not perform analysis and calculation from the global angle of the multiple data sources.

Disclosure of Invention

In order to solve the problems in the background art, the present invention provides an entity alignment optimization method based on graph partitioning, which aims to overcome the shortcomings in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the method utilizes the combined index to mine candidate entity pairs from all entities, judges whether the candidate entity pairs are aligned to obtain equivalent entity pairs through a similarity measurement method of the entities, and provides an optimization algorithm based on graph division to improve the alignment accuracy of the equivalent entity pairs by utilizing the similarity relation between the entities.

The entity is obtained by analyzing, extracting and converting data from the Internet.

The document type data may be, for example, web pages, terms, documents of a website.

The entity is, for example, a web page record, an entry record, a document record.

The method comprises the following steps:

1) analyzing and extracting Internet data to obtain document type data, converting the single document type data into entities, wherein the main information of the entities comprises names, unique codes (IDs), attributes and context information, and a context corpus is formed by the context information of all the entities;

and converting the document type data obtained by analyzing and extracting the Internet data into entities with uniform data structures.

For example, the attributes of the web page are editor information and release time information in the web page, and the context information of the web page is text information.

2) Mining candidate entity pairs by using a combined index mode of an inverted index and a locality sensitive hash index, and then merging and de-duplicating the candidate entity pairs mined by different indexes;

3) traversing the candidate entity pair set obtained in the step 2), firstly respectively calculating entity similarity, if the entity similarity is greater than a similarity threshold, considering that the candidate entity pair is reserved, otherwise, discarding the candidate entity pair, wherein the equivalent entity is an entity pointing to the same object in reality, and reserving the obtained candidate entity pair as the equivalent entity pair, thereby completing the alignment of the candidate entity pair;

4) in the case of multiple data sources, there will be multiple entities that are equivalent entities, and what we get through step 3 is just an entity pair. The related equivalent entity pairs are converted into an equivalent entity set through parallel searching by the transitivity of the equivalent relationship;

if the same entity exists in two pairs of equivalent entities, then the two pairs of equivalent entities are associated.

5) Because certain errors exist in the calculation of the entity similarity, entities which are mistakenly confused may exist in the equivalent entity set, the context similarity of every two entities in the equivalent entity set is calculated to be used as the edge weight between the entities, an equivalent entity relationship diagram is constructed, and the entity relationship diagram is further divided by using the edge weight between the entities to obtain a final entity set.

The step 2) generates candidate entity pairs by using four indexes, namely, an inverted index based on a name, an inverted index based on an attribute, an inverted index based on a context, and a locality sensitive hash index based on a name, specifically:

a, constructing an inverted index based on names by taking entity names as index keys and an entity ID set as an index value, wherein unique codes of every two entities in an exhaustive entity ID set are taken as candidate entity pairs;

b, traversing the attributes p of all the entities based on the inverted index of the attributes, wherein the attributes p comprise attribute names and attribute values, and then establishing a mapping relation in the following form by taking the attribute names and the attribute values as composite keys and taking index values as entity ID sets: (p.name, p.value) → { ID₁,ID₂,..,ID_nAnd classifying entities with the same attribute name and the same attribute value into the same entity attribute set, wherein p.name represents an attribute name, p.value represents an attribute value, and ID (identity)₁,ID₂,..,ID_nRespectively representing each unique code of the entity;

calculating the attribute accumulation weight (p.name, p.value) of each entity in the entity attribute set by adopting the following formula to obtain:

wherein, | SameValueCount | represents the number of entities having the same attribute name and the same attribute value;

traversing and counting index values in the inverted index based on the attributes, exhausting any two entity pairs for each group of entity ID sets, setting the initialized attribute weight of each entity to be zero, accumulating the weight of the attributes obtained by calculation to the original attribute weight of the entity, traversing all the entity pairs, and taking the entity pairs as candidate entity pairs if the weight of the entity pairs is greater than a preset attribute weight threshold;

2, c, the context-based inverted index takes the keyword t as an index key and takes the set of entity IDs as an index value, and an index structure in the following form is established: t → { ID₁,ID₂,..,ID_n}，

Traversing the unique codes of all entities, exhaling any two entity pairs, setting the weight of each entity initialization keyword to be zero, calculating a keyword accumulation weight to accumulate the keyword accumulation weight to the original keyword weight of the entity, wherein the keyword accumulation weight is a word frequency-inverse Text Frequency Index (TFIDF) value of a word in the entity context information in a context corpus; traversing all entity pairs, and if the weight sum of the keywords obtained after the entity pairs are accumulated is greater than a preset keyword weight threshold value, taking the entity pair as a candidate entity pair;

d, as shown in FIG. 2, the process of generating candidate entity pairs based on the locality-sensitive hash index of names is divided into three processes of vectorization, min-hashing and locality-sensitive hashing,

firstly, the vectorization process is to convert the name of an entity into a 01-vector format to obtain an entity vector of each entity, and an entity matrix is formed by all the entity vectors; for example, with the N-gram model, entity names are first represented as a collection of short strings. If a character string complete set obtained after the entity name is decomposed by the N-gram is represented as S, the entity name is mapped into a vector 01 with the dimension of | S |, one dimension in the vector corresponds to a certain character string in the S, if the entity name contains the character string after the N-gram is decomposed, the bit is marked as 1, and if the entity name does not contain the character string, the bit is marked as 0.

Then, the minimum hash process is to generate random numbers according to the dimensionality of the entity vectors, the number of the random numbers is the same as the dimensionality of the entity vectors, different minimum hash functions minhash are generated according to the random numbers to carry out disorder rearrangement on the entity matrix, the return value of the minimum hash function minhash is the position of the first 1 in each entity vector in the entity matrix, the entity vectors are compressed into a hash signature matrix for a plurality of times by using a minimum hash method, namely, the high-dimensional sparse 01 vectors of the entities are compressed into low-dimensional hash signatures, and the large-scale matrix of the entity matrix is converted into a small-scale matrix of the hash signature matrix;

then, the process of the locality sensitive hashing is to carry out locality sensitive hashing processing on the hash signature matrix to obtain candidate entity pairs;

dividing a hash signature matrix into b block brads in the horizontal direction, wherein each block brad has r rows, each column of the hash signature matrix represents a characteristic signature of an entity, carrying out hash processing on the characteristic signature of each entity in the same block brad in the column, mapping obtained hash values into a plurality of Bucket buckets, and enabling the Bucket buckets of different block brads to be different; the process is carried out on each block Brand, and the Hash process is carried out for b times; and traversing all the bucket buckets, and forming a candidate entity pair by any two entities in the exhaustive bucket buckets to form a candidate entity pair set.

And finally, generating a candidate entity pair combination union set by the four indexes to obtain a final candidate entity pair set.

In the step 3), the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then performing weighted calculation on the three similarities.

The entity name similarity, the attribute similarity and the context similarity are respectively obtained by the following calculation methods:

and (3) calculating the similarity of entity names: calculating the name editing distance similarity as the entity name similarity:

sim_name(e_a,e_b)＝lev_sim(name_a,name_b)

wherein, name_a,name_bEntity names, lev _ sim (name), representing two entities respectively_a,name_b) Denotes the edit distance similarity function, sim _ name (name)_a,name_b) Representing the similarity function of entity names, e_a,e_bRespectively representing two entities;

and (3) calculating the similarity of the entity attributes: firstly, acquiring a common attribute set of two entities, wherein the common attribute refers to an attribute corresponding to an attribute name shared by the two entities, and then calculating the arithmetic mean of the edit distance similarity of all common attribute values as the entity attribute similarity:

wherein v is_aiAnd v_biRespectively representing attribute values of two entities, i represents an ordinal number of a common attribute of the two entities, S represents a total number of the common attribute of the two entities, and lev _ sim () represents an edit distance similarity function;

and (3) calculating the context similarity of the entities: training a context corpus through a document vector model (Doc2Vec), converting entity contexts into vectors by using the trained document vector model (Doc2Vec), and calculating cosine values of the two vectors as entity context similarity:

wherein,respectively represent entities e_aAnd e_bThe context is translated into a vector.

The step 4) is specifically to construct and search a set data structure, initialize each entity to an independent set, traverse all the equivalent entity pairs, combine all the equivalent entity pairs with the same entity into a set, and process all the equivalent entity pairs to obtain a plurality of equivalent entity sets.

And 5) specifically, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with weights lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is calculated, judged and divided by adopting a social network modularity dividing calculation method, and the equivalent entity set is divided into a plurality of entity sets.

The method comprises the steps of extracting document data in the Internet, converting entities with a uniform data structure, excavating candidate entity pairs by utilizing various combined indexes such as inverted indexes and locality sensitive hash indexes, and designing an entity similarity measurement method for judging whether the entity pairs are aligned. And then converting the aligned entity pair into an equivalent entity set, providing an optimization algorithm based on graph division, converting the entity set into an entity relationship graph, and further dividing the entity set according to the density relationship of the nodes and the weight in the graph to improve the accuracy of entity alignment.

An equivalent entity refers to an entity in the real world that points to the same objective object.

The invention has the beneficial effects that:

the method combines the inverted index and the locality sensitive hashing technology, reduces the calculation space of entity matching and generates a high-quality candidate set.

The invention integrates the name, attribute and context information of the entity, defines a method for calculating the similarity measurement of the entity and effectively judges whether the entity pair is aligned or not.

The invention is finally divided into the equivalent entity set formed by multiple entities, thereby further improving the accuracy of the entity alignment algorithm.

The method can be applied to the entity alignment problem of large-scale internet data, and mutually equivalent entity sets in the original data can be more accurately and completely excavated under the condition of higher entity alignment efficiency.

Drawings

FIG. 1 is a flow chart of the method steps of the present invention.

FIG. 2 is a flow diagram of the locality sensitive hash index generation candidate entity pair of the present invention.

Detailed Description

The technical solution of the present invention will now be further explained with reference to the specific embodiments and schematic diagrams.

Referring to fig. 1, the embodiment of the present invention and its specific implementation are as follows:

step 1: the data of document types such as web pages from the Internet are analyzed and extracted, and the data are converted into entities with uniform data structures by using the existing tools such as Scapy. A single page or document is mapped into an entity, the main information of the entity comprises a name, a unique code (ID), attributes and context information, and a context corpus is formed by the context information of all the entities.

And (4) preprocessing before entity matching. Traversing attribute information of the entity, counting the weight of different attributes, traversing context information of the entity, segmenting words of the context information, and counting information such as word frequency distribution of the whole corpus. And calculating TF-IDF values of the words in the computational context as weights of the words, and selecting the word with the highest TF-IDF value as a keyword of the context for reservation.

Step 2: the method comprises the following steps of mining more equivalent entity pairs by constructing various indexes, specifically constructing four indexes such as a name-based, attribute-based and context-based inverted index and a name-based locality sensitive hash index to generate candidate entity pairs, and specifically:

traversing statistics is carried out on index values in the inverted index based on the attributes, any two entity pairs are exhausted for each group of entity ID sets, the initialized attribute weight of each entity is set to be zero, the attribute accumulation weight obtained through calculation is accumulated to the original attribute weight of the entity, all the entity pairs are traversed, and if the attribute weight sum obtained after accumulation of the entity pairs is larger than a preset attribute weight threshold value, the entity pairs are used as candidate entity pairs;

And step 3: traversing the final candidate entity pair set obtained in the step 2, calculating the similarity of the entity pairs, and if the similarity is greater than a similarity threshold, considering the candidate entity pairs as equivalent entity pairs;

the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then carrying out weighted calculation on the three similarities.

sim_name(e_a,e_b)＝lev_sim(name_ai,name_bj)

wherein, name_ai,name_bjEntity names respectively representing two entities, lev _ sim () representing an edit distance similarity function, sim _ name () representing an entity name similarity function, e_a,e_bRespectively representing two entities;

And 4, step 4: and 3, obtaining a plurality of entity pairs with higher similarity as equivalent entity pairs through the process of calculating the entity similarity in the step 3. Due to transitivity of the equivalence relation, N (N ≧ 3) entities e are equivalent under the condition of multiple sources, and the equivalent entity pair needs to be converted into an equivalent entity set. The data structure of the structure and the search set is firstly initialized to an independent set by each entity, then all the equivalent entity pairs are traversed, all the equivalent entity pairs with the same entity are combined into a set, and all the equivalent entity pairs are processed to obtain a plurality of equivalent entity sets.

And 5: in step 4, a set of equivalent entities is obtained, but since there is a certain error in the process of calculating the similarity of the entities, there may be entities in the set that are mistakenly confused. The step further divides the entity set by an optimization algorithm based on graph division to divide a plurality of really equivalent sets.

And 4, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with the weight lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is calculated, judged and divided by adopting a social network modularity dividing calculation method, and the equivalent entity set is divided into a plurality of entity sets.

Claims

1. An entity alignment optimization method based on graph partitioning is characterized in that: and mining candidate entity pairs from all entities by using the combined index, judging whether the candidate entity pairs are aligned to obtain equivalent entity pairs by using a similarity measurement method of the entities, and then providing an optimization algorithm based on graph division by using a similarity relation between the entities to improve the alignment accuracy of the equivalent entity pairs.

2. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the entity is obtained by analyzing, extracting and converting data from the Internet.

3. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the method comprises the following specific steps:

3) traversing the candidate entity pair set obtained in the step 2), respectively calculating entity similarity, if the entity similarity is greater than a similarity threshold, considering that the candidate entity pair is reserved, otherwise, discarding the candidate entity pair, and reserving the obtained candidate entity pair as an equivalent entity pair;

4) the related equivalent entity pairs are converted into an equivalent entity set through parallel searching by the transitivity of the equivalent relationship;

5) and calculating the context similarity of every two entities in the equivalent entity set as the edge weight between the entities, constructing an equivalent entity relationship graph, and further dividing by using the edge weight between the entities to obtain a final entity set.

4. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the step 2) generates candidate entity pairs by using four indexes, namely, an inverted index based on a name, an inverted index based on an attribute, an inverted index based on a context, and a locality sensitive hash index based on a name, specifically:

wherein, | Same Value Count | represents the number of entities having the Same attribute name and the Same attribute Value;

for each group of entity ID sets, exhausting any two entity pairs, setting the initialized attribute weight of each entity to be zero, accumulating the weight of the attribute obtained by calculation to the original attribute weight of the entity, traversing all the entity pairs, and if the weight of the entity pair is greater than a preset attribute weight threshold value, taking the entity pair as a candidate entity pair;

2, c, the context-based inverted index takes the keyword t as an index key and takes the set of entity IDs as an index value, and an index structure in the following form is established: t → { ID₁,ID₂,..,ID_nExhausting any two entity pairs, setting the initialized keyword weight of each entity to be zero, calculating a keyword accumulation weight, and accumulating the keyword accumulation weight to the original keyword weight of the entity, wherein the keyword accumulation weight is a word frequency-inverse Text Frequency Index (TFIDF) value of words in the entity context information in a context corpus; traversing all entity pairs, and if the weight sum of the keywords obtained after the entity pairs are accumulated is greater than a preset keyword weight threshold value, taking the entity pair as a candidate entity pair;

d, the process of generating candidate entity pair based on the locality sensitive hash index of the name is divided into three processes of vectorization, minimum hash and locality sensitive hash,

firstly, the vectorization process is to convert the name of an entity into a 01-vector format to obtain an entity vector of each entity, and an entity matrix is formed by all the entity vectors;

then, the minimum hash process is to generate random numbers according to the dimensionality of the entity vectors, the number of the random numbers is the same as the dimensionality of the entity vectors, different minimum hash functions minhash are generated according to the random numbers to carry out disorder rearrangement on the entity matrix, the return value of the minimum hash function minhash is the position of the first 1 in each entity vector in the entity matrix, and the entity vectors are compressed into a hash signature matrix for a plurality of times by using a minimum hash method;

5. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: in the step 3), the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then performing weighted calculation on the three similarities.

6. The entity alignment optimization method based on graph partitioning as claimed in claim 5, wherein: the entity name similarity, the attribute similarity and the context similarity are respectively obtained by the following calculation methods:

sim_name(e_a,e_b)＝lev_sim(name_a,name_b)

wherein, name_a,name_bEntity names, lev _ sim, representing two entities respectively(name_a,name_b) Denotes the edit distance similarity function, sim _ name (name)_a,name_b) Representing the similarity function of entity names, e_a,e_bRespectively representing two entities;

7. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the step 4) is specifically to construct and search a set data structure, initialize each entity to an independent set, traverse all the equivalent entity pairs, combine all the equivalent entity pairs with the same entity into a set, and process all the equivalent entity pairs to obtain a plurality of equivalent entity sets.

8. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: and 5) specifically, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with weights lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is processed by adopting a social network modularity division calculation method, and the equivalent entity sets are divided into a plurality of entity sets.