CN109359172A - A kind of entity alignment optimization method divided based on figure - Google Patents

A kind of entity alignment optimization method divided based on figure Download PDF

Info

Publication number
CN109359172A
CN109359172A CN201810871604.1A CN201810871604A CN109359172A CN 109359172 A CN109359172 A CN 109359172A CN 201810871604 A CN201810871604 A CN 201810871604A CN 109359172 A CN109359172 A CN 109359172A
Authority
CN
China
Prior art keywords
entity
entities
attribute
name
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810871604.1A
Other languages
Chinese (zh)
Other versions
CN109359172B (en
Inventor
陈珂
寿黎但
王凌阳
陈刚
江大伟
伍赛
胡天磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810871604.1A priority Critical patent/CN109359172B/en
Publication of CN109359172A publication Critical patent/CN109359172A/en
Application granted granted Critical
Publication of CN109359172B publication Critical patent/CN109359172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of entities divided based on figure to be aligned optimization method.Candidate entity pair is excavated from all entities using combined index, differentiate that candidate entity to whether acquisition equivalent entities pair are aligned, recycles the similarity relationship between entity to propose the optimization algorithm promotion equivalent entities divided based on figure to the accuracy of alignment by the method for measuring similarity of entity.The method of the present invention solves the entity alignment problem of large-scale internet data, can the accurate complete entity sets for excavating equivalent equivalence in initial data.

Description

Entity alignment optimization method based on graph partitioning
Technical Field
The invention relates to an entity processing method in the field of databases, in particular to an entity alignment optimization method based on graph division.
The method relates to an inverted index and local sensitive hashing method in the field of databases, a TF-IDF model and a Doc2Vec model in the field of machine learning, a community partitioning algorithm in the field of social networks and an entity alignment method in the field of semantic networks.
Background
At present, internet resources containing a large amount of information and knowledge, such as hundred-degree encyclopedias, interactive encyclopedias and the like, emerge on the internet. Data barriers naturally exist among different data sources, so that the data are difficult to be related and interacted. However, if only a single data source is used to describe the object in the real world, there are problems of low coverage of the object, incomplete information description, and the like. Entity alignment is a problem in studying how to dig out objects that point to the same object in the real world from different data sources.
The current traditional entity alignment method research has three problems: (1) when only two data sources are subjected to entity matching, if all entity pairs are directly traversed, the calculation complexity is in direct proportion to the square of the scale of the data sources, and the calculation cost is too high. (2) At present, most entity alignment methods are aligned under semantic frameworks such as RDFS or OWL and the like, entity information is represented in the form of a large number of triples, and semantic expression and relationship information are rich. However, in the internet, entity data is generally represented in the form of a single page or document, and the current entity alignment method has no generality. (3) Under the condition of facing multiple data sources, most of the existing entity alignment methods convert the entity alignment methods into the entity alignment problem of multiple groups of two data sources, and do not perform analysis and calculation from the global angle of the multiple data sources.
Disclosure of Invention
In order to solve the problems in the background art, the present invention provides an entity alignment optimization method based on graph partitioning, which aims to overcome the shortcomings in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the method utilizes the combined index to mine candidate entity pairs from all entities, judges whether the candidate entity pairs are aligned to obtain equivalent entity pairs through a similarity measurement method of the entities, and provides an optimization algorithm based on graph division to improve the alignment accuracy of the equivalent entity pairs by utilizing the similarity relation between the entities.
The entity is obtained by analyzing, extracting and converting data from the Internet.
The document type data may be, for example, web pages, terms, documents of a website.
The entity is, for example, a web page record, an entry record, a document record.
The method comprises the following steps:
1) analyzing and extracting Internet data to obtain document type data, converting the single document type data into entities, wherein the main information of the entities comprises names, unique codes (IDs), attributes and context information, and a context corpus is formed by the context information of all the entities;
and converting the document type data obtained by analyzing and extracting the Internet data into entities with uniform data structures.
For example, the attributes of the web page are editor information and release time information in the web page, and the context information of the web page is text information.
2) Mining candidate entity pairs by using a combined index mode of an inverted index and a locality sensitive hash index, and then merging and de-duplicating the candidate entity pairs mined by different indexes;
3) traversing the candidate entity pair set obtained in the step 2), firstly respectively calculating entity similarity, if the entity similarity is greater than a similarity threshold, considering that the candidate entity pair is reserved, otherwise, discarding the candidate entity pair, wherein the equivalent entity is an entity pointing to the same object in reality, and reserving the obtained candidate entity pair as the equivalent entity pair, thereby completing the alignment of the candidate entity pair;
4) in the case of multiple data sources, there will be multiple entities that are equivalent entities, and what we get through step 3 is just an entity pair. The related equivalent entity pairs are converted into an equivalent entity set through parallel searching by the transitivity of the equivalent relationship;
if the same entity exists in two pairs of equivalent entities, then the two pairs of equivalent entities are associated.
5) Because certain errors exist in the calculation of the entity similarity, entities which are mistakenly confused may exist in the equivalent entity set, the context similarity of every two entities in the equivalent entity set is calculated to be used as the edge weight between the entities, an equivalent entity relationship diagram is constructed, and the entity relationship diagram is further divided by using the edge weight between the entities to obtain a final entity set.
The step 2) generates candidate entity pairs by using four indexes, namely, an inverted index based on a name, an inverted index based on an attribute, an inverted index based on a context, and a locality sensitive hash index based on a name, specifically:
a, constructing an inverted index based on names by taking entity names as index keys and an entity ID set as an index value, wherein unique codes of every two entities in an exhaustive entity ID set are taken as candidate entity pairs;
b, traversing the attributes p of all the entities based on the inverted index of the attributes, wherein the attributes p comprise attribute names and attribute values, and then establishing a mapping relation in the following form by taking the attribute names and the attribute values as composite keys and taking index values as entity ID sets: (p.name, p.value) → { ID1,ID2,..,IDnAnd classifying entities with the same attribute name and the same attribute value into the same entity attribute set, wherein p.name represents an attribute name, p.value represents an attribute value, and ID (identity)1,ID2,..,IDnRespectively representing each unique code of the entity;
calculating the attribute accumulation weight (p.name, p.value) of each entity in the entity attribute set by adopting the following formula to obtain:
wherein, | SameValueCount | represents the number of entities having the same attribute name and the same attribute value;
traversing and counting index values in the inverted index based on the attributes, exhausting any two entity pairs for each group of entity ID sets, setting the initialized attribute weight of each entity to be zero, accumulating the weight of the attributes obtained by calculation to the original attribute weight of the entity, traversing all the entity pairs, and taking the entity pairs as candidate entity pairs if the weight of the entity pairs is greater than a preset attribute weight threshold;
2, c, the context-based inverted index takes the keyword t as an index key and takes the set of entity IDs as an index value, and an index structure in the following form is established: t → { ID1,ID2,..,IDn},
Traversing the unique codes of all entities, exhaling any two entity pairs, setting the weight of each entity initialization keyword to be zero, calculating a keyword accumulation weight to accumulate the keyword accumulation weight to the original keyword weight of the entity, wherein the keyword accumulation weight is a word frequency-inverse Text Frequency Index (TFIDF) value of a word in the entity context information in a context corpus; traversing all entity pairs, and if the weight sum of the keywords obtained after the entity pairs are accumulated is greater than a preset keyword weight threshold value, taking the entity pair as a candidate entity pair;
d, as shown in FIG. 2, the process of generating candidate entity pairs based on the locality-sensitive hash index of names is divided into three processes of vectorization, min-hashing and locality-sensitive hashing,
firstly, the vectorization process is to convert the name of an entity into a 01-vector format to obtain an entity vector of each entity, and an entity matrix is formed by all the entity vectors; for example, with the N-gram model, entity names are first represented as a collection of short strings. If a character string complete set obtained after the entity name is decomposed by the N-gram is represented as S, the entity name is mapped into a vector 01 with the dimension of | S |, one dimension in the vector corresponds to a certain character string in the S, if the entity name contains the character string after the N-gram is decomposed, the bit is marked as 1, and if the entity name does not contain the character string, the bit is marked as 0.
Then, the minimum hash process is to generate random numbers according to the dimensionality of the entity vectors, the number of the random numbers is the same as the dimensionality of the entity vectors, different minimum hash functions minhash are generated according to the random numbers to carry out disorder rearrangement on the entity matrix, the return value of the minimum hash function minhash is the position of the first 1 in each entity vector in the entity matrix, the entity vectors are compressed into a hash signature matrix for a plurality of times by using a minimum hash method, namely, the high-dimensional sparse 01 vectors of the entities are compressed into low-dimensional hash signatures, and the large-scale matrix of the entity matrix is converted into a small-scale matrix of the hash signature matrix;
then, the process of the locality sensitive hashing is to carry out locality sensitive hashing processing on the hash signature matrix to obtain candidate entity pairs;
dividing a hash signature matrix into b block brads in the horizontal direction, wherein each block brad has r rows, each column of the hash signature matrix represents a characteristic signature of an entity, carrying out hash processing on the characteristic signature of each entity in the same block brad in the column, mapping obtained hash values into a plurality of Bucket buckets, and enabling the Bucket buckets of different block brads to be different; the process is carried out on each block Brand, and the Hash process is carried out for b times; and traversing all the bucket buckets, and forming a candidate entity pair by any two entities in the exhaustive bucket buckets to form a candidate entity pair set.
And finally, generating a candidate entity pair combination union set by the four indexes to obtain a final candidate entity pair set.
In the step 3), the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then performing weighted calculation on the three similarities.
The entity name similarity, the attribute similarity and the context similarity are respectively obtained by the following calculation methods:
and (3) calculating the similarity of entity names: calculating the name editing distance similarity as the entity name similarity:
sim_name(ea,eb)=lev_sim(namea,nameb)
wherein, namea,namebEntity names, lev _ sim (name), representing two entities respectivelya,nameb) Denotes the edit distance similarity function, sim _ name (name)a,nameb) Representing the similarity function of entity names, ea,ebRespectively representing two entities;
and (3) calculating the similarity of the entity attributes: firstly, acquiring a common attribute set of two entities, wherein the common attribute refers to an attribute corresponding to an attribute name shared by the two entities, and then calculating the arithmetic mean of the edit distance similarity of all common attribute values as the entity attribute similarity:
wherein v isaiAnd vbiRespectively representing attribute values of two entities, i represents an ordinal number of a common attribute of the two entities, S represents a total number of the common attribute of the two entities, and lev _ sim () represents an edit distance similarity function;
and (3) calculating the context similarity of the entities: training a context corpus through a document vector model (Doc2Vec), converting entity contexts into vectors by using the trained document vector model (Doc2Vec), and calculating cosine values of the two vectors as entity context similarity:
wherein,respectively represent entities eaAnd ebThe context is translated into a vector.
The step 4) is specifically to construct and search a set data structure, initialize each entity to an independent set, traverse all the equivalent entity pairs, combine all the equivalent entity pairs with the same entity into a set, and process all the equivalent entity pairs to obtain a plurality of equivalent entity sets.
And 5) specifically, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with weights lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is calculated, judged and divided by adopting a social network modularity dividing calculation method, and the equivalent entity set is divided into a plurality of entity sets.
The method comprises the steps of extracting document data in the Internet, converting entities with a uniform data structure, excavating candidate entity pairs by utilizing various combined indexes such as inverted indexes and locality sensitive hash indexes, and designing an entity similarity measurement method for judging whether the entity pairs are aligned. And then converting the aligned entity pair into an equivalent entity set, providing an optimization algorithm based on graph division, converting the entity set into an entity relationship graph, and further dividing the entity set according to the density relationship of the nodes and the weight in the graph to improve the accuracy of entity alignment.
An equivalent entity refers to an entity in the real world that points to the same objective object.
The invention has the beneficial effects that:
the method combines the inverted index and the locality sensitive hashing technology, reduces the calculation space of entity matching and generates a high-quality candidate set.
The invention integrates the name, attribute and context information of the entity, defines a method for calculating the similarity measurement of the entity and effectively judges whether the entity pair is aligned or not.
The invention is finally divided into the equivalent entity set formed by multiple entities, thereby further improving the accuracy of the entity alignment algorithm.
The method can be applied to the entity alignment problem of large-scale internet data, and mutually equivalent entity sets in the original data can be more accurately and completely excavated under the condition of higher entity alignment efficiency.
Drawings
FIG. 1 is a flow chart of the method steps of the present invention.
FIG. 2 is a flow diagram of the locality sensitive hash index generation candidate entity pair of the present invention.
Detailed Description
The technical solution of the present invention will now be further explained with reference to the specific embodiments and schematic diagrams.
Referring to fig. 1, the embodiment of the present invention and its specific implementation are as follows:
step 1: the data of document types such as web pages from the Internet are analyzed and extracted, and the data are converted into entities with uniform data structures by using the existing tools such as Scapy. A single page or document is mapped into an entity, the main information of the entity comprises a name, a unique code (ID), attributes and context information, and a context corpus is formed by the context information of all the entities.
And (4) preprocessing before entity matching. Traversing attribute information of the entity, counting the weight of different attributes, traversing context information of the entity, segmenting words of the context information, and counting information such as word frequency distribution of the whole corpus. And calculating TF-IDF values of the words in the computational context as weights of the words, and selecting the word with the highest TF-IDF value as a keyword of the context for reservation.
Step 2: the method comprises the following steps of mining more equivalent entity pairs by constructing various indexes, specifically constructing four indexes such as a name-based, attribute-based and context-based inverted index and a name-based locality sensitive hash index to generate candidate entity pairs, and specifically:
a, constructing an inverted index based on names by taking entity names as index keys and an entity ID set as an index value, wherein unique codes of every two entities in an exhaustive entity ID set are taken as candidate entity pairs;
b, traversing the attributes p of all the entities based on the inverted index of the attributes, wherein the attributes p comprise attribute names and attribute values, and then establishing a mapping relation in the following form by taking the attribute names and the attribute values as composite keys and taking index values as entity ID sets: (p.name, p.value) → { ID1,ID2,..,IDnAnd classifying entities with the same attribute name and the same attribute value into the same entity attribute set, wherein p.name represents an attribute name, p.value represents an attribute value, and ID (identity)1,ID2,..,IDnRespectively representing each unique code of the entity;
calculating the attribute accumulation weight (p.name, p.value) of each entity in the entity attribute set by adopting the following formula to obtain:
wherein, | SameValueCount | represents the number of entities having the same attribute name and the same attribute value;
traversing statistics is carried out on index values in the inverted index based on the attributes, any two entity pairs are exhausted for each group of entity ID sets, the initialized attribute weight of each entity is set to be zero, the attribute accumulation weight obtained through calculation is accumulated to the original attribute weight of the entity, all the entity pairs are traversed, and if the attribute weight sum obtained after accumulation of the entity pairs is larger than a preset attribute weight threshold value, the entity pairs are used as candidate entity pairs;
2, c, the context-based inverted index takes the keyword t as an index key and takes the set of entity IDs as an index value, and an index structure in the following form is established: t → { ID1,ID2,..,IDn},
Traversing the unique codes of all entities, exhaling any two entity pairs, setting the weight of each entity initialization keyword to be zero, calculating a keyword accumulation weight to accumulate the keyword accumulation weight to the original keyword weight of the entity, wherein the keyword accumulation weight is a word frequency-inverse Text Frequency Index (TFIDF) value of a word in the entity context information in a context corpus; traversing all entity pairs, and if the weight sum of the keywords obtained after the entity pairs are accumulated is greater than a preset keyword weight threshold value, taking the entity pair as a candidate entity pair;
d, as shown in FIG. 2, the process of generating candidate entity pairs based on the locality-sensitive hash index of names is divided into three processes of vectorization, min-hashing and locality-sensitive hashing,
firstly, the vectorization process is to convert the name of an entity into a 01-vector format to obtain an entity vector of each entity, and an entity matrix is formed by all the entity vectors; for example, with the N-gram model, entity names are first represented as a collection of short strings. If a character string complete set obtained after the entity name is decomposed by the N-gram is represented as S, the entity name is mapped into a vector 01 with the dimension of | S |, one dimension in the vector corresponds to a certain character string in the S, if the entity name contains the character string after the N-gram is decomposed, the bit is marked as 1, and if the entity name does not contain the character string, the bit is marked as 0.
Then, the minimum hash process is to generate random numbers according to the dimensionality of the entity vectors, the number of the random numbers is the same as the dimensionality of the entity vectors, different minimum hash functions minhash are generated according to the random numbers to carry out disorder rearrangement on the entity matrix, the return value of the minimum hash function minhash is the position of the first 1 in each entity vector in the entity matrix, the entity vectors are compressed into a hash signature matrix for a plurality of times by using a minimum hash method, namely, the high-dimensional sparse 01 vectors of the entities are compressed into low-dimensional hash signatures, and the large-scale matrix of the entity matrix is converted into a small-scale matrix of the hash signature matrix;
then, the process of the locality sensitive hashing is to carry out locality sensitive hashing processing on the hash signature matrix to obtain candidate entity pairs;
dividing a hash signature matrix into b block brads in the horizontal direction, wherein each block brad has r rows, each column of the hash signature matrix represents a characteristic signature of an entity, carrying out hash processing on the characteristic signature of each entity in the same block brad in the column, mapping obtained hash values into a plurality of Bucket buckets, and enabling the Bucket buckets of different block brads to be different; the process is carried out on each block Brand, and the Hash process is carried out for b times; and traversing all the bucket buckets, and forming a candidate entity pair by any two entities in the exhaustive bucket buckets to form a candidate entity pair set.
And finally, generating a candidate entity pair combination union set by the four indexes to obtain a final candidate entity pair set.
And step 3: traversing the final candidate entity pair set obtained in the step 2, calculating the similarity of the entity pairs, and if the similarity is greater than a similarity threshold, considering the candidate entity pairs as equivalent entity pairs;
the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then carrying out weighted calculation on the three similarities.
The entity name similarity, the attribute similarity and the context similarity are respectively obtained by the following calculation methods:
and (3) calculating the similarity of entity names: calculating the name editing distance similarity as the entity name similarity:
sim_name(ea,eb)=lev_sim(nameai,namebj)
wherein, nameai,namebjEntity names respectively representing two entities, lev _ sim () representing an edit distance similarity function, sim _ name () representing an entity name similarity function, ea,ebRespectively representing two entities;
and (3) calculating the similarity of the entity attributes: firstly, acquiring a common attribute set of two entities, wherein the common attribute refers to an attribute corresponding to an attribute name shared by the two entities, and then calculating the arithmetic mean of the edit distance similarity of all common attribute values as the entity attribute similarity:
wherein v isaiAnd vbiRespectively representing attribute values of two entities, i represents an ordinal number of a common attribute of the two entities, S represents a total number of the common attribute of the two entities, and lev _ sim () represents an edit distance similarity function;
and (3) calculating the context similarity of the entities: training a context corpus through a document vector model (Doc2Vec), converting entity contexts into vectors by using the trained document vector model (Doc2Vec), and calculating cosine values of the two vectors as entity context similarity:
wherein,respectively represent entities eaAnd ebThe context is translated into a vector.
And 4, step 4: and 3, obtaining a plurality of entity pairs with higher similarity as equivalent entity pairs through the process of calculating the entity similarity in the step 3. Due to transitivity of the equivalence relation, N (N ≧ 3) entities e are equivalent under the condition of multiple sources, and the equivalent entity pair needs to be converted into an equivalent entity set. The data structure of the structure and the search set is firstly initialized to an independent set by each entity, then all the equivalent entity pairs are traversed, all the equivalent entity pairs with the same entity are combined into a set, and all the equivalent entity pairs are processed to obtain a plurality of equivalent entity sets.
And 5: in step 4, a set of equivalent entities is obtained, but since there is a certain error in the process of calculating the similarity of the entities, there may be entities in the set that are mistakenly confused. The step further divides the entity set by an optimization algorithm based on graph division to divide a plurality of really equivalent sets.
And 4, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with the weight lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is calculated, judged and divided by adopting a social network modularity dividing calculation method, and the equivalent entity set is divided into a plurality of entity sets.

Claims (8)

1. An entity alignment optimization method based on graph partitioning is characterized in that: and mining candidate entity pairs from all entities by using the combined index, judging whether the candidate entity pairs are aligned to obtain equivalent entity pairs by using a similarity measurement method of the entities, and then providing an optimization algorithm based on graph division by using a similarity relation between the entities to improve the alignment accuracy of the equivalent entity pairs.
2. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the entity is obtained by analyzing, extracting and converting data from the Internet.
3. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the method comprises the following specific steps:
1) analyzing and extracting Internet data to obtain document type data, converting the single document type data into entities, wherein the main information of the entities comprises names, unique codes (IDs), attributes and context information, and a context corpus is formed by the context information of all the entities;
2) mining candidate entity pairs by using a combined index mode of an inverted index and a locality sensitive hash index, and then merging and de-duplicating the candidate entity pairs mined by different indexes;
3) traversing the candidate entity pair set obtained in the step 2), respectively calculating entity similarity, if the entity similarity is greater than a similarity threshold, considering that the candidate entity pair is reserved, otherwise, discarding the candidate entity pair, and reserving the obtained candidate entity pair as an equivalent entity pair;
4) the related equivalent entity pairs are converted into an equivalent entity set through parallel searching by the transitivity of the equivalent relationship;
5) and calculating the context similarity of every two entities in the equivalent entity set as the edge weight between the entities, constructing an equivalent entity relationship graph, and further dividing by using the edge weight between the entities to obtain a final entity set.
4. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the step 2) generates candidate entity pairs by using four indexes, namely, an inverted index based on a name, an inverted index based on an attribute, an inverted index based on a context, and a locality sensitive hash index based on a name, specifically:
a, constructing an inverted index based on names by taking entity names as index keys and an entity ID set as an index value, wherein unique codes of every two entities in an exhaustive entity ID set are taken as candidate entity pairs;
b, traversing the attributes p of all the entities based on the inverted index of the attributes, wherein the attributes p comprise attribute names and attribute values, and then establishing a mapping relation in the following form by taking the attribute names and the attribute values as composite keys and taking index values as entity ID sets: (p.name, p.value) → { ID1,ID2,..,IDnAnd classifying entities with the same attribute name and the same attribute value into the same entity attribute set, wherein p.name represents an attribute name, p.value represents an attribute value, and ID (identity)1,ID2,..,IDnRespectively representing each unique code of the entity;
calculating the attribute accumulation weight (p.name, p.value) of each entity in the entity attribute set by adopting the following formula to obtain:
wherein, | Same Value Count | represents the number of entities having the Same attribute name and the Same attribute Value;
for each group of entity ID sets, exhausting any two entity pairs, setting the initialized attribute weight of each entity to be zero, accumulating the weight of the attribute obtained by calculation to the original attribute weight of the entity, traversing all the entity pairs, and if the weight of the entity pair is greater than a preset attribute weight threshold value, taking the entity pair as a candidate entity pair;
2, c, the context-based inverted index takes the keyword t as an index key and takes the set of entity IDs as an index value, and an index structure in the following form is established: t → { ID1,ID2,..,IDnExhausting any two entity pairs, setting the initialized keyword weight of each entity to be zero, calculating a keyword accumulation weight, and accumulating the keyword accumulation weight to the original keyword weight of the entity, wherein the keyword accumulation weight is a word frequency-inverse Text Frequency Index (TFIDF) value of words in the entity context information in a context corpus; traversing all entity pairs, and if the weight sum of the keywords obtained after the entity pairs are accumulated is greater than a preset keyword weight threshold value, taking the entity pair as a candidate entity pair;
d, the process of generating candidate entity pair based on the locality sensitive hash index of the name is divided into three processes of vectorization, minimum hash and locality sensitive hash,
firstly, the vectorization process is to convert the name of an entity into a 01-vector format to obtain an entity vector of each entity, and an entity matrix is formed by all the entity vectors;
then, the minimum hash process is to generate random numbers according to the dimensionality of the entity vectors, the number of the random numbers is the same as the dimensionality of the entity vectors, different minimum hash functions minhash are generated according to the random numbers to carry out disorder rearrangement on the entity matrix, the return value of the minimum hash function minhash is the position of the first 1 in each entity vector in the entity matrix, and the entity vectors are compressed into a hash signature matrix for a plurality of times by using a minimum hash method;
then, the process of the locality sensitive hashing is to carry out locality sensitive hashing processing on the hash signature matrix to obtain candidate entity pairs;
and finally, generating a candidate entity pair combination union set by the four indexes to obtain a final candidate entity pair set.
5. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: in the step 3), the entity pair similarity is obtained by respectively calculating the entity name similarity, the attribute similarity and the context similarity and then performing weighted calculation on the three similarities.
6. The entity alignment optimization method based on graph partitioning as claimed in claim 5, wherein: the entity name similarity, the attribute similarity and the context similarity are respectively obtained by the following calculation methods:
and (3) calculating the similarity of entity names: calculating the name editing distance similarity as the entity name similarity:
sim_name(ea,eb)=lev_sim(namea,nameb)
wherein, namea,namebEntity names, lev _ sim, representing two entities respectively(namea,nameb) Denotes the edit distance similarity function, sim _ name (name)a,nameb) Representing the similarity function of entity names, ea,ebRespectively representing two entities;
and (3) calculating the similarity of the entity attributes: firstly, acquiring a common attribute set of two entities, wherein the common attribute refers to an attribute corresponding to an attribute name shared by the two entities, and then calculating the arithmetic mean of the edit distance similarity of all common attribute values as the entity attribute similarity:
wherein v isaiAnd vbiRespectively representing attribute values of two entities, i represents an ordinal number of a common attribute of the two entities, S represents a total number of the common attribute of the two entities, and lev _ sim () represents an edit distance similarity function;
and (3) calculating the context similarity of the entities: training a context corpus through a document vector model (Doc2Vec), converting entity contexts into vectors by using the trained document vector model (Doc2Vec), and calculating cosine values of the two vectors as entity context similarity:
wherein,respectively represent entities eaAnd ebThe context is translated into a vector.
7. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: the step 4) is specifically to construct and search a set data structure, initialize each entity to an independent set, traverse all the equivalent entity pairs, combine all the equivalent entity pairs with the same entity into a set, and process all the equivalent entity pairs to obtain a plurality of equivalent entity sets.
8. The entity alignment optimization method based on graph partitioning as claimed in claim 1, wherein: and 5) specifically, converting each equivalent entity set obtained in the step 4 into an entity relationship graph, wherein nodes in the entity relationship graph are entities, edges among the nodes are represented by context similarity among the entities, edges with weights lower than an edge weight threshold value are removed from the entity relationship graph, each entity in each equivalent entity set is processed by adopting a social network modularity division calculation method, and the equivalent entity sets are divided into a plurality of entity sets.
CN201810871604.1A 2018-08-02 2018-08-02 Entity alignment optimization method based on graph partitioning Active CN109359172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810871604.1A CN109359172B (en) 2018-08-02 2018-08-02 Entity alignment optimization method based on graph partitioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810871604.1A CN109359172B (en) 2018-08-02 2018-08-02 Entity alignment optimization method based on graph partitioning

Publications (2)

Publication Number Publication Date
CN109359172A true CN109359172A (en) 2019-02-19
CN109359172B CN109359172B (en) 2020-12-11

Family

ID=65349767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810871604.1A Active CN109359172B (en) 2018-08-02 2018-08-02 Entity alignment optimization method based on graph partitioning

Country Status (1)

Country Link
CN (1) CN109359172B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059194A (en) * 2019-03-01 2019-07-26 中国科学院信息工程研究所 A kind of fusion indicates the extensive ontology merging method of study and divide-and-conquer strategy
CN110162591A (en) * 2019-05-22 2019-08-23 南京邮电大学 A kind of entity alignment schemes and system towards digital education resource
CN110795453A (en) * 2019-10-22 2020-02-14 中国西安卫星测控中心 Method for automatically constructing RDF (remote data format) based on relational database
CN112182139A (en) * 2019-08-29 2021-01-05 盈盛智创科技(广州)有限公司 Method, device and equipment for tracing resource description framework triple
CN112966027A (en) * 2021-03-22 2021-06-15 青岛科技大学 Entity association mining method based on dynamic probe
CN113297213A (en) * 2021-04-29 2021-08-24 军事科学院系统工程研究院网络信息研究所 Dynamic multi-attribute matching method for entity object
WO2022057303A1 (en) * 2020-09-21 2022-03-24 华为技术有限公司 Image processing method, system and apparatus
CN114676267A (en) * 2022-04-01 2022-06-28 北京明略软件系统有限公司 Method and device for entity alignment and electronic equipment
CN115906796A (en) * 2022-09-23 2023-04-04 北京市应急管理科学技术研究院 Alignment method and system for potential safety production hazard entities
CN116167530A (en) * 2023-04-25 2023-05-26 安徽思高智能科技有限公司 RPA flow optimization method based on multi-flow node alignment
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188839A1 (en) * 2012-12-28 2014-07-03 Microsoft Corporation Using social signals to rank search results
CN104133673A (en) * 2014-07-04 2014-11-05 清华大学 Ontology example matching system and method based on user customization
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN104866471A (en) * 2015-06-05 2015-08-26 南开大学 Instance matching method based on local sensitive Hash strategy
CN106202041A (en) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 A kind of method and apparatus of the entity alignment problem solved in knowledge mapping
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188839A1 (en) * 2012-12-28 2014-07-03 Microsoft Corporation Using social signals to rank search results
CN104133673A (en) * 2014-07-04 2014-11-05 清华大学 Ontology example matching system and method based on user customization
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN104866471A (en) * 2015-06-05 2015-08-26 南开大学 Instance matching method based on local sensitive Hash strategy
CN106202041A (en) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 A kind of method and apparatus of the entity alignment problem solved in knowledge mapping
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108038183A (en) * 2017-12-08 2018-05-15 北京百度网讯科技有限公司 Architectural entities recording method, device, server and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
庄严 等: "知识库实体对齐技术综述", 《计算机研究与发展》 *
王凌阳 等: "多源异构数据的实体匹配方法研究", 《计算机工程与应用》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059194A (en) * 2019-03-01 2019-07-26 中国科学院信息工程研究所 A kind of fusion indicates the extensive ontology merging method of study and divide-and-conquer strategy
CN110162591A (en) * 2019-05-22 2019-08-23 南京邮电大学 A kind of entity alignment schemes and system towards digital education resource
CN110162591B (en) * 2019-05-22 2022-08-19 南京邮电大学 Entity alignment method and system for digital education resources
CN112182139A (en) * 2019-08-29 2021-01-05 盈盛智创科技(广州)有限公司 Method, device and equipment for tracing resource description framework triple
CN110795453A (en) * 2019-10-22 2020-02-14 中国西安卫星测控中心 Method for automatically constructing RDF (remote data format) based on relational database
WO2022057303A1 (en) * 2020-09-21 2022-03-24 华为技术有限公司 Image processing method, system and apparatus
CN112966027A (en) * 2021-03-22 2021-06-15 青岛科技大学 Entity association mining method based on dynamic probe
CN112966027B (en) * 2021-03-22 2022-10-21 青岛科技大学 Entity association mining method based on dynamic probe
CN113297213A (en) * 2021-04-29 2021-08-24 军事科学院系统工程研究院网络信息研究所 Dynamic multi-attribute matching method for entity object
CN113297213B (en) * 2021-04-29 2023-09-12 军事科学院系统工程研究院网络信息研究所 Dynamic multi-attribute matching method for entity object
CN114676267A (en) * 2022-04-01 2022-06-28 北京明略软件系统有限公司 Method and device for entity alignment and electronic equipment
CN115906796A (en) * 2022-09-23 2023-04-04 北京市应急管理科学技术研究院 Alignment method and system for potential safety production hazard entities
CN116167530A (en) * 2023-04-25 2023-05-26 安徽思高智能科技有限公司 RPA flow optimization method based on multi-flow node alignment
CN116702899A (en) * 2023-08-07 2023-09-05 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene
CN116702899B (en) * 2023-08-07 2023-11-28 上海银行股份有限公司 Entity fusion method suitable for public and private linkage scene

Also Published As

Publication number Publication date
CN109359172B (en) 2020-12-11

Similar Documents

Publication Publication Date Title
CN109359172B (en) Entity alignment optimization method based on graph partitioning
Mohammed et al. A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms
Karthikeyan et al. A survey on association rule mining
WO2020143184A1 (en) Knowledge fusion method and apparatus, computer device, and storage medium
Popat et al. Hierarchical document clustering based on cosine similarity measure
CN109710701A (en) A kind of automated construction method for public safety field big data knowledge mapping
CN101504654B (en) Method for implementing automatic database schema matching
US10883345B2 (en) Processing of computer log messages for visualization and retrieval
CN107038505B (en) Ore finding model prediction method based on machine learning
CN103729402A (en) Method for establishing mapping knowledge domain based on book catalogue
CN102207946B (en) Knowledge network semi-automatic generation method
CN101449271A (en) Annotation by search
CN105488196A (en) Automatic hot topic mining system based on internet corpora
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN107291895B (en) Quick hierarchical document query method
CN104112005B (en) Distributed mass fingerprint identification method
CN110969517B (en) Bidding life cycle association method, system, storage medium and computer equipment
CN109325019A (en) Data correlation relation network establishing method
CN110633371A (en) Log classification method and system
CN103886072A (en) Retrieved result clustering system in coal mine search engine
CN117151659B (en) Ecological restoration engineering full life cycle tracing method based on large language model
Shah et al. Analysis of different clustering algorithms for accurate knowledge extraction from popular datasets
Wenli Application research on latent semantic analysis for information retrieval
CN105677684A (en) Method for making semantic annotations on content generated by users based on external data sources
Tian A mathematical indexing method based on the hierarchical features of operators in formulae

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant