CN114048240A

CN114048240A - Data integration method and system based on approximate graph matching algorithm

Info

Publication number: CN114048240A
Application number: CN202111371583.5A
Authority: CN
Inventors: 陈占芳; 刘庆宗; 姜晓明; 梁玉柱; 吴森森; 高鹏辉
Original assignee: Chongqing Research Institute Of Changchun University Of Technology; Changchun University of Science and Technology
Current assignee: Chongqing Research Institute Of Changchun University Of Technology; Changchun University of Science and Technology
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-15
Anticipated expiration: 2041-11-18
Also published as: CN114048240B

Abstract

The invention relates to a data integration method and a data integration system based on an approximate graph matching algorithm. The method comprises the steps of respectively carrying out graph mapping on different data modes and determining corresponding graph structures; carrying out graph matching on the mapped graph by using an approximate graph matching algorithm, and determining a cost matrix between graph structures; determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence; and integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance. The invention can improve the integration degree of heterogeneous data.

Description

Data integration method and system based on approximate graph matching algorithm

Technical Field

The invention relates to the field of data integration, in particular to a data integration method and system based on an approximate graph matching algorithm.

Background

Under the current era of data explosion, big data management is one of the important challenges, and in the big data management scenario, data integration is an important issue. Data integration is a problem of mismatching among data sources, integrates heterogeneous, distributed and autonomous data together, provides a single view for a user, and can transparently access the data sources. Therefore, it is particularly important to match and integrate patterns of heterogeneous data. Early research in the field of data integration was primarily directed towards how to identify associations between data tables, data columns and tuples describing the same attributes, the same entities, for a given data source and data set. In the aspect of mining the incidence relation, the method mainly adopts the modes of being simpler, carrying out direct matching based on character strings, and completing the matching through manual identification and the like. For data with similar field attributes but inconsistent field names, such as Name and employee Name, the degree of integration is low.

Thus. A new data integration method or system is needed to improve the integration degree of heterogeneous data.

Disclosure of Invention

The invention aims to provide a data integration method and a data integration system based on an approximate graph matching algorithm, which can improve the integration degree of heterogeneous data.

In order to achieve the purpose, the invention provides the following scheme:

a data integration method based on an approximate graph matching algorithm comprises the following steps:

respectively mapping different data modes to determine corresponding graph structures;

carrying out graph matching on the mapped graph by using an approximate graph matching algorithm, and determining a cost matrix between graph structures;

determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;

and integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance.

Optionally, the map mapping the different data modes respectively to determine corresponding map structures specifically includes:

and respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool to determine corresponding Graph structures.

Optionally, the performing graph matching on the mapped graph by using an approximate graph matching algorithm to determine a cost matrix between graph structures specifically includes:

determining a binary cost matrix according to the graph structures mapped by different data modes;

and simplifying the binary cost matrix to construct an SFBP cost matrix.

Optionally, the optimal matching sequence of the nodes in the graph structure is determined according to a cost matrix between the graph structures; and determining a graph editing distance according to the optimal matching sequence, which specifically comprises the following steps:

using introduction A in cost matrix_ijReplacing the rule; the introduction element A_ijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;

for adopting introduction element A_ijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.

Optionally, the integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance specifically includes:

judging whether the graph editing distance is larger than a distance threshold value;

if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;

and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.

A data integration system based on an approximate graph matching algorithm, comprising:

the graph structure determining module is used for respectively carrying out graph mapping on different data modes and determining corresponding graph structures;

the cost matrix determining module is used for carrying out image matching on the mapped images by utilizing an approximate image matching algorithm and determining a cost matrix between image structures;

the optimal matching sequence and graph editing distance determining module is used for determining the optimal matching sequence of the nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;

and the data integration completion module is used for integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance.

Optionally, the graph structure determining module specifically includes:

and the Graph structure determining unit is used for respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool and determining corresponding Graph structures.

Optionally, the cost matrix determining module specifically includes:

the binary cost matrix determining unit is used for determining a binary cost matrix according to the graph structures mapped by different data modes;

and the SFBP cost matrix determining unit is used for simplifying the binary cost matrix to construct the SFBP cost matrix.

Optionally, the module for determining the optimal matching sequence and the graph edit distance specifically includes:

introduction element A_ijA replacement rule unit for employing the introduction element A in the cost matrix_ijReplacing the rule; the introduction element A_ijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents the deletion operation performed on the ith node in a structural diagram, and j represents the deletion operation performed on the ith node in a structural diagramInserting a jth node in another structure graph g into the composition;

an optimal matching sequence determining unit for employing the introduction element A_ijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.

Optionally, the data integration completion module specifically includes:

a first judgment unit configured to judge whether the graph edit distance is greater than a distance threshold;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the data integration method and system based on the approximate graph matching algorithm, two or more data modes (such as XML documents, data in a database and the like) are mapped into a graph structure, and then the mapped graphs are subjected to graph matching through the approximate graph matching algorithm to obtain a cost matrix and graph editing distance between the graphs, so that the similarity between the graphs is measured. On the basis, similar parts in different data modes are extracted and integrated, and the purpose of data integration among heterogeneous data is achieved. The method utilizes the characteristic that the graph structure can better represent the attributes of the objects and the relationship between the objects, maps the original abstract data mode into the graph with clear structural hierarchy, and combines the approximate graph matching algorithm to enable the heterogeneous data to be matched and integrated more efficiently and more accurately.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a data integration method based on an approximate graph matching algorithm according to the present invention;

FIG. 2 is a diagram of a mapped structure;

FIG. 3 is a schematic flow chart of an embodiment of the present invention

Fig. 4 is a schematic structural diagram of a data integration system based on an approximate graph matching algorithm according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a data integration method based on an approximate graph matching algorithm provided by the present invention, and as shown in fig. 1, the data integration method based on the approximate graph matching algorithm provided by the present invention includes:

s101, respectively mapping different data modes to determine corresponding graph structures;

s101 specifically comprises the following steps:

As a specific example, the different data patterns are as follows:

mode S1:

CREATE TABLE Personnel(

Pno int，

Pname string，

Dept string，

Born date，

UNIQUE Pkey(Pno)

)

mode S2:

CREATE TABLE Employee(

EmpNo int PRIMARY KEY，

EMpName varchar(50)，

DeptNo int REFERENCES

Department，

Salary dec(15，2)，

Birthdate date

)

CREATE TABLE Department(

DeptNo int PRIMARY KEY，

DeptName varchar(70)

)

s1 is a table named Personal with four fields, namely Pno with type int, Pname with type string, Dept with type string and Born with type date, wherein Pno is the only main key of the table. S2 contains two tables, Employee and Department, where there are 5 fields in Employee, which are respectively the only primary key EmpNo of type int, EmpName of type varchar (50), DeptNo of type int, which is a foreign key of table Department, Salary of type dec (15, 2), and Birthate of type date. The table prefix contains two fields, a unique primary key, type int, DeptNo and type varchar (70), DeptName.

The elements in S1 and S2 are both table structures defined using SQL DDL. Mapping G of the above schema by means of SQL2Graph tool₁＝SQL2Graph(S₁)，G₂＝SQL2Graph(S₂). The mapping result is shown in FIG. 2, where G₁As a source map, G₂Is a target map.

The graph G₁For clarity, the structure of the map obtained after mapping the pattern S1 is not shown in the figurekey, etc. The nodes in the graph are shown as ovals and rectangles. Labels in the ellipses denote identifiers of nodes, while rectangles denote text or string values. The three ellipses at the upper layer are represented as node types, and there are three types in the schema, namely Table, Column and ClolumnType. The above graph can be simply expressed that the node 1 type is a table with the table name of Personal. Node 1 connects nodes 2, 4, 6, 7, the four nodes are of Column type, and the name connected with the node is the field name of each field. Nodes 3, 5 and 8 are ColumnType, node 3 is int type, node 5 is string type, and node 8 is date type. Node 2 is connected to node 3, indicating that node 2 is int, and so on.

S102, carrying out image matching on the mapped images by using an approximate image matching algorithm, and determining a cost matrix between image structures;

s102 specifically comprises the following steps:

and simplifying the binary cost matrix to construct an SFBP cost matrix.

As a specific example, the above graph G₁，G₂For simplification, assume their node numbers are respectively | V₁I | ═ m and | V₂Adding a certain empty point epsilon in the two graphs, and keeping the number of points in the two graphs consistent, wherein the specific values are as follows:

in this case, G can be obtained₁Adjacency matrix a of (a):

G₂adjacency matrix B of (a):

on the basis of the fixed point set of the two graphs, constructing a binary cost matrix C:

in matrix C, C_ijRepresenting a replacement operation of a node (u)_i→v_j) Cost value of c_iεIndicating a delete operation of a node (u)_i→ epsilon), in the cost matrix C, the upper left block represents the replacement cost value of all points, the lower left block is the insertion node cost value, the upper right block is the deletion node cost value, and the lower right block is the replacement operation between empty nodes, so the lower right block cost values are all 0. Since each node can perform only one insertion or deletion operation, only diagonal elements of the lower left partition and the upper right partition are real values, and the remaining elements are ∞.

Graph edit path formula for the secondary assignment problem:

wherein

Represents all (n + m)!of the sequence of diagram reference numerals (1, 2., (n + m))! A possible edit distance.

The nodes of the source graph and the target graph are assumed to be n and m respectively, and the construction is carried out in two cases.

When n > m, the SFBP cost matrix is as follows:

and (4) all elements with the cost matrix upper right block value of ∞ are replaced by the values of elements with unrealistic values in all rows without considering the node insertion operation on the source graph.

When n < m, the SFBP cost matrix is as follows:

the source graph is not considered to be subjected to row node deletion operation, and the values of the elements with the median value of ∞ in the lower left blocks in the cost matrix are all replaced by the values of the elements with real values in the columns.

S103, determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;

s103 specifically comprises the following steps:

In order to avoid accuracy loss caused by not considering deletion or insertion operation of source graph nodes when the SFBP cost matrix is constructed. Introducing a new element A_ijWherein i represents the deletion operation of the ith node in the source graph G, and j represents the insertion of the jth node in the target graph G in the source graph G. Element A_ijThe following conditions are satisfied:

(1) the cost values of the operations of inserting and then deleting or deleting and then inserting are performed by the same two nodes are consistent.

(2) Element A_ijReal numbers that are not negative numbers.

(3) i ≠ j, i.e., no delete or insert operation can be performed on the same node.

Using introduction A in cost matrix_ijThe replacement rule specifically includes:

(1) when n > m:

the left side of the matrix is partitioned by c_ijThe elements on the main diagonal of the right block are cost values for deleting the nodes of the source graph, and the elements of each other row are represented by A_ijEach row of elements in the array is correspondingly formed in sequence.

(2) When n < m:

the blocking element above the matrix is composed of c_ijThe main diagonal elements of the lower block are the added operation cost values of the nodes of the source graph, and the other elements of each row are A_ijEach row of elements in the array is correspondingly formed in sequence.

(3)n＝m：

When A is_ij<c_ijThen use A_ijReplacement c_ij。

By introducing element A_ijThe cost matrix after replacing the rule can be divided into three cases:

the first condition is as follows: if the total row number of the matrix is larger than the total column number and the column number of the column of the current calculation element is larger than the difference between the total row number and the total column number of the matrix, c is_ijAll ratios A in_ijAll large elements use the corresponding A_ijAnd (6) replacing. Constructing an n-dimensional matrix, wherein the left side of the matrix is partitioned into blocks by c_ijThe elements on the main diagonal of the right block are cost values for deleting the nodes of the source graph, and the elements of each other row are c_ijEach row of elements in the array is correspondingly formed in sequence.

Case two: if the total row number of the matrix is less than the total column number and the row number of the row where the current calculation element is located is greater than the difference between the total row number and the total column number of the matrix, c is_ijAll ratios A in_ijAll large elements use the corresponding A_ijAnd (6) replacing. Constructing an m-dimensional matrix, wherein the block elements above the matrix are formed by c_ijThe main diagonal elements of the lower block are the added operation cost values of the nodes of the source graph, and the other elements of each row are A_ijEach row of elements in the array is correspondingly formed in sequence.

Case three: if the total number of rows of the matrix is equal to the total number of columns, c is_ijAll ratios A in_ijSmall elements use the corresponding A_ijAnd (6) replacing. Constructing an n-dimensional matrix, wherein the elements in the matrix are c_ijMiddle corresponding element.

The cost matrix under the following three different conditions can be obtained through the algorithm:

(1)n>m:

(2)n<m：

(3)n＝m：

solving the graph matching problem using the GLA algorithm:

the first step is as follows: set psi to null

The second step is that: checking all nodes in the cost matrix C one by one, if a certain node i is not used, performing the following calculation

And

the third step: and judging whether i-k is true or not. If yes, proceed

And removes the ith row and phi from the cost matrix C_iAnd (4) columns.

Otherwise, calculating

And

if it is

If true, then the ith, kth and phi rows are removed from the cost matrix C_i，φ′_iAnd (4) columns.

The fourth step: the graph edit distance value N is obtained from the sequence in ψ.

From the end of this algorithm, the graph edit distance value N is returned.

The basic idea of the optimal matching sequence with psi as node in the algorithm is to find the minimum cost value element c from the ith row in the cost matrix_ijAnd is measured by phi_iIndicates the column in which the element is located, and then from phi_iFind the minimum cost value element in the column

And k is used to indicate the row in which the element is located. When i is k, c_ijI.e., ith and phi_iThe minimum cost value element of the column and adds it to the best matching sequence. If i ≠ k, find the divisor c from the ith row_ijMinimum cost value element c of other_ij', by phi_i' indicate the column of this element, find the division from the k rows

Minimum element outside

By phi_kRepresenting the column in which the element is located. Comparison

And adding the row and the column where the element with smaller cost value is positioned into the optimal matching sequence. And carrying out iterative calculation on the process until all rows and columns are completely distributed for use, and finally calculating the graph editing distance according to the optimal matching sequence.

And S104, integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance.

S104 specifically comprises the following steps:

judging whether the graph editing distance N is larger than a distance threshold value MaxN or not;

if the graph editing distance N is larger than a distance threshold value MaxN, different data modes are not matched, and data integration cannot be carried out;

and if the graph editing distance N is smaller than or equal to a distance threshold value MaxN, integrating the graph according to the optimal matching sequence. The mapped rows and columns of nodes are stored in psi, and the nodes in corresponding rows and columns in the two graphs are integrated.

Fig. 3 is a flowchart of an embodiment provided by the present invention, and the present invention can more efficiently match and integrate data patterns, because the diagram can better represent object attributes and relationships between objects. Introducing a new element A on the basis of SFBP cost matrix_ijAnd the matching accuracy is improved. The graph structure enables abstract association between original data modes to be clearer, and the problems of mass heterogeneous data matching and integration in the current data explosion age can be better solved.

Fig. 4 is a schematic structural diagram of a data integration system based on an approximate graph matching algorithm provided by the present invention, and as shown in fig. 4, the data integration system based on the approximate graph matching algorithm provided by the present invention includes:

a graph structure determining module 401, configured to perform graph mapping on different data modes respectively, and determine corresponding graph structures;

a cost matrix determining module 402, configured to perform graph matching on the mapped graph by using an approximate graph matching algorithm, and determine a cost matrix between graph structures;

an optimal matching sequence and graph edit distance determining module 403, configured to determine an optimal matching sequence of nodes in a graph structure according to a cost matrix between graph structures; determining a graph editing distance according to the optimal matching sequence;

and a data integration completion module 404, configured to integrate nodes at corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance.

The graph structure determining module 301 specifically includes:

The cost matrix determining module 302 specifically includes:

The optimal matching sequence and graph edit distance determining module 303 specifically includes:

introduction element A_ijA replacement rule unit for employing the introduction element A in the cost matrix_ijReplacing the rule; the introduction element A_ijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;

The data integration completion module 304 specifically includes:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A data integration method based on approximate graph matching algorithm is characterized by comprising the following steps:

2. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the mapping different data patterns to determine corresponding graph structures comprises:

3. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the graph matching is performed on the mapped graph by using the approximate graph matching algorithm to determine the cost matrix between graph structures, and specifically comprises:

and simplifying the binary cost matrix to construct an SFBP cost matrix.

4. The data integration method based on the approximate graph matching algorithm as claimed in claim 1, wherein the optimal matching sequence of the nodes in the graph structure is determined according to the cost matrix between the graph structures; and determining a graph editing distance according to the optimal matching sequence, which specifically comprises the following steps:

5. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance specifically comprises:

6. A data integration system based on an approximate graph matching algorithm, comprising:

7. The approximate graph matching algorithm-based data integration system of claim 6, wherein the graph structure determination module specifically comprises:

8. The approximate graph matching algorithm-based data integration system according to claim 6, wherein the cost matrix determination module specifically comprises:

9. The approximate graph matching algorithm-based data integration system of claim 6, wherein the optimal matching sequence and graph edit distance determining module specifically comprises:

introduction element A_ijA replacement rule unit for employing the introduction element A in the cost matrix_ijReplacing the rule; the introduction element A_ijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents the operation performed on the ith node in a structural diagramA deleting operation, wherein j represents that a j-th node in another structure diagram g is inserted into the structure diagram;

10. The approximate graph matching algorithm-based data integration system according to claim 6, wherein the data integration completion module specifically comprises: