CN114048240A - Data integration method and system based on approximate graph matching algorithm - Google Patents

Data integration method and system based on approximate graph matching algorithm Download PDF

Info

Publication number
CN114048240A
CN114048240A CN202111371583.5A CN202111371583A CN114048240A CN 114048240 A CN114048240 A CN 114048240A CN 202111371583 A CN202111371583 A CN 202111371583A CN 114048240 A CN114048240 A CN 114048240A
Authority
CN
China
Prior art keywords
graph
determining
cost matrix
structures
approximate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111371583.5A
Other languages
Chinese (zh)
Other versions
CN114048240B (en
Inventor
陈占芳
刘庆宗
姜晓明
梁玉柱
吴森森
高鹏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Research Institute Of Changchun University Of Technology
Changchun University of Science and Technology
Original Assignee
Chongqing Research Institute Of Changchun University Of Technology
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Research Institute Of Changchun University Of Technology, Changchun University of Science and Technology filed Critical Chongqing Research Institute Of Changchun University Of Technology
Priority to CN202111371583.5A priority Critical patent/CN114048240B/en
Publication of CN114048240A publication Critical patent/CN114048240A/en
Application granted granted Critical
Publication of CN114048240B publication Critical patent/CN114048240B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data integration method and a data integration system based on an approximate graph matching algorithm. The method comprises the steps of respectively carrying out graph mapping on different data modes and determining corresponding graph structures; carrying out graph matching on the mapped graph by using an approximate graph matching algorithm, and determining a cost matrix between graph structures; determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence; and integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance. The invention can improve the integration degree of heterogeneous data.

Description

Data integration method and system based on approximate graph matching algorithm
Technical Field
The invention relates to the field of data integration, in particular to a data integration method and system based on an approximate graph matching algorithm.
Background
Under the current era of data explosion, big data management is one of the important challenges, and in the big data management scenario, data integration is an important issue. Data integration is a problem of mismatching among data sources, integrates heterogeneous, distributed and autonomous data together, provides a single view for a user, and can transparently access the data sources. Therefore, it is particularly important to match and integrate patterns of heterogeneous data. Early research in the field of data integration was primarily directed towards how to identify associations between data tables, data columns and tuples describing the same attributes, the same entities, for a given data source and data set. In the aspect of mining the incidence relation, the method mainly adopts the modes of being simpler, carrying out direct matching based on character strings, and completing the matching through manual identification and the like. For data with similar field attributes but inconsistent field names, such as Name and employee Name, the degree of integration is low.
Thus. A new data integration method or system is needed to improve the integration degree of heterogeneous data.
Disclosure of Invention
The invention aims to provide a data integration method and a data integration system based on an approximate graph matching algorithm, which can improve the integration degree of heterogeneous data.
In order to achieve the purpose, the invention provides the following scheme:
a data integration method based on an approximate graph matching algorithm comprises the following steps:
respectively mapping different data modes to determine corresponding graph structures;
carrying out graph matching on the mapped graph by using an approximate graph matching algorithm, and determining a cost matrix between graph structures;
determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;
and integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance.
Optionally, the map mapping the different data modes respectively to determine corresponding map structures specifically includes:
and respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool to determine corresponding Graph structures.
Optionally, the performing graph matching on the mapped graph by using an approximate graph matching algorithm to determine a cost matrix between graph structures specifically includes:
determining a binary cost matrix according to the graph structures mapped by different data modes;
and simplifying the binary cost matrix to construct an SFBP cost matrix.
Optionally, the optimal matching sequence of the nodes in the graph structure is determined according to a cost matrix between the graph structures; and determining a graph editing distance according to the optimal matching sequence, which specifically comprises the following steps:
using introduction A in cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;
for adopting introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
Optionally, the integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance specifically includes:
judging whether the graph editing distance is larger than a distance threshold value;
if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.
A data integration system based on an approximate graph matching algorithm, comprising:
the graph structure determining module is used for respectively carrying out graph mapping on different data modes and determining corresponding graph structures;
the cost matrix determining module is used for carrying out image matching on the mapped images by utilizing an approximate image matching algorithm and determining a cost matrix between image structures;
the optimal matching sequence and graph editing distance determining module is used for determining the optimal matching sequence of the nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;
and the data integration completion module is used for integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance.
Optionally, the graph structure determining module specifically includes:
and the Graph structure determining unit is used for respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool and determining corresponding Graph structures.
Optionally, the cost matrix determining module specifically includes:
the binary cost matrix determining unit is used for determining a binary cost matrix according to the graph structures mapped by different data modes;
and the SFBP cost matrix determining unit is used for simplifying the binary cost matrix to construct the SFBP cost matrix.
Optionally, the module for determining the optimal matching sequence and the graph edit distance specifically includes:
introduction element AijA replacement rule unit for employing the introduction element A in the cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents the deletion operation performed on the ith node in a structural diagram, and j represents the deletion operation performed on the ith node in a structural diagramInserting a jth node in another structure graph g into the composition;
an optimal matching sequence determining unit for employing the introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
Optionally, the data integration completion module specifically includes:
a first judgment unit configured to judge whether the graph edit distance is greater than a distance threshold;
if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the data integration method and system based on the approximate graph matching algorithm, two or more data modes (such as XML documents, data in a database and the like) are mapped into a graph structure, and then the mapped graphs are subjected to graph matching through the approximate graph matching algorithm to obtain a cost matrix and graph editing distance between the graphs, so that the similarity between the graphs is measured. On the basis, similar parts in different data modes are extracted and integrated, and the purpose of data integration among heterogeneous data is achieved. The method utilizes the characteristic that the graph structure can better represent the attributes of the objects and the relationship between the objects, maps the original abstract data mode into the graph with clear structural hierarchy, and combines the approximate graph matching algorithm to enable the heterogeneous data to be matched and integrated more efficiently and more accurately.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a data integration method based on an approximate graph matching algorithm according to the present invention;
FIG. 2 is a diagram of a mapped structure;
FIG. 3 is a schematic flow chart of an embodiment of the present invention
Fig. 4 is a schematic structural diagram of a data integration system based on an approximate graph matching algorithm according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a data integration method and a data integration system based on an approximate graph matching algorithm, which can improve the integration degree of heterogeneous data.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a data integration method based on an approximate graph matching algorithm provided by the present invention, and as shown in fig. 1, the data integration method based on the approximate graph matching algorithm provided by the present invention includes:
s101, respectively mapping different data modes to determine corresponding graph structures;
s101 specifically comprises the following steps:
and respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool to determine corresponding Graph structures.
As a specific example, the different data patterns are as follows:
mode S1:
CREATE TABLE Personnel(
Pno int,
Pname string,
Dept string,
Born date,
UNIQUE Pkey(Pno)
)
mode S2:
CREATE TABLE Employee(
EmpNo int PRIMARY KEY,
EMpName varchar(50),
DeptNo int REFERENCES
Department,
Salary dec(15,2),
Birthdate date
)
CREATE TABLE Department(
DeptNo int PRIMARY KEY,
DeptName varchar(70)
)
s1 is a table named Personal with four fields, namely Pno with type int, Pname with type string, Dept with type string and Born with type date, wherein Pno is the only main key of the table. S2 contains two tables, Employee and Department, where there are 5 fields in Employee, which are respectively the only primary key EmpNo of type int, EmpName of type varchar (50), DeptNo of type int, which is a foreign key of table Department, Salary of type dec (15, 2), and Birthate of type date. The table prefix contains two fields, a unique primary key, type int, DeptNo and type varchar (70), DeptName.
The elements in S1 and S2 are both table structures defined using SQL DDL. Mapping G of the above schema by means of SQL2Graph tool1=SQL2Graph(S1),G2=SQL2Graph(S2). The mapping result is shown in FIG. 2, where G1As a source map, G2Is a target map.
The graph G1For clarity, the structure of the map obtained after mapping the pattern S1 is not shown in the figurekey, etc. The nodes in the graph are shown as ovals and rectangles. Labels in the ellipses denote identifiers of nodes, while rectangles denote text or string values. The three ellipses at the upper layer are represented as node types, and there are three types in the schema, namely Table, Column and ClolumnType. The above graph can be simply expressed that the node 1 type is a table with the table name of Personal. Node 1 connects nodes 2, 4, 6, 7, the four nodes are of Column type, and the name connected with the node is the field name of each field. Nodes 3, 5 and 8 are ColumnType, node 3 is int type, node 5 is string type, and node 8 is date type. Node 2 is connected to node 3, indicating that node 2 is int, and so on.
S102, carrying out image matching on the mapped images by using an approximate image matching algorithm, and determining a cost matrix between image structures;
s102 specifically comprises the following steps:
determining a binary cost matrix according to the graph structures mapped by different data modes;
and simplifying the binary cost matrix to construct an SFBP cost matrix.
As a specific example, the above graph G1,G2For simplification, assume their node numbers are respectively | V1I | ═ m and | V2Adding a certain empty point epsilon in the two graphs, and keeping the number of points in the two graphs consistent, wherein the specific values are as follows:
Figure BDA0003362523800000071
Figure BDA0003362523800000072
in this case, G can be obtained1Adjacency matrix a of (a):
Figure BDA0003362523800000073
G2adjacency matrix B of (a):
Figure BDA0003362523800000074
on the basis of the fixed point set of the two graphs, constructing a binary cost matrix C:
Figure BDA0003362523800000081
in matrix C, CijRepresenting a replacement operation of a node (u)i→vj) Cost value of cIndicating a delete operation of a node (u)i→ epsilon), in the cost matrix C, the upper left block represents the replacement cost value of all points, the lower left block is the insertion node cost value, the upper right block is the deletion node cost value, and the lower right block is the replacement operation between empty nodes, so the lower right block cost values are all 0. Since each node can perform only one insertion or deletion operation, only diagonal elements of the lower left partition and the upper right partition are real values, and the remaining elements are ∞.
Graph edit path formula for the secondary assignment problem:
Figure BDA0003362523800000082
wherein
Figure BDA0003362523800000083
Represents all (n + m)!of the sequence of diagram reference numerals (1, 2., (n + m))! A possible edit distance.
The nodes of the source graph and the target graph are assumed to be n and m respectively, and the construction is carried out in two cases.
When n > m, the SFBP cost matrix is as follows:
Figure BDA0003362523800000084
and (4) all elements with the cost matrix upper right block value of ∞ are replaced by the values of elements with unrealistic values in all rows without considering the node insertion operation on the source graph.
When n < m, the SFBP cost matrix is as follows:
Figure BDA0003362523800000091
the source graph is not considered to be subjected to row node deletion operation, and the values of the elements with the median value of ∞ in the lower left blocks in the cost matrix are all replaced by the values of the elements with real values in the columns.
S103, determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;
s103 specifically comprises the following steps:
using introduction A in cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;
for adopting introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
In order to avoid accuracy loss caused by not considering deletion or insertion operation of source graph nodes when the SFBP cost matrix is constructed. Introducing a new element AijWherein i represents the deletion operation of the ith node in the source graph G, and j represents the insertion of the jth node in the target graph G in the source graph G. Element AijThe following conditions are satisfied:
(1) the cost values of the operations of inserting and then deleting or deleting and then inserting are performed by the same two nodes are consistent.
(2) Element AijReal numbers that are not negative numbers.
(3) i ≠ j, i.e., no delete or insert operation can be performed on the same node.
Using introduction A in cost matrixijThe replacement rule specifically includes:
(1) when n > m:
the left side of the matrix is partitioned by cijThe elements on the main diagonal of the right block are cost values for deleting the nodes of the source graph, and the elements of each other row are represented by AijEach row of elements in the array is correspondingly formed in sequence.
(2) When n < m:
the blocking element above the matrix is composed of cijThe main diagonal elements of the lower block are the added operation cost values of the nodes of the source graph, and the other elements of each row are AijEach row of elements in the array is correspondingly formed in sequence.
(3)n=m:
When A isij<cijThen use AijReplacement cij
By introducing element AijThe cost matrix after replacing the rule can be divided into three cases:
the first condition is as follows: if the total row number of the matrix is larger than the total column number and the column number of the column of the current calculation element is larger than the difference between the total row number and the total column number of the matrix, c isijAll ratios A inijAll large elements use the corresponding AijAnd (6) replacing. Constructing an n-dimensional matrix, wherein the left side of the matrix is partitioned into blocks by cijThe elements on the main diagonal of the right block are cost values for deleting the nodes of the source graph, and the elements of each other row are cijEach row of elements in the array is correspondingly formed in sequence.
Case two: if the total row number of the matrix is less than the total column number and the row number of the row where the current calculation element is located is greater than the difference between the total row number and the total column number of the matrix, c isijAll ratios A inijAll large elements use the corresponding AijAnd (6) replacing. Constructing an m-dimensional matrix, wherein the block elements above the matrix are formed by cijThe main diagonal elements of the lower block are the added operation cost values of the nodes of the source graph, and the other elements of each row are AijEach row of elements in the array is correspondingly formed in sequence.
Case three: if the total number of rows of the matrix is equal to the total number of columns, c isijAll ratios A inijSmall elements use the corresponding AijAnd (6) replacing. Constructing an n-dimensional matrix, wherein the elements in the matrix are cijMiddle corresponding element.
The cost matrix under the following three different conditions can be obtained through the algorithm:
(1)n>m:
Figure BDA0003362523800000101
(2)n<m:
Figure BDA0003362523800000111
(3)n=m:
Figure BDA0003362523800000112
solving the graph matching problem using the GLA algorithm:
the first step is as follows: set psi to null
The second step is that: checking all nodes in the cost matrix C one by one, if a certain node i is not used, performing the following calculation
Figure BDA0003362523800000113
And
Figure BDA0003362523800000114
the third step: and judging whether i-k is true or not. If yes, proceed
Figure BDA0003362523800000115
And removes the ith row and phi from the cost matrix CiAnd (4) columns.
Otherwise, calculating
Figure BDA0003362523800000116
And
Figure BDA0003362523800000117
if it is
Figure BDA0003362523800000118
If true, then the ith, kth and phi rows are removed from the cost matrix Ci,φ′iAnd (4) columns.
The fourth step: the graph edit distance value N is obtained from the sequence in ψ.
From the end of this algorithm, the graph edit distance value N is returned.
The basic idea of the optimal matching sequence with psi as node in the algorithm is to find the minimum cost value element c from the ith row in the cost matrixijAnd is measured by phiiIndicates the column in which the element is located, and then from phiiFind the minimum cost value element in the column
Figure BDA0003362523800000119
And k is used to indicate the row in which the element is located. When i is k, cijI.e., ith and phiiThe minimum cost value element of the column and adds it to the best matching sequence. If i ≠ k, find the divisor c from the ith rowijMinimum cost value element c of otherij', by phii' indicate the column of this element, find the division from the k rows
Figure BDA0003362523800000121
Minimum element outside
Figure BDA0003362523800000122
By phikRepresenting the column in which the element is located. Comparison
Figure BDA0003362523800000123
And adding the row and the column where the element with smaller cost value is positioned into the optimal matching sequence. And carrying out iterative calculation on the process until all rows and columns are completely distributed for use, and finally calculating the graph editing distance according to the optimal matching sequence.
And S104, integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance.
S104 specifically comprises the following steps:
judging whether the graph editing distance N is larger than a distance threshold value MaxN or not;
if the graph editing distance N is larger than a distance threshold value MaxN, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance N is smaller than or equal to a distance threshold value MaxN, integrating the graph according to the optimal matching sequence. The mapped rows and columns of nodes are stored in psi, and the nodes in corresponding rows and columns in the two graphs are integrated.
Fig. 3 is a flowchart of an embodiment provided by the present invention, and the present invention can more efficiently match and integrate data patterns, because the diagram can better represent object attributes and relationships between objects. Introducing a new element A on the basis of SFBP cost matrixijAnd the matching accuracy is improved. The graph structure enables abstract association between original data modes to be clearer, and the problems of mass heterogeneous data matching and integration in the current data explosion age can be better solved.
Fig. 4 is a schematic structural diagram of a data integration system based on an approximate graph matching algorithm provided by the present invention, and as shown in fig. 4, the data integration system based on the approximate graph matching algorithm provided by the present invention includes:
a graph structure determining module 401, configured to perform graph mapping on different data modes respectively, and determine corresponding graph structures;
a cost matrix determining module 402, configured to perform graph matching on the mapped graph by using an approximate graph matching algorithm, and determine a cost matrix between graph structures;
an optimal matching sequence and graph edit distance determining module 403, configured to determine an optimal matching sequence of nodes in a graph structure according to a cost matrix between graph structures; determining a graph editing distance according to the optimal matching sequence;
and a data integration completion module 404, configured to integrate nodes at corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance.
The graph structure determining module 301 specifically includes:
and the Graph structure determining unit is used for respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool and determining corresponding Graph structures.
The cost matrix determining module 302 specifically includes:
the binary cost matrix determining unit is used for determining a binary cost matrix according to the graph structures mapped by different data modes;
and the SFBP cost matrix determining unit is used for simplifying the binary cost matrix to construct the SFBP cost matrix.
The optimal matching sequence and graph edit distance determining module 303 specifically includes:
introduction element AijA replacement rule unit for employing the introduction element A in the cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;
an optimal matching sequence determining unit for employing the introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
The data integration completion module 304 specifically includes:
a first judgment unit configured to judge whether the graph edit distance is greater than a distance threshold;
if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A data integration method based on approximate graph matching algorithm is characterized by comprising the following steps:
respectively mapping different data modes to determine corresponding graph structures;
carrying out graph matching on the mapped graph by using an approximate graph matching algorithm, and determining a cost matrix between graph structures;
determining an optimal matching sequence of nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;
and integrating nodes at corresponding positions among the graph structures according to the optimal matching sequence and the graph editing distance.
2. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the mapping different data patterns to determine corresponding graph structures comprises:
and respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool to determine corresponding Graph structures.
3. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the graph matching is performed on the mapped graph by using the approximate graph matching algorithm to determine the cost matrix between graph structures, and specifically comprises:
determining a binary cost matrix according to the graph structures mapped by different data modes;
and simplifying the binary cost matrix to construct an SFBP cost matrix.
4. The data integration method based on the approximate graph matching algorithm as claimed in claim 1, wherein the optimal matching sequence of the nodes in the graph structure is determined according to the cost matrix between the graph structures; and determining a graph editing distance according to the optimal matching sequence, which specifically comprises the following steps:
using introduction A in cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents deleting operation on the ith node in a structural graph, and j represents inserting the jth node in another structural graph g in the structural graph;
for adopting introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
5. The data integration method based on the approximate graph matching algorithm according to claim 1, wherein the integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance specifically comprises:
judging whether the graph editing distance is larger than a distance threshold value;
if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.
6. A data integration system based on an approximate graph matching algorithm, comprising:
the graph structure determining module is used for respectively carrying out graph mapping on different data modes and determining corresponding graph structures;
the cost matrix determining module is used for carrying out image matching on the mapped images by utilizing an approximate image matching algorithm and determining a cost matrix between image structures;
the optimal matching sequence and graph editing distance determining module is used for determining the optimal matching sequence of the nodes in the graph structure according to the cost matrix between the graph structures; determining a graph editing distance according to the optimal matching sequence;
and the data integration completion module is used for integrating the nodes at the corresponding positions between the graph structures according to the optimal matching sequence and the graph editing distance.
7. The approximate graph matching algorithm-based data integration system of claim 6, wherein the graph structure determination module specifically comprises:
and the Graph structure determining unit is used for respectively carrying out Graph mapping on different data modes by adopting an SQL2Graph tool and determining corresponding Graph structures.
8. The approximate graph matching algorithm-based data integration system according to claim 6, wherein the cost matrix determination module specifically comprises:
the binary cost matrix determining unit is used for determining a binary cost matrix according to the graph structures mapped by different data modes;
and the SFBP cost matrix determining unit is used for simplifying the binary cost matrix to construct the SFBP cost matrix.
9. The approximate graph matching algorithm-based data integration system of claim 6, wherein the optimal matching sequence and graph edit distance determining module specifically comprises:
introduction element AijA replacement rule unit for employing the introduction element A in the cost matrixijReplacing the rule; the introduction element AijThe replacement rule is that the cost values of the operations of first inserting and then deleting or first deleting and then inserting are executed by the same two nodes are consistent; wherein i represents the operation performed on the ith node in a structural diagramA deleting operation, wherein j represents that a j-th node in another structure diagram g is inserted into the structure diagram;
an optimal matching sequence determining unit for employing the introduction element AijAnd (4) replacing the cost matrix after the rule, and determining an optimal matching sequence by adopting a GLA algorithm.
10. The approximate graph matching algorithm-based data integration system according to claim 6, wherein the data integration completion module specifically comprises:
a first judgment unit configured to judge whether the graph edit distance is greater than a distance threshold;
if the graph editing distance is larger than the distance threshold, different data modes are not matched, and data integration cannot be carried out;
and if the graph editing distance is smaller than or equal to the distance threshold, integrating the graph according to the optimal matching sequence.
CN202111371583.5A 2021-11-18 2021-11-18 Data integration method and system based on approximate graph matching algorithm Active CN114048240B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111371583.5A CN114048240B (en) 2021-11-18 2021-11-18 Data integration method and system based on approximate graph matching algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111371583.5A CN114048240B (en) 2021-11-18 2021-11-18 Data integration method and system based on approximate graph matching algorithm

Publications (2)

Publication Number Publication Date
CN114048240A true CN114048240A (en) 2022-02-15
CN114048240B CN114048240B (en) 2024-06-14

Family

ID=80209857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111371583.5A Active CN114048240B (en) 2021-11-18 2021-11-18 Data integration method and system based on approximate graph matching algorithm

Country Status (1)

Country Link
CN (1) CN114048240B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060212860A1 (en) * 2004-09-30 2006-09-21 Benedikt Michael A Method for performing information-preserving DTD schema embeddings
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN107609592A (en) * 2017-09-15 2018-01-19 桂林电子科技大学 A kind of figure edit distance approach towards Letter identification
CN108710663A (en) * 2018-05-14 2018-10-26 北京大学 A kind of data matching method and system based on ontology model
CN108960437A (en) * 2018-07-31 2018-12-07 佛山科学技术学院 Unbalanced data learning method based on industry manufacture big data
CN111461196A (en) * 2020-03-27 2020-07-28 上海大学 Method and device for identifying and tracking fast robust image based on structural features
CN111783879A (en) * 2020-07-01 2020-10-16 中国人民解放军国防科技大学 Hierarchical compression map matching method and system based on orthogonal attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060212860A1 (en) * 2004-09-30 2006-09-21 Benedikt Michael A Method for performing information-preserving DTD schema embeddings
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN107609592A (en) * 2017-09-15 2018-01-19 桂林电子科技大学 A kind of figure edit distance approach towards Letter identification
CN108710663A (en) * 2018-05-14 2018-10-26 北京大学 A kind of data matching method and system based on ontology model
CN108960437A (en) * 2018-07-31 2018-12-07 佛山科学技术学院 Unbalanced data learning method based on industry manufacture big data
CN111461196A (en) * 2020-03-27 2020-07-28 上海大学 Method and device for identifying and tracking fast robust image based on structural features
CN111783879A (en) * 2020-07-01 2020-10-16 中国人民解放军国防科技大学 Hierarchical compression map matching method and system based on orthogonal attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冶运涛;王兴奎;蒋云钟;李丹勋;姜晓明;: "虚拟环境中基于多叉树的降雨时空分布等值线填充算法研究", 水力发电学报, no. 05, 25 October 2011 (2011-10-25) *
范威振;陈占芳;刘燕龙: "基于多维相似度的整体式实体统一算法研究", 长春理工大学学报(自然科学版), vol. 42, no. 004, 31 December 2019 (2019-12-31) *

Also Published As

Publication number Publication date
CN114048240B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
US11475034B2 (en) Schemaless to relational representation conversion
US10515090B2 (en) Data extraction and transformation method and system
US7890518B2 (en) Method for creating a scalable graph database
US7606817B2 (en) Primenet data management system
US7822784B2 (en) Data cells and data cell generations
EP3435256B1 (en) Optimal sort key compression and index rebuilding
CN104866593A (en) Database searching method based on knowledge graph
Lbath et al. Schema inference for property graphs
US7159171B2 (en) Structured document management system, structured document management method, search device and search method
US20080294673A1 (en) Data transfer and storage based on meta-data
US10235422B2 (en) Lock-free parallel dictionary encoding
CN112596719B (en) Method and system for generating front-end and back-end codes
CN112148735B (en) Construction method for structured form data knowledge graph
CN106933844B (en) Construction method of reachability query index facing large-scale RDF data
CN110389953B (en) Data storage method, storage medium, storage device and server based on compression map
CN111984649A (en) Data index searching method and device and related equipment
CN114048240A (en) Data integration method and system based on approximate graph matching algorithm
US20180144060A1 (en) Processing deleted edges in graph databases
CN117033534A (en) Geographic information processing method, device, computer equipment and storage medium
CN109460467B (en) Method for constructing network information classification system
CN113407538B (en) Incremental acquisition method for data of multi-source heterogeneous relational database
CN110390024B (en) Family tree data processing method and device and processor
Ferrada et al. Similarity joins and clustering for SPARQL
CN116881262B (en) Intelligent multi-format digital identity mapping method and system
CN112749301B (en) Keyword query method for fuzzy XML (extensive makeup language) of massive remote sensing metadata

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant