CN113469280A

CN113469280A - Data blood margin discovery method, system and device based on graph neural network

Info

Publication number: CN113469280A
Application number: CN202110830737.6A
Authority: CN
Inventors: 黄勋
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-10-01
Anticipated expiration: 2041-07-22
Also published as: CN113469280B

Abstract

The invention relates to the technical field of big data platform construction, and provides a data blood relationship discovery method, a system and a device based on a graph neural network, which are used for acquiring newly added table records of tables in the same batch; the table records newly added in the same batch are converted into an output sample set after passing through a training layer; clustering the output sample set to obtain a clustering pseudo label; the output sample set is classified according to the clustering pseudo labels to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo labels, and if the average classification error is larger than or equal to an error threshold value, a training layer is updated in a reverse propagation mode; and circularly executing a sampling training process for multiple times, iteratively updating a training layer, and taking out an attention layer in the training layer until the average classification error is smaller than an error threshold value to obtain the blood relation among tables.

Description

Data blood margin discovery method, system and device based on graph neural network

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of big data platform construction, in particular to a data blood margin discovery method, a data blood margin discovery system and a data blood margin discovery device based on a graph neural network.

[ background of the invention ]

With the increasing network scale, the data generated by the network is developed toward a huge amount, and the industrial reality of the network big data is formed. The requirements for collecting, managing and analyzing network big data are increasingly urgent, and therefore higher requirements are put forward for the quality of the network big data.

The network big data can be stored in different tables in the database, and the processes of data generation, processing, circulation and the like naturally form various relations which are called data consanguinity. The blooding relationships between data tables and between fields can help to analyze the reasonability of table design, analyze the influence on downstream data caused by the change of upstream data and trace back to the source of upstream problems when the downstream data is abnormal. The data blood margin is a beneficial technical means for data governance, and the data blood margin discovery technology is particularly important.

At present, the discovery technology of data blood relationship is mainly divided into four categories, each of which has the following defects:

and (4) building a dependency relationship field in the table, and directly writing the upstream and downstream dependency relationship fields into a database for storage so as to construct a blood relationship between the tables. The scheme has the defects that the table structure is changed, and the dependence on the data processing assembly is strong.

And disassembling the DDL of the table in a file analysis mode to obtain the source table information and the destination table information of the data. The scheme has the disadvantage that the DDL file cannot be used when the DDL file is not easy to obtain due to the authority problem.

And acquiring the upstream and downstream relation of the table through a task scheduling system during table construction. The scheme has the defects that additional task scheduling information is relied on, the task scheduling process is complex in configuration actually, and the scheduling process is limited by a third-party component.

And manually combing to obtain the data blood relationship. Although this scheme can be comparatively meticulous carding blood margin relation, nevertheless waste time and energy.

In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.

[ summary of the invention ]

The technical problem to be solved by the invention is as follows:

some scenarios may occur in the prior art, for example: the original construction information and the table data flow information of the table are missing, and if the original construction information and the table data flow information of the table are missing, the association degree between the table and the table, which table is the source and which table is the destination (i.e. the direction between the tables) cannot be obtained.

The invention achieves the above purpose by the following technical scheme:

in a first aspect, the invention provides a data blood margin discovery method based on a graph neural network, wherein a sampling training process comprises the steps of obtaining table records newly added to each table in the same batch;

the table records newly added in the same batch are converted into an output sample set after passing through a training layer;

clustering the output sample set to obtain a clustering pseudo label;

the output sample set is classified according to the clustering pseudo labels to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo labels, and if the average classification error is larger than or equal to an error threshold value, a training layer is updated in a reverse propagation mode;

and performing a sampling training process for multiple times in a circulating manner, iteratively updating the training layer until the average classification error is smaller than the error threshold, and taking out the attention layer in the training layer so as to obtain the blood relationship among the tables.

Preferably, the acquiring new table records of each table in the same batch specifically includes:

randomly selecting one table from each table, and inserting one table record into the selected table every time, wherein one table record can be correspondingly generated in other tables;

and taking one table record inserted in the selected table each time and one table record generated correspondingly in other tables as the newly added table record of each table in the same batch.

Preferably, the output sample set is classified according to the clustering pseudo labels to obtain classification results, and an average classification error is obtained according to the classification results and the clustering pseudo labels, and the method specifically includes:

the classification result comprises a probability set corresponding to each output sample in the output sample set under each clustering pseudo label;

converting each clustering pseudo label into a corresponding coding vector;

and obtaining an average classification error according to the probability set corresponding to each output sample under each clustering pseudo label in the output sample set and the coding vector corresponding to the clustering pseudo label of the corresponding output sample.

Preferably, each table is converted into an output sample set after a newly added table record in the same batch passes through a training layer, and the method specifically includes:

the training layer comprises one or more of an input layer and an intermediate layer, wherein the intermediate layer comprises an attention layer;

and the newly added table records in the same batch are converted into an input sample set after passing through an input layer, and the input sample set is converted into an output sample set after passing through an intermediate layer.

Preferably, the input sample set is embodied as a homogeneous matrix.

Preferably, the intermediate layer further comprises one or more of a ReLU layer, a regularization layer, and an output layer.

Preferably, the output sample set is a high-dimensional feature matrix.

In a second aspect, the present invention provides a data blood margin discovery system based on a graph neural network, including: a graph neural network and a clustering device;

clustering device: clustering and obtaining clustering pseudo labels;

graph neural network: the method comprises the steps of obtaining table records newly added to each table in the same batch;

the newly added table records of each table in the same batch are converted into an output sample set after passing through a training layer of the graph neural network;

the output sample set is classified according to the clustering pseudo labels to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo labels, and if the average classification error is larger than or equal to an error threshold value, a training layer is updated in a reverse propagation mode; and iteratively updating the training layer until the average classification error is smaller than the error threshold, and taking out the attention layer in the training layer so as to obtain the blood relationship among the tables.

Preferably, the training layer further comprises an input layer;

the input layer: and the table records are used for converting the newly added table records of the tables in the same batch into an input sample set in a homogeneous matrix form.

In a third aspect, the present invention further provides a data blood margin discovery device based on a graph neural network, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the method for data blood margin discovery based on a graph neural network of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the method only depends on the tables, namely, the newly added table records of each table in the same batch are obtained for multiple times, the sampling training process is executed for multiple times in a circulating way, the training layer of the graph neural network is updated in an iterative way, until the average classification error is smaller than the error threshold, the attention layer in the training layer is taken out, and thus the blood relationship among the tables (namely, the association degree between the tables, which table is the source and which table is the destination) is obtained; the method can also be used for finding the blood relationship between tables in the scenes that the DDL is acquired without permission, the task scheduling process cannot be acquired and the like.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is an architecture diagram of a data blood margin discovery system based on a graph neural network according to an embodiment of the present invention;

FIG. 2 is a graph model used in a data blood relationship discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 3 is a flowchart of a data blood margin discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of table records added in the same batch according to a data blood relationship discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 5 is a flowchart of a data blood relationship discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of data in an attention layer obtained by a data blood margin discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a graph model obtained according to data in an attention layer in a data blood margin discovery method based on a graph neural network according to an embodiment of the present invention;

fig. 8 is an architecture diagram of a data blood margin finding apparatus based on a graph neural network according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

the embodiment of the invention provides a data blood margin discovery system based on a graph neural network, which comprises: a graph neural network and a clustering device;

clustering device: clustering and obtaining clustering pseudo labels;

graph neural network: the method comprises the steps of obtaining table records newly added to each table in the same batch, and converting the table records newly added to each table in the same batch into an output sample set after passing through a training layer of the graph neural network; the output sample sets are respectively output to a classifier and a clustering device, the classifier classifies the output sample sets by taking clustering pseudo labels obtained by the clustering device as a standard to generate average classification errors, the training layer is updated by utilizing the back propagation of the average classification errors, so that the iterative updating of the training layer is realized, and when the average classification errors are smaller than an error threshold value, the attention layer in the training layer is taken out, so that the blood relationship among tables is obtained; wherein the output sample set is a high-dimensional feature matrix.

The embodiment provides a mode that can be implemented in an actual scenario, as shown in fig. 1, specifically: the data blood margin discovery system based on the graph neural network comprises the graph neural network and a clustering device; the graph neural network may specifically be a graph attention network, and the graph neural network includes: the device comprises a training layer and a classifier, wherein the training layer comprises an input layer and an intermediate layer, and the intermediate layer further comprises one or more of a ReLU layer, a regularization layer and an output layer.

The input layer: the input matrix is a heterogeneous matrix, the heterogeneous matrix is converted into an input sample set after passing through an input layer, and the input sample set is a set of homogeneous matrices (namely a matrix with a uniform form);

an intermediate layer: the system comprises a classifier, a data processing unit and a data processing unit, wherein the data processing unit is used for obtaining an input sample set, the input sample set is converted into an output sample set after passing through a middle layer of a graph neural network, the output sample set is output by an output layer of the middle layer, the output sample set is specifically a set of high-dimensional feature matrixes, the output sample set is clustered by the classifier to obtain clustering pseudo labels, the classifier in the graph neural network classifies according to the clustering pseudo labels to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo labels, if the average classification error is larger than or equal to an error threshold, a training layer in the graph neural network is updated by back propagation, the training layer is updated by iteration, until the average classification error is smaller than the error threshold, an attention layer in the training layer is taken out, and thus a blood relationship between tables is obtained;

a classifier: and classifying the output sample set by taking the clustering pseudo label obtained by the clustering device as a standard to generate an average classification error, and updating the training layer by utilizing the back propagation of the average classification error, thereby realizing the iterative updating of the training layer.

Example 2:

the embodiment also provides a data blood relationship discovery method based on a graph neural network, as shown in fig. 3, the sampling training process includes:

step 10, acquiring newly added table records of each table in the same batch;

in this embodiment, a machine learning technique on a graph model is used, as shown in fig. 2, each table is regarded as a node in the graph, and an association between tables is regarded as an edge between the node and the node, first, one table is arbitrarily selected from each table, a table record is inserted into the selected table, since a table record is inserted into the selected table, a table record is correspondingly generated in each of the other tables, and the table record inserted into the selected table each time and the table record correspondingly generated in each of the other tables are used as table records newly added to each table in the same batch.

The embodiment provides an implementable manner, specifically:

assuming that the graph includes 4 nodes (i.e. the graph includes 4 tables), as shown in fig. 1, Table1, Table2, Table3 and Table4, where T1_ C1, T1_ C2, T1_ C3, T2_ C1, T2_ C2, T3_ C1, T3_ C2, T3_ C3, T3_ C4, T4_ C1, T4_ C2 and T4_ C3 all represent field names of the tables;

wherein, the structure of Table1 is:

T1_C1

T1_C2

T1_C3

the structure of Table2 is:

T2_C1

T2_C2

the structure of Table3 is:

T3_C1

T3_C2

T3_C3

T3_C4

the structure of Table4 is:

T4_C1

T4_C2

T4_C3

one Table is arbitrarily selected from the 4 tables, assuming that the selected Table is Table1, a Table record (x1, x2, x3) is inserted into the selected Table1, and since a Table record (x1, x2, x3) is inserted into Table1, a Table record is correspondingly generated for each of the other tables, wherein, for a Table record which is not generated due to the insertion of the Table record (x1, x2, x3), a Table record with each field being 0 is used for replacing, so as to maintain the integrity of the sample record. Assume that after a Table record (x1, x2, x3) is inserted in Table1, Table1, Table2, Table3, and Table4 become:

table1 is:

T1_C1	T1_C2	T1_C3
			x1	x2	x3

table2 is:

T2_C1	T2_C2
		y1	y2

table3 is:

T3_C1	T3_C2	T3_C3	T3_C4
				z1	z2	z3	z4

table4 is:

T4_C1	T4_C2	T4_C3
			v1	v2	v3

the Table records inserted into Table1 and generated by Table2, Table3 and Table4 are (x1, x2, x3), (y1, y2), (z1, z2, z3, z4), (v1, v2, v3), and the 4 Table records of (x1, x2, x3), (y1, y2), (z1, z2, z3, z4), (v1, v2, v3) are used as new Table records in the same batch, wherein, for the Table records which are not generated due to the insertion of Table records (x1, x2, x3), a Table record with 0 field is used instead of a Table record with 0 field such as (y1, y2), which may be (0, 0), (y1, y2) is (0, 0), and the invention is not limited by the present invention.

The input layer obtains the 4 table records as the newly added table records in the same batch, which is only an example in this embodiment, where the structure of each table may be a list of table structures (i.e., fields), or may be a list of tables, and the table in this embodiment is not limited to the present invention.

Step 20, converting the newly added table records of each table in the same batch into an output sample set after passing through a training layer;

each table is converted into an output sample set after newly added table records in the same batch pass through a training layer, and the method specifically comprises the following steps: the training layer comprises one or more of an input layer and an intermediate layer, wherein the intermediate layer comprises an attention layer; and the newly added table records in the same batch are converted into an input sample set after passing through an input layer, and the input sample set is converted into an output sample set after passing through an intermediate layer. The input layer converts newly added table records of all tables in the same batch into an input sample set, wherein the input sample set is a homogeneous matrix. Firstly, the input layer converts table records newly added to each table in the same batch into input matrixes, and because the structures of the tables may be different, the forms of the input matrixes obtained by converting the table records are also different (namely heterogeneous matrixes), and at the moment, the input matrixes are multiplied by corresponding parameter matrixes Wi in the input layer to obtain homogeneous matrixes with the uniform forms, so that an input sample set is obtained, wherein the parameter matrixes Wi are updatable.

The embodiment provides a mode that can be implemented in an actual scenario, as shown in fig. 4, specifically:

taking the 4 table records of (x1, x2, x3), (y1, y2), (z1, z2, z3, z4), (v1, v2, v3) as newly added table records in the same batch, and respectively converting the 4 table records into corresponding input matrixes

It can be known that the forms of the input matrices are not uniform, that is, the input matrices are heterogeneous matrices, the input matrices with the non-uniform forms are multiplied by corresponding parameter matrices Wi in the input layer respectively to obtain homogeneous matrices with uniform forms, and the conversion of the heterogeneous matrices into the homogeneous matrices is convenient for calculation of intermediate layers in the neural network of the subsequent graph. For example, for the input matrix of (x1, x2, x3), a parameter matrix W1 of n × 3 is selected, and the parameter matrix W1 is multiplied by the input matrix of (x1, x2, x3), such as:

multiplying the Input matrix of W1 and (x1, x2, x3) to obtain a result which is recorded as Input1, wherein Input1 is a matrix of n × 1;

similarly, the Input matrix (y1, y2) needs to select n × 2 parameter matrix W2, and the result is marked as Input2 by multiplying the Input matrix of (y1, y2) by the parameter matrix W2 to obtain n × 1 matrix; the Input matrixes (z1, z2, z3 and z4) need to select a parameter matrix W3 of n x 4, and the parameter matrix W3 is multiplied by the Input matrixes (z1, z2, z3 and z4) to obtain a matrix of n x1, and the result is recorded as Input 3; the Input matrix of (v1, v2, v3) needs to select n × 3 parameter matrix W4, and the result of multiplying the parameter matrix W4 by the Input matrix of (v1, v2, v3) to obtain n × 1 matrix is denoted as Input4, at this time, an Input sample set is obtained, the Input sample set is (Input1, Input2, Input3, Input4), and at this time, the Input sample set is a set of homogeneous matrices.

The Input sample set is converted into an output sample set after passing through an intermediate layer of the graph neural network, the output sample set is output by an output layer, wherein the output sample set is specifically a high-dimensional feature matrix, the essence of the output sample set is that the Input sample set is subjected to matrix multiplication and nonlinear transformation for a plurality of times in the intermediate layer, for example, the Input sample set (Input1, Input2, Input3 and Input4) is a homogeneous matrix of n 1 when being Input into the graph neural network, after the homogeneous matrix is calculated with the intermediate layer of the graph neural network, the high-dimensional feature matrix (namely, the output sample set) is output by the output layer in the graph neural network, the intermediate layer of the graph neural network comprises an attention layer, a ReLU layer, a normalized layer and an output layer, the Input sample set is converted into the output sample set after passing through the intermediate layer in the graph neural network by recording the blood margin relationship between the table and the table through the attention layer, the output sample set is a set of the high-dimensional feature matrix, specifically, the set of high-dimensional feature matrices (Output1, Output2, Output3, Output4) may be 1 × 512, such as:

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

step 30, clustering the output sample set to obtain a clustering pseudo label;

the graph neural network outputs the output sample set to a clustering device for clustering, wherein the clustering device can be a K-means clustering device, the clustering essence is that the output samples (namely all high-dimensional characteristic matrixes) in the output sample set are clustered in a high-dimensional space to form a plurality of data clusters, the centroid distribution of each data cluster is different, different clustering pseudo labels are marked for each data cluster, and the clustering pseudo labels of the output samples distributed in the same data cluster are consistent. When different batches are clustered, in order to allocate stable clustering pseudo labels to each data cluster, the embodiment labels the clustering pseudo labels for the output sample sets of different batches according to the centroid distance of the data clusters, that is, after the output sample sets of a new batch are clustered, a new data cluster is formed, the centroid distance calculation is performed on each centroid of the new data cluster and the centroid of each data cluster of a previous batch, and the clustering pseudo labels of the new data cluster are consistent with the one with the smallest centroid distance in the data clusters of the previous batch.

The embodiment provides a mode that can be realized in an actual scene, specifically:

assuming that the Output sample set is (Output1, Output2, Output3, Output4), the Output sample set forms 4 data clusters after clustering by a clustering device, different clustering pseudo labels are labeled for the data clusters, assuming that the clustering pseudo label of the data cluster where the Output1 is located is 3, the clustering pseudo label of the data cluster where the Output2 is located is 1, the clustering pseudo label of the data cluster where the Output3 is located is 2, and the clustering pseudo label of the data cluster where the Output4 is located is 4.

Step 40, classifying the output sample set according to the clustering pseudo labels to obtain a classification result, obtaining an average classification error according to the classification result and the clustering pseudo labels, and if the average classification error is greater than or equal to an error threshold value, reversely propagating and updating a training layer;

the output sample set is output to a clustering device by an output layer for clustering to obtain clustering pseudo labels, and the clustering device outputs the clustering pseudo labels to a classifier of a graph neural network, wherein the classifier can be specifically a softmax classifier, and the classifier in the graph neural network classifies the output sample set by taking the clustering pseudo labels as a standard to obtain a probability set (namely a classification result) of each output sample in the output sample set under each clustering pseudo label; and the classifier converts the clustering pseudo labels into corresponding one-hot coding vectors, and obtains average classification errors according to the probability set corresponding to each output sample under each clustering pseudo label in the output sample set and the coding vectors corresponding to the clustering pseudo labels of the corresponding output samples.

assume that the Output sample set is (Output1, Output2, Output3, Output4), where,

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

clustering pseudo labels of Output samples of the Output sample set, Output1, Output2, Output3 and Output4 after clustering by the clustering device are respectively 3, 1, 2 and 4, the clustering pseudo labels 3, 1, 2 and 4 are Output to a classifier of the graph neural network, the classifier takes the clustering pseudo labels as a standard to classify the Output samples in the Output sample set, and a probability set corresponding to each Output sample under each clustering pseudo label is obtained, for example:

the probabilities of Output1 under clustering pseudo labels 1, 2, 3 and 4 are 0.44,0.20,0.26 and 0.1 respectively, so that the probability set is (0.44,0.20,0.26,0.1), and the corresponding vector is [0.44,0.20,0.26,0.1 ];

the probabilities of Output2 under clustering pseudo labels 1, 2, 3 and 4 are 0.41,0.19,0.28 and 0.12 respectively, so the probability set is (0.41,0.19,0.28,0.12), and its corresponding vector is [0.41,0.19,0.28,0.12 ];

the probabilities of Output3 under clustering pseudo labels 1, 2, 3 and 4 are 0.27,0.53,0.11 and 0.09 respectively, so that the probability set is (0.27,0.53,0.11,0.09), and the corresponding vector is [0.27,0.53,0.11,0.09 ];

the probabilities of Output4 under clustering pseudo labels 1, 2, 3 and 4 are 0.14,0.35,0.26 and 0.25 respectively, so that the probability set is (0.14,0.35,0.26,0.25), and its corresponding vector is [0.14,0.35,0.26,0.25 ];

and converting each clustering pseudo label into a corresponding coding vector, such as:

the encoding vector of the clustering pseudo label 3 of Output1 is [0,0,1,0 ];

the encoding vector of the clustering pseudo label 1 of Output2 is [1,0,0,0 ];

the encoding vector of the clustering pseudo label 2 of Output3 is [0,1,0,0 ];

the encoding vector of the clustering pseudo label 4 of Output4 is [0,0,0,1 ];

calculating a classification error according to a vector corresponding to the probability set corresponding to each output sample under each clustering pseudo label in each output sample set and a coding vector corresponding to the clustering pseudo label of the corresponding output sample, thereby obtaining an average classification error;

classification error:

output1 has a classification error of

Output2 has a classification error of

Output3 has a classification error of

Output4 has a classification error of

The average classification error calculated from the classification errors is approximately 0.7548.

The average classification error is about 0.7548 at this time, and if the error threshold is 10^ -5, it is known that the average classification error is greater than the error threshold, and the classifier propagates the parameters of the training layer of the updated graph neural network backward (i.e., training the input layer and the middle layer in the graph neural network).

And step 50, circularly executing a sampling training process for multiple times, iteratively updating the training layer until the average classification error is smaller than the error threshold, and taking out the attention layer in the training layer so as to obtain the blood relationship among the tables.

When the average classification error is greater than the error threshold, the classifier reversely propagates and updates the parameters of the training layer of the graph neural network, and after the parameters of the training layer are updated, step 10 to step 40 are repeated, that is, a new Table record is obtained in a new batch of tables, specifically, assuming that a new Table record is reinserted into the selected Table1, since a Table record is reinserted into Table1, another Table record is correspondingly regenerated, and the newly Table record reinserted into the selected Table1 this time and the newly regenerated Table records corresponding to other tables are input into the graph neural network-based data blood margin discovery system provided in this embodiment as a new batch of Table records (since the new batch of newly added Table records are newly added at the same time, the present invention refers to the newly added Table records of each Table in the same batch), and performing a sampling training process for multiple times in a circulating manner, iteratively updating a training layer until the classification error is smaller than an error threshold, taking out an attention layer in the training layer, thereby obtaining a blood-related relationship between the tables, assuming that after the training layer performs the sampling training process for M times in the circulating manner, an average classification error obtained by a classifier is smaller than the error threshold 10^ -5, taking out the attention layer in the neural network as a representation of the association degree and direction between the tables (namely the blood-related relationship between the tables), assuming that data of the taken-out attention layer is shown in fig. 6-7, if a value of a first row and a second column is 0.8, indicating that the association degree between Table1 and Table2 is 0.8, having strong dependence, and knowing that Table2 is a source according to the data, and Table1 is a target. The same reasoning can be derived for table-to-table kindred relationships.

The acquiring of the newly added table records of each table in the same batch specifically includes: randomly selecting one table from each table, and inserting one table record into the selected table every time, wherein one table record can be correspondingly generated in other tables;

firstly, a table is arbitrarily selected from each table, a table record is inserted into the selected table, because a table record is inserted into the selected table, a table record is correspondingly generated in other tables, and the table record inserted into the selected table every time and the table record correspondingly generated in other tables are used as table records newly added to the same batch of tables.

The embodiment provides an implementable manner, specifically:

assume that the graph includes 4 nodes (i.e. the graph includes 4 tables), as shown in fig. 2, Table1, Table2, Table3 and Table 4;

wherein, the structure of Table1 is:

T1_C1

T1_C2

T1_C3

the structure of Table2 is:

T2_C1

T2_C2

the structure of Table3 is:

T3_C1

T3_C2

T3_C3

T3_C4

the structure of Table4 is:

T4_C1

T4_C2

T4_C3

one Table is arbitrarily selected from 4 tables, assuming that the selected Table is Table1, a Table record (x1, x2, x3) is inserted into the selected Table1, since a Table record (x1, x2, x3) is inserted into Table1, then other tables can correspondingly generate a Table record, and after a Table record (x1, x2, x3) is inserted into Table1, tables 1, Table2, Table3 and Table4 respectively are:

table1 is:

T1_C1	T1_C2	T1_C3
			x1	x2	x3

table2 is:

T2_C1	T2_C2
		y1	y2

table3 is:

T3_C1	T3_C2	T3_C3	T3_C4
				z1	z2	z3	z4

table4 is:

T4_C1	T4_C2	T4_C3
			v1	v2	v3

The Table records inserted into Table1 and generated by Table2, Table3 and Table4 are (x1, x2, x3), (y1, y2), (z1, z2, z3, z4), (v1, v2 and v3), and the 4 Table records of (x1, x2, x3), (y1, y2), (z1, z2, z3, z4) and (v1, v2 and v3) are used as new Table records in the same batch.

The output sample set is classified according to the clustering pseudo labels to obtain a classification result, and an average classification error is obtained according to the classification result and the clustering pseudo labels, as shown in fig. 5, the method specifically includes: step 401, the classification result includes a probability set corresponding to each output sample in the output sample set under each clustering pseudo label;

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

step 402, converting each clustering pseudo label into a corresponding coding vector;

the encoding vector of the clustering pseudo label 3 of Output1 is [0,0,1,0 ];

the encoding vector of the clustering pseudo label 1 of Output2 is [1,0,0,0 ];

the encoding vector of the clustering pseudo label 2 of Output3 is [0,1,0,0 ];

the encoding vector of the clustering pseudo label 4 of Output4 is [0,0,0,1 ];

and 403, obtaining an average classification error according to the probability set corresponding to each output sample in the output sample set under each clustering pseudo label and the coding vector corresponding to the clustering pseudo label of the corresponding output sample.

Firstly, calculating a classification error:

classification of Output1Error is

Output2 has a classification error of

Output3 has a classification error of

Output4 has a classification error of

And circularly executing a sampling training process for multiple times, iteratively updating a training layer, and taking out the attention layer in the training layer until the classification error is smaller than an error threshold value, so as to obtain the blood-related relationship between the tables, and taking out the attention layer in the graph neural network as the representation of the association degree and the direction between the tables (namely the blood-related relationship between the tables) if the training layer is supposed to circularly execute the sampling training process for M times and the average classification error obtained by the classifier is smaller than the error threshold value by 10^ -5.

Example 3

On the basis of the data blood margin discovery method based on the graph neural network provided in the embodiment 1, the present invention further provides a data blood margin discovery device based on the graph neural network, which can be used for implementing the method, as shown in fig. 8, the device is a schematic structural diagram of the device in the embodiment of the present invention. The data blood margin discovery device based on the graph neural network of the embodiment comprises one or more processors 21 and a memory 22. In fig. 8, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 8 illustrates the connection by a bus as an example.

The memory 22, as a non-volatile computer-readable storage medium for a graph neural network-based data blood-margin discovery method, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the graph neural network-based data blood-margin discovery method in embodiment 1. The processor 21 executes various functional applications and data processing of the data blood margin discovery device based on the graph neural network by executing the nonvolatile software program, instructions and modules stored in the memory 22, that is, implements the data blood margin discovery method based on the graph neural network according to embodiment 1.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the graph neural network-based data blood margin discovery method of embodiment 1 described above, for example, perform the steps of fig. 3 and 5 described above.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A data blood margin discovery method based on a graph neural network is characterized in that a sampling training process comprises the steps of obtaining table records newly added to each table in the same batch;

clustering the output sample set to obtain a clustering pseudo label;

2. The method for discovering data blooding borders based on graph neural network according to claim 1, wherein the obtaining of new table records of each table in the same batch comprises:

3. The method for discovering data bloody borders based on graph neural network according to claim 1, wherein the output sample set is classified according to the clustering pseudo-labels to obtain a classification result, and an average classification error is obtained according to the classification result and the clustering pseudo-labels, specifically comprising:

converting each clustering pseudo label into a corresponding coding vector;

4. The graph neural network-based data blooding border discovery method according to claim 1, wherein each table is converted into an output sample set after a newly added table record in the same batch passes through a training layer, and the method specifically comprises:

5. The method of claim 4, wherein the input sample set is a homogeneous matrix.

6. The graph neural network-based data blooding border discovery method according to claim 4, characterized in that the intermediate layer further comprises one or more of a ReLU layer, a regularization layer and an output layer.

7. The method for data blood margin discovery based on graph neural network according to any of claims 1-6, wherein the output sample set is specifically a high-dimensional feature matrix.

8. A data blood margin discovery system based on a graph neural network is characterized by comprising: a graph neural network and a clustering device;

clustering device: clustering and obtaining clustering pseudo labels;

9. The graph neural network-based data blooding border discovery system of claim 8, characterized in that the training layer further comprises an input layer;

10. A data blood margin discovery device based on a graph neural network, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the graph neural network-based data context discovery method of any one of claims 1-7.