CN113469280B

CN113469280B - Data blood-edge discovery method, system and device based on graph neural network

Info

Publication number: CN113469280B
Application number: CN202110830737.6A
Authority: CN
Inventors: 黄勋
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2023-06-16
Anticipated expiration: 2041-07-22
Also published as: CN113469280A

Abstract

The invention relates to the technical field of large data platform construction, and provides a data blood-edge discovery method, system and device based on a graph neural network, which are used for acquiring newly added table records of each table in the same batch; the tables are converted into output sample sets after the tables newly added in the same batch are recorded through the training layer; clustering the output sample set to obtain a clustering pseudo tag; the output sample set is classified according to the clustering pseudo tag to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo tag, and if the average classification error is greater than or equal to an error threshold value, the training layer is updated in a counter-propagation mode; and repeatedly performing a sampling training process, iteratively updating the training layer until the average classification error is smaller than an error threshold value, and taking out the attention layer in the training layer to acquire the blood-edge relationship among the tables.

Description

Data blood-edge discovery method, system and device based on graph neural network

[ field of technology ]

The invention relates to the technical field of large data platform construction, in particular to a data blood-edge discovery method, a system and a device based on a graph neural network.

[ background Art ]

Along with the increasing of the network scale, the data generated by the network is greatly developed to form the industry reality of the large data of the network. The requirements for the acquisition, management, analysis and the like of the network big data are increasingly urgent, so that higher requirements are put on the quality of the network big data.

The network big data can be stored by different tables in the database, and various relations can be naturally formed in the processes of data generation, processing, circulation and the like, which are called as data blood edges. The blood relationship between the data tables and between the fields can help to analyze the rationality of the table design, analyze the influence on the downstream data after the upstream data is changed, and trace back the upstream problem source when the downstream data is abnormal. Data blood edge is a beneficial technical means of data management, and the discovery technology of the data blood edge is particularly important.

At present, the discovery technology of data blood edges is mainly divided into four types, and each type has the following defects:

and a dependency relationship field is built in the tables, and the upstream and downstream dependency relationship fields are directly written into a database for storage, so that the blood relationship between the tables is constructed. The disadvantage of this solution is that the table structure is changed and there is a strong dependence on the data processing components.

And disassembling DDL of the table in a file analysis mode to obtain source table information and destination table information of the data. The scheme is not applicable when DDL files are not easy to obtain due to authority problems.

And acquiring the upstream and downstream relation of the table through a task scheduling system during table construction. This solution has the disadvantage of relying on additional task scheduling information, whereas the task scheduling process is actually complex to configure, which scheduling process is also limited to third party components.

And manually combing to obtain the data blood relationship. Although the scheme can comb blood-related relations more finely, time and labor are wasted.

In view of this, overcoming the drawbacks of the prior art is a problem to be solved in the art.

[ invention ]

The invention aims to solve the technical problems that:

some scenarios may occur in the prior art, for example: there may be a case where original construction information of a table and table data flow information are missing, and if the original construction information of a table and the table data flow information are missing, the degree of association between tables cannot be obtained, and which table is the source and which table is the destination (i.e., the direction between tables).

The invention achieves the aim through the following technical scheme:

in a first aspect, the invention provides a data blood-edge discovery method based on a graph neural network, wherein the sampling training process comprises the steps of obtaining newly added table records of each table in the same batch;

the tables are converted into output sample sets after the tables newly added in the same batch are recorded through the training layer;

clustering the output sample set to obtain a clustering pseudo tag;

the output sample set is classified according to the clustering pseudo tag to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo tag, and if the average classification error is greater than or equal to an error threshold value, the training layer is updated in a counter-propagation mode;

and repeatedly executing the sampling training process, and iteratively updating the training layer until the average classification error is smaller than an error threshold value, and taking out the attention layer in the training layer so as to obtain the blood margin relation among the tables.

Preferably, the obtaining the table record of each table newly added in the same batch specifically includes:

selecting one table at will from the tables, inserting one table record in each selected table, and correspondingly generating one table record in other tables;

and taking one table record inserted each time in the selected tables and one table record correspondingly generated in other tables as a newly added table record of each table in the same batch.

Preferably, the output sample set classifies according to the clustering pseudo tag to obtain a classification result, and obtains an average classification error according to the classification result and the clustering pseudo tag, which specifically includes:

the classification result comprises a probability set corresponding to each output sample in the output sample set under each clustering pseudo tag;

converting each clustering pseudo tag into a corresponding coding vector;

and obtaining average classification errors according to the probability set corresponding to each output sample under each clustering pseudo tag in the output sample set and the coding vector corresponding to the clustering pseudo tag of the corresponding output sample.

Preferably, after the table records newly added in the same batch pass through the training layer, each table is converted into an output sample set, which specifically includes:

the training layer comprises one or more of an input layer and an intermediate layer, wherein the intermediate layer comprises an attention layer;

each table is converted into an input sample set after the table record newly added in the same batch passes through an input layer, and the input sample set is converted into an output sample set after the input sample set passes through an intermediate layer.

Preferably, the input sample set is in particular a homogeneous matrix.

Preferably, the intermediate layer further comprises one or more of a ReLU layer, a regularization layer and an output layer.

Preferably, the output sample set is specifically a high-dimensional feature matrix.

In a second aspect, the present invention provides a data blood-edge discovery system based on a graph neural network, including: a graph neural network and a clustering device;

a clustering device: the method comprises the steps of clustering and obtaining a clustering pseudo tag;

graph neural network: the method comprises the steps of obtaining a new table record of each table in the same batch;

the tables are converted into output sample sets after the tables newly added in the same batch are recorded through a training layer of the graph neural network;

the output sample set is classified according to the clustering pseudo tag to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo tag, and if the average classification error is greater than or equal to an error threshold value, the training layer is updated in a counter-propagation mode; and iteratively updating the training layer until the average classification error is smaller than an error threshold value, and taking out the attention layer in the training layer so as to obtain the blood margin relation among the tables.

Preferably, the training layer further comprises an input layer;

the input layer: for converting the table records of the tables newly added in the same batch into an input sample set in a homogeneous matrix form.

In a third aspect, the present invention further provides a data blood-edge discovery device based on a graph neural network, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being programmed to perform the data lineage discovery method based on a graph neural network of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention only relies on the table itself, namely, the table records newly added by each table in the same batch are acquired for a plurality of times, the sampling training process is executed for a plurality of times, the training layer of the graph neural network is iteratively updated, and the attention layer in the training layer is taken out until the average classification error is smaller than the error threshold value, so that the blood-margin relation between the tables (namely, the association degree between the tables and which table is the source and which table is the destination) is acquired; the method can be used for finding the blood relationship between the tables under the conditions of acquiring DDL without permission, acquiring task scheduling flow and the like.

[ description of the drawings ]

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a block diagram of a data blood-edge discovery system based on a neural network of the present invention;

FIG. 2 is a graph model used by a data blood-edge discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 3 is a flowchart of a data blood-edge discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a newly added table record in the same batch of data blood-edge discovery methods based on a neural network according to an embodiment of the present invention;

FIG. 5 is a flowchart of a data blood-edge discovery method based on a graph neural network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of data in an attention layer obtained by a data blood-edge discovery method based on a graph neural network according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a graph model obtained according to data in an attention layer in a data blood edge discovery method based on a graph neural network according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a data blood-edge discovery device based on a graph neural network according to an embodiment of the present invention.

[ detailed description ] of the invention

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "transverse", "upper", "lower", "top", "bottom", etc. refer to an orientation or positional relationship based on that shown in the drawings, merely for convenience of describing the present invention and do not require that the present invention must be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1:

the embodiment of the invention provides a data blood-edge discovery system based on a graph neural network, which comprises the following steps: a graph neural network and a clustering device;

graph neural network: the method comprises the steps of obtaining newly added table records of each table in the same batch, and converting the newly added table records of each table in the same batch into an output sample set after the newly added table records pass through a training layer of a graph neural network; the output sample set is respectively output to a classifier and a clustering device, the classifier classifies the output sample set by taking a clustering pseudo tag obtained by the clustering device as a standard to generate average classification errors, the training layer is updated by using the average classification errors in a counter propagation mode, so that the training layer is updated in an iteration mode, when the average classification errors are smaller than an error threshold value, the attention layer in the training layer is taken out, and therefore the blood-edge relation among all tables is obtained; the output sample set is a high-dimensional feature matrix.

The embodiment provides a manner that can be implemented in an actual scene, as shown in fig. 1, specifically: the data blood-edge discovery system based on the graph neural network comprises the graph neural network and a clustering device; the graph neural network may specifically be a graph meaning network, and the graph neural network includes: the training layer comprises an input layer, an intermediate layer and one or more of a ReLU layer, a regularization layer and an output layer.

The input layer: the method comprises the steps of obtaining a new table record of each table in the same batch, and converting the new table record of each table in the same batch into an input matrix, wherein the structures of the tables may be different, so that the input matrix obtained by converting the new table record is non-uniform in form, namely a heterogeneous matrix, the heterogeneous matrix is converted into an input sample set after passing through an input layer, and the input sample set is a set of homogeneous matrices (namely a matrix with uniform form);

an intermediate layer: the method comprises the steps of obtaining an input sample set, converting the input sample set into an output sample set after passing through an intermediate layer of a graph neural network, outputting the output sample set by an output layer of the intermediate layer, wherein the output sample set is a set of high-dimensional feature matrixes, clustering the output sample set by a clustering device to obtain a clustering pseudo tag, classifying by a classifier in the graph neural network according to the clustering pseudo tag to obtain a classification result, obtaining an average classification error according to the classification result and the clustering pseudo tag, reversely propagating and updating a training layer in the graph neural network if the average classification error is greater than or equal to an error threshold, iteratively updating the training layer until the average classification error is smaller than the error threshold, and taking out an attention layer in the training layer so as to obtain a blood margin relation among all tables;

a classifier: and classifying the output sample set by taking the clustering pseudo tag obtained by the clustering device as a standard to generate average classification errors, and updating the training layer by using the back propagation of the average classification errors, thereby realizing iterative updating of the training layer.

Example 2:

the embodiment also provides a data blood-edge discovery method based on a graph neural network, as shown in fig. 3, the sampling training process includes:

step 10, obtaining newly added table records of each table in the same batch;

in this embodiment, as shown in fig. 2, each table is regarded as a node in the graph, the association between tables is regarded as an edge between nodes, first, one table is selected arbitrarily in each table, one table record is inserted into the selected table, and since one table record is inserted into the selected table, one table record is correspondingly generated in each other table, and one table record inserted into each selected table and one table record correspondingly generated in each other table are used as newly added table records of each table in the same batch.

The embodiment provides a realizable mode, which specifically comprises the following steps:

assuming that the graph contains 4 nodes (i.e., 4 tables in the graph), as shown in fig. 1, table2, table3, and Table4, respectively, where t1_c1, t1_c2, t1_c3, t2_c1, t2_c2, t3_c1, t3_c2, t3_c3, t3_c4, t4_c1, t4_c2, and t4_c3 each represent a field name of a Table;

wherein, the structure of Table1 is:

T1_C1

T1_C2

T1_C3

the structure of Table2 is:

T2_C1

T2_C2

the structure of Table3 is:

T3_C1

T3_C2

T3_C3

T3_C4

the structure of Table4 is:

T4_C1

T4_C2

T4_C3

any Table is selected from the 4 tables, assuming that the selected Table is Table1, a Table record (x 1, x2, x 3) is inserted into the selected Table1, and since a Table record (x 1, x2, x 3) is inserted into the Table1, each other Table correspondingly generates a Table record, wherein for the Table record (x 1, x2, x 3) which is inserted and does not generate a corresponding Table record, a Table record with each field of 0 is used for replacing the Table record to keep the integrity of the sample record. Suppose that after inserting a Table record (x 1, x2, x 3) in Table1, table2, table3 and Table4 become:

table1 is:

T1_C1	T1_C2	T1_C3
			x1	x2	x3

table2 is:

T2_C1	T2_C2
		y1	y2

table3 is:

T3_C1	T3_C2	T3_C3	T3_C4
				z1	z2	z3	z4

table4 is:

T4_C1	T4_C2	T4_C3
			v1	v2	v3

the Table records inserted in Table1 and the Table records generated by Table2, table3 and Table4 are respectively (x 1, x2, x 3), (y 1, y 2), (z 1, z2, z3, z 4), (v 1, v2, v 3), and the 4 Table records (x 1, x2, x 3), (y 1, y 2), (z 1, z2, z3, z 4) and (v 1, v2, v 3) are taken as the newly added Table records in the same batch, wherein, for the Table records (x 1, x2, x 3) and the Table records not generating the corresponding Table records, the Table records with each field of 0 are replaced by the Table records (y 1, y 2) such that (y 1, y 2) is (0, 0) and (y 1, y 2) are merely illustrative, and not limited to the invention.

The input layer obtains the 4 table records as newly added table records in the same batch, and this embodiment is merely illustrative, where each table structure may be a list of table structures (i.e. fields) or a list of table structures, and the table illustrated in this embodiment is not intended to limit the present invention.

Step 20, after the table records newly added in the same batch pass through the training layer, each table is converted into an output sample set;

each table is converted into an output sample set after the table record newly added in the same batch passes through the training layer, and the method specifically comprises the following steps: the training layer comprises one or more of an input layer and an intermediate layer, wherein the intermediate layer comprises an attention layer; each table is converted into an input sample set after the table record newly added in the same batch passes through an input layer, and the input sample set is converted into an output sample set after the input sample set passes through an intermediate layer. The input layer will convert the table records of each table newly added in the same batch into an input sample set, which is specifically a homogeneous matrix. Firstly, an input layer will record and convert the tables newly added in the same batch into input matrixes, and because the structures of the tables may be different, the input matrixes obtained by recording and converting the tables are also non-uniform (i.e. heterogeneous matrixes), and at the moment, each input matrix will be multiplied with a corresponding parameter matrix Wi in the input layer to obtain a homogeneous matrix with uniform form, so that an input sample set is obtained, wherein the parameter matrix Wi is updatable.

The embodiment provides a manner that can be implemented in an actual scene, as shown in fig. 4, specifically:

taking 4 table records (x 1, x2, x 3), (y 1, y 2), (z 1, z2, z3, z 4), (v 1, v2, v 3) as newly added table records in the same batch, and respectively converting the 4 table records into corresponding input matrixes

It can be known that the input matrixes are not uniform in form, namely, the input matrixes are heterogeneous matrixes, the input matrixes with the non-uniform in form are multiplied by the corresponding parameter matrixes Wi in the input layer respectively to obtain homogeneous matrixes with the uniform in form, and the heterogeneous matrixes are converted into the homogeneous matrixes so as to facilitate calculation of middle layers in the subsequent graph neural network. For the input matrix of (x 1, x2, x 3), a parameter matrix W1 of n×3 is selected, and the parameter matrix W1 is multiplied by the input matrix of (x 1, x2, x 3), for example:

the result obtained by multiplying W1 by the Input matrix of (x 1, x2, x 3) is recorded as a matrix of Input1, and Input1 is n 1;

similarly, the Input matrix of (y 1, y 2) needs to select a parameter matrix W2 of n×2, and the result is recorded as Input2, where n×1 is obtained by multiplying the parameter matrix W2 by the Input matrix of (y 1, y 2); the Input matrix of (z 1, z2, z3, z 4) needs to select a parameter matrix W3 of n×4, and the result is denoted as Input3 by multiplying the parameter matrix W3 by the Input matrix of (z 1, z2, z3, z 4) to obtain a matrix of n×1; the Input matrix of (v 1, v2, v 3) needs to select the parameter matrix W4 of n×3, the multiplication of the parameter matrix W4 and the Input matrix of (v 1, v2, v 3) obtains the matrix of n×1, the result is recorded as Input4, the Input sample set can be obtained at this time, the Input sample set is (Input 1, input2, input3, input 4), and the Input sample set at this time is a set of homogeneous matrices.

The Input sample set is converted after passing through the middle layer of the graph neural network to obtain an Output sample set, the Output sample set is Output by an Output layer, wherein the Output sample set is specifically a high-dimensional feature matrix, the essence of the Output sample set is that the Input sample set is subjected to matrix multiplication and nonlinear transformation in the middle layer, for example, the Input sample set (Input 1, input2, input3, input 4) is a homogeneous matrix of n×1 when the Input sample set is Input into the graph neural network, the homogeneous matrix is Output by the Output layer in the graph neural network (i.e. the Output sample set) after being calculated with the middle layer of the graph neural network, the middle layer of the graph neural network comprises an attention layer, a ReLU layer, a regularization layer and an Output layer, the Input sample set is converted into the Output sample set after passing through the middle layer in the graph neural network through the blood-edge relation between the attention layer record table and the table, and the Output sample set is a set of the high-dimensional feature matrix, and specifically can be an Output1×512 high-dimensional feature matrix (Output 1, output2, output3, output 4), for example:

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

step 30, clustering the output sample set to obtain a clustering pseudo tag;

the graph neural network outputs an output sample set to a clustering device for clustering, wherein the clustering device can be a K-means clustering device, the clustering essence is to cluster output samples (namely, each high-dimensional feature matrix) in the output sample set in a high-dimensional space to form a plurality of data clusters, centroid distribution of each data cluster is different, different clustering pseudo labels are marked for each data cluster, and the clustering pseudo labels of the output samples distributed in the same data cluster are consistent. When clustering is performed on different batches, in order to allocate stable clustering pseudo tags to each data cluster, in this embodiment, the clustering pseudo tags are marked for different batches of output sample sets according to the centroid distances of the data clusters, that is, after clustering of a new batch of output sample sets, a new data cluster is formed, and centroid distance calculation is performed on each centroid of the new data cluster and each centroid of each data cluster of a previous batch, where the clustering pseudo tag of the new data cluster is consistent with the one with the smallest centroid distance in the data cluster of the previous batch.

The embodiment provides a mode which can be realized in an actual scene, which is specifically as follows:

assuming that the Output sample set is (Output 1, output2, output3 and Output 4), 4 data clusters are formed after the Output sample set is clustered by a clustering device, different clustering pseudo tags are marked for each data cluster, and assuming that the clustering pseudo tag of the data cluster where Output1 is located is 3, the clustering pseudo tag of the data cluster where Output2 is located is 1, the clustering pseudo tag of the data cluster where Output3 is located is 2 and the clustering pseudo tag of the data cluster where Output4 is located is 4.

Step 40, the output sample set is classified according to the clustering pseudo tag to obtain a classification result, an average classification error is obtained according to the classification result and the clustering pseudo tag, and if the average classification error is greater than or equal to an error threshold value, the training layer is updated in a counter-propagation mode;

the output sample set is output to a clustering device by an output layer to be clustered to obtain a clustering pseudo tag, the clustering device outputs the clustering pseudo tag to a classifier of a graph neural network, wherein the classifier can be specifically a softmax classifier, the classifier in the graph neural network classifies the output sample set by taking the clustering pseudo tag as a standard to obtain a probability set (namely a classification result) of each output sample in the output sample set under each clustering pseudo tag; and the classifier converts the clustering pseudo labels into corresponding one-hot coding vectors, and average classification errors are obtained according to the probability set corresponding to each output sample under each clustering pseudo label in the output sample set and the coding vectors corresponding to the clustering pseudo labels of the corresponding output samples.

let the Output sample set be (Output 1, output2, output3, output 4), wherein,

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

clustering pseudo labels of Output samples Output1, output2, output3 and Output4 in the Output sample set after clustering by a clustering device are respectively 3, 1, 2 and 4, and outputting the clustering pseudo labels 3, 1, 2 and 4 to a classifier of a graph neural network, wherein the classifier classifies each Output sample in the Output sample set by taking the clustering pseudo labels as a standard to obtain a probability set corresponding to each Output sample under each clustering pseudo label, for example:

the probability of Output1 under the clustering pseudo tags 1, 2, 3 and 4 is 0.44,0.20,0.26 and 0.1 respectively, and the probability set is (0.44,0.20,0.26,0.1) and the corresponding vector is [0.44,0.20,0.26,0.1];

the probability of Output2 under the clustering pseudo tags 1, 2, 3 and 4 is 0.41,0.19,0.28 and 0.12 respectively, and the probability set is (0.41,0.19,0.28,0.12) and the corresponding vector is [0.41,0.19,0.28,0.12];

the probability of Output3 under the clustering pseudo tags 1, 2, 3 and 4 is 0.27,0.53,0.11 and 0.09 respectively, and the probability set is (0.27,0.53,0.11,0.09), and the corresponding vector is [0.27,0.53,0.11,0.09];

the probability of Output4 under the clustering pseudo tags 1, 2, 3 and 4 is 0.14,0.35,0.26 and 0.25 respectively, and the probability set is (0.14,0.35,0.26,0.25) and the corresponding vector is [0.14,0.35,0.26,0.25];

converting each cluster pseudo tag into a corresponding coding vector, such as:

the coding vector of the clustering pseudo tag 3 of Output1 is [0,1, 0];

the code vector of the clustering pseudo tag 1 of Output2 is [1, 0];

the coding vector of the clustering pseudo tag 2 of Output3 is [0,1, 0];

the code vector of the clustering pseudo tag 4 of Output4 is [0, 1];

calculating a classification error according to the vectors corresponding to the probability sets corresponding to the output samples under the clustering pseudo tags in the output sample sets and the coding vectors corresponding to the clustering pseudo tags of the corresponding output samples, so as to obtain an average classification error;

classification error:

the classification error of Output1 is

The classification error of Output2 is

The classification error of Output3 is

The classification error of Output4 is

The average classification error calculated from the classification errors is approximately 0.7548.

At this time, the average classification error is about 0.7548, and if the error threshold is 10-5, it is known that the average classification error is greater than the error threshold, and at this time, the classifier propagates back to update the parameters of the training layer of the graph neural network (i.e., trains the input layer and the middle layer in the graph neural network).

And 50, circularly executing the sampling training process for a plurality of times, and iteratively updating the training layer until the average classification error is smaller than the error threshold value, and taking out the attention layer in the training layer so as to acquire the blood margin relation among the tables.

When the average classification error is greater than the error threshold, the classifier will reversely propagate and update the parameters of the training layer of the neural network, after the parameters of the training layer are updated, repeat the steps 10-40 to obtain a new Table record of a new batch of tables, specifically, assuming that a new Table record is reinserted in the selected Table1, since a Table record is reinserted in Table1, other tables correspondingly regenerate a Table record, a new Table record reinserted in the selected Table1 and other tables correspondingly regenerate a Table record as a new Table record of a new batch of tables to be input into the data blood edge discovery system based on the neural network provided by the embodiment (since the new Table record of a new batch is newly added at the same time, therefore, the method is called as a new Table record of each Table in the same batch), the sampling training process is repeatedly executed, the training layer is iteratively updated until the classification error is smaller than the error threshold, the attention layer in the training layer is taken out, so that the blood margin relation among the tables is obtained, after the training layer is subjected to the sampling training process for M times, the average classification error obtained by the classifier is smaller than the error threshold 10 < -5 >, the attention layer in the graph neural network is taken out as the representation of the association degree and the direction between the tables (namely the blood margin relation between the tables), the data of the attention layer taken out is shown as fig. 6-7, if the value of the second column of the first row is 0.8, the association degree of Table1 and Table2 is 0.8, and the method has stronger dependence, meanwhile, the Table1 is known as a source. Similarly, the blood relationship between the meters can be deduced.

The obtaining the table record of each table newly added in the same batch specifically comprises the following steps: selecting one table at will from the tables, inserting one table record in each selected table, and correspondingly generating one table record in other tables;

firstly, a table is selected at will in each table, a table record is inserted in the selected table, and as the table record is inserted in the selected table, a table record is correspondingly generated in other tables, and the table record inserted in each time in the selected table and the table record correspondingly generated in other tables are used as the table record newly added in the same batch in each table.

assume that the graph contains 4 nodes (i.e., the graph contains 4 tables), as shown in fig. 2, respectively, table1, table2, table3, and Table4;

wherein, the structure of Table1 is:

T1_C1

T1_C2

T1_C3

the structure of Table2 is:

T2_C1

T2_C2

the structure of Table3 is:

T3_C1

T3_C2

T3_C3

T3_C4

the structure of Table4 is:

T4_C1

T4_C2

T4_C3

any Table is selected from the 4 tables, and assuming that the selected Table is Table1, one Table record (x 1, x2, x 3) is inserted into the selected Table1, and since one Table record (x 1, x2, x 3) is inserted into the Table1, each other Table correspondingly generates one Table record, and after one Table record (x 1, x2, x 3) is inserted into the Table1, table2, table3 and Table4 are respectively:

table1 is:

T1_C1	T1_C2	T1_C3
			x1	x2	x3

table2 is:

T2_C1	T2_C2
		y1	y2

table3 is:

T3_C1	T3_C2	T3_C3	T3_C4
				z1	z2	z3	z4

table4 is:

T4_C1	T4_C2	T4_C3
			v1	v2	v3

The Table records inserted in the Table1 and the Table records generated by the corresponding Table2, table3 and Table4 are respectively (x 1, x2, x 3), (y 1, y 2), (z 1, z2, z3, z 4) and (v 1, v2, v 3), and the 4 Table records (x 1, x2, x 3), (y 1, y 2), (z 1, z2, z3, z 4) and (v 1, v2, v 3) are used as the Table records newly added in the same batch.

The output sample set is classified according to the clustering pseudo tag to obtain a classification result, and average classification errors are obtained according to the classification result and the clustering pseudo tag, as shown in fig. 5, specifically including: step 401, the classification result includes a probability set corresponding to each output sample in the output sample set under each clustering pseudo tag;

let the Output sample set be (Output 1, output2, output3, output 4), wherein,

Output1＝[-0.043624,0.004513,0.062156,-0.013981,...,0.028901]

Output2＝[0.070011,0.000111,0.000002,-0.000044,...,-0.001604]

Output3＝[0.015643,-0.092466,-0.001992,0.003929,...,-0.002443]

Output4＝[0.000901,-0.012801,0.0103378,-0.000066,...,0.010568]

step 402, converting each cluster pseudo tag into a corresponding coding vector;

the coding vector of the clustering pseudo tag 3 of Output1 is [0,1, 0];

the code vector of the clustering pseudo tag 1 of Output2 is [1, 0];

the coding vector of the clustering pseudo tag 2 of Output3 is [0,1, 0];

the code vector of the clustering pseudo tag 4 of Output4 is [0, 1];

and step 403, obtaining an average classification error according to the probability set corresponding to each output sample under each clustering pseudo tag in the output sample set and the coding vector corresponding to the clustering pseudo tag of the corresponding output sample.

First, the classification error is calculated:

the classification error of Output1 is

The classification error of Output2 is

The classification error of Output3 is

The classification error of Output4 is

And performing sampling training process for multiple times, iteratively updating the training layer until the classification error is smaller than an error threshold, taking out the attention layer in the training layer so as to acquire the blood-edge relationship between the tables, and taking out the attention layer in the graph neural network as the representation of the association degree and the direction between the tables (namely the blood-edge relationship between the tables) assuming that the average classification error obtained by the classifier after the training layer performs sampling training process for M times is smaller than the error threshold 10-5.

Example 3

On the basis of the data blood-edge discovery method based on the graph neural network provided in the above embodiment 1, the present invention further provides a data blood-edge discovery device based on the graph neural network, which can be used to implement the method, as shown in fig. 8, and is a schematic diagram of a device architecture according to an embodiment of the present invention. The data blood-edge finding device based on the graph neural network of the present embodiment includes one or more processors 21 and a memory 22. In fig. 8, a processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or otherwise, for example in fig. 8.

The memory 22 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs and modules for implementing the data blood edge discovery method based on the graph neural network in embodiment 1. The processor 21 executes various functional applications and data processing of the data blood-edge finding device based on the graph neural network by executing the nonvolatile software programs, instructions, and modules stored in the memory 22, that is, implements the data blood-edge finding method based on the graph neural network of embodiment 1.

The memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, such remote memory being connectable to the processor 21 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, which when executed by the one or more processors 21, perform the data blood-edge discovery method based on a graph neural network in the above-described embodiment 1, for example, performing the respective steps of fig. 3 and 5 described above.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the embodiments may be implemented by a program that instructs associated hardware, the program may be stored on a computer readable storage medium, the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The data blood-edge discovery method based on the graph neural network is characterized in that the sampling training process comprises the steps of obtaining newly added table records of each table in the same batch, specifically, randomly selecting one table from the tables, inserting one table record into the selected table each time, and correspondingly generating one table record in other tables; taking one table record inserted each time in the selected table and one table record correspondingly generated in other tables as a newly added table record of each table in the same batch;

clustering the output sample set to obtain a clustering pseudo tag;

2. The data blood-edge discovery method based on a graph neural network according to claim 1, wherein the output sample set is classified according to the clustering pseudo tag to obtain a classification result, and an average classification error is obtained according to the classification result and the clustering pseudo tag, and specifically comprises the following steps:

converting each clustering pseudo tag into a corresponding coding vector;

3. The data blood-edge discovery method based on a graph neural network according to claim 1, wherein each table is converted into an output sample set after the table record newly added in the same batch passes through a training layer, specifically comprising:

4. A data blood-edge finding method based on a graph neural network as claimed in claim 3, wherein the input sample set is embodied as a homogeneous matrix.

5. The data blood-edge discovery method based on a graph neural network of claim 3, wherein the middle layer further comprises one or more of a ReLU layer, a regularization layer, and an output layer.

6. The method of any one of claims 1-5, wherein the output sample set is embodied as a high-dimensional feature matrix.

7. A data blood-edge discovery system based on a graph neural network, comprising: a graph neural network and a clustering device;

graph neural network: the method comprises the steps of obtaining newly added table records of each table in the same batch, specifically, randomly selecting one table from each table, inserting one table record into each selected table, and correspondingly generating one table record in each other table; taking one table record inserted each time in the selected table and one table record correspondingly generated in other tables as a newly added table record of each table in the same batch;

8. The data blood-edge discovery system based on a graph neural network of claim 7, wherein the training layer further comprises an input layer;

9. A data blood-edge discovery device based on a graph neural network, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being programmed to perform the data blood-edge discovery method based on a graph neural network of any one of claims 1-6.