CN111275201A

CN111275201A - Sub-graph division based distributed implementation method for semi-supervised learning of graph

Info

Publication number: CN111275201A
Application number: CN202010068356.4A
Authority: CN
Inventors: 蒋俊正; 黄炟鑫; 冯海荣; 卢军志; 池源
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-12

Abstract

The invention discloses a distributed implementation method of semi-supervised learning of a graph based on sub-graph division, which is characterized by comprising the following steps of: 1) constructing a graph; 2) modeling an optimization problem; 3) sub-graph division and optimization problem decomposition; 4) splicing the solution and the solution of the subproblem; 5) carrying out iterative solution; 6) and (4) distributed solving. The method has short calculation time and high acquisition speed of data used for calculation, and the calculation result obtained under large-scale data is consistent with a centralized result.

Description

Sub-graph division based distributed implementation method for semi-supervised learning of graph

Technical Field

The invention relates to the technical field of machine learning and graph signal processing, in particular to a distributed implementation method of graph semi-supervised learning based on sub-graph partitioning.

Background

The era of our modern times is a big data era, the acquisition and storage of data are simpler than ever, and under the huge data, how to extract valuable information from the data becomes more critical. Nowadays, machine learning algorithms are used for processing big data, and algorithms such as neural networks achieve certain results in practical application, however, at present, the training time of the algorithms is long, and the acquisition of data used for training is very difficult.

The semi-supervised learning of the graph is a relatively important part in a machine learning algorithm, and has unique advantages in comparison with other machine learning algorithms. First, the graph semi-supervised learning is a direct-push algorithm, which can directly calculate the result without training the model, and as a semi-supervised learning algorithm, the requirement on data is low, and only a small part of labeled known data is needed to label the rest of data. Based on this, it is necessary to improve the solution method of the current semi-supervised learning problem.

Disclosure of Invention

The invention aims to provide a distributed implementation method of graph semi-supervised learning based on sub-graph partitioning, aiming at the defects of the prior art. The method has short calculation time and high acquisition speed of data used for calculation, and the calculation result obtained under large-scale data is consistent with a centralized result.

The technical scheme for realizing the purpose of the invention is as follows:

a distributed implementation method of semi-supervised learning of a graph based on sub-graph partitioning comprises the following steps:

1) constructing a graph: the data set for semi-supervised learning is

There are a total of N samples, x, in the dataset_nRepresenting the nth sample, the labels in the data set all come from

With a class c tag, where { x₁,x₂,...x_lThe label information of is known, the corresponding label information is

And { x_l+1,...,x_nThe tag information of } is unknown, according to

The similarity of the samples in the drawing is established

And E are respectively a node set and an edge set,

each node in the E group corresponds to each sample in the data set, and the E group comprises the connection information of the nodes;

2) modeling an optimization problem: representing label information of a data set to be processed into a graph signal f ═ f₁,…,f_n]^TThe signal value is a label of the corresponding sample, and the optimization problem of semi-supervised learning of the graph is defined as follows:

the method comprises the following steps that (1) information of each type of label is transmitted to a sample with unknown label information, then, a final classification result is extracted by a formula (2), F in the formula (1) is a classification matrix, Y is a known label information matrix which are both N multiplied by c, Y is generated through a data set F to be processed, and the generation mode is as follows:

in the formula (1), the first and second groups,

for the matching term, τ is the weighting factor, S (F)_:,j) Is set as a penalty item

Wherein L is_norm＝I-D^-1/2WD^-1/2Normalizing the Laplace matrix for the graph, wherein I is a unit matrix, D is a degree matrix, and W is an adjacency matrix, and the propagation of the jth label information in the formula (1) is expressed as:

equation (4) can be expressed as:

f in formula (5)_j＝F_:,j，y_j＝Y_:,j，

Is f_jTransposing;

3) sub-graph division and optimization problem decomposition: using indicator operators

To the picture

Sub-graph partitioning to indicate operators

Is a diagonal matrix and is defined as:

b (k,2r) in equation (7) is a set of nodes that contains k and neighbor nodes within a radius of 2r, for node k,

dividing a subgraph taking k as a center and taking a node set as B (k,2r) for a node k

For a graph with N nodes, there are N such subgraphs, and the optimization problem as shown in equation (5) is projected onto each subgraph, for each node k and the corresponding subgraph

There is a corresponding optimization sub-problem as shown in equation (8):

f in formula (8)_j,kIs f in the subfigure

The projection of (2) is performed,

4) solving the subproblem and splicing the solution: solving the subproblem shown in the formula (8) in the step 3) according to the formula (10):

in formula (10), C ═ τ I + L_norm，b_j＝y_j，

Is the local solution provided by the node k and its neighbors, and the local solutions on all nodes are solved according to the formula (11)

Splicing, and then averaging:

in formula (11)

Is an approximation to the solution of the global problem of equation (5);

5) and (3) iterative solution:

has an error from the actual solution, so that an iterative calculation method is adopted according to the formula (12)

Approaching the solution of the global problem as equation (5):

6) distributed solution: for node k and corresponding subgraph

Comprises the following steps:

6-1)b_jis projected to the subgraph

To obtain b_j,kLet it be a subgraph

Initial signal value of upper node, i.e.

6-2) operator P is projected onto the subgraph

The corresponding information vector p_kTo represent;

6-3) calculating

It is the value of the local approximate iterative solution at node k;

6-4) calculation

Updating the value of the local approximate solution on the node k by using the local approximate iterative solution;

6-5) making the node k exchange information with the neighbor nodes in the 2r to generate a subgraph

Partial solution of v_j,k；

6-6) calculation

6-7) again, this time with the exchange of information

In the corresponding subgraph

The value on the node, the initial value of the next iteration is generated

6-8) judging the termination condition | v_j(k) If | < ε is reached: if the end condition | v is reached_j(k) If | < ε, then all nodes are processed

Spliced together to generate propagation result f of label j_jAnd starting the propagation of the next label information; if the termination condition | v is not reached_j(k) If | < ε, proceed to the initial value

The next iteration.

According to the technical scheme, a global graph semi-supervised learning problem is converted into a series of simpler local problems by adopting subgraph division, then each simple local problem is distributed to a corresponding node in a graph, iterative solution is carried out through all nodes in the graph, and a global approximate result is solved.

The method has short training time and high acquisition speed of data used for training, and the calculation result is obtained by semi-supervised learning under large-scale data and is consistent with a centralized result.

Drawings

FIG. 1 is a schematic diagram of iterative convergence in Minnesota data for an example method;

fig. 2 is a schematic diagram of an iterative convergence situation of the MNIST data according to the embodiment.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

1) constructing a graph: the data set for semi-supervised learning is

The data set has N samples, and the labels in the data set are all from

And { x_l+1,...,x_nThe tag information of } is unknown, according to

The similarity of the samples in the drawing is established

And E are respectively a node set and an edge set,

in the formula (1), the first and second groups,

for the matching term, τ is the weighting factor,

is set as a penalty item

equation (4) can be expressed as:

f in formula (5)_j＝F_:,j，y_j＝Y_:,j，

Is f_jTransposing;

To the picture

Sub-graph partitioning to indicate operators

Is a diagonal matrix and is defined as:

There is a corresponding optimization sub-problem as shown in equation (8):

f in formula (8)_j,kIs f in the subfigure

The projection of (2) is performed,

in formula (10), C ═ τ I + L_norm，b_j＝y_j，

Splicing together, and then obtainingAveraging:

in formula (11)

Is an approximation to the solution of the global problem of equation (5);

5) and (3) iterative solution:

Approaching the solution of the global problem as equation (5):

in the formula (12), the first and second groups,

m in the upper right corner is the number of iterations,

6) distributed solution: for node k and corresponding subgraph

Comprises the following steps:

6-1)b_jis projected to the subgraph

To obtain b_j,kLet it be a subgraph

Initial signal value of upper node, i.e.

6-2) operator P is projected onto the subgraph

The corresponding information vector p_kTo represent;

6-3) calculating

It is the value of the local approximate iterative solution at node k;

6-4) calculation

Partial solution of v_j,k；

6-6) calculation

Preparing for the next iteration;

6-7) again, this time with the exchange of information

In the corresponding subgraph

The value on the node, the initial value of the next iteration is generated

6-8) judging the termination condition | v_j(k) If | less epsilon is reached: if the end condition | v is reached_j(k) If | < ε, then all nodes are processed

The next iteration.

Simulation example 1:

in this example, the simulation learning is performed by using Minnesota traffic data, which has 2642 data points and only +1 and-1 values, so that the Minnesota traffic data can be regarded as a database containing two types of tags, before the simulation, the tag signals are sampled to generate data sets with different known tag data ratios, and for different situations, data sets with known tag ratios of 1%, 2%, 5%, 10%, 20% and 50% are generated. The comparison of classification performance in Minnesota data by using the method of the present embodiment and the centralized calculation method, including the classification accuracy and the calculation time, is shown in table 1.

TABLE 1

Ratio of

1％

2％

5％

10％

20％

50％

Accuracy rate

90.29％

93.94％

95.88％

97.38％

98.19％

99.11％

Time consuming(s)

0.2532

0.2782

0.2991

0.3181

0.3439

0.3847

Centralized accuracy

90.65％

93.96％

95.88％

97.38％

98.19％

99.11％

Centralized time consuming(s)

0.3565

0.3886

0.4055

0.3835

0.408

0.4115

The simulation results in table 1 show that in Minnesota data, the error between the method of the present embodiment and the conventional centralized method is slightly small when the label ratio is known to be low, and gradually becomes smaller as the ratio increases, and when the ratio reaches a certain degree, the results of the two methods are the same, as shown in fig. 1, the method of the present embodiment can converge within a limited number of iterations, the method of the present embodiment can produce a better approximate result, and the time consumed by the method of the present embodiment is short.

Simulation example 2:

in this example, MNIST data set is used for simulation learning, MNIST is a hand-written digital character data set, MNIST data content is 0-9 hand-written digital pictures, 10000 pictures in MNIST are taken out for simulation in this example, firstly, Euclidean distance is used for measuring similarity between pictures, and then, kNN algorithm is used for constructing a picture of a database

Finally, sampling the label information to generate data sets with different label information ratios, and generating the data sets with label ratios of 1%, 2%, 5%, 10%, 20% and 50% respectively, which are the same as the simulation example 1. The comparison of the classification performance in the MNIST data by using the present example method and the centralized calculation method, including the classification accuracy and the calculation time, is shown in table 2.

TABLE 2

Ratio of

1％

2％

5％

10％

20％

50％

Accuracy rate

85.89％

89.27％

91.84％

93.28％

94.86％

97.33％

Time consuming(s)

3.41

3.84

4.25

4.68

4.97

5.53

Centralized accuracy

85.97％

89.27％

91.84％

93.28％

94.86％

97.33％

Centralized time consuming(s)

16.17

16.25

16.39

16.24

16.26

16.38

The simulation results in table 2 show that, in MNIST data, the error variation trends of the method of the present embodiment and the method of the centralized type are consistent with those of the simulation example 1, and both of the error variation trends become smaller with the increase of the proportion of labeled data, and the effects of the two are consistent, but in the simulation example, the calculation speed of the method of the present embodiment is faster or faster than that of the method of the centralized type, as shown in fig. 2, the curve in fig. 2 shows that not only can the method of the present embodiment converge within the limited number of iterations, but also the number of steps required for convergence is less than that in the simulation case 1, which shows that the method of the present embodiment can show advantages under large-scale data.

Claims

1. A distributed implementation method of semi-supervised learning of a graph based on sub-graph partitioning is characterized by comprising the following steps:

1) constructing a graph: the data set for semi-supervised learning is