CN111275201A - Sub-graph division based distributed implementation method for semi-supervised learning of graph - Google Patents
Sub-graph division based distributed implementation method for semi-supervised learning of graph Download PDFInfo
- Publication number
- CN111275201A CN111275201A CN202010068356.4A CN202010068356A CN111275201A CN 111275201 A CN111275201 A CN 111275201A CN 202010068356 A CN202010068356 A CN 202010068356A CN 111275201 A CN111275201 A CN 111275201A
- Authority
- CN
- China
- Prior art keywords
- graph
- node
- formula
- solution
- subgraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a distributed implementation method of semi-supervised learning of a graph based on sub-graph division, which is characterized by comprising the following steps of: 1) constructing a graph; 2) modeling an optimization problem; 3) sub-graph division and optimization problem decomposition; 4) splicing the solution and the solution of the subproblem; 5) carrying out iterative solution; 6) and (4) distributed solving. The method has short calculation time and high acquisition speed of data used for calculation, and the calculation result obtained under large-scale data is consistent with a centralized result.
Description
Technical Field
The invention relates to the technical field of machine learning and graph signal processing, in particular to a distributed implementation method of graph semi-supervised learning based on sub-graph partitioning.
Background
The era of our modern times is a big data era, the acquisition and storage of data are simpler than ever, and under the huge data, how to extract valuable information from the data becomes more critical. Nowadays, machine learning algorithms are used for processing big data, and algorithms such as neural networks achieve certain results in practical application, however, at present, the training time of the algorithms is long, and the acquisition of data used for training is very difficult.
The semi-supervised learning of the graph is a relatively important part in a machine learning algorithm, and has unique advantages in comparison with other machine learning algorithms. First, the graph semi-supervised learning is a direct-push algorithm, which can directly calculate the result without training the model, and as a semi-supervised learning algorithm, the requirement on data is low, and only a small part of labeled known data is needed to label the rest of data. Based on this, it is necessary to improve the solution method of the current semi-supervised learning problem.
Disclosure of Invention
The invention aims to provide a distributed implementation method of graph semi-supervised learning based on sub-graph partitioning, aiming at the defects of the prior art. The method has short calculation time and high acquisition speed of data used for calculation, and the calculation result obtained under large-scale data is consistent with a centralized result.
The technical scheme for realizing the purpose of the invention is as follows:
a distributed implementation method of semi-supervised learning of a graph based on sub-graph partitioning comprises the following steps:
1) constructing a graph: the data set for semi-supervised learning isThere are a total of N samples, x, in the datasetnRepresenting the nth sample, the labels in the data set all come fromWith a class c tag, where { x1,x2,...xlThe label information of is known, the corresponding label information isAnd { xl+1,...,xnThe tag information of } is unknown, according toThe similarity of the samples in the drawing is established And E are respectively a node set and an edge set,each node in the E group corresponds to each sample in the data set, and the E group comprises the connection information of the nodes;
2) modeling an optimization problem: representing label information of a data set to be processed into a graph signal f ═ f1,…,fn]TThe signal value is a label of the corresponding sample, and the optimization problem of semi-supervised learning of the graph is defined as follows:
the method comprises the following steps that (1) information of each type of label is transmitted to a sample with unknown label information, then, a final classification result is extracted by a formula (2), F in the formula (1) is a classification matrix, Y is a known label information matrix which are both N multiplied by c, Y is generated through a data set F to be processed, and the generation mode is as follows:
in the formula (1), the first and second groups,for the matching term, τ is the weighting factor, S (F):,j) Is set as a penalty itemWherein L isnorm=I-D-1/2WD-1/2Normalizing the Laplace matrix for the graph, wherein I is a unit matrix, D is a degree matrix, and W is an adjacency matrix, and the propagation of the jth label information in the formula (1) is expressed as:
equation (4) can be expressed as:
3) sub-graph division and optimization problem decomposition: using indicator operatorsTo the pictureSub-graph partitioning to indicate operatorsIs a diagonal matrix and is defined as:
b (k,2r) in equation (7) is a set of nodes that contains k and neighbor nodes within a radius of 2r, for node k,dividing a subgraph taking k as a center and taking a node set as B (k,2r) for a node kFor a graph with N nodes, there are N such subgraphs, and the optimization problem as shown in equation (5) is projected onto each subgraph, for each node k and the corresponding subgraphThere is a corresponding optimization sub-problem as shown in equation (8):
4) solving the subproblem and splicing the solution: solving the subproblem shown in the formula (8) in the step 3) according to the formula (10):
in formula (10), C ═ τ I + Lnorm,bj=yj,Is the local solution provided by the node k and its neighbors, and the local solutions on all nodes are solved according to the formula (11)Splicing, and then averaging:
5) and (3) iterative solution:has an error from the actual solution, so that an iterative calculation method is adopted according to the formula (12)Approaching the solution of the global problem as equation (5):
6-1)bjis projected to the subgraphTo obtain bj,kLet it be a subgraphInitial signal value of upper node, i.e.
6-4) calculationUpdating the value of the local approximate solution on the node k by using the local approximate iterative solution;
6-5) making the node k exchange information with the neighbor nodes in the 2r to generate a subgraphPartial solution of vj,k;
6-7) again, this time with the exchange of informationIn the corresponding subgraphThe value on the node, the initial value of the next iteration is generated
6-8) judging the termination condition | vj(k) If | < ε is reached: if the end condition | v is reachedj(k) If | < ε, then all nodes are processedSpliced together to generate propagation result f of label jjAnd starting the propagation of the next label information; if the termination condition | v is not reachedj(k) If | < ε, proceed to the initial valueThe next iteration.
According to the technical scheme, a global graph semi-supervised learning problem is converted into a series of simpler local problems by adopting subgraph division, then each simple local problem is distributed to a corresponding node in a graph, iterative solution is carried out through all nodes in the graph, and a global approximate result is solved.
The method has short training time and high acquisition speed of data used for training, and the calculation result is obtained by semi-supervised learning under large-scale data and is consistent with a centralized result.
Drawings
FIG. 1 is a schematic diagram of iterative convergence in Minnesota data for an example method;
fig. 2 is a schematic diagram of an iterative convergence situation of the MNIST data according to the embodiment.
Detailed Description
The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.
Example (b):
1) constructing a graph: the data set for semi-supervised learning isThe data set has N samples, and the labels in the data set are all fromWith a class c tag, where { x1,x2,...xlThe label information of is known, the corresponding label information isAnd { xl+1,...,xnThe tag information of } is unknown, according toThe similarity of the samples in the drawing is established And E are respectively a node set and an edge set,each node in the E group corresponds to each sample in the data set, and the E group comprises the connection information of the nodes;
2) modeling an optimization problem: representing label information of a data set to be processed into a graph signal f ═ f1,…,fn]TThe signal value is a label of the corresponding sample, and the optimization problem of semi-supervised learning of the graph is defined as follows:
the method comprises the following steps that (1) information of each type of label is transmitted to a sample with unknown label information, then, a final classification result is extracted by a formula (2), F in the formula (1) is a classification matrix, Y is a known label information matrix which are both N multiplied by c, Y is generated through a data set F to be processed, and the generation mode is as follows:
in the formula (1), the first and second groups,for the matching term, τ is the weighting factor,is set as a penalty item
Wherein L isnorm=I-D-1/2WD-1/2Normalizing the Laplace matrix for the graph, wherein I is a unit matrix, D is a degree matrix, and W is an adjacency matrix, and the propagation of the jth label information in the formula (1) is expressed as:
equation (4) can be expressed as:
3) sub-graph division and optimization problem decomposition: using indicator operatorsTo the pictureSub-graph partitioning to indicate operatorsIs a diagonal matrix and is defined as:
b (k,2r) in equation (7) is a set of nodes that contains k and neighbor nodes within a radius of 2r, for node k,dividing a subgraph taking k as a center and taking a node set as B (k,2r) for a node kFor a graph with N nodes, there are N such subgraphs, and the optimization problem as shown in equation (5) is projected onto each subgraph, for each node k and the corresponding subgraphThere is a corresponding optimization sub-problem as shown in equation (8):
4) solving the subproblem and splicing the solution: solving the subproblem shown in the formula (8) in the step 3) according to the formula (10):
in formula (10), C ═ τ I + Lnorm,bj=yj,Is the local solution provided by the node k and its neighbors, and the local solutions on all nodes are solved according to the formula (11)Splicing together, and then obtainingAveraging:
5) and (3) iterative solution:has an error from the actual solution, so that an iterative calculation method is adopted according to the formula (12)Approaching the solution of the global problem as equation (5):
in the formula (12), the first and second groups,m in the upper right corner is the number of iterations,
6-1)bjis projected to the subgraphTo obtain bj,kLet it be a subgraphInitial signal value of upper node, i.e.
6-4) calculationUpdating the value of the local approximate solution on the node k by using the local approximate iterative solution;
6-5) making the node k exchange information with the neighbor nodes in the 2r to generate a subgraphPartial solution of vj,k;
6-7) again, this time with the exchange of informationIn the corresponding subgraphThe value on the node, the initial value of the next iteration is generated
6-8) judging the termination condition | vj(k) If | less epsilon is reached: if the end condition | v is reachedj(k) If | < ε, then all nodes are processedSpliced together to generate propagation result f of label jjAnd starting the propagation of the next label information; if the termination condition | v is not reachedj(k) If | < ε, proceed to the initial valueThe next iteration.
Simulation example 1:
in this example, the simulation learning is performed by using Minnesota traffic data, which has 2642 data points and only +1 and-1 values, so that the Minnesota traffic data can be regarded as a database containing two types of tags, before the simulation, the tag signals are sampled to generate data sets with different known tag data ratios, and for different situations, data sets with known tag ratios of 1%, 2%, 5%, 10%, 20% and 50% are generated. The comparison of classification performance in Minnesota data by using the method of the present embodiment and the centralized calculation method, including the classification accuracy and the calculation time, is shown in table 1.
TABLE 1
Ratio of | 1% | 2% | 5% | 10% | 20% | 50% |
Accuracy rate | 90.29% | 93.94% | 95.88% | 97.38% | 98.19% | 99.11% |
Time consuming(s) | 0.2532 | 0.2782 | 0.2991 | 0.3181 | 0.3439 | 0.3847 |
Centralized accuracy | 90.65% | 93.96% | 95.88% | 97.38% | 98.19% | 99.11% |
Centralized time consuming(s) | 0.3565 | 0.3886 | 0.4055 | 0.3835 | 0.408 | 0.4115 |
The simulation results in table 1 show that in Minnesota data, the error between the method of the present embodiment and the conventional centralized method is slightly small when the label ratio is known to be low, and gradually becomes smaller as the ratio increases, and when the ratio reaches a certain degree, the results of the two methods are the same, as shown in fig. 1, the method of the present embodiment can converge within a limited number of iterations, the method of the present embodiment can produce a better approximate result, and the time consumed by the method of the present embodiment is short.
Simulation example 2:
in this example, MNIST data set is used for simulation learning, MNIST is a hand-written digital character data set, MNIST data content is 0-9 hand-written digital pictures, 10000 pictures in MNIST are taken out for simulation in this example, firstly, Euclidean distance is used for measuring similarity between pictures, and then, kNN algorithm is used for constructing a picture of a databaseFinally, sampling the label information to generate data sets with different label information ratios, and generating the data sets with label ratios of 1%, 2%, 5%, 10%, 20% and 50% respectively, which are the same as the simulation example 1. The comparison of the classification performance in the MNIST data by using the present example method and the centralized calculation method, including the classification accuracy and the calculation time, is shown in table 2.
TABLE 2
Ratio of | 1% | 2% | 5% | 10% | 20% | 50% |
Accuracy rate | 85.89% | 89.27% | 91.84% | 93.28% | 94.86% | 97.33% |
Time consuming(s) | 3.41 | 3.84 | 4.25 | 4.68 | 4.97 | 5.53 |
Centralized accuracy | 85.97% | 89.27% | 91.84% | 93.28% | 94.86% | 97.33% |
Centralized time consuming(s) | 16.17 | 16.25 | 16.39 | 16.24 | 16.26 | 16.38 |
The simulation results in table 2 show that, in MNIST data, the error variation trends of the method of the present embodiment and the method of the centralized type are consistent with those of the simulation example 1, and both of the error variation trends become smaller with the increase of the proportion of labeled data, and the effects of the two are consistent, but in the simulation example, the calculation speed of the method of the present embodiment is faster or faster than that of the method of the centralized type, as shown in fig. 2, the curve in fig. 2 shows that not only can the method of the present embodiment converge within the limited number of iterations, but also the number of steps required for convergence is less than that in the simulation case 1, which shows that the method of the present embodiment can show advantages under large-scale data.
Claims (1)
1. A distributed implementation method of semi-supervised learning of a graph based on sub-graph partitioning is characterized by comprising the following steps:
1) constructing a graph: the data set for semi-supervised learning isThere are a total of N samples, x, in the datasetnRepresenting the nth sample, the labels in the data set all come fromWith a class c tag, where { x1,x2,...xlThe label information of is known, the corresponding label information isAnd { xl+1,...,xnThe tag information of } is unknown, according toThe similarity of the samples in the drawing is established And E are respectively a node set and an edge set,each node in the E group corresponds to each sample in the data set, and the E group comprises the connection information of the nodes;
2) modeling an optimization problem: representing label information of a data set to be processed into a graph signal f ═ f1,…,fn]TThe signal value is a label of the corresponding sample, and the optimization problem of semi-supervised learning of the graph is defined as follows:
the method comprises the following steps that (1) information of each type of label is transmitted to a sample with unknown label information, then, a final classification result is extracted by a formula (2), F in the formula (1) is a classification matrix, Y is a known label information matrix which are both N multiplied by c, Y is generated through a data set F to be processed, and the generation mode is as follows:
in the formula (1), the first and second groups,for the matching term, τ is the weighting factor,is set as a penalty itemWherein L isnorm=I-D-1/2WD-1/2Normalizing the Laplace matrix for the graph, wherein I is a unit matrix, D is a degree matrix, and W is an adjacency matrix, and the propagation of the jth label information in the formula (1) is expressed as:
equation (4) can be expressed as:
3) sub-graph division and optimization problem decomposition: using indicator operatorsTo the pictureSub-graph partitioning to indicate operatorsIs a diagonal matrix and is defined as:
b (k,2r) in equation (7) is a set of nodes that contains k and neighbor nodes within a radius of 2r, for node k,dividing node k into a number k ofSubgraph with core and node set as B (k,2r)For a graph with N nodes, there are N such subgraphs, and the optimization problem as shown in equation (5) is projected onto each subgraph, for each node k and the corresponding subgraphThere is a corresponding optimization sub-problem as shown in equation (8):
4) solving the subproblem and splicing the solution: solving the subproblem shown in the formula (8) in the step 3) according to the formula (10):
in formula (10), C ═ τ I + Lnorm,bj=yj,Is the local solution provided by the node k and its neighbors, and the local solutions on all nodes are solved according to the formula (11)Splicing, and then averaging:
5) and (3) iterative solution: according to the formula (12), an iterative calculation mode is adopted to ensure thatApproaching the solution of the global problem as equation (5):
in the formula (12), the first and second groups,m in the upper right corner is the number of iterations,
6-1)bjis projected to the subgraphTo obtain bj,kLet it be a subgraphInitial signal value of upper node, i.e.
6-4) calculationUpdating the value of the local approximate solution on the node k by using the local approximate iterative solution;
6-5) making the node k exchange information with the neighbor nodes in the 2r to generate a subgraphPartial solution of vj,k;
6-7) again, this time with the exchange of informationIn the corresponding subgraphThe value on the node, the initial value of the next iteration is generated
6-8) judging the termination condition | vj(k) If | < ε is reached: if the end condition | v is reachedj(k) If | < ε, then all nodes are processedSpliced together to generate propagation result f of label jjAnd starting the propagation of the next label information; if the termination condition | v is not reachedj(k) If | < ε, proceed to the initial valueThe next iteration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010068356.4A CN111275201A (en) | 2020-01-21 | 2020-01-21 | Sub-graph division based distributed implementation method for semi-supervised learning of graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010068356.4A CN111275201A (en) | 2020-01-21 | 2020-01-21 | Sub-graph division based distributed implementation method for semi-supervised learning of graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111275201A true CN111275201A (en) | 2020-06-12 |
Family
ID=71001851
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010068356.4A Pending CN111275201A (en) | 2020-01-21 | 2020-01-21 | Sub-graph division based distributed implementation method for semi-supervised learning of graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111275201A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417188A (en) * | 2020-12-10 | 2021-02-26 | 桂林电子科技大学 | Hyperspectral image classification method based on graph model |
-
2020
- 2020-01-21 CN CN202010068356.4A patent/CN111275201A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417188A (en) * | 2020-12-10 | 2021-02-26 | 桂林电子科技大学 | Hyperspectral image classification method based on graph model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022077646A1 (en) | Method and apparatus for training student model for image processing | |
CN114021799A (en) | Day-ahead wind power prediction method and system for wind power plant | |
CN112668579A (en) | Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution | |
CN108519971B (en) | Cross-language news topic similarity comparison method based on parallel corpus | |
Yang et al. | Efficient and robust MultiView clustering with anchor graph regularization | |
CN112395487A (en) | Information recommendation method and device, computer-readable storage medium and electronic equipment | |
CN115080749B (en) | Weak supervision text classification method, system and device based on self-supervision training | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN112348001B (en) | Training method, recognition method, device, equipment and medium for expression recognition model | |
CN111275201A (en) | Sub-graph division based distributed implementation method for semi-supervised learning of graph | |
CN113408301A (en) | Sample processing method, device, equipment and medium | |
CN111326215B (en) | Method and system for searching nucleic acid sequence based on k-tuple frequency | |
WO2022162427A1 (en) | Annotation-efficient image anomaly detection | |
CN111241326B (en) | Image visual relationship indication positioning method based on attention pyramid graph network | |
CN117060401A (en) | New energy power prediction method, device, equipment and computer readable storage medium | |
CN111339258A (en) | University computer basic exercise recommendation method based on knowledge graph | |
CN107622048B (en) | Text mode recognition method and system | |
CN115759254A (en) | Question-answering method, system and medium based on knowledge-enhanced generative language model | |
CN113032443A (en) | Method, apparatus, device and computer-readable storage medium for processing data | |
CN113011597A (en) | Deep learning method and device for regression task | |
CN117056550B (en) | Long-tail image retrieval method, system, equipment and storage medium | |
CN114896436B (en) | Network structure searching method based on characterization mutual information | |
CN112131446B (en) | Graph node classification method and device, electronic equipment and storage medium | |
CN112766380B (en) | Image classification method and system based on feature gain matrix incremental learning | |
Li et al. | Specific emitter identification based on signal-graph capsule network (SGCN) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200612 |
|
RJ01 | Rejection of invention patent application after publication |