CN111046914B

CN111046914B - Semi-supervised classification method based on dynamic composition

Info

Publication number: CN111046914B
Application number: CN201911131232.XA
Authority: CN
Inventors: 马君亮; 肖冰; 敬欣怡; 何聚厚; 汪西莉
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-10-27
Anticipated expiration: 2039-11-20
Also published as: CN111046914A

Abstract

The present disclosure relates to a method of semi-supervised classification based on dynamic composition, comprising: s100, preparing a data set; s200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A; s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M; s400, performing label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result. The classification method provided by the disclosure can capture the distribution of data, more edges are connected in a data dense region, fewer edges are connected in a data sparse region, and the degree of data density can be reflected better, so that the classification method has a better classification effect.

Description

Semi-supervised classification method based on dynamic composition

Technical Field

The present disclosure relates to data classification methods, and in particular, to a semi-supervised classification method based on dynamic composition DCG (Dynamic Construction Graph).

Background

The existing data classification methods comprise methods of supervised classification, semi-supervised classification, unsupervised classification and the like. The supervised classification method requires a large amount of marked data to train the model, so that the application scene is limited; the unsupervised classification does not need the class information of the data, and has wide application, but the classification effect is poor due to the lack of the class information. The semi-supervised classification method has low acquisition cost because only a small amount of marked data is needed, and can obtain a better classification effect by learning the data distribution of a large amount of unmarked data, so that the semi-supervised classification method has a wide application scene.

The semi-supervised classification based on the graph is an important branch in the semi-supervised classification, and because the relation between data is fully utilized, a good effect is often obtained, and the great attention is paid. However, in the current semi-supervised classification method based on the graph, the similarity graph is often constructed by a k-nearest neighbor (kNN) or epsilon-nearest neighbor method, in the process of constructing the graph, only the attribute features of the data are used, the class information of the marked data is not used, the obtained similarity graph cannot well reflect the actual situation, and the classification result is also inaccurate.

Different graph structures are constructed based on different data distribution assumptions. The ideal diagram should possess the following three features: the edge selection algorithm can reflect data distribution, more neighbors are selected in a data dense region, and fewer neighbors are selected in a sparse region; the measure of similarity should not only relate to distance, but also to local structure; the composition algorithm should reduce the impact of the manually set parameters on the composition effect. Because the edge selection algorithm and the similarity calculation method in the prior art have respective limitations, the result of the classification method is not accurate enough, and therefore a new semi-supervised classification method is needed to make the result of data classification more accurate.

Disclosure of Invention

Aiming at the problems, the present disclosure provides a semi-supervised classification method based on dynamic composition, which adopts a dynamic neighbor (DNN, dynamic Nearest Neighbor) method to select edges, adopts an adaptive degree weighting (ADW, adaptive Degree Weighting) method to calculate similarity, and can well describe local characteristics of data, thereby improving accuracy of data classification.

The semi-supervised classification method based on dynamic composition firstly uses a DNN method to select edges, dynamically selects D neighbors of each node, then uses an ADW method to calculate weights of the edges, namely the similarity probability among the nodes, and finally uses a local and global consistency (Learning with Local and Global Consistency, LLGC) algorithm to classify the graph.

Specifically, a semi-supervised classification method based on dynamic composition comprises the following steps:

s100, preparing a data set, wherein the data set comprises marked data X _l And unlabeled data X _u Two parts, marked data X _l The marking information of (2) is F _l The characteristics of data in a data set are described by data attribute information, i represents the number of marked data, the data in the data set is abstracted into n nodes on m-dimensional space, and the i-th node is represented as p _i ；

S200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A, wherein the method specifically comprises the following steps:

s201, calculating Euclidean distance among nodes in the data set to obtain a direct distance matrix S;

s202, selecting a node p by using a dynamic neighbor DNN method _i As a selected edge, and generating an adjacency matrix A based on the D-nearest neighbor, A being an n x n matrix, where p is _j Is p _i Is then the corresponding position a in the matrix _ij Has a value of 1, otherwise 0, A _ij A value representing the j-th column of the i-th row in the adjacency matrix a;

s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M, wherein the affinity matrix M is specifically as follows:

s301, defining a distance matrix S ' and S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 ' _ij The value representing the j-th column of the i-th row in the distance matrix S' is specifically defined as:

when i is not equal to j,

when i=j, S' _ij ＝0；

S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n×n matrix, and W is a matrix of n×n _ij To describe node p _i And node p _j I.e. the value of the ith row and jth column of the weight matrix W;

s303, normalizing the weight matrix W defined in the step 302 to obtainAffinity matrix M, which is an n matrix, where M _ij To describe node p _i And node p _j Similar probability, i.e. the value of the ith row and jth column of the affinity matrix M;

s400, performing label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result.

Preferably, the data sets in step S100 include a composite data set TwoSpirals, toyData, fourGaussian, twoMoon and image data sets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.

Preferably, in the step S201, the node p in the dataset _i And p _j The euclidean distance between them is:

where m represents the dimension of the data, p _i 、p _j Represents the ith and jth nodes in the diagram, x _ik And x _jk Respectively node p _i 、p _j The coordinate of the kth dimension generates a direct distance matrix S according to Euclidean distance between nodes, wherein the direct distance matrix S is an n multiplied by n two-dimensional matrix, S _ij Values representing the ith row and jth column of the matrix, storage node p _i And node p _j Euclidean distance between them.

Preferably, the step S201 further includes:

the Euclidean distance between each node and other nodes in the direct distance matrix S is sequenced from small to large to obtain a matrix O, and an index matrix E corresponding to the direct distance matrix S is generated at the same time, specifically, for the ith row in the direct distance matrix S, the stored distances are sequenced from small to large, and the distances sequenced as j-1 are stored in O _ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E _ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E _ij Representing the elements of the ith row and jth column of the index matrix E.

Preferably, in the step S202, the node p is selected using a dynamic neighbor DNN method _i The D neighbor of (2) adopts an algebraic method:

N(p _i ) Represents p _i In matrix O, i-th row and j-th column stores distance node p _i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S _im I.e. node p _m Is to node p _i Is arranged to rank the nodes of distance j, and the node p is arranged to _m Is marked asJudging->Whether or not p _i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> Eye->When (I)>Is p _i Neighbor of>To->Neither is p _i Nearest neighbor of>Is p _i Nearest neighbor (s)/(s)>Representing the distance p _i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p _i Is to be->Judgment is made as a reference point, and so on until +.>Not p _i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p _i D-nearest neighbor of (a), p _i The connection to its D neighbor is taken as the selected edge.

Preferably, in the step S202, the node p is selected using a dynamic neighbor DNN method _i The geometric method is specifically adopted for the D neighbor:

node p _i D neighbor search process of (c): storing distance nodes p in ith row and jth column of matrix O _i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S _im I.e. node p _m Is to node p _i Is a node of distance order jNode p _m Is marked asNearest neighbor->Add to D neighbor,/->Dividing the plane into two areas according to +.>P is selected by the mid-vertical line of (2) _i The region to which the D neighbor of (2) belongs, i.e. close to p _i The side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs _i Nearest neighbors, i.e.)>According to->P is selected by the mid-vertical line of (2) _i The region to which the new D neighbor of (a) belongs, i.e. close to p _i This side is the region to which the new D neighbor belongs, in which region the distance p is selected _i Nearest dot->Add and serve as the next reference point, circulate so far as the proximity p, which is made up of all perpendicular bisectors _i Becomes a closed region, and all nodes in the closed region are p _i D-nearest neighbor of (a), p _i The connection to its D neighbor is taken as the selected edge.

Preferably, in the step S302, the weight matrix W is defined as:

W _ij ＝deg(p _i )S′ _ij

wherein p is _i W is a node on the graph _ij Values representing the j-th column of the i-th row in the weight matrix W, deg (p _i ) Is a nodeDegree of (S)' _ij Representing node p _i And node p _j I.e. the value of row i and column j in the distance matrix S'.

Preferably, in the step S303, an affinity matrix M is defined, specifically:

from the weight matrix W, a diagonal matrix T is calculated using the following formula:

T _ii ＝∑ _j W _ij

wherein T is _ii Representing the value of the ith row and ith column of the diagonal matrix T, W _ij Representing the value of the ith row and jth column of the weight matrix W.

The affinity matrix M is obtained after normalization using the following formula:

M＝T ^-1/2 WT ^-1/2

where T is the diagonal matrix, W is the weight matrix, and M is the affinity matrix.

Preferably, the tag in the step S400 propagates through a local and global consistency LLGC method, and the calculation method is as follows:

F＝(I-αM) ^-1 Y

wherein F is a matrix of n×c, n is the number of nodes, c is the number of kinds of labels, F _ij Representing node p _i The probability of marking as a j-th label, i.e. the value of the j-th column of the i-th row of the matrix F; i is an identity matrix, alpha is a regulating parameter, M is an affinity matrix, Y is marking information, which is an n×c matrix in which the marking information of each node is stored, Y _ij For the value of the ith row and jth column of matrix Y, if node p _i Is marked as a j-th class label, Y _ij =1, otherwise Y _ij ＝0；(I-αM) ^-1 The probability of each node obtaining a label of the marked node;

finally, the node p is obtained _i The marking information of (a) specifically includes:

wherein F is _ij Is the value of row i and column j of matrix F, argmax representing the value that will be F _ij J when maximum value is obtainedAssignment of value to y _i I.e. node p _i The label of (2) is y _i And after obtaining the marks for all the nodes, finishing the classification of the data.

Compared with the prior art, the method has the following beneficial technical effects:

(1) The semi-supervised classification method based on dynamic composition can better express potential distribution characteristics of data;

(2) The semi-supervised classification method based on dynamic composition adopts a dynamic neighbor edge selection method, and then calculates the weight of the edge by using an ADW method so as to classify the edge; the classification method can capture the distribution of data, more edges are connected in a data dense region, and fewer edges are connected in a data sparse region, so that the degree of data density can be reflected better, and a better classification effect is achieved.

Drawings

FIG. 1 shows a flow chart of a dynamic composition-based semi-supervised classification algorithm of the present disclosure;

FIG. 2 is a schematic diagram of a DNN edge selection method on a two-dimensional plane;

FIG. 3 shows a schematic diagram of a D-nearest neighbor search process on a two-dimensional plane;

FIG. 4 (a) shows a TwoSpirals dataset;

FIG. 4 (b) shows a ToyData dataset;

FIG. 4 (c) shows a Fourdeussian dataset;

FIG. 4 (d) shows a TwoMoon dataset;

FIG. 5 (a) shows the TwoSpirals dataset DNN patterning results;

FIG. 5 (b) shows the TwoSpirals dataset DNN prediction;

FIG. 5 (c) shows the results of the kNN patterning of the TwoSpirals dataset;

FIG. 5 (d) shows the predicted kNN result of the TwoSpirals dataset;

FIG. 6 (a) shows the result of the DNN composition of the ToyData dataset;

FIG. 6 (b) shows the result of the ToyData dataset DNN prediction;

fig. 6 (c) shows kNN patterning results when ToyData dataset k=5;

fig. 6 (d) shows kNN prediction results when ToyData dataset k=5;

fig. 6 (e) shows kNN patterning results when ToyData dataset k=10;

fig. 6 (f) shows kNN prediction results when the ToyData dataset k=10;

fig. 6 (g) shows kNN patterning results when ToyData dataset k=15;

fig. 6 (h) shows kNN prediction results when the ToyData dataset k=15.

Detailed Description

The semi-supervised classification method based on dynamic composition is provided by the present disclosure:

the invention is explained below with reference to fig. 1 to 6 (h), in one embodiment, as shown in fig. 1, a semi-supervised classification method based on dynamic composition is provided, comprising steps S100-S400:

when i is not equal to j,

when i=j, S' _ij ＝0；

s303, normalizing the weight matrix W defined in the step 302 to obtain an affinity matrix M, wherein the affinity matrix M is an n×n matrix _ij To describe node p _i And node p _j Similar probability, i.e. the value of the ith row and jth column of the affinity matrix M;

In the embodiment, selecting edges on the prepared data set by using a dynamic neighbor DNN method to obtain an adjacent matrix A, calculating to obtain an affinity matrix M, and further performing label propagation to obtain a final classification result; the classification method can capture the distribution of data, more edges are connected in a data dense region, and fewer edges are connected in a data sparse region, so that the degree of data density can be reflected better, and a better classification effect is achieved. The DNN method may be implemented by algebraic or geometric methods.

In another embodiment, the classification method proposed by the present disclosure may be applied to classifying in a variety of data sets, as long as the data set meets the requirements including data and its matching tags. Such as composite dataset TwoSpirals, toyData, fourGaussian, twoMoon, and image datasets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.

The number of samples, the number of attributes, and the number of categories of the composite dataset TwoSpirals, toyData, fourGaussian, twoMoon are shown in table 1:

table 1 synthetic dataset

Synthetic data set	Number of samples	Attribute dimension	Number of categories
				TwoSpirals	2000	2	2
ToyData	788	2	7
				FourGaussian	1200	2	4
TwoMoon	400	2	2

The sample numbers, attribute numbers, and category numbers for image datasets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2 are shown in table 2:

table 2 image dataset

Reference data set	Number of samples	Attribute dimension	Number of categories
				USPS	1800	256	6
Mnist	6996	784	10
				Mnist-3495	3495	784	10
Coil20	1440	1024	20
				Coil(1500)	1500	241	6
G241d	1500	241	2
				COIL2	1500	241	2

The above data sets are all existing data sets and are available in the ImageNet database.

In another embodiment, in the step S201, the node p in the dataset _i And p _j The euclidean distance between them is:

where m represents the dimension of space, p _i 、p _j Represents the ith and jth nodes in the diagram, x _ik And x _jk Respectively node p _i 、p _j The coordinate of the kth dimension generates a direct distance matrix S according to Euclidean distance between nodes, wherein the direct distance matrix S is an n multiplied by n two-dimensional matrix, S _ij Representing the elements of the ith row and jth column of the matrix, a storage node p _i And node p _j Euclidean distance between them.

In another embodiment, the step S201 further includes:

ordering Euclidean distances between each node and other nodes in the direct distance matrix S in order from small to large to obtain a matrix O, and simultaneously generating cables of the corresponding direct distance matrix SThe indexing matrix E comprises the steps of sorting the stored distances of the ith row in the direct distance matrix S in order from small to large, and storing the distances sorted into j-1 in O _ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E _ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E _ij Representing the elements of the ith row and jth column of the index matrix E.

In another embodiment, the step S202 selects the node p by using a dynamic neighbor DNN method _i The D neighbor of (2) adopts an algebraic method:

N(p _i ) Represents p _i In matrix O, i-th row and j-th column stores distance node p _i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S _im I.e. node p _m Is to node p _i Is arranged to rank the nodes of distance j, and the node p is arranged to _m Is marked asJudging->Whether or not p _i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> And->When (I)>Is p _i Or elseTo->Part is not p _i Nearest neighbor of>Is p _i Nearest neighbor (s)/(s)>Representing the distance p _i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p _i Is to be->Judgment is made as a reference point, and so on until +.>Not p _i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p _i D-nearest neighbor of (a), p _i The connection to its D neighbor is taken as the selected edge.

In this embodiment, in particular, p _i D nearest neighbor determination of (2) as shown in fig. 2 on a two-dimensional plane: to ensure connectivity of graph structures in DNN, i.e. no isolated nodes appear, at least one connection is established per sample, so thatAdded to p _i Is connected to p in the neighbor of (2) _i And->The dashed line on the figure is +.>Is then judged +.>Whether or not p _i According to the D-nearest neighbor criterion defined above, when +.> I.e. < ->Is positioned at the center vertical line near p _i On one side, and with p _i Is used as the center of a circle and is>Within circles of radius p _i D neighbor of (2), then->Added to set N (p _i ) Is a kind of medium. />Although sum->Distance p _i The distance is the same but due to->It will probably become +>D neighbor of (c). Sample point p _i The search procedure for D-neighbor of (c) is essentially a search with p _i And (3) searching concentric circles with the distances from different neighbors as the radius and the center of the circle. If p is _i The denser the local data distribution is, the more concentric circles it searches, and the more D neighbors are generated.

Therefore, the method can capture the distribution of the data, more edges are connected in the data dense area, and fewer edges are connected in the data sparse area, so that the degree of the density of the data can be reflected better, and the method has a better classification effect.

The graphs constructed using DNN are connected: in graph g= (x, epsilon), a connected vertex set is selected Find a point B εx and +.>And B is marked as a closest to the vertex set z, the point B is an adaptive neighbor of the point a, that is, the point B is connected with the vertex set z, the point B is added to the connected subset z, and all points in the graph can be added to the connected subset z by repeating the above operation, so that all points in the graph g= (x, epsilon) are connected.

The graph constructed using DNN does not have the problem of weak connectivity: using the anti-evidence method, assume that point a is an adaptive neighbor of point B, which is not an adaptive neighbor of point a. Since point B is not an adaptive neighbor of point a, there must be a point C for point B such that the path cost of point BCA is less than the path cost of BA, and there is a path cost of AB greater than the path cost of ACB, so point a is not an adaptive neighbor of point B, contradicting the assumption. So that weak connectivity does not occur.

In another embodiment, the step S202 selects the node p by using a dynamic neighbor DNN method _i The geometric method is specifically adopted for the D neighbor:

node p _i D neighbor search process of (c): storing distance nodes p in ith row and jth column of matrix O _i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S _im I.e. node p _m Is to node p _i Is arranged to rank the nodes of distance j, and the node p is arranged to _m Is marked asNearest neighbor->Add to D neighbor,/->Dividing the plane into two areas according to +.>P is selected by the mid-vertical line of (2) _i The region to which the D neighbor of (2) belongs, i.e. close to p _i The side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs _i Nearest neighbors, i.e.)>According to->P is selected by the mid-vertical line of (2) _i The region to which the new D neighbor of (a) belongs, i.e. close to p _i This side is the region to which the new D neighbor belongs, in which region the distance p is selected _i Nearest dot->Add and serve as the next reference point, circulate so far as the proximity p, which is made up of all perpendicular bisectors _i Becomes a closed region, and all nodes in the closed region are p _i D-nearest neighbor of (a), p _i The connection to its D neighbor is taken as the selected edge.

In this embodiment, an enclosed region refers to a region where any point inside the region intersects with any point outside the region at a line that intersects with the boundary of the region.

In this embodiment, in particular, p _i D nearest neighbor determination of (2) as shown in fig. 3 on a two-dimensional plane: to ensure connectivity of graph structures in DNN, i.e. no isolated nodes appear, at least one connection is established per sample, so thatAdded to p _i Is connected to p in the neighbor of (2) _i And->The dashed line (1) on the figure is +.>A perpendicular bisector dividing the planar region into two parts, p _i The region of D neighbor of (2) is close to p _i The area on one side, i.e. the left area of the dashed line (1); selecting p in the left region of the dashed line (1) _i Nearest neighbor->Connection p _i And->The dashed line (2) on the figure is +.>The plane area is divided into four parts, p by a dotted line (1) and a dotted line (2) _i The region of D neighbor of (2) is close to p _i The area of one side, i.e. virtualA left side region of the line (1), a lower side region of the broken line (2); selecting p in the region _i Nearest neighbor->Connection p _i And->The dashed line (3) on the figure is +.>The plane area is divided into six parts, p by a dotted line (1), a dotted line (2) and a dotted line (3) _i The region of D neighbor of (2) is close to p _i A region on one side, namely, a left region of the broken line (1), a lower region of the broken line (2), and a right region of the broken line (3); selecting p in the region _i Nearest neighbor of (a)Connection p _i And->The dashed line (4) on the figure is +.>The plane area is divided into nine parts, p by a dotted line (1), a dotted line (2), a dotted line (3) and a dotted line (4) _i The region of D neighbor of (2) is close to p _i One area, namely a closed area surrounded by a left area of the broken line (1), a lower area of the broken line (2), a right area of the broken line (3) and an upper area of the broken line (4), wherein all nodes in the area are nodes p _i D neighbor of (c).

It can be seen that if p _i The more densely the local data distribution is, the more the number of nodes in the closed area is, the more the D neighbors of the nodes are, and the method can capture the data distribution, connect more edges in the data dense area, connect fewer edges in the data sparse area, and can better reflect the density degree of the data, so that the method has better classification effect.

In another embodiment, in the step S302, the weight matrix W is defined as:

W _ij ＝deg(p _i )S′ _ij

wherein p is _i W is a node on the graph _ij Values representing the j-th column of the i-th row in the weight matrix W, deg (p _i ) Degree of node, S' _ij Representing node p _i And node p _j I.e. the value of row i and column j in the distance matrix S'.

In another embodiment, the step S303 defines an affinity matrix M, specifically:

T _ii ＝∑ _j W _ij

M＝T ^-1/2 WT ^-1/2

In the embodiment, the similarity probability of the nodes in the graph in the ADW is determined by the similarity and the node degree, the distribution of the data is expressed by the node degree in the ADW, the similarity between the data is expressed by the Gaussian kernel function, the ADW is simple to calculate, the algorithm complexity is low, and the method has the following two advantages:

(1) The formula reduces the overfitting problem caused by weight parameterization and is insensitive to noise data. In experiments, ADW was observed to be robust to input data noise using synthetic datasets and demonstrated performance advantages of ADW in 7 real datasets.

(2) ADW has no additional tuning parameters.

In another embodiment, the tag propagation in the step S400 is implemented by a local and global consistency LLGC (Learning with Local and Global Consistency) method, which is calculated as follows:

F＝(I-αM) ^-1 Y

wherein F is _ij Is the value of row i and column j of matrix F, argmax representing the value that will be F _ij Assignment of j to y when maximum value is taken _i I.e. node p _i The label of (2) is y _i And after obtaining the marks for all the nodes, finishing the classification of the data.

The steps of the semi-supervised classification method based on dynamic composition provided by the present disclosure are specifically introduced, and the superiority of the classification method provided by the present disclosure compared with the existing data classification method is illustrated by specific experimental comparison.

Experiment:

to illustrate the superiority of the dynamic composition-based semi-supervised classification approach presented in this disclosure, experiments were performed on synthetic and real datasets widely used in graph-based semi-supervised learning. The method is mainly used for verifying the potential distribution characteristics of the expression data of the proposed method, and improving the semi-supervised classification method. Comparing the method proposed by the present disclosure with the kNN method, the kNN method refers to that by finding k nearest neighbors of a sample, and assigning an average value of the attributes of the nearest neighbors to the sample, the attributes of the sample can be obtained. One of the evaluation criteria of the graph construction method is: on the premise of using the same deduction method, whether better classification performance can be achieved or not is judged, and an LLGC classification method is adopted in the experiment. For multi-classification problems, the algorithm performance is evaluated using Error Rate (Error Rate) as shown in the following equation:

wherein c is the total number of sample categories, N _i For the number of samples of class i, F _i Is the number of misclassified samples in the class i samples.

Results of synthetic dataset experiments

The classification performance of dnn+llgc and knn+llgc was verified using the 4 synthetic datasets in table 1. The 4 synthetic data sets included TwoSpirals, toyData, fourGaussian and TwoMoon. Their sample numbers, attribute numbers and category numbers are listed in table 1, and they are randomly generated two-dimensional data. In the TwoSpirals dataset, 1000 positive and negative samples each, as shown in fig. 4 (a), are distributed in a double helix shape. The ToyData dataset has 788 samples, belonging to class 7, each sample having 2 attributes, as shown in fig. 4 (b). The fourier dataset consisted of 1200 samples, belonging to class 4, with 2 attributes per sample, as shown in fig. 4 (c). In the TwoMoon dataset, the positive and negative samples were 200 each, as shown in fig. 4 (d), and the distribution was in the shape of a double month.

Taking TwoSpirals dataset and ToyData dataset as examples, the results of graph and class prediction constructed by DNN and kNN methods are shown in fig. 5 (a) -5 (d) and fig. 6 (a) -6 (h), the composition results are graphs obtained using the above-described edge selection method and edge re-weighting method, and the prediction results are classification results obtained by tag propagation according to the composition results. In fig. 5 (a) -5 (d), the dark color point and the light color point represent two types of data, respectively. Fig. 5 (a) and 5 (c) are DNN and kNN plots, respectively, of the TwoSpirals dataset. In the DNN graph, edge connection still exists between light-color data points with longer distances, and correlation between similar sample points is better expressed. Fig. 5 (b) and fig. 5 (d) show the category prediction results obtained by the DNN and kNN algorithms, and it can be seen that the DNN prediction results are more accurate. In fig. 6 (a) -6 (h), points of different depth colors represent different seven types of data. Fig. 6 (a) and 6 (b) show the composition result and the category prediction result obtained by the DNN method, and fig. 6 (c), 6 (e) and 6 (g) show the composition result obtained by the kNN method when k=5, 10, 15, and fig. 6 (d), 6 (f) and 6 (h) show the category prediction result obtained by the kNN method when k=5, 10, 15.

When the classification experiment was performed, the number of marked samples is shown in Table 3, and the other data were all unmarked. In the graph construction process, k=5, 10, 15, llgc is used for label propagation using kNN, DNN methods. The experiment was repeated 20 times and the average accuracy was calculated. The classification results are shown in Table 3. The first column is the name of the synthesized data set, the second column is the number of marked samples, the third column is the composition method used in the composition process, the fourth column is the different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of nodes in the graph, and the seventh column and the eighth column are the average error rate of classification and the standard deviation of the error rate. It can be seen that, except for the TwoSpirals dataset, the classification error rate of the DNN method is smaller than that of the kNN method for the other datasets with different values of k.

Table 3 classification performance of synthetic dataset

Image dataset experimental results

In order to compare the classification performance of the different mapping methods and derivation methods, a combination of the different mapping methods and derivation methods was applied to the following 7 image data sets and they are presented in table 2. These data sets are each composed of a gray scale image, and the gray scale value is taken as the characteristic value of each image.

Randomly selected sample points in each class are marked as marked sample points, the number of marked samples for each class is shown in table 4, and the remaining sample points are not marked. In the experiment, the PCA method is used for carrying out data dimension reduction, the PCA method realizes the mapping of the features of the high-dimensional data to the low-dimensional data by retaining the important features of the high-dimensional data and removing noise and unimportant features, and the dimension of the data in the experiment is reduced to 50 dimensions. The unlabeled samples were classified and the experiment was repeated 10 times, and the classification results (average classification error rate) are shown in table 4: the first column is the name of the image data set, the second column is the number of marked samples, the third column is the composition method, the fourth column is the different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of the nodes in the graph, and the seventh column and the eighth column are the average error rate of classification and the standard deviation of the error rate. It can be seen that for each dataset, the classification error rate of the DNN method is smaller than that of the kNN method at different values of k. Therefore, the semi-supervised classification method based on dynamic composition can improve classification accuracy.

Table 4 classification performance of image dataset

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims

1. A semi-supervised classification method based on dynamic composition executed on a computer, wherein the method aims at the problem that in the semi-supervised classification based on the graph, both an edge selection algorithm and a similarity calculation method have respective limitations, so that the result of the classification method is not accurate enough, and the method is characterized by comprising the following steps:

when i is not equal to j,

when i=j, S' _ij ＝0；

s400, carrying out label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result;

wherein the data set in step S100 includes a composite data set TwoSpirals, toyData, fourGaussian, twoMoon and image data sets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.

2. The method according to claim 1, wherein in step S201, the node p in the dataset _i And p _j The euclidean distance between them is:

3. The method of claim 2, said step S201 further comprising:

for direct distanceThe Euclidean distance between each node in the matrix S and other nodes is ordered from small to large to obtain a matrix O, and an index matrix E corresponding to the direct distance matrix S is generated at the same time, specifically, for the ith row in the direct distance matrix S, the stored distances are ordered from small to large, and the distances ordered as j-1 are stored in the O _ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E _ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E _ij Representing the elements of the ith row and jth column of the index matrix E.

4. A method according to claim 3, wherein the step S202 selects the node p using a dynamic neighbor DNN method _i The D neighbor of (2) adopts an algebraic method:

N(p _i ) Represents p _i In matrix O, i-th row and j-th column stores distance node p _i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S _im I.e. node p _m Is to node p _i Is arranged to rank the nodes of distance j, and the node p is arranged to _m Is marked asJudging->Whether or not p _i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> And->When (I)>Is p _i Neighbor of>To->Neither is p _i Nearest neighbor of>Is p _i Nearest neighbor (s)/(s)>Representing the distance p _i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p _i Is to be->Judgment is made as a reference point, and so on until +.>Not p _i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p _i D-nearest neighbor of (a), p _i The connection to its D neighbor is taken as the selected edge.

5. A method according to claim 3, wherein the step S202 selects the node p using a dynamic neighbor DNN method _i The geometric method is specifically adopted for the D neighbor:

6. The method according to claim 1, wherein in the step S302, the weight matrix W is defined as:

W _ij ＝deg(p _i )S′ _ij

7. The method according to claim 1, wherein the step S303 defines an affinity matrix M, specifically:

T _ii ＝∑ _j W _ij

wherein T is _ii Representing the value of the ith row and ith column of the diagonal matrix T, W _ij A value representing the ith row and jth column of the weight matrix W;

M＝T ^-1/2 WT ^-1/2

8. The method according to claim 1, wherein the tag in step S400 propagates through local and global consistency LLGC methods, calculated as follows:

F＝(I-αM) ^-1 Y