CN111046914A

CN111046914A - Semi-supervised classification method based on dynamic composition

Info

Publication number: CN111046914A
Application number: CN201911131232.XA
Authority: CN
Inventors: 马君亮; 肖冰; 敬欣怡; 何聚厚; 汪西莉
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-04-21
Anticipated expiration: 2039-11-20
Also published as: CN111046914B

Abstract

The present disclosure relates to a semi-supervised classification method based on dynamic composition, which includes: s100, preparing a data set; s200, selecting edges on the data set prepared in the step S100 by using a dynamic nearest neighbor DNN method to obtain an adjacency matrix A; s300, calculating the similarity probability among the nodes of the adjacent matrix A generated in the step S200 by using an ADW method to obtain an affinity matrix M; s400, carrying out label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result. The classification method can capture the distribution of data, more edges are connected in a data dense area, and fewer edges are connected in a data sparse area, so that the density degree of the data can be better reflected, and a better classification effect is achieved.

Description

Semi-supervised classification method based on dynamic composition

Technical Field

The present disclosure relates to a data classification method, and in particular, to a semi-supervised classification method based on dynamic composition dcg (dynamic configuration graph).

Background

The existing data classification methods comprise methods of supervised classification, semi-supervised classification, unsupervised classification and the like. In the supervised classification method, a large amount of labeled data is needed to train the model, so that the application scene of the model is limited; unsupervised classification does not need class information of data and is wide in application, but classification effect is poor due to lack of class information. The semi-supervised classification method only needs a small amount of marked data, is low in acquisition cost, and can obtain a good classification effect by learning the data distribution of a large amount of unmarked data, so that the semi-supervised classification method has a wide application scene.

Semi-supervised classification based on graphs is an important branch in semi-supervised classification, and due to the fact that the relation among data is fully utilized, good effects are obtained often, and wide attention is paid. However, in the current semi-supervised classification method based on graphs, a similarity graph is often constructed by a k-nearest neighbor (kNN) or epsilon-nearest neighbor method, only the attribute features of data are used in the process of constructing the graph, the category information of the labeled data is not used, the obtained similarity graph cannot well reflect the actual situation, and the classification result is relatively inaccurate.

Different graph structures are constructed based on different data distribution assumptions. An ideal graph should have the following three features: the edge selection algorithm can reflect data distribution, more neighbors are selected in a region with dense data, and fewer neighbors are selected in a region with sparse data; the measure of similarity should not only be related to distance, but also to local structure; the composition algorithm should reduce the impact of the human setting parameters on the composition effect. Since the edge selection algorithm and the similarity calculation method in the prior art both have respective limitations, and the result of the classification method is not accurate enough, a new semi-supervised classification method is urgently needed to make the result of data classification more accurate.

Disclosure of Invention

In order to solve the problems, the disclosure provides a semi-supervised classification method based on Dynamic composition, which adopts a Dynamic Nearest Neighbor (DNN) method to select edges and an Adaptive Degree Weighting (ADW) method to calculate similarity, and the method can well describe local features of data, thereby improving the accuracy of data classification.

According to the semi-supervised classification method based on the dynamic composition, firstly, a DNN method is used for edge selection, D neighbor of each node is dynamically selected, then, an ADW method is used for calculating the weight of the edge, namely the similarity probability among the nodes, and finally, a Local and Global Consistency (LLGC) algorithm is used for classifying the graph.

Specifically, a semi-supervised classification method based on dynamic composition comprises the following steps:

s100, preparing a data set, wherein the data set comprises marked data X_lAnd unlabeled data X_uTwo parts, marked data X_lIs marked with information F_lThe characteristics of the data in the data set are described by data attribute information, l represents the number of marked data, the data in the data set is abstracted into n nodes on an m-dimensional space, and the ith node is represented as p_i；

S200, selecting edges on the data set prepared in step S100 by using a dynamic nearest neighbor DNN method to obtain an adjacency matrix a, specifically:

s201, calculating Euclidean distances among the nodes in the data set to obtain a direct distance matrix S;

s202, selecting node p by using dynamic neighbor DNN method_iIs selected as the edge, and an adjacency matrix A is generated based on the D neighbors, A is an n × n matrix, and in the adjacency matrix A, if p is p_jIs p_iIs close to, then the corresponding position A in the matrix_ijIs 1, otherwise is 0, A_ijA value representing the ith row and the jth column in the adjacency matrix A;

s300, calculating the similarity probability among the nodes of the adjacent matrix A generated in the step S200 by using an ADW method to obtain an affinity matrix M, wherein the method specifically comprises the following steps:

s301, defining a distance matrix S ', S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 '_ijThe value representing the ith row and the jth column in the distance matrix S' is specifically defined as：

When i ≠ j,

when i ═ j, S'_ij＝0；

S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n multiplied by n matrix, and W_ijIs used to describe the node p_iAnd node p_jThe similarity of the weight matrix W is the value of the ith row and the jth column of the weight matrix W;

s303, normalizing the weight matrix W defined in the step 302 to obtain an affinity matrix M, wherein the affinity matrix M is an n multiplied by n matrix, and M is_ijIs used to describe the node p_iAnd node p_jSimilar probabilities, i.e., values of the affinity matrix M at row i and column j;

s400, carrying out label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result.

Preferably, the data sets in step S100 include composite data sets TwoSpirals, ToyData, FourGaussian, TwoMoon and image data sets USPS, mnst-3495, Coil20, Coil (1500), G241d, Coil 2.

Preferably, in step S201, the node p in the data set_iAnd p_jThe euclidean distance between them is:

where m denotes the dimension of the data, p_i、p_jDenotes the ith, j nodes, x in the diagram_ikAnd x_jkAre respectively a node p_i、p_jGenerating a direct distance matrix S according to Euclidean distance between nodes according to the k-dimension coordinate, wherein the direct distance matrix S is a two-dimensional matrix of n multiplied by n, and S_ijRepresenting the value of the ith row and jth column in the matrix, storing a node p_iAnd node p_jThe euclidean distance between.

Preferably, the step S201 further includes:

sorting Euclidean distances between each node and other nodes in the direct distance matrix S from small to large to obtain a matrix O, and simultaneously generating an index matrix E corresponding to the direct distance matrix S, wherein the specific process is that for the ith row in the direct distance matrix S, the stored distances are sorted from small to large, and the distance sorted into j-1 is stored in the matrix O_ijWhile storing the position of the distance in the direct distance matrix S in E_ijTherefore, the corresponding position of the distance stored in the matrix O in the direct distance matrix S can be found through the index matrix E; the matrix O and the index matrix E are both two-dimensional matrices of n x n, E_ijRepresenting the elements of the ith row and the jth column of the index matrix E.

Preferably, in step S202, the node p is selected by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts an algebraic method:

N(p_i) Represents p_iD neighbor set of (a) storing distance nodes p in ith row and jth column of matrix O_iThe distance with j can be found out at the position S in the direct distance matrix S through the index matrix E_imI.e. node p_mIs to node p_iIs ranked as j, node p_mIs marked as

Judgment of

Is or is not p_iD neighbor, the criterion is: nearest neighbor is connected

Adding into D neighbor, adding into

As a reference point, when

Eyes of a user

When the temperature of the water is higher than the set temperature,

is p_iOtherwise, otherwise

To

Are not p_iOf which is in the nearest neighborhood, wherein

Is p_iIs determined to be the nearest neighbor of (c),

represents the distance p_iSample points ordered as j, d (-) represents a distance metric; then will be

As a reference point, when

And is

When the temperature of the water is higher than the set temperature,

is p_iIs close to, then will

The judgment is carried out as a reference point, and the analogy is carried out until the judgment is finished

Is not p_iD is close to the node p, stopping judgment, and obtaining the result at the moment_iD is nearAdjacent to, p is_iThe connection line adjacent to its D is the selected edge.

Preferably, in step S202, the node p is selected by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts a geometric method:

node p_iD neighbor lookup procedure of (1): storing a distance node p in the ith row and jth column of the matrix O_iThe distance with j can be found out at the position S in the direct distance matrix S through the index matrix E_imI.e. node p_mIs to node p_iIs ranked as j, node p_mIs marked as

Nearest neighbor is connected

Adding the D-neighbor cell into the D-neighbor cell,

the perpendicular bisector of (A) divides the plane into two regions, according to

Is selected from the perpendicular bisector of_iD is close to the region to which the neighbor belongs, i.e. close to p_iThe side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs_iNearest neighbors, i.e.

According to

Is selected from the perpendicular bisector of_iIs close to the region to which the new D neighbor belongs, i.e. close to p_iThis side is the area to which the new D neighbor belongs, and the distance p is selected in this area_iNearest point

Added and used as the next reference point, and the process is circulated until the approach p composed of all the perpendicular bisectors is reached_iBecomes a closed region, and all nodes in the closed region are p_iD is close to p_iThe connection line adjacent to its D is the selected edge.

Preferably, in step S302, the weight matrix W is defined as:

W_ij＝deg(p_i)S′_ij

wherein p is_iAre nodes on the graph, W_ijThe value representing the ith row and jth column in the weight matrix W, deg (p)_i) Is degree, S 'of a node'_ijRepresenting a node p_iAnd node p_jI.e. the value of the ith row and the jth column in the distance matrix S'.

Preferably, the affinity matrix M is defined in step S303, and specifically:

according to the weight matrix W, the diagonal matrix T is calculated by using the following formula:

T_ii＝∑_jW_ij

wherein T is_iiRepresenting the value of the diagonal matrix T at row i and column i, W_ijRepresents the value of the ith row and the jth column of the weight matrix W.

The affinity matrix M is obtained after normalization using the following formula:

M＝T^-1/2WT^-1/2

where T is the diagonal matrix, W is the weight matrix, and M is the affinity matrix.

Preferably, the label propagation in step S400 is performed by a local and global consistency LLGC method, and the calculation method is as follows:

F＝(I-αM)^-1Y

wherein F is an n × c matrix, n is the number of nodes, c is the number of label types, F_ijRepresenting a node p_iProbability of label marked as jth type label, i.e. value of ith row and jth column of matrix F, I is identity matrix, α is regulation parameter, M is affinity matrix, Y is label information, it is a matrix of n x c, in which label information of every node is stored, Y is label information of every node_ijIs the value of the ith row and jth column of the matrix Y if the node p_iIs marked as a class j tag, then Y _ij1, otherwise Y_ij＝0；(I-αM)^-1The probability that each node acquires a label of a marked node;

finally, the node p is obtained_iThe marking information of (1) is specifically:

wherein F_ijIs the value of the ith row and the jth column of the matrix F, argmax represents the value of the current F_ijThe value of j when the maximum value is obtained is assigned to y_iI.e. node p_iIs marked as y_iAnd finishing the classification of the data after all the nodes are marked.

Compared with the prior art, the method has the following beneficial technical effects:

(1) the semi-supervised classification method based on the dynamic composition can better express the potential distribution characteristics of data;

(2) the semi-supervised classification method based on dynamic composition provided by the disclosure adopts a dynamic neighbor edge selection method, and then calculates the weight of edges by using an ADW method, so as to classify; the classification method can capture the distribution of data, more edges are connected in a data dense area, and fewer edges are connected in a data sparse area, so that the density degree of the data can be better reflected, and a better classification effect is achieved.

Drawings

FIG. 1 shows a flow chart of a semi-supervised classification algorithm based on dynamic composition of the present disclosure;

FIG. 2 is a schematic diagram of a DNN edge selection method on a two-dimensional plane;

FIG. 3 is a diagram illustrating a D-nearest neighbor lookup process on a two-dimensional plane;

FIG. 4(a) shows a TwoSPIRals dataset;

FIG. 4(b) shows a ToyData data set;

FIG. 4(c) shows a Fourier gaussian dataset;

FIG. 4(d) shows a TwoMoon dataset;

FIG. 5(a) shows a Twospirals dataset DNN patterning result;

FIG. 5(b) shows the Twospirals dataset DNN prediction results;

FIG. 5(c) shows Twospirals dataset kNN patterning results;

FIG. 5(d) shows Twospirals dataset kNN prediction results;

FIG. 6(a) shows the ToyData data set DNN composition result;

FIG. 6(b) shows the results of the ToyData data set DNN prediction;

fig. 6(c) shows kNN composition results when the ToyData dataset k is 5;

fig. 6(d) shows kNN prediction results when the ToyData dataset k is 5;

fig. 6(e) shows kNN composition results when the ToyData dataset k is 10;

fig. 6(f) shows kNN prediction results when the ToyData dataset k is 10;

fig. 6(g) shows kNN composition results when the ToyData dataset k is 15;

fig. 6(h) shows kNN prediction results when the ToyData dataset k is 15.

Detailed Description

The semi-supervised classification method based on dynamic composition provided by the disclosure comprises the following steps:

the present invention is explained below with reference to fig. 1 to 6(h), and in one embodiment, as shown in fig. 1, a dynamic composition-based semi-supervised classification method is provided, including steps S100-S400:

s202, useNode p selected by dynamic nearest neighbor DNN method_iIs selected as the edge, and an adjacency matrix A is generated based on the D neighbors, A is an n × n matrix, and in the adjacency matrix A, if p is p_jIs p_iIs close to, then the corresponding position A in the matrix_ijIs 1, otherwise is 0, A_ijA value representing the ith row and the jth column in the adjacency matrix A;

s301, defining a distance matrix S ', S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 '_ijThe value representing the ith row and the jth column in the distance matrix S' is specifically defined as:

when i ≠ j,

when i ═ j, S'_ij＝0；

In the embodiment, a dynamic nearest neighbor DNN method is used for edge selection on a prepared data set to obtain an adjacency matrix A, an affinity matrix M is obtained through calculation, and label propagation is further carried out to obtain a final classification result; the classification method can capture the distribution of data, more edges are connected in a data dense area, and fewer edges are connected in a data sparse area, so that the density degree of the data can be better reflected, and a better classification effect is achieved. The DNN method may be specifically implemented by an algebraic method or a geometric method.

In another embodiment, the classification method proposed by the present disclosure can be applied to a variety of data sets for classification as long as the data sets satisfy the requirements including data and matching tags thereof. Such as synthetic datasets Twompirals, ToyData, FourGaussian, TwoMoon, and image datasets USPS, Mnist-3495, Coil20, Coil (1500), G241d, COIL 2.

The sample number, attribute number and category number of the synthetic datasets TwoSpirals, ToyData, fourier gaussian, TwoMoon are shown in table 1:

TABLE 1 synthetic data set

Synthesizing data sets	Number of samples	Attribute dimension	Number of categories
				TwoSpirals	2000	2	2
ToyData	788	2	7
				FourGaussian	1200	2	4
TwoMoon	400	2	2

The number of samples, the number of attributes, and the number of categories of the image data sets USPS, mnst-3495, Coil20, Coil (1500), G241d, and Coil2 are shown in table 2:

TABLE 2 image dataset

Reference data set	Number of samples	Attribute dimension	Number of categories
				USPS	1800	256	6
Mnist	6996	784	10
				Mnist-3495	3495	784	10
Coil20	1440	1024	20
				Coil(1500)	1500	241	6
G241d	1500	241	2
				COIL2	1500	241	2

The data sets are all existing data sets and can be obtained in an ImageNet database.

In another embodiment, in the step S201, the node p in the data set_iAnd p_jThe euclidean distance between them is:

where m denotes the dimension of the space, p_i、p_jDenotes the ith, j nodes, x in the diagram_ikAnd x_jkAre respectively a node p_i、p_jGenerating a direct distance matrix S according to Euclidean distance between nodes according to the k-dimension coordinate, wherein the direct distance matrix S is a two-dimensional matrix of n multiplied by n, and S_ijRepresentation matrixThe element in the ith row and the jth column in the ith row, a storage node p_iAnd node p_jThe euclidean distance between.

In another embodiment, the step S201 further includes:

In another embodiment, the node p is selected in step S202 by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts an algebraic method:

Judgment of

Is or is not p_iD neighbor, the criterion is: nearest neighbor is connected

Adding into D neighbor, adding into

As a reference point, when

And is

When the temperature of the water is higher than the set temperature,

is p_iOtherwise, otherwise

To

Part is not p_iOf which is in the nearest neighborhood, wherein

Is p_iIs determined to be the nearest neighbor of (c),

As a reference point, when

And is

When the temperature of the water is higher than the set temperature,

is p_iIs close to, then will

Is not p_iD is close to the node p, stopping judgment, and obtaining the result at the moment_iD is close to p_iThe connection line adjacent to its D is the selected edge.

In this embodiment, in particular, p_iD neighbor determination as shown in fig. 2 on a two-dimensional plane: in order to ensure the connectivity of the graph structure in the DNN, namely no isolated nodes appear, at least one connection is established for each sample, so that firstly, the connection is established

Is added to p_iIn the neighborhood of (1), connect p_iAnd

the dotted line in the figure is

Perpendicular bisector of (1), then judging

Is or is not p_iAccording to the previously defined D neighbor judgment criterion, when

Namely, it is

Is located at the perpendicular bisector near p_iOn one side, and with p_iAs a center of circle, in

Is p in the circle of radius_iD is adjacent, then

Adding to the set N (p)_i) In (1).

Albeit with

Distance p_iThe distances are the same, but because

It will probably become

D is adjacent to. Sample point p_iThe searching process of the D neighbor is essentially p_iAnd (3) a concentric circle searching process taking the distance from the center to different neighbors as a radius. If p is_iThe denser the local data distribution, the more concentric circles it searches, generating more D neighbors.

Therefore, the method can capture the distribution of the data, more edges are connected in the data dense area, and less edges are connected in the data sparse area, so that the density degree of the data can be better reflected, and the method has a better classification effect.

The graph constructed using DNN is connected: in graph G ═ x, ∈, a set of connected sub-vertices is selected

Find a point B ∈ x outside the set of vertices and

and B is recorded as A from the nearest point in the vertex set z, the point B is the adaptive neighbor of the point A, namely the point B is communicated with the vertex set z, the point B is added into the communicated subset z, and the operations are repeated, all the points in the graph can be added into the communicated subset z, so that all the points in the graph G (x, epsilon) are communicated.

The graph constructed using DNN does not have the problem of weak connectivity: using the back-syndrome method, assume that point a is an adaptive neighbor of point B, which is not. Since point B is not an adaptive neighbor of point a, it is certain that there is a point C for point B, such that the path cost of point BCA is less than the path cost of BA, and the path cost of AB is greater than the path cost of ACB, point a is not an adaptive neighbor of point B, contrary to the assumption. So weak communication does not occur.

In another embodiment, the node p is selected in step S202 by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts a geometric method:

Nearest neighbor is connected

Adding the D-neighbor cell into the D-neighbor cell,

According to

In this embodiment, the enclosed region means that a connecting line between any point inside the region and any point outside the region intersects with the boundary of the region.

In this embodiment, in particular, p_iD neighbor determination as shown in fig. 3 on a two-dimensional plane: in order to ensure the connectivity of the graph structure in the DNN, namely no isolated nodes appear, at least one connection is established for each sample, so that firstly, the connection is established

Is added to p_iIn the neighborhood of (1), connect p_iAnd

the dotted line ① is

The perpendicular bisector line of (a) which divides the planar area into two parts, p_iD is close to p_iOne side region, i.e., the left side region of the dotted line ①, and p is selected in the left side region of the dotted line ①_iNearest neighbor of (2)

Connection p_iAnd

the dotted line ② is

Middle drop ofThe lines, dashed line ① and dashed line ② divide the planar area into four parts, p_iD is close to p_iOne side region, i.e., the left side region of the dotted line ① and the lower side region of the dotted line ②, in which p is selected_iNearest neighbor of (2)

Connection p_iAnd

the dotted line ③ is

The dashed lines ①, ②, and ③ divide the planar area into six portions, p_iD is close to p_iOne side region, i.e., the left side region of the dashed line ①, the lower side region of the dashed line ②, and the right side region of the dashed line ③, where p is selected_iNearest neighbor of (2)

Connection p_iAnd

the dotted line ④ is

The dashed lines ①, ②, ③ and ④ divide the planar area into nine parts, p_iD is close to p_iA region on one side, i.e. a left region of the dashed line ①, a lower region of the dashed line ②, a right region of the dashed line ③, and an upper region of the dashed line ④, is a closed region, and all nodes in the closed region are nodes p_iD is adjacent to.

It can be seen that if p_iThe more densely distributed local data are located, the more the number of nodes in the closed area is, the more D neighbors of the nodes are, the method can capture the distribution of the data, more edges are connected in the dense data area, and the data are sparseThe sparse area is connected with fewer edges, so that the density degree of data can be better reflected, and a better classification effect is achieved.

In another embodiment, in the step S302, the weight matrix W is defined as:

W_ij＝deg(p_i)S′_ij

In another embodiment, the affinity matrix M is defined in step S303, specifically:

T_ii＝∑_jW_ij

M＝T^-1/2WT^-1/2

In this embodiment, the similarity probability of the nodes in the graph in the ADW is determined by both the similarity and the degree of the nodes, the distribution of the data in the ADW is expressed by the degree of the nodes, the similarity between the data is expressed by the gaussian kernel function, the ADW is simple to calculate, the algorithm complexity is low, and the ADW has the following two advantages:

(1) the formula reduces the overfitting problem caused by weight parameterization and is insensitive to noise data. In experiments, ADW was observed to be robust to input data noise using the composite data set, and the performance advantages of ADW were demonstrated in 7 real data sets.

(2) ADW has no additional tuning parameters.

In another embodiment, the label propagation in step S400 is implemented by Local and Global consistency llgc (learning with Local and Global consistency) method, and the calculation method is as follows:

F＝(I-αM)^-1Y

The steps of the semi-supervised classification method based on dynamic composition provided by the present disclosure are specifically introduced above, and the superiority of the classification method provided by the present disclosure compared with the existing data classification method is illustrated by specific experimental comparison below.

Experiment:

to illustrate the superiority of the semi-supervised classification method based on dynamic composition proposed by the present disclosure, experiments were performed on synthetic data sets and real data sets widely used in graph-based semi-supervised learning. The method mainly aims to verify that the proposed method can better express the potential distribution characteristics of data, and improves the semi-supervised classification method. Comparing the method provided by the present disclosure with the kNN method, the kNN method is to find k nearest neighbors of a sample, and assign an average value of attributes of the nearest neighbors to the sample, so as to obtain the attribute of the sample. One of the evaluation criteria of the graph construction method is: on the premise of using the same derivation method, if the better classification performance can be realized, the LLGC classification method is adopted in the experiment. For the multi-class problem, the Error Rate (Error Rate) is used to evaluate the performance of the algorithm, as shown in the following equation:

wherein c is the total number of sample classes, N_iIs the number of class i samples, F_iThe number of misclassified samples in the ith type of sample.

Synthetic data set experimental results

The classification performance of DNN + LLGC and kNN + LLGC was verified using 4 synthetic datasets in table 1. The 4 synthetic data sets included Twospirals, ToyData, FourGaussian and TwoMoon. Their sample number, attribute number and category number are listed in table 1, and they are randomly generated two-dimensional data. In the TwoSpirals dataset, 1000 positive and negative samples each, as shown in fig. 4(a), were distributed in a double spiral shape. The ToyData data set has 788 samples, which belong to 7 classes, and each sample has 2 attributes, as shown in FIG. 4 (b). The fourier gaussian dataset consists of 1200 samples, classified into 4 classes, each sample having 2 attributes, as shown in fig. 4 (c). In the TwoMoon dataset, 200 positive and negative samples each, as shown in fig. 4(d), were distributed in a bi-lunar shape.

Taking the TwoSpirals data set and the ToyData data set as examples, the graphs constructed by using the DNN and kNN methods and the results of the category prediction are shown in fig. 5(a) -5 (d) and 6(a) -6 (h), the graph composition results are graphs obtained by using the edge selection method and the edge re-weighting method, and the prediction results are classification results obtained by performing label propagation according to the graph composition results. In fig. 5(a) -5 (d), dark dots and light dots represent two types of data, respectively. Fig. 5(a) and 5(c) are DNN and kNN plots of the TwoSpirals dataset, respectively. In the DNN graph, edge connection still exists between light-colored data points with longer distance, and the correlation between the same-type sample points is better expressed. Fig. 5(b) and 5(d) show the class prediction results obtained by the DNN and kNN algorithms, and it can be seen that the DNN prediction results are more accurate. In fig. 6(a) -6 (h), the dots of different depth colors represent different seven types of data. Fig. 6(a) and 6(b) show the composition results and the category prediction results obtained by the DNN method, fig. 6(c), 6(e), and 6(g) show the composition results obtained by the kNN method when K is 5, 10, and 15, and fig. 6(d), 6(f), and 6(h) show the category prediction results obtained by the kNN method when K is 5, 10, and 15.

When a sort experiment was performed, the number of labeled samples is shown in Table 3, and the other data was unlabeled. In the graph construction process, the kNN, DNN method is used,

k

5, 10, 15, LLGC is used for tag propagation. The experiment was repeated 20 times and the average accuracy was calculated. The classification results are shown in table 3. The first column is the name of the synthesized data set, the second column is the number of marked samples, the third column is the composition method used in the composition process, the fourth column is different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of the nodes in the graph, and the seventh column and the eighth column respectively give the average error rate and the standard deviation of the error rate of the classification. It can be seen that, in addition to the TwoSpirals dataset, for other datasets, the classification error rate of the DNN method is smaller than that of the kNN method under different values of k.

TABLE 3 Classification Performance of synthetic datasets

Experimental results of image data set

In order to compare the classification performance of the different composition methods and derivation methods, a combination of the different composition methods and derivation methods was applied to the following 7 image data sets, and they are presented in table 2. These data sets are each composed of grayscale images, and the grayscale values are taken as the characteristic values of each image.

Randomly selected sample points in each class are labeled as labeled sample points, the number of labeled samples in each class is shown in table 4, and the remaining sample points are unlabeled. In the experiment, the data dimensionality is reduced to 50 dimensions by using a PCA method, wherein the PCA method is used for mapping the characteristics of the high-dimensional data to the low-dimensional data by retaining the important characteristics of the high-dimensional data and removing noise and unimportant characteristics. The unlabeled samples were classified and the experiment was repeated 10 times, and the classification results (average classification error rate) are shown in table 4: the first column is the name of the image data set, the second column is the number of marked samples, the third column is the composition method, the fourth column is different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of the nodes in the graph, and the seventh column and the eighth column respectively give the standard deviation of the classified average error rate and error rate. It can be seen that for each data set, the classification error rate of the DNN method is smaller than that of the kNN method under different values of k. Therefore, the semi-supervised classification method based on dynamic composition can improve the classification precision.

TABLE 4 Classification Performance of image datasets

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims

1. A semi-supervised classification method based on dynamic composition comprises the following steps:

when i ≠ j,

when i ═ j, S'_ij＝0；

S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n multiplied by n matrix, and W_ijIs used to describe the node p_iAnd node p_jIs likeDegree, i.e. the value of the ith row and the jth column of the weight matrix W;

2. The method according to claim 1, wherein the data set in step S100 includes synthetic data sets TwoSpirals, ToyData, FourGaussian, TwoMoon and image data sets USPS, mnst-3495, Coil20, Coil (1500), G241d, Coil 2.

3. The method according to claim 1, wherein in step S201, a node p in a data set_iAnd p_jThe euclidean distance between them is:

4. The method of claim 3, the step S201 further comprising:

sequencing Euclidean distances of each node and other nodes in the direct distance matrix S from small to large to obtain a matrix O, and simultaneously generating an index matrix E corresponding to the direct distance matrix S, wherein the specific process is that for the ith in the direct distance matrix SThe rows are sorted from small to large, and the distances sorted to j-1 are stored in O_ijWhile storing the position of the distance in the direct distance matrix S in E_ijTherefore, the corresponding position of the distance stored in the matrix O in the direct distance matrix S can be found through the index matrix E; the matrix O and the index matrix E are both two-dimensional matrices of n x n, E_ijRepresenting the elements of the ith row and the jth column of the index matrix E.

5. The method of claim 4, wherein the step S202 of selecting the node p is performed by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts an algebraic method:

Judgment of

Is or is not p_iD neighbor, the criterion is: nearest neighbor is connected

Adding into D neighbor, adding into

As a reference point, when

And is

When the temperature of the water is higher than the set temperature,

is p_iOtherwise, otherwise

To

Are not p_iOf which is in the nearest neighborhood, wherein

Is p_iIs determined to be the nearest neighbor of (c),

As a reference point, when

And is

When the temperature of the water is higher than the set temperature,

is p_iIs close to, then will

Is not p_iD when the neighbor is close to the D, the judgment is stopped, and the result obtained at the timeI.e. node p_iD is close to p_iThe connection line adjacent to its D is the selected edge.

6. The method of claim 4, wherein the step S202 of selecting the node p is performed by using a dynamic neighbor DNN method_iThe D neighbor specifically adopts a geometric method:

Nearest neighbor is connected

Adding the D-neighbor cell into the D-neighbor cell,

According to

7. The method according to claim 1, wherein in step S302, the weight matrix W is defined as:

W_ij＝deg(p_i)S′_ij

8. The method according to claim 1, wherein the affinity matrix M is defined in step S303, and specifically:

T_ii＝∑_jW_ij

wherein T is_iiRepresenting the value of the diagonal matrix T at row i and column i, W_ijRepresenting the value of ith row and jth column of the weight matrix W;

M＝T^-1/2WT^-1/2

9. The method of claim 1, wherein the label propagation in step S400 is performed by local and global consistency LLGC methods, which are calculated as follows:

F＝(I-αM)^-1Y

where F is an nxc matrix, n is the number of nodes, and c is the indexNumber of types of labels, F_ijRepresenting a node p_iProbability of label marked as jth type label, i.e. value of ith row and jth column of matrix F, I is identity matrix, α is regulation parameter, M is affinity matrix, Y is label information, it is a matrix of n x c, in which label information of every node is stored, Y is label information of every node_ijIs the value of the ith row and jth column of the matrix Y if the node p_iIs marked as a class j tag, then Y_ij1, otherwise Y_ij＝0；(I-αM)^-1The probability that each node acquires a label of a marked node;