CN111046914B - Semi-supervised classification method based on dynamic composition - Google Patents

Semi-supervised classification method based on dynamic composition Download PDF

Info

Publication number
CN111046914B
CN111046914B CN201911131232.XA CN201911131232A CN111046914B CN 111046914 B CN111046914 B CN 111046914B CN 201911131232 A CN201911131232 A CN 201911131232A CN 111046914 B CN111046914 B CN 111046914B
Authority
CN
China
Prior art keywords
matrix
node
neighbor
distance
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911131232.XA
Other languages
Chinese (zh)
Other versions
CN111046914A (en
Inventor
马君亮
肖冰
敬欣怡
何聚厚
汪西莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN201911131232.XA priority Critical patent/CN111046914B/en
Publication of CN111046914A publication Critical patent/CN111046914A/en
Application granted granted Critical
Publication of CN111046914B publication Critical patent/CN111046914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a method of semi-supervised classification based on dynamic composition, comprising: s100, preparing a data set; s200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A; s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M; s400, performing label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result. The classification method provided by the disclosure can capture the distribution of data, more edges are connected in a data dense region, fewer edges are connected in a data sparse region, and the degree of data density can be reflected better, so that the classification method has a better classification effect.

Description

Semi-supervised classification method based on dynamic composition
Technical Field
The present disclosure relates to data classification methods, and in particular, to a semi-supervised classification method based on dynamic composition DCG (Dynamic Construction Graph).
Background
The existing data classification methods comprise methods of supervised classification, semi-supervised classification, unsupervised classification and the like. The supervised classification method requires a large amount of marked data to train the model, so that the application scene is limited; the unsupervised classification does not need the class information of the data, and has wide application, but the classification effect is poor due to the lack of the class information. The semi-supervised classification method has low acquisition cost because only a small amount of marked data is needed, and can obtain a better classification effect by learning the data distribution of a large amount of unmarked data, so that the semi-supervised classification method has a wide application scene.
The semi-supervised classification based on the graph is an important branch in the semi-supervised classification, and because the relation between data is fully utilized, a good effect is often obtained, and the great attention is paid. However, in the current semi-supervised classification method based on the graph, the similarity graph is often constructed by a k-nearest neighbor (kNN) or epsilon-nearest neighbor method, in the process of constructing the graph, only the attribute features of the data are used, the class information of the marked data is not used, the obtained similarity graph cannot well reflect the actual situation, and the classification result is also inaccurate.
Different graph structures are constructed based on different data distribution assumptions. The ideal diagram should possess the following three features: the edge selection algorithm can reflect data distribution, more neighbors are selected in a data dense region, and fewer neighbors are selected in a sparse region; the measure of similarity should not only relate to distance, but also to local structure; the composition algorithm should reduce the impact of the manually set parameters on the composition effect. Because the edge selection algorithm and the similarity calculation method in the prior art have respective limitations, the result of the classification method is not accurate enough, and therefore a new semi-supervised classification method is needed to make the result of data classification more accurate.
Disclosure of Invention
Aiming at the problems, the present disclosure provides a semi-supervised classification method based on dynamic composition, which adopts a dynamic neighbor (DNN, dynamic Nearest Neighbor) method to select edges, adopts an adaptive degree weighting (ADW, adaptive Degree Weighting) method to calculate similarity, and can well describe local characteristics of data, thereby improving accuracy of data classification.
The semi-supervised classification method based on dynamic composition firstly uses a DNN method to select edges, dynamically selects D neighbors of each node, then uses an ADW method to calculate weights of the edges, namely the similarity probability among the nodes, and finally uses a local and global consistency (Learning with Local and Global Consistency, LLGC) algorithm to classify the graph.
Specifically, a semi-supervised classification method based on dynamic composition comprises the following steps:
s100, preparing a data set, wherein the data set comprises marked data X l And unlabeled data X u Two parts, marked data X l The marking information of (2) is F l The characteristics of data in a data set are described by data attribute information, i represents the number of marked data, the data in the data set is abstracted into n nodes on m-dimensional space, and the i-th node is represented as p i
S200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A, wherein the method specifically comprises the following steps:
s201, calculating Euclidean distance among nodes in the data set to obtain a direct distance matrix S;
s202, selecting a node p by using a dynamic neighbor DNN method i As a selected edge, and generating an adjacency matrix A based on the D-nearest neighbor, A being an n x n matrix, where p is j Is p i Is then the corresponding position a in the matrix ij Has a value of 1, otherwise 0, A ij A value representing the j-th column of the i-th row in the adjacency matrix a;
s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M, wherein the affinity matrix M is specifically as follows:
s301, defining a distance matrix S ' and S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 ' ij The value representing the j-th column of the i-th row in the distance matrix S' is specifically defined as:
when i is not equal to j,
when i=j, S' ij =0;
S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n×n matrix, and W is a matrix of n×n ij To describe node p i And node p j I.e. the value of the ith row and jth column of the weight matrix W;
s303, normalizing the weight matrix W defined in the step 302 to obtainAffinity matrix M, which is an n matrix, where M ij To describe node p i And node p j Similar probability, i.e. the value of the ith row and jth column of the affinity matrix M;
s400, performing label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result.
Preferably, the data sets in step S100 include a composite data set TwoSpirals, toyData, fourGaussian, twoMoon and image data sets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.
Preferably, in the step S201, the node p in the dataset i And p j The euclidean distance between them is:
where m represents the dimension of the data, p i 、p j Represents the ith and jth nodes in the diagram, x ik And x jk Respectively node p i 、p j The coordinate of the kth dimension generates a direct distance matrix S according to Euclidean distance between nodes, wherein the direct distance matrix S is an n multiplied by n two-dimensional matrix, S ij Values representing the ith row and jth column of the matrix, storage node p i And node p j Euclidean distance between them.
Preferably, the step S201 further includes:
the Euclidean distance between each node and other nodes in the direct distance matrix S is sequenced from small to large to obtain a matrix O, and an index matrix E corresponding to the direct distance matrix S is generated at the same time, specifically, for the ith row in the direct distance matrix S, the stored distances are sequenced from small to large, and the distances sequenced as j-1 are stored in O ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E ij Representing the elements of the ith row and jth column of the index matrix E.
Preferably, in the step S202, the node p is selected using a dynamic neighbor DNN method i The D neighbor of (2) adopts an algebraic method:
N(p i ) Represents p i In matrix O, i-th row and j-th column stores distance node p i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is arranged to rank the nodes of distance j, and the node p is arranged to m Is marked asJudging->Whether or not p i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> Eye->When (I)>Is p i Neighbor of>To->Neither is p i Nearest neighbor of>Is p i Nearest neighbor (s)/(s)>Representing the distance p i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p i Is to be->Judgment is made as a reference point, and so on until +.>Not p i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
Preferably, in the step S202, the node p is selected using a dynamic neighbor DNN method i The geometric method is specifically adopted for the D neighbor:
node p i D neighbor search process of (c): storing distance nodes p in ith row and jth column of matrix O i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is a node of distance order jNode p m Is marked asNearest neighbor->Add to D neighbor,/->Dividing the plane into two areas according to +.>P is selected by the mid-vertical line of (2) i The region to which the D neighbor of (2) belongs, i.e. close to p i The side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs i Nearest neighbors, i.e.)>According to->P is selected by the mid-vertical line of (2) i The region to which the new D neighbor of (a) belongs, i.e. close to p i This side is the region to which the new D neighbor belongs, in which region the distance p is selected i Nearest dot->Add and serve as the next reference point, circulate so far as the proximity p, which is made up of all perpendicular bisectors i Becomes a closed region, and all nodes in the closed region are p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
Preferably, in the step S302, the weight matrix W is defined as:
W ij =deg(p i )S′ ij
wherein p is i W is a node on the graph ij Values representing the j-th column of the i-th row in the weight matrix W, deg (p i ) Is a nodeDegree of (S)' ij Representing node p i And node p j I.e. the value of row i and column j in the distance matrix S'.
Preferably, in the step S303, an affinity matrix M is defined, specifically:
from the weight matrix W, a diagonal matrix T is calculated using the following formula:
T ii =∑ j W ij
wherein T is ii Representing the value of the ith row and ith column of the diagonal matrix T, W ij Representing the value of the ith row and jth column of the weight matrix W.
The affinity matrix M is obtained after normalization using the following formula:
M=T -1/2 WT -1/2
where T is the diagonal matrix, W is the weight matrix, and M is the affinity matrix.
Preferably, the tag in the step S400 propagates through a local and global consistency LLGC method, and the calculation method is as follows:
F=(I-αM) -1 Y
wherein F is a matrix of n×c, n is the number of nodes, c is the number of kinds of labels, F ij Representing node p i The probability of marking as a j-th label, i.e. the value of the j-th column of the i-th row of the matrix F; i is an identity matrix, alpha is a regulating parameter, M is an affinity matrix, Y is marking information, which is an n×c matrix in which the marking information of each node is stored, Y ij For the value of the ith row and jth column of matrix Y, if node p i Is marked as a j-th class label, Y ij =1, otherwise Y ij =0;(I-αM) -1 The probability of each node obtaining a label of the marked node;
finally, the node p is obtained i The marking information of (a) specifically includes:
wherein F is ij Is the value of row i and column j of matrix F, argmax representing the value that will be F ij J when maximum value is obtainedAssignment of value to y i I.e. node p i The label of (2) is y i And after obtaining the marks for all the nodes, finishing the classification of the data.
Compared with the prior art, the method has the following beneficial technical effects:
(1) The semi-supervised classification method based on dynamic composition can better express potential distribution characteristics of data;
(2) The semi-supervised classification method based on dynamic composition adopts a dynamic neighbor edge selection method, and then calculates the weight of the edge by using an ADW method so as to classify the edge; the classification method can capture the distribution of data, more edges are connected in a data dense region, and fewer edges are connected in a data sparse region, so that the degree of data density can be reflected better, and a better classification effect is achieved.
Drawings
FIG. 1 shows a flow chart of a dynamic composition-based semi-supervised classification algorithm of the present disclosure;
FIG. 2 is a schematic diagram of a DNN edge selection method on a two-dimensional plane;
FIG. 3 shows a schematic diagram of a D-nearest neighbor search process on a two-dimensional plane;
FIG. 4 (a) shows a TwoSpirals dataset;
FIG. 4 (b) shows a ToyData dataset;
FIG. 4 (c) shows a Fourdeussian dataset;
FIG. 4 (d) shows a TwoMoon dataset;
FIG. 5 (a) shows the TwoSpirals dataset DNN patterning results;
FIG. 5 (b) shows the TwoSpirals dataset DNN prediction;
FIG. 5 (c) shows the results of the kNN patterning of the TwoSpirals dataset;
FIG. 5 (d) shows the predicted kNN result of the TwoSpirals dataset;
FIG. 6 (a) shows the result of the DNN composition of the ToyData dataset;
FIG. 6 (b) shows the result of the ToyData dataset DNN prediction;
fig. 6 (c) shows kNN patterning results when ToyData dataset k=5;
fig. 6 (d) shows kNN prediction results when ToyData dataset k=5;
fig. 6 (e) shows kNN patterning results when ToyData dataset k=10;
fig. 6 (f) shows kNN prediction results when the ToyData dataset k=10;
fig. 6 (g) shows kNN patterning results when ToyData dataset k=15;
fig. 6 (h) shows kNN prediction results when the ToyData dataset k=15.
Detailed Description
The semi-supervised classification method based on dynamic composition is provided by the present disclosure:
the invention is explained below with reference to fig. 1 to 6 (h), in one embodiment, as shown in fig. 1, a semi-supervised classification method based on dynamic composition is provided, comprising steps S100-S400:
s100, preparing a data set, wherein the data set comprises marked data X l And unlabeled data X u Two parts, marked data X l The marking information of (2) is F l The characteristics of data in a data set are described by data attribute information, i represents the number of marked data, the data in the data set is abstracted into n nodes on m-dimensional space, and the i-th node is represented as p i
S200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A, wherein the method specifically comprises the following steps:
s201, calculating Euclidean distance among nodes in the data set to obtain a direct distance matrix S;
s202, selecting a node p by using a dynamic neighbor DNN method i As a selected edge, and generating an adjacency matrix A based on the D-nearest neighbor, A being an n x n matrix, where p is j Is p i Is then the corresponding position a in the matrix ij Has a value of 1, otherwise 0, A ij A value representing the j-th column of the i-th row in the adjacency matrix a;
s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M, wherein the affinity matrix M is specifically as follows:
s301, defining a distance matrix S ' and S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 ' ij The value representing the j-th column of the i-th row in the distance matrix S' is specifically defined as:
when i is not equal to j,
when i=j, S' ij =0;
S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n×n matrix, and W is a matrix of n×n ij To describe node p i And node p j I.e. the value of the ith row and jth column of the weight matrix W;
s303, normalizing the weight matrix W defined in the step 302 to obtain an affinity matrix M, wherein the affinity matrix M is an n×n matrix ij To describe node p i And node p j Similar probability, i.e. the value of the ith row and jth column of the affinity matrix M;
s400, performing label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result.
In the embodiment, selecting edges on the prepared data set by using a dynamic neighbor DNN method to obtain an adjacent matrix A, calculating to obtain an affinity matrix M, and further performing label propagation to obtain a final classification result; the classification method can capture the distribution of data, more edges are connected in a data dense region, and fewer edges are connected in a data sparse region, so that the degree of data density can be reflected better, and a better classification effect is achieved. The DNN method may be implemented by algebraic or geometric methods.
In another embodiment, the classification method proposed by the present disclosure may be applied to classifying in a variety of data sets, as long as the data set meets the requirements including data and its matching tags. Such as composite dataset TwoSpirals, toyData, fourGaussian, twoMoon, and image datasets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.
The number of samples, the number of attributes, and the number of categories of the composite dataset TwoSpirals, toyData, fourGaussian, twoMoon are shown in table 1:
table 1 synthetic dataset
Synthetic data set Number of samples Attribute dimension Number of categories
TwoSpirals 2000 2 2
ToyData 788 2 7
FourGaussian 1200 2 4
TwoMoon 400 2 2
The sample numbers, attribute numbers, and category numbers for image datasets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2 are shown in table 2:
table 2 image dataset
Reference data set Number of samples Attribute dimension Number of categories
USPS 1800 256 6
Mnist 6996 784 10
Mnist-3495 3495 784 10
Coil20 1440 1024 20
Coil(1500) 1500 241 6
G241d 1500 241 2
COIL2 1500 241 2
The above data sets are all existing data sets and are available in the ImageNet database.
In another embodiment, in the step S201, the node p in the dataset i And p j The euclidean distance between them is:
where m represents the dimension of space, p i 、p j Represents the ith and jth nodes in the diagram, x ik And x jk Respectively node p i 、p j The coordinate of the kth dimension generates a direct distance matrix S according to Euclidean distance between nodes, wherein the direct distance matrix S is an n multiplied by n two-dimensional matrix, S ij Representing the elements of the ith row and jth column of the matrix, a storage node p i And node p j Euclidean distance between them.
In another embodiment, the step S201 further includes:
ordering Euclidean distances between each node and other nodes in the direct distance matrix S in order from small to large to obtain a matrix O, and simultaneously generating cables of the corresponding direct distance matrix SThe indexing matrix E comprises the steps of sorting the stored distances of the ith row in the direct distance matrix S in order from small to large, and storing the distances sorted into j-1 in O ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E ij Representing the elements of the ith row and jth column of the index matrix E.
In another embodiment, the step S202 selects the node p by using a dynamic neighbor DNN method i The D neighbor of (2) adopts an algebraic method:
N(p i ) Represents p i In matrix O, i-th row and j-th column stores distance node p i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is arranged to rank the nodes of distance j, and the node p is arranged to m Is marked asJudging->Whether or not p i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> And->When (I)>Is p i Or elseTo->Part is not p i Nearest neighbor of>Is p i Nearest neighbor (s)/(s)>Representing the distance p i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p i Is to be->Judgment is made as a reference point, and so on until +.>Not p i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
In this embodiment, in particular, p i D nearest neighbor determination of (2) as shown in fig. 2 on a two-dimensional plane: to ensure connectivity of graph structures in DNN, i.e. no isolated nodes appear, at least one connection is established per sample, so thatAdded to p i Is connected to p in the neighbor of (2) i And->The dashed line on the figure is +.>Is then judged +.>Whether or not p i According to the D-nearest neighbor criterion defined above, when +.> I.e. < ->Is positioned at the center vertical line near p i On one side, and with p i Is used as the center of a circle and is>Within circles of radius p i D neighbor of (2), then->Added to set N (p i ) Is a kind of medium. />Although sum->Distance p i The distance is the same but due to->It will probably become +>D neighbor of (c). Sample point p i The search procedure for D-neighbor of (c) is essentially a search with p i And (3) searching concentric circles with the distances from different neighbors as the radius and the center of the circle. If p is i The denser the local data distribution is, the more concentric circles it searches, and the more D neighbors are generated.
Therefore, the method can capture the distribution of the data, more edges are connected in the data dense area, and fewer edges are connected in the data sparse area, so that the degree of the density of the data can be reflected better, and the method has a better classification effect.
The graphs constructed using DNN are connected: in graph g= (x, epsilon), a connected vertex set is selected Find a point B εx and +.>And B is marked as a closest to the vertex set z, the point B is an adaptive neighbor of the point a, that is, the point B is connected with the vertex set z, the point B is added to the connected subset z, and all points in the graph can be added to the connected subset z by repeating the above operation, so that all points in the graph g= (x, epsilon) are connected.
The graph constructed using DNN does not have the problem of weak connectivity: using the anti-evidence method, assume that point a is an adaptive neighbor of point B, which is not an adaptive neighbor of point a. Since point B is not an adaptive neighbor of point a, there must be a point C for point B such that the path cost of point BCA is less than the path cost of BA, and there is a path cost of AB greater than the path cost of ACB, so point a is not an adaptive neighbor of point B, contradicting the assumption. So that weak connectivity does not occur.
In another embodiment, the step S202 selects the node p by using a dynamic neighbor DNN method i The geometric method is specifically adopted for the D neighbor:
node p i D neighbor search process of (c): storing distance nodes p in ith row and jth column of matrix O i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is arranged to rank the nodes of distance j, and the node p is arranged to m Is marked asNearest neighbor->Add to D neighbor,/->Dividing the plane into two areas according to +.>P is selected by the mid-vertical line of (2) i The region to which the D neighbor of (2) belongs, i.e. close to p i The side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs i Nearest neighbors, i.e.)>According to->P is selected by the mid-vertical line of (2) i The region to which the new D neighbor of (a) belongs, i.e. close to p i This side is the region to which the new D neighbor belongs, in which region the distance p is selected i Nearest dot->Add and serve as the next reference point, circulate so far as the proximity p, which is made up of all perpendicular bisectors i Becomes a closed region, and all nodes in the closed region are p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
In this embodiment, an enclosed region refers to a region where any point inside the region intersects with any point outside the region at a line that intersects with the boundary of the region.
In this embodiment, in particular, p i D nearest neighbor determination of (2) as shown in fig. 3 on a two-dimensional plane: to ensure connectivity of graph structures in DNN, i.e. no isolated nodes appear, at least one connection is established per sample, so thatAdded to p i Is connected to p in the neighbor of (2) i And->The dashed line (1) on the figure is +.>A perpendicular bisector dividing the planar region into two parts, p i The region of D neighbor of (2) is close to p i The area on one side, i.e. the left area of the dashed line (1); selecting p in the left region of the dashed line (1) i Nearest neighbor->Connection p i And->The dashed line (2) on the figure is +.>The plane area is divided into four parts, p by a dotted line (1) and a dotted line (2) i The region of D neighbor of (2) is close to p i The area of one side, i.e. virtualA left side region of the line (1), a lower side region of the broken line (2); selecting p in the region i Nearest neighbor->Connection p i And->The dashed line (3) on the figure is +.>The plane area is divided into six parts, p by a dotted line (1), a dotted line (2) and a dotted line (3) i The region of D neighbor of (2) is close to p i A region on one side, namely, a left region of the broken line (1), a lower region of the broken line (2), and a right region of the broken line (3); selecting p in the region i Nearest neighbor of (a)Connection p i And->The dashed line (4) on the figure is +.>The plane area is divided into nine parts, p by a dotted line (1), a dotted line (2), a dotted line (3) and a dotted line (4) i The region of D neighbor of (2) is close to p i One area, namely a closed area surrounded by a left area of the broken line (1), a lower area of the broken line (2), a right area of the broken line (3) and an upper area of the broken line (4), wherein all nodes in the area are nodes p i D neighbor of (c).
It can be seen that if p i The more densely the local data distribution is, the more the number of nodes in the closed area is, the more the D neighbors of the nodes are, and the method can capture the data distribution, connect more edges in the data dense area, connect fewer edges in the data sparse area, and can better reflect the density degree of the data, so that the method has better classification effect.
In another embodiment, in the step S302, the weight matrix W is defined as:
W ij =deg(p i )S′ ij
wherein p is i W is a node on the graph ij Values representing the j-th column of the i-th row in the weight matrix W, deg (p i ) Degree of node, S' ij Representing node p i And node p j I.e. the value of row i and column j in the distance matrix S'.
In another embodiment, the step S303 defines an affinity matrix M, specifically:
from the weight matrix W, a diagonal matrix T is calculated using the following formula:
T ii =∑ j W ij
wherein T is ii Representing the value of the ith row and ith column of the diagonal matrix T, W ij Representing the value of the ith row and jth column of the weight matrix W.
The affinity matrix M is obtained after normalization using the following formula:
M=T -1/2 WT -1/2
where T is the diagonal matrix, W is the weight matrix, and M is the affinity matrix.
In the embodiment, the similarity probability of the nodes in the graph in the ADW is determined by the similarity and the node degree, the distribution of the data is expressed by the node degree in the ADW, the similarity between the data is expressed by the Gaussian kernel function, the ADW is simple to calculate, the algorithm complexity is low, and the method has the following two advantages:
(1) The formula reduces the overfitting problem caused by weight parameterization and is insensitive to noise data. In experiments, ADW was observed to be robust to input data noise using synthetic datasets and demonstrated performance advantages of ADW in 7 real datasets.
(2) ADW has no additional tuning parameters.
In another embodiment, the tag propagation in the step S400 is implemented by a local and global consistency LLGC (Learning with Local and Global Consistency) method, which is calculated as follows:
F=(I-αM) -1 Y
wherein F is a matrix of n×c, n is the number of nodes, c is the number of kinds of labels, F ij Representing node p i The probability of marking as a j-th label, i.e. the value of the j-th column of the i-th row of the matrix F; i is an identity matrix, alpha is a regulating parameter, M is an affinity matrix, Y is marking information, which is an n×c matrix in which the marking information of each node is stored, Y ij For the value of the ith row and jth column of matrix Y, if node p i Is marked as a j-th class label, Y ij =1, otherwise Y ij =0;(I-αM) -1 The probability of each node obtaining a label of the marked node;
finally, the node p is obtained i The marking information of (a) specifically includes:
wherein F is ij Is the value of row i and column j of matrix F, argmax representing the value that will be F ij Assignment of j to y when maximum value is taken i I.e. node p i The label of (2) is y i And after obtaining the marks for all the nodes, finishing the classification of the data.
The steps of the semi-supervised classification method based on dynamic composition provided by the present disclosure are specifically introduced, and the superiority of the classification method provided by the present disclosure compared with the existing data classification method is illustrated by specific experimental comparison.
Experiment:
to illustrate the superiority of the dynamic composition-based semi-supervised classification approach presented in this disclosure, experiments were performed on synthetic and real datasets widely used in graph-based semi-supervised learning. The method is mainly used for verifying the potential distribution characteristics of the expression data of the proposed method, and improving the semi-supervised classification method. Comparing the method proposed by the present disclosure with the kNN method, the kNN method refers to that by finding k nearest neighbors of a sample, and assigning an average value of the attributes of the nearest neighbors to the sample, the attributes of the sample can be obtained. One of the evaluation criteria of the graph construction method is: on the premise of using the same deduction method, whether better classification performance can be achieved or not is judged, and an LLGC classification method is adopted in the experiment. For multi-classification problems, the algorithm performance is evaluated using Error Rate (Error Rate) as shown in the following equation:
wherein c is the total number of sample categories, N i For the number of samples of class i, F i Is the number of misclassified samples in the class i samples.
Results of synthetic dataset experiments
The classification performance of dnn+llgc and knn+llgc was verified using the 4 synthetic datasets in table 1. The 4 synthetic data sets included TwoSpirals, toyData, fourGaussian and TwoMoon. Their sample numbers, attribute numbers and category numbers are listed in table 1, and they are randomly generated two-dimensional data. In the TwoSpirals dataset, 1000 positive and negative samples each, as shown in fig. 4 (a), are distributed in a double helix shape. The ToyData dataset has 788 samples, belonging to class 7, each sample having 2 attributes, as shown in fig. 4 (b). The fourier dataset consisted of 1200 samples, belonging to class 4, with 2 attributes per sample, as shown in fig. 4 (c). In the TwoMoon dataset, the positive and negative samples were 200 each, as shown in fig. 4 (d), and the distribution was in the shape of a double month.
Taking TwoSpirals dataset and ToyData dataset as examples, the results of graph and class prediction constructed by DNN and kNN methods are shown in fig. 5 (a) -5 (d) and fig. 6 (a) -6 (h), the composition results are graphs obtained using the above-described edge selection method and edge re-weighting method, and the prediction results are classification results obtained by tag propagation according to the composition results. In fig. 5 (a) -5 (d), the dark color point and the light color point represent two types of data, respectively. Fig. 5 (a) and 5 (c) are DNN and kNN plots, respectively, of the TwoSpirals dataset. In the DNN graph, edge connection still exists between light-color data points with longer distances, and correlation between similar sample points is better expressed. Fig. 5 (b) and fig. 5 (d) show the category prediction results obtained by the DNN and kNN algorithms, and it can be seen that the DNN prediction results are more accurate. In fig. 6 (a) -6 (h), points of different depth colors represent different seven types of data. Fig. 6 (a) and 6 (b) show the composition result and the category prediction result obtained by the DNN method, and fig. 6 (c), 6 (e) and 6 (g) show the composition result obtained by the kNN method when k=5, 10, 15, and fig. 6 (d), 6 (f) and 6 (h) show the category prediction result obtained by the kNN method when k=5, 10, 15.
When the classification experiment was performed, the number of marked samples is shown in Table 3, and the other data were all unmarked. In the graph construction process, k=5, 10, 15, llgc is used for label propagation using kNN, DNN methods. The experiment was repeated 20 times and the average accuracy was calculated. The classification results are shown in Table 3. The first column is the name of the synthesized data set, the second column is the number of marked samples, the third column is the composition method used in the composition process, the fourth column is the different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of nodes in the graph, and the seventh column and the eighth column are the average error rate of classification and the standard deviation of the error rate. It can be seen that, except for the TwoSpirals dataset, the classification error rate of the DNN method is smaller than that of the kNN method for the other datasets with different values of k.
Table 3 classification performance of synthetic dataset
Image dataset experimental results
In order to compare the classification performance of the different mapping methods and derivation methods, a combination of the different mapping methods and derivation methods was applied to the following 7 image data sets and they are presented in table 2. These data sets are each composed of a gray scale image, and the gray scale value is taken as the characteristic value of each image.
Randomly selected sample points in each class are marked as marked sample points, the number of marked samples for each class is shown in table 4, and the remaining sample points are not marked. In the experiment, the PCA method is used for carrying out data dimension reduction, the PCA method realizes the mapping of the features of the high-dimensional data to the low-dimensional data by retaining the important features of the high-dimensional data and removing noise and unimportant features, and the dimension of the data in the experiment is reduced to 50 dimensions. The unlabeled samples were classified and the experiment was repeated 10 times, and the classification results (average classification error rate) are shown in table 4: the first column is the name of the image data set, the second column is the number of marked samples, the third column is the composition method, the fourth column is the different values of the parameter k in the kNN method, the fifth column to the sixth column respectively give the minimum degree and the maximum degree of the nodes in the graph, and the seventh column and the eighth column are the average error rate of classification and the standard deviation of the error rate. It can be seen that for each dataset, the classification error rate of the DNN method is smaller than that of the kNN method at different values of k. Therefore, the semi-supervised classification method based on dynamic composition can improve classification accuracy.
Table 4 classification performance of image dataset
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims (8)

1. A semi-supervised classification method based on dynamic composition executed on a computer, wherein the method aims at the problem that in the semi-supervised classification based on the graph, both an edge selection algorithm and a similarity calculation method have respective limitations, so that the result of the classification method is not accurate enough, and the method is characterized by comprising the following steps:
s100, preparing a data set, wherein the data set comprises marked data X l And unlabeled data X u Two parts, marked data X l The marking information of (2) is F l The characteristics of data in a data set are described by data attribute information, i represents the number of marked data, the data in the data set is abstracted into n nodes on m-dimensional space, and the i-th node is represented as p i
S200, selecting edges on the data set prepared in the step S100 by using a dynamic neighbor DNN method to obtain an adjacent matrix A, wherein the method specifically comprises the following steps:
s201, calculating Euclidean distance among nodes in the data set to obtain a direct distance matrix S;
s202, selecting a node p by using a dynamic neighbor DNN method i As a selected edge, and generating an adjacency matrix A based on the D-nearest neighbor, A being an n x n matrix, where p is j Is p i Is then the corresponding position a in the matrix ij Has a value of 1, otherwise 0, A ij A value representing the j-th column of the i-th row in the adjacency matrix a;
s300, calculating the similarity probability among nodes by using an ADW method on the adjacent matrix A generated in the step S200 to obtain an affinity matrix M, wherein the affinity matrix M is specifically as follows:
s301, defining a distance matrix S ' and S ' according to the direct distance matrix S in the step S201 and the adjacent matrix A defined in the step S202 ' ij The value representing the j-th column of the i-th row in the distance matrix S' is specifically defined as:
when i is not equal to j,
when i=j, S' ij =0;
S302, defining a weight matrix W according to the distance matrix S' defined in the step 301, wherein the weight matrix W is an n×n matrix, and W is a matrix of n×n ij To describe node p i And node p j I.e. the value of the ith row and jth column of the weight matrix W;
s303, normalizing the weight matrix W defined in the step 302 to obtain an affinity matrix M, wherein the affinity matrix M is an n×n matrix ij To describe node p i And node p j Similar probability, i.e. the value of the ith row and jth column of the affinity matrix M;
s400, carrying out label propagation according to the affinity matrix M obtained in the step S300 to obtain a final classification result;
wherein the data set in step S100 includes a composite data set TwoSpirals, toyData, fourGaussian, twoMoon and image data sets USPS, mnist, mnist-3495, coil20, coil (1500), G241d, coil2.
2. The method according to claim 1, wherein in step S201, the node p in the dataset i And p j The euclidean distance between them is:
where m represents the dimension of the data, p i 、p j Represents the ith and jth nodes in the diagram, x ik And x jk Respectively node p i 、p j The coordinate of the kth dimension generates a direct distance matrix S according to Euclidean distance between nodes, wherein the direct distance matrix S is an n multiplied by n two-dimensional matrix, S ij Values representing the ith row and jth column of the matrix, storage node p i And node p j Euclidean distance between them.
3. The method of claim 2, said step S201 further comprising:
for direct distanceThe Euclidean distance between each node in the matrix S and other nodes is ordered from small to large to obtain a matrix O, and an index matrix E corresponding to the direct distance matrix S is generated at the same time, specifically, for the ith row in the direct distance matrix S, the stored distances are ordered from small to large, and the distances ordered as j-1 are stored in the O ij In the direct distance matrix S, the positions of the distances in the direct distance matrix S are stored in E ij The corresponding position of the distance stored in the matrix O in the direct distance matrix S can be searched through the index matrix E; the matrix O and the index matrix E are both n×n two-dimensional matrices, E ij Representing the elements of the ith row and jth column of the index matrix E.
4. A method according to claim 3, wherein the step S202 selects the node p using a dynamic neighbor DNN method i The D neighbor of (2) adopts an algebraic method:
N(p i ) Represents p i In matrix O, i-th row and j-th column stores distance node p i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is arranged to rank the nodes of distance j, and the node p is arranged to m Is marked asJudging->Whether or not p i D neighbor of (c), the judgment criteria are: nearest neighbor->Adding to the D neighbor, adding +.>As a datum point, when-> And->When (I)>Is p i Neighbor of>To->Neither is p i Nearest neighbor of>Is p i Nearest neighbor (s)/(s)>Representing the distance p i Sample points ordered as j, d (·) represents a distance measure; and then->As a datum point, when->And->When (I)>Is p i Is to be->Judgment is made as a reference point, and so on until +.>Not p i Stopping judgment when the D neighbor of (2) is in the same time, wherein the obtained result is the node p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
5. A method according to claim 3, wherein the step S202 selects the node p using a dynamic neighbor DNN method i The geometric method is specifically adopted for the D neighbor:
node p i D neighbor search process of (c): storing distance nodes p in ith row and jth column of matrix O i The distances ordered as j can be found by indexing the matrix E to find its position S in the direct distance matrix S im I.e. node p m Is to node p i Is arranged to rank the nodes of distance j, and the node p is arranged to m Is marked asNearest neighbor->Add to D neighbor,/->Dividing the plane into two areas according to +.>P is selected by the mid-vertical line of (2) i The region to which the D neighbor of (2) belongs, i.e. close to p i The side is the region to which the D neighbor belongs; selecting a distance p from the region to which the D neighbor belongs i Nearest neighbors, i.e.)>According to->P is selected by the mid-vertical line of (2) i The region to which the new D neighbor of (a) belongs, i.e. close to p i This side is the region to which the new D neighbor belongs, in which region the distance p is selected i Nearest dot->Add and serve as the next reference point, circulate so far as the proximity p, which is made up of all perpendicular bisectors i Becomes a closed region, and all nodes in the closed region are p i D-nearest neighbor of (a), p i The connection to its D neighbor is taken as the selected edge.
6. The method according to claim 1, wherein in the step S302, the weight matrix W is defined as:
W ij =deg(p i )S′ ij
wherein p is i W is a node on the graph ij Values representing the j-th column of the i-th row in the weight matrix W, deg (p i ) Degree of node, S' ij Representing node p i And node p j I.e. the value of row i and column j in the distance matrix S'.
7. The method according to claim 1, wherein the step S303 defines an affinity matrix M, specifically:
from the weight matrix W, a diagonal matrix T is calculated using the following formula:
T ii =∑ j W ij
wherein T is ii Representing the value of the ith row and ith column of the diagonal matrix T, W ij A value representing the ith row and jth column of the weight matrix W;
the affinity matrix M is obtained after normalization using the following formula:
M=T -1/2 WT -1/2
where T is the diagonal matrix, W is the weight matrix, and M is the affinity matrix.
8. The method according to claim 1, wherein the tag in step S400 propagates through local and global consistency LLGC methods, calculated as follows:
F=(I-αM) -1 Y
wherein F is a matrix of n×c, n is the number of nodes, c is the number of kinds of labels, F ij Representing node p i The probability of marking as a j-th label, i.e. the value of the j-th column of the i-th row of the matrix F; i is an identity matrix, alpha is a regulating parameter, M is an affinity matrix, Y is marking information, which is an n×c matrix in which the marking information of each node is stored, Y ij For the value of the ith row and jth column of matrix Y, if node p i Is marked as a j-th class label, Y ij =1, otherwise Y ij =0;(I-αM) -1 The probability of each node obtaining a label of the marked node;
finally, the node p is obtained i The marking information of (a) specifically includes:
wherein F is ij Is the value of row i and column j of matrix F, argmax representing the value that will be F ij Assignment of j to y when maximum value is taken i I.e. node p i The label of (2) is y i And after obtaining the marks for all the nodes, finishing the classification of the data.
CN201911131232.XA 2019-11-20 2019-11-20 Semi-supervised classification method based on dynamic composition Active CN111046914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911131232.XA CN111046914B (en) 2019-11-20 2019-11-20 Semi-supervised classification method based on dynamic composition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911131232.XA CN111046914B (en) 2019-11-20 2019-11-20 Semi-supervised classification method based on dynamic composition

Publications (2)

Publication Number Publication Date
CN111046914A CN111046914A (en) 2020-04-21
CN111046914B true CN111046914B (en) 2023-10-27

Family

ID=70232184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911131232.XA Active CN111046914B (en) 2019-11-20 2019-11-20 Semi-supervised classification method based on dynamic composition

Country Status (1)

Country Link
CN (1) CN111046914B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019001071A1 (en) * 2017-06-28 2019-01-03 浙江大学 Adjacency matrix-based graph feature extraction system and graph classification system and method
CN109815986A (en) * 2018-12-24 2019-05-28 陕西师范大学 The semisupervised classification method of fusion part and global characteristics
CN109829472A (en) * 2018-12-24 2019-05-31 陕西师范大学 Semisupervised classification method based on probability neighbour
CN110309871A (en) * 2019-06-27 2019-10-08 西北工业大学深圳研究院 A kind of semi-supervised learning image classification method based on random resampling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019001071A1 (en) * 2017-06-28 2019-01-03 浙江大学 Adjacency matrix-based graph feature extraction system and graph classification system and method
CN109815986A (en) * 2018-12-24 2019-05-28 陕西师范大学 The semisupervised classification method of fusion part and global characteristics
CN109829472A (en) * 2018-12-24 2019-05-31 陕西师范大学 Semisupervised classification method based on probability neighbour
CN110309871A (en) * 2019-06-27 2019-10-08 西北工业大学深圳研究院 A kind of semi-supervised learning image classification method based on random resampling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于C均值聚类和图转导的半监督分类算法;王娜;王小凤;耿国华;宋倩楠;;计算机应用(09);全文 *
基于半监督深度信念网络的图像分类算法研究;朱常宝;程勇;高强;;计算机科学(S1);全文 *

Also Published As

Publication number Publication date
CN111046914A (en) 2020-04-21

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN113408605B (en) Hyperspectral image semi-supervised classification method based on small sample learning
CN108647736B (en) Image classification method based on perception loss and matching attention mechanism
CN108038122B (en) Trademark image retrieval method
CN111191732A (en) Target detection method based on full-automatic learning
CN108897791B (en) Image retrieval method based on depth convolution characteristics and semantic similarity measurement
CN110942091B (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN110188228A (en) Cross-module state search method based on Sketch Searching threedimensional model
CN113326731A (en) Cross-domain pedestrian re-identification algorithm based on momentum network guidance
CN109635140B (en) Image retrieval method based on deep learning and density peak clustering
CN113032613B (en) Three-dimensional model retrieval method based on interactive attention convolution neural network
CN110147841A (en) The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN113807456A (en) Feature screening and association rule multi-label classification algorithm based on mutual information
Chakraborty et al. Hyper-spectral image segmentation using an improved PSO aided with multilevel fuzzy entropy
Zhang et al. Non-parameter clustering algorithm based on saturated neighborhood graph
Suresh et al. Data clustering using multi-objective differential evolution algorithms
CN111046914B (en) Semi-supervised classification method based on dynamic composition
CN107704872A (en) A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method
CN111353525A (en) Modeling and missing value filling method for unbalanced incomplete data set
Tamrakar et al. Integration of lazy learning associative classification with kNN algorithm
CN111126467A (en) Remote sensing image space spectrum clustering method based on multi-target sine and cosine algorithm
CN105844299A (en) Image classification method based on bag of words
CN115393631A (en) Hyperspectral image classification method based on Bayesian layer graph convolution neural network
CN113205124A (en) Clustering method, system and storage medium under high-dimensional real scene based on density peak value
Kumar et al. Automatic feature weight determination using indexing and pseudo-relevance feedback for multi-feature content-based image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant