CN116821715A - Artificial bee colony optimization clustering method based on semi-supervision constraint - Google Patents
Artificial bee colony optimization clustering method based on semi-supervision constraint Download PDFInfo
- Publication number
- CN116821715A CN116821715A CN202310412843.1A CN202310412843A CN116821715A CN 116821715 A CN116821715 A CN 116821715A CN 202310412843 A CN202310412843 A CN 202310412843A CN 116821715 A CN116821715 A CN 116821715A
- Authority
- CN
- China
- Prior art keywords
- constraint
- clustering
- matrix
- data
- bee colony
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000005457 optimization Methods 0.000 title claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 122
- 239000011159 matrix material Substances 0.000 claims abstract description 95
- 230000008569 process Effects 0.000 claims description 21
- 230000006872 improvement Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 abstract description 43
- 230000000694 effects Effects 0.000 abstract description 19
- 230000006870 function Effects 0.000 description 27
- 238000002474 experimental method Methods 0.000 description 11
- 235000013305 food Nutrition 0.000 description 10
- 241000364051 Pima Species 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 235000012907 honey Nutrition 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012821 model calculation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000257303 Hymenoptera Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a artificial bee colony optimization clustering method based on semi-supervised constraint, which comprises the steps of firstly generating pair constraint guide adjacency matrix by using marked data, perfecting adjacency matrix information and improving spectral clustering accuracy; then, by combining with the artificial bee colony algorithm, the speed of searching the data clustering center is improved by providing area optimization search, the accuracy of searching the data clustering center is improved by reconstructing the fitness function of the artificial bee colony algorithm, and a reasonable clustering center is found by utilizing the artificial bee colony algorithm; finally, the K-means algorithm is combined with the Laplace matrix and the optimized clustering center to finish clustering work, and the clustering algorithm of the UCI data set is used for comparing experimental results to find out reasonable clustering center points in a short time, so that a more accurate clustering result is obtained, and the method has the characteristics of good robustness, high precision and good clustering effect.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to an artificial bee colony optimization clustering method based on semi-supervised constraint.
Background
The spectral clustering algorithm is established on the basis of a spectrogram theory, and compared with the traditional clustering algorithm, the spectral clustering algorithm has the characteristic of clustering on a sample space with any shape and converging on a global optimal solution in the data processing process, so that the spectral clustering algorithm is widely researched, dhillon et al propose to research the spectral clustering N-cut algorithm and the K-mean algorithm together and combine the spectral clustering N-cut algorithm and the K-mean algorithm to obtain a new weighting kernel K-means algorithm; hu et al expand spectral clustering from single view to multi-view to improve the application range of the algorithm and the algorithm realization speed is high, so that the algorithm practicability is enhanced; zha et al studied a new spectral clustering algorithm based on bipartite graphs and concluded that the singular value decomposition problem of bipartite graph associated edge weight matrix was equivalent to the minimization of the objective function; perez et al propose sparse nuclear spectrum clustering to learn a data similarity matrix by assigning an adaptive and optimal neighbor to each data point based on local distance, and the proposed improved algorithm is applied to a large-scale data set to obtain a better effect; xie Juanying it proposes fully adaptive spectral clustering to avoid the influence of noise points on local scale parameters of the spectral clustering algorithm self-tuning;
however, the spectral clustering is an unsupervised algorithm, and only single data can be used for clustering, so that the research of combining the spectral clustering and the semi-supervised clustering is focused widely in recent years; the common semi-supervised clustering is 3 types, namely a constraint-based semi-supervised clustering algorithm, a distance-based semi-supervised clustering algorithm and a constraint-distance-fused semi-supervised clustering algorithm; the Kamvar S D et al firstly proposes a constrained spectral clustering algorithm, and updates a similarity matrix by randomly selecting given pairwise quantity constraints; kulis et al improve the performance of semi-supervised graph clustering algorithms by a kernel method; ding et al introduce pair constraint into the graph segmentation objective function to obtain a semi-supervised approximate spectral clustering result based on a hidden Markov random field;
however, under the condition of considering the improvement of the clustering effect, the algorithm cannot make excessive contribution to the improvement of the efficiency, so that a design of an artificial bee colony optimization clustering method based on semi-supervision constraint is needed to solve the problems in the prior art, and the clustering effect is improved while the data processing efficiency is improved.
Disclosure of Invention
Aiming at the problems, the invention aims to provide the artificial bee colony optimization clustering method based on the semi-supervised constraint, which guides the generation of a similar matrix by adding the semi-supervised pair constraint method into a spectral clustering algorithm, so as to improve the efficiency in the spectral clustering algorithm process, further improve the clustering effect while improving the efficiency, and has the characteristics of good robustness, high precision and good clustering effect.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a artificial bee colony optimization clustering method based on semi-supervision constraint comprises the following steps of
Step1, in the data processing process, collecting data by using a sampling method of layered sampling;
step2, after the data set is acquired, part of label data is put forward in the data set to generate paired constraint;
step3, constructing a semi-supervised constraint matrix according to the pair constraint relationship;
step4, calculating an initial similarity matrix S by using a Gaussian similarity function, normalizing the similarity matrix S by using a softmax function to obtain S ', adding semi-supervised pairwise constraint information into the similarity matrix S', and guiding to generate an adjacent matrix W by using labeled data;
step5, taking the sum of similarity weights of each row of samples adjacent to the matrix W as a matrix diagonal factor to obtain a degree matrix D;
step6. Calculating a laplace matrix L according to the formula l=d-W, and calculating a eigenvector v corresponding to the calculated eigenvalue i ,v 2 ,...,v k ;
Step7, constructing a vector matrix of the feature space, and forming k feature vectors into an n multiplied by k matrix to obtain F= [ F ] 1 ,f 2 ,...,f k ],F∈X n×k ;
Step8, clustering the F matrix, and searching an optimal clustering center point by using an improved artificial bee colony algorithm;
step9, clustering by using K-means according to the found clustering center point, and completing the processing of the sample data set.
Preferably, the pair constraint generation process of Step2 includes
Step201 assume that data set x= { X 1 ,x 2 ,...,x n}, wherein xi ,x j ,x k Is a sample data point; the pair constraint relationship of the association constraint and the non-association constraint has symmetry and transitivity, re represents the association constraint, irRe represents the non-association constraint, and according to the definition of the association constraint and the non-association constraint, the pair constraint relationship comprises:
transmissibility:
symmetry:
(x i ,x j )∈Re&(x j ,x i )∈Re
(x i ,x j )∈IrRe&(x j ,x i )∈IrRe
step202. According to the constraint propagation theory:
in this way, constraint relationships between tag data are obtained.
Preferably, the semi-supervised constraint matrix construction process of Step3 includes
According to the transitivity of the pair constraint, the symmetry and constraint propagation theory can obtain the pair constraint information of more data, and the constraint matrix T= [ T ] can be derived ij ];
wherein ,tij Representing sample x i And sample x j Similarity of (2); when the inter-sample relationship is a relationship constraint as shown in formula (1), the similarity is 1; when the relationship between samples is a non-association constraintThe similarity is-1 when the relationship between the samples is ambiguous, and 0 when the relationship between the samples is ambiguous.
Preferably, the obtaining of the adjacency matrix W in Step4 includes
Step401 calculating sample similarity by using a Gaussian kernel function to obtain an initial similarity matrix S, and normalizing the similarity matrix S by a softmax function to obtain S';
step402, adding the obtained semi-supervised pair constraint matrix into S' to obtain an adjacency matrix W, wherein the calculation formula is as follows:
wherein ,w'ij Representing samples x in the adjacency matrix W i And sample x j Is a similarity of (3).
Preferably, the degree matrix D in Step5 is expressed as
wherein ,di Is a diagonal factor of D.
Preferably, step6 calculates the eigenvector v corresponding to the eigenvalue i ,v 2 ,...,v k The calculation process of (1) comprises
Step601. Calculating the laplace matrix according to the formula l=d-W to obtain an n×n matrix;
step602 construction of a normalized Laplace matrix L s =D -1/2 LD -1/2 =I-D -1/2 WD -1/2 ;
Wherein I is an identity matrix;
step603 calculate L s The characteristic values are arranged in ascending order, the first k characteristic values are taken, and the characteristic vector v corresponding to the characteristic values is calculated i ,v 2 ,...,v k 。
Preferably, the process of searching for the optimal cluster center point by using the improved artificial bee colony algorithm in Step8 comprises
Step801, an improvement scheme of artificial bee colony is provided, and a traditional artificial bee colony algorithm is improved;
step802, updating the fitness function of the artificial bee colony;
step803 an improved artificial bee colony algorithm based clustering center optimization method is provided, and an optimal clustering center point is searched by using the method.
Preferably, the process of updating the fitness function of the artificial bee colony in Step802 includes
(1) Let X be the data set x= { X 1 ,x 2 ,...,x n Dividing data into k classes, and dividing sample data into class C by clustering k ,C k ={C 1 ,C 2 ,...,C k Intra-class distance d wit (C i ) Is of the class C i Any two data x of (2) m ,x n Square sum of (2):
inter-class distance d bet (C i ,C j ) Is of the class C i To other classes C j Distance of (2):
D i,j =|x k -x l | (4)
wherein ,ωj =ω p ∪ω q ,D i,j Is of the class C i ,C j Distance between x k ∈C i ,x l ∈C j The method comprises the steps of carrying out a first treatment on the surface of the The obtained inter-class distance is the intermediate distance;
(2) The fitness function is:
preferably, the process of searching for the optimal cluster center point by using the improved artificial bee colony algorithm-based cluster center optimization method in Step803 comprises
(1) If the label data exactly has k types of data points, directly and randomly selecting one of the data points of each type as an initial clustering center;
(2) If the number of the clustering centers which can be determined by the data points in the label sample is less than k, calculating the points farthest from each class of data points by using a KMeans++ algorithm idea as the initial clustering center points of the class, and sequentially calculating until the data points of k classes are found;
(3) Finding k initial cluster center points a i Setting an effective threshold value area with the points as the center; performing ABC algorithm search on cluster center points within a threshold range:
(4) Setting an initial population SN, calculating the fitness of each point, and carrying out descending order sequencing;
(5) Selecting a fitness greater thanCarrying out neighborhood search on each point;
wherein ,fitness value for the first 50% of points after descending order;
(6) Finding a new clustering center, if the fitness value of the new clustering center is higher than that of the original clustering center, replacing the original clustering center with the new position, otherwise, reserving the original clustering center;
(7) After a plurality of new candidate points are found, calculating the fitness value of each candidate point again, searching for a global optimal point in the points according to a roulette mode, and taking the point as a new clustering center in the area;
(8) After multiple iterations, resetting the threshold range according to the new candidate points in each region;
(9) And searching a new round of clustering center, recording the optimal solution until the termination condition is met, and outputting the position of the optimal solution.
The beneficial effects of the invention are as follows: the invention discloses a semi-supervised constraint-based artificial bee colony optimization clustering method, which is improved compared with the prior art in that:
the invention provides a semi-supervised constraint-based artificial bee colony optimization clustering method, which uses tagged data and untagged data as basic data to find out constraint relation between the data, utilizes constraint propagation to generate a constraint matrix, adds the constraint matrix into a similar matrix to reconstruct the matrix so as to generate an adjacent matrix, and provides a constraint-added spectral clustering algorithm on the basis; introducing an ABC algorithm to optimize, and using the ABC algorithm to reduce the searching time within a defined threshold value, so as to quickly and effectively find out a clustering center to enable the subsequent clustering result to be more accurate; finally, the UCI data set is used for testing the clustering effect of the algorithm, and experiments prove that the influence on the adjacent matrix is generated after the spectral clustering is added into the constraint relation, so that the clustering accuracy can be improved, the improved ABC algorithm optimization scheme can reduce the time complexity of the algorithm, the spectral clustering is optimized by combining the two algorithms, and compared with the original spectral clustering algorithm, the improved algorithm provided by the invention has the advantages of higher precision, better clustering effect, good robustness, high precision and good clustering effect compared with other algorithms.
Drawings
Fig. 1 is an algorithm flow chart of the artificial bee colony optimization clustering method based on semi-supervision constraint.
FIG. 2 is a graph showing the effect of example 2 of the present invention after the pair-wise constraint is introduced.
FIG. 3 is a graph showing clustering time according to the different algorithms of embodiment 2 of the present invention.
FIG. 4 is a graph showing the comparison of the efficiency of different clustering algorithms according to example 2 of the present invention.
Wherein: in fig. 2, fig. 2 (a) is an Iris adjacency matrix change graph after introducing a pair constraint, fig. 2 (b) is a Pima adjacency matrix change graph after introducing a pair constraint, and fig. 2 (c) is an accuracy change graph after introducing a pair constraint;
in fig. 3, fig. (a) is a graph of the Iris data collection type time variation; fig. (b) is a Sym data aggregation type time-varying graph, fig. (c) is a Pima data aggregation type time-varying graph, and fig. (d) is a Segment data aggregation type time-varying graph.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1: 1-4, a artificial bee colony optimization clustering method based on semi-supervision constraint comprises the following steps of
Step1, grading data to be processed into a user grading data set (relating to fields such as film evaluation, music grading data and the like), and collecting the data by using a sampling method of hierarchical sampling in the data processing process;
step2, after the data set is acquired, part of the labeled data is extracted in the data set to generate pair constraint
The embodiment classifies the data sample calibration label information into two types of association constraint (release constraint) and non-association constraint (irrelease constraint); because the pair constraint belongs to the mandatory constraint, the pair constraint can be used for describing whether two data samples belong to the same class, if the pair constraint is associated with the data samples, the pair constraint indicates that users are similar, and if the pair constraint is not associated with the data samples, the pair constraint indicates that the users are dissimilar, and the pair constraint is defined as follows:
definition 1. Releasant set r= { (x) i ,x j ) If (x) i ,x j ) E R, then indicate data x i and xj Must belong to the same class, i.e. x i and xj Satisfying the association constraint relation;
definition 2.Irrelevant set i= { (x) i ,x j ) If (x) i ,x j ) E I, then indicate data x i and xj Must not belong to the same class, i.e. x i and xj Satisfying the non-association constraint relation;
step201 assume that data set x= { X 1 ,x 2 ,...,x n}, wherein xi ,x j ,x k Is a sample data point; then the pair constraint of the two rules closesThe system has symmetry and transitivity, re represents association constraint, irRe represents non-association constraint, and according to the definition of association constraint and non-association constraint, the system comprises:
transmissibility:
symmetry:
(x i ,x j )∈Re&(x j ,x i )∈Re
(x i ,x j )∈IrRe&(x j ,x i )∈IrRe
step202. According to the constraint propagation theory:
step3, constructing a semi-supervised constraint matrix according to the pair constraint relationship
According to the transitivity of the pair constraint, the symmetry and constraint propagation theory can obtain the pair constraint information of more data, and the constraint matrix T= [ T ] can be derived ij], wherein tij Representing sample x i And sample x j Similarity of (2); when the inter-sample relationship is a relationship constraint as shown in formula (1), the similarity is 1; when the relation among the samples is a non-association constraint, the similarity is-1, and when the relation among the samples is ambiguous, the similarity is 0;
step4. Calculating an initial similarity matrix S using a Gaussian similarity function, normalizing the similarity matrix S using a softmax function to obtain S ', adding semi-supervised pairwise constraint information to the similarity matrix S', and using the tagged data to guide the generation of an adjacency matrix W
Step401, firstly, calculating sample similarity by using a Gaussian kernel function to obtain an initial similarity matrix S, and normalizing the similarity matrix S by a softmax function to obtain S';
step402, adding the obtained semi-supervised pair constraint matrix into S' to obtain an adjacency matrix W, wherein the calculation formula is as follows:
wherein ,w'ij Representing samples x in the adjacency matrix W i And sample x j If the two samples belong to the association constraint, the similarity between the two samples is larger, and the weight between the two samples should be correspondingly increased to ensure that the two samples are certain in the same class of clusters; if the two samples belong to the non-association constraint, reducing the similarity degree of the two samples through weighting to ensure that the two samples are not classified into the same class cluster in the subsequent classification; if the sample does not belong to the association constraint or the non-association constraint, temporarily not carrying out weight adjustment on the similarity between the two;
step5. Taking the sum of the similarity weights of each row of samples of the adjacency matrix W as a matrix diagonal factor to obtain a degree matrix D, D i Is the diagonal factor of D, the formula is
Step6. Calculate the laplace matrix L according to the formula l=d-W
Step601. Calculating the laplace matrix according to the formula l=d-W to obtain an n×n matrix;
step602 construction of a normalized Laplace matrix L s =D -1/2 LD -1/2 =I-D -1/2 WD -1/2 ;
Wherein I is an identity matrix;
step603 calculate L s The characteristic values are arranged in ascending order, the first k characteristic values are taken, and the characteristic vector v corresponding to the characteristic values is calculated i ,v 2 ,...,v k ;
Step7, constructing a vector matrix of the feature space, and combining k feature vectorsForming an n×k matrix to obtain f= [ F ] 1 ,f 2 ,...,f k ],F∈X n×k ;
Step8, clustering the F matrix, and searching an optimal clustering center point by using an improved Artificial Bee Colony (ABC) algorithm;
step9, clustering by using K-means according to the found cluster center point, and completing processing of the sample data set to obtain a similar user set.
Preferably, the process of searching for the optimal cluster center point by using the modified Artificial Bee Colony (ABC) algorithm in Step8 includes:
step801 proposes an improvement of Artificial Bee Colony (ABC) to improve the conventional Artificial Bee Colony (ABC) algorithm
Traditional Artificial Bee Colony (ABC) algorithms possibly fall into the situation of local optimum in searching, and the overall performance of the algorithms is affected; to alleviate the above situation, an improvement scheme of the ABC algorithm is proposed: firstly, introducing an fitness function formula between the intra-class distances, comparing the advantages and disadvantages of the clustering centers by using the fitness function, and searching the clustering centers; secondly, the ABC algorithm is applied to a plurality of threshold ranges, so that the possibility that the search algorithm falls into local optimum is reduced, and the calculation complexity is reduced;
step802 update fitness function of Artificial Bee Colony (ABC)
In the original ABC algorithm, following bees select high-quality food sources to follow through probability information, and local exploitation is carried out to promote population evolution; the magnitude of the probability reflects the quality of the food source, i.e. the fitness value; the larger the fitness value, the better the food source quality, the greater the probability of being selected; the process of selecting the food source is a process of searching the clustering center, so that the selection of a proper fitness function has important influence on the searching result of the clustering center; the food source quality mainly looks at the richness of the food source, namely, if the food source is positioned at the population center point, the adaptation value of the food source is higher; therefore, in the embodiment, the objective function is constructed by using the intra-class distance and the inter-class distance, and the construction basis is that when the similarity degree of the objects in the same class cluster is the largest and the dissimilarity degree of the objects in different class clusters is the largest, the position of the honey source of the current point is better, and the honey source can be selected to be a clustering center point with higher probability;
define X as data set x= { X 1 ,x 2 ,...,x n Dividing the data into k classes, assuming that the sample data is divided into class C by clusters k ,C k ={C 1 ,C 2 ,...,C k Intra-class distance d wit (C i ) Defined as class C i Any two data x of (2) m ,x n Square sum of (2):
inter-class distance d bet (C i ,C j ) Definition class C i To other classes C j The formula is as follows:
D i,j =|x k -x l | (4)
wherein ,ωj =ω p ∪ω q ,D i,j Is of the class C i ,C j Distance between x k ∈C i ,x l ∈C j The method comprises the steps of carrying out a first treatment on the surface of the The obtained inter-class distance is the middle distance, so that the defect of calculation of the farthest distance and the nearest distance is avoided, all clustering center points do not need to be traversed, and the calculation efficiency is improved;
the present embodiment defines the fitness function as:
the improved fitness function is used for measuring the merits of honey sources by means of the inter-class relationship in the classes, and when the inter-class distance is small and the inter-class distance is large, the larger the fitness value of the point is, the larger the probability of being selected as a clustering center (food source) is; if the intra-class distance is large and the inter-class distance is small, the point fitness value is small, and the probability of being selected as a clustering center is low; the point with large fitness is selected as a clustering center, the obtained clustering effect is better, and the algorithm accuracy is higher;
step 803A cluster center optimization method based on an improved Artificial Bee Colony (ABC) algorithm is provided, and an optimal cluster center point is searched by using the method
Based on the improved Artificial Bee Colony (ABC) algorithm, a region optimization searching scheme is provided, a data set is divided into k categories, and k center points are initialized according to the pair constraint relation: extracting data points from a sample containing a pair constraint relationship, and respectively taking out data points of different categories as initial clustering center points by using the containing and mutual exclusion relationships of the association constraint and the non-association constraint; the possibility of taking the initial clustering center point and the processing scheme are as follows:
(1) If the label data exactly has k types of data points, directly and randomly selecting one of the data points of each type as an initial clustering center;
(2) If the number of the clustering centers which can be determined by the data points in the label sample is less than k, calculating the points farthest from each class of data points as the initial clustering center points of the class by using a KMeans++ algorithm idea, and sequentially calculating until the data points of k classes are found;
(3) Finding k initial cluster center points a i Setting an effective threshold value area with the points as the center; performing ABC algorithm search on cluster center points within a threshold range:
(4) Setting an initial population SN, calculating the fitness of each point, and carrying out descending order sequencing;
(5) Selecting a fitness greater thanPoints (wherein>For the fitness value of the first 50% of points after descending order, carrying out neighborhood search on each point;
(6) Finding a new clustering center, if the fitness value of the new clustering center is higher than that of the original clustering center, replacing the original clustering center with the new position, otherwise, reserving the original clustering center;
(7) After a plurality of new candidate points are found, calculating the fitness value of each candidate point again, searching for a global optimal point in the points according to a roulette mode, and taking the point as a new clustering center in the area;
(8) After multiple iterations, resetting the threshold range according to the new candidate points in each region;
(9) Searching a new round of clustering center, recording the optimal solution until the termination condition is met, and outputting the position of the optimal solution;
in the improved artificial bee colony (SSABC), irRe represents no associated data, RE represents associated data, limit represents the maximum search times, MAX represents the maximum iteration times, the cluster population is represented as k, and finally C (food) is used for representing the finally obtained cluster center point;
the specific algorithm is as follows:
example 2: unlike example 1 above, to verify the effectiveness and superiority of the method described in example 1 above, the following experiment was designed to verify the method described above:
step one, experimental data and environment
The experimental environment of the algorithm of the embodiment is a Windows10 operating system, an Intel core i7 processor and a 4GB memory, and experimental tests are carried out under a vscore platform by adopting a Python language; in order to verify the clustering effect of the semi-supervised spectral clustering algorithm and the improved artificial bee colony hybrid clustering algorithm, which are proposed in the embodiment 1, the accuracy of the clustering algorithm is evaluated by adopting UCI data sets, and four data sets are selected from the UCI database as experimental data sets, as shown in the table 1;
table 1: UCI data information
When evaluating the performance of the semi-supervised clustering algorithm, adopting the clustering precision ACC (accuracy of clustering) to adjust the Lande coefficient ARI (Adjusted Rand index) as a clustering effect evaluation index;
the calculation formula of ACC is as follows:
TP represents the same data of the real class label and the class label obtained after model calculation, and FP represents the different data point numbers of the real class label and the class label obtained after model calculation; the ACC reflects the matching degree between the classification condition calculated by the model and the real classification, and the larger the value is, the more accurate the representation is;
the formula for adjusting the Lande coefficient is as follows:
wherein RI represents the Lande coefficient, E < RI > represents the expected value of RI, and max (RI) represents the maximized RI; the larger the ARI E [ -1.1] value means that the clustering result is more consistent with the real situation;
step two, experimental analysis
Experiment one: introducing pairwise constraint matrix effect verification
Verifying the influence on the adjacency matrix after introducing the pair constraint, and carrying out experiments on the Iris data set and the Pima data set, gradually increasing the introduced constraint comparison ratio from 0 to 50%, checking the non-0 data volume change of the adjacency matrix and the clustering accuracy change of a clustering algorithm and a traditional spectral clustering algorithm after adding the constraint pair, wherein the experimental results are shown in figure 2;
it can be seen that when no pairwise constraints are introduced, the adjacency matrix is less data than 0, i.e. the user relevance is lower; after the pair constraint information is added, the adjacent matrix non-0 data is increased, and the relevance among users is enhanced; as shown in FIG. 2 (c), when constraint pair information is 0, iris dataset clustering ACC value is 0.8511, pima dataset ACC value is 0.6612 by using a traditional spectral clustering algorithm, after pair constraint is introduced, clustering result accuracy is improved along with the increase of the added constraint comparison proportion, when constraint pair information is introduced into the constraint pair proportion to be 50%, iris dataset ACC value is 0.9014, pima dataset ACC value is 0.7421, and due to improvement of similarity matrix by using a pair constraint relation, adjacent matrix non-0 data of spectral clustering can be increased by adding semi-supervision constraint, correlation among users is increased, and information quantity of adjacent matrixes is enriched, so that the effect of improving spectral clustering accuracy is achieved.
Experiment II: clustering effect verification for optimizing ABC algorithm
Designing four groups of control experiments, wherein the first group is a traditional spectral clustering algorithm, the second group is a spectral clustering region optimization clustering algorithm after data processing, and the third group of experiments is based on the second group and uses an ABC algorithm to search for clustering centers for clustering; the last group is to search the clustering center by using ABC of the improved fitness function based on the second group of experiments; comparing the experimental time, verifying whether the improved algorithm can reduce the time complexity of the spectral clustering algorithm, wherein the experimental result is shown in figure 3;
according to the graph shown in FIG. 3, under the same data set, the traditional spectral clustering algorithm consumes the longest time, and the improved ABC optimized spectral clustering algorithm consumes the shortest time; taking Sym data as an example, when using traditional spectral clustering, the clustering time is 3.7563s, when using threshold value to optimize spectral clustering, the time is reduced, the clustering time is 3.3459s, the time for adding ABC algorithm to optimize spectral clustering is 2.1554s, and the time for adding improved ABC optimized spectral clustering is 2.0125s; in experiments of data sets Iris and Segment, the time consumption of the ABC algorithm for improving the fitness function is close to that of the original ABC algorithm, which is the influence of the data sets, while in Sym and Pima data sets, the ABC algorithm for improving the fitness function has better effect; the method is characterized in that an ABC algorithm for improving the fitness function is introduced into a region optimization searching mode, so that the searching efficiency is improved, the ABC algorithm is used for calculating and searching cluster center points, the data are converged faster, the cluster center points can be searched in a short time, and the complexity of clustering time is reduced; from all dataset experiments, the ABC algorithm effect of improving the fitness function was less time consuming than other algorithms, demonstrating that the ABC algorithm incorporating the improved fitness function herein was effective in reducing spectral clustering time complexity.
Experiment III: algorithm comprehensive contrast verification
In order to verify the improved superiority of the semi-supervised pairwise constraint combined with the ABC algorithm to the traditional spectral clustering algorithm, the algorithm (SS-ACSC) provided in the embodiment 1 is compared with the traditional spectral clustering algorithm (SC), only the spectral clustering algorithm (SS-SC) of the semi-supervised pairwise constraint improved adjacency matrix is added, only the spectral clustering Algorithm (ACSC) of the ABC algorithm search clustering center after the adaptation function is added, clustering is carried out aiming at different data sets of UCI, different superiority and inferiority degrees are reflected on the adjustment of the Lande coefficient, and the overall algorithm superiority and inferiority pair is shown in a graph shown in FIG. 4;
as can be seen from fig. 4, when the spectrum clustering is performed alone, the ARI values of the different data sets are all the lowest, taking the Segment data set as an example, when the SC algorithm is used, the ARI value is 0.45, when the ss-SC algorithm is used, the ARI value is 0.49, when the ACSC algorithm is used, the ARI is 0.51, and when the ss-ACSC algorithm is used, the ARI is 0.55; the ARI value gradually increases, which shows that the improvement algorithm proposed in the embodiment 1 has obviously higher superiority and inferiority than the traditional spectral clustering algorithm; the ARI values in the four data sets are higher than those of the SC algorithm because of the improvement on the adjacency matrix by the spectral clustering algorithm added with the pair constraint alone, which shows that the algorithm improves the algorithm efficiency; the spectral clustering algorithm of the clustering center is searched by independently introducing the ABC algorithm, so that the time complexity of the algorithm is mainly reduced, and the improvement on the algorithm efficiency is not obvious, so that the difference between the ACSC algorithm and the SC algorithm on the Iris data set and the Sym data set is not great, but the improvement is also realized; according to the experimental results, the spectral clustering improvement algorithm combining pair constraint and artificial bee colony optimization provided by the method is superior to a comparison algorithm, and the clustering effect can be effectively improved.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (9)
1. A semi-supervised constraint-based artificial bee colony optimization clustering method is characterized by comprising the following steps of: comprising the steps of
Step1, in the data processing process, collecting data by using a sampling method of layered sampling;
step2, after the data set is acquired, part of label data is put forward in the data set to generate paired constraint;
step3, constructing a semi-supervised constraint matrix according to the pair constraint relationship;
step4, calculating an initial similarity matrix S by using a Gaussian similarity function, normalizing the similarity matrix S by using a softmax function to obtain S ', adding semi-supervised pairwise constraint information into the similarity matrix S', and guiding to generate an adjacent matrix W by using labeled data;
step5, taking the sum of similarity weights of each row of samples adjacent to the matrix W as a matrix diagonal factor to obtain a degree matrix D;
step6. Calculating a laplace matrix L according to the formula l=d-W, and calculating a eigenvector v corresponding to the calculated eigenvalue i ,v 2 ,...,v k ;
Step7, constructing a vector matrix of the feature space, and forming k feature vectors into an n multiplied by k matrix to obtain F= [ F ] 1 ,f 2 ,...,f k ],F∈X n×k ;
Step8, clustering the F matrix, and searching an optimal clustering center point by using an improved artificial bee colony algorithm;
step9, clustering by using K-means according to the found clustering center point, and completing the processing of the sample data set.
2. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 1, wherein the method comprises the following steps: the pair-wise constraint generation process described in Step2 includes
Step201 assume that data set x= { X 1 ,x 2 ,...,x n}, wherein xi ,x j ,x k Is a sample data point; the pair constraint relationship of the association constraint and the non-association constraint has symmetry and transitivity, re represents the association constraint, irRe represents the non-association constraint, and according to the definition of the association constraint and the non-association constraint, the pair constraint relationship comprises:
transmissibility:
symmetry:
(x i ,x j )∈Re&(x j ,x i )∈Re
(x i ,x j )∈IrRe&(x j ,x i )∈IrRe
step202. According to the constraint propagation theory:
in this way, constraint relationships between tag data are obtained.
3. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 1, wherein the method comprises the following steps: the semi-supervised constraint matrix construction process of Step3 includes
According to the transitivity of the pair constraint, the symmetry and constraint propagation theory can obtain the pair constraint information of more data, and the constraint matrix T= [ T ] can be derived ij ];
wherein ,tij Representing sample x i And sample x j Similarity of (2); when the inter-sample relationship is a relationship constraint as shown in formula (1), the similarity is 1; when the relationship between samples is a non-association constraint, the similarity is-1, and when the relationship between samples is ambiguous, the similarity is 0.
4. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 1, wherein the method comprises the following steps: the process of obtaining the adjacency matrix W in Step4 includes
Step401 calculating sample similarity by using a Gaussian kernel function to obtain an initial similarity matrix S, and normalizing the similarity matrix S by a softmax function to obtain S';
step402, adding the obtained semi-supervised pair constraint matrix into S' to obtain an adjacency matrix W, wherein the calculation formula is as follows:
wherein ,w'ij Representing samples x in the adjacency matrix W i And sample x j Is a similarity of (3).
5. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 1, wherein the method comprises the following steps: the degree matrix D in Step5 is expressed as
wherein ,di Is a diagonal factor of D.
6. Artificial bee colony optimization based on semi-supervised constraint as recited in claim 1The clustering method is characterized in that: step6, calculating a feature vector v corresponding to the feature value i ,v 2 ,...,v k The calculation process of (1) comprises
Step601. Calculating the laplace matrix according to the formula l=d-W to obtain an n×n matrix;
step602 construction of a normalized Laplace matrix L s =D -1/2 LD -1/2 =I-D -1/2 WD -1/2 ;
Wherein I is an identity matrix;
step603 calculate L s The characteristic values are arranged in ascending order, the first k characteristic values are taken, and the characteristic vector v corresponding to the characteristic values is calculated i ,v 2 ,...,v k 。
7. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 1, wherein the method comprises the following steps: the process of searching for the optimal clustering center point by using the improved artificial bee colony algorithm in the Step8 comprises
Step801, an improvement scheme of artificial bee colony is provided, and a traditional artificial bee colony algorithm is improved;
step802, updating the fitness function of the artificial bee colony;
step803 an improved artificial bee colony algorithm based clustering center optimization method is provided, and an optimal clustering center point is searched by using the method.
8. The artificial bee colony optimization clustering method based on semi-supervised constraint of claim 7, wherein the method comprises the following steps: the process of updating the fitness function of the artificial bee colony in Step802 comprises
(1) Let X be the data set x= { X 1 ,x 2 ,...,x n Dividing data into k classes, and dividing sample data into class C by clustering k ,C k ={C 1 ,C 2 ,...,C k Intra-class distance d wit (C i ) Is of the class C i Any two data x of (2) m ,x n Square sum of (2):
inter-class distance d bet (C i ,C j ) Is of the class C i To other classes C j Distance of (2):
D i,j =|x k -x l | (4)
wherein ,ωj =ω p ∪ω q ,D i,j Is of the class C i ,C j Distance between x k ∈C i ,x l ∈C j The method comprises the steps of carrying out a first treatment on the surface of the The obtained inter-class distance is the intermediate distance;
(2) The fitness function is:
9. the artificial bee colony optimization clustering method based on semi-supervised constraint of claim 7, wherein the method comprises the following steps: the process of searching for the optimal clustering center point by using the improved artificial bee colony algorithm-based clustering center optimization method in Step803 comprises
(1) If the label data exactly has k types of data points, directly and randomly selecting one of the data points of each type as an initial clustering center;
(2) If the number of the clustering centers which can be determined by the data points in the label sample is less than k, calculating the points farthest from each class of data points by using a KMeans++ algorithm idea as the initial clustering center points of the class, and sequentially calculating until the data points of k classes are found;
(3) Finding k initial cluster center points a i The points are set as the centerAn effective threshold value area; performing ABC algorithm search on cluster center points within a threshold range:
(4) Setting an initial population SN, calculating the fitness of each point, and carrying out descending order sequencing;
(5) Selecting a fitness greater thanCarrying out neighborhood search on each point;
wherein ,fitness value for the first 50% of points after descending order;
(6) Finding a new clustering center, if the fitness value of the new clustering center is higher than that of the original clustering center, replacing the original clustering center with the new position, otherwise, reserving the original clustering center;
(7) After a plurality of new candidate points are found, calculating the fitness value of each candidate point again, searching for a global optimal point in the points according to a roulette mode, and taking the point as a new clustering center in the area;
(8) After multiple iterations, resetting the threshold range according to the new candidate points in each region;
(9) And searching a new round of clustering center, recording the optimal solution until the termination condition is met, and outputting the position of the optimal solution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310412843.1A CN116821715A (en) | 2023-04-18 | 2023-04-18 | Artificial bee colony optimization clustering method based on semi-supervision constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310412843.1A CN116821715A (en) | 2023-04-18 | 2023-04-18 | Artificial bee colony optimization clustering method based on semi-supervision constraint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116821715A true CN116821715A (en) | 2023-09-29 |
Family
ID=88124839
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310412843.1A Pending CN116821715A (en) | 2023-04-18 | 2023-04-18 | Artificial bee colony optimization clustering method based on semi-supervision constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116821715A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117075756A (en) * | 2023-10-12 | 2023-11-17 | 深圳市麦沃宝科技有限公司 | Real-time induction data processing method for intelligent touch keyboard |
CN117311801A (en) * | 2023-11-27 | 2023-12-29 | 湖南科技大学 | Micro-service splitting method based on networking structural characteristics |
CN117574137A (en) * | 2024-01-17 | 2024-02-20 | 南京先维信息技术有限公司 | Feature selection method and system for high-dimensional manufacturing process data |
-
2023
- 2023-04-18 CN CN202310412843.1A patent/CN116821715A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117075756A (en) * | 2023-10-12 | 2023-11-17 | 深圳市麦沃宝科技有限公司 | Real-time induction data processing method for intelligent touch keyboard |
CN117075756B (en) * | 2023-10-12 | 2024-03-19 | 深圳市麦沃宝科技有限公司 | Real-time induction data processing method for intelligent touch keyboard |
CN117311801A (en) * | 2023-11-27 | 2023-12-29 | 湖南科技大学 | Micro-service splitting method based on networking structural characteristics |
CN117311801B (en) * | 2023-11-27 | 2024-04-09 | 湖南科技大学 | Micro-service splitting method based on networking structural characteristics |
CN117574137A (en) * | 2024-01-17 | 2024-02-20 | 南京先维信息技术有限公司 | Feature selection method and system for high-dimensional manufacturing process data |
CN117574137B (en) * | 2024-01-17 | 2024-03-29 | 南京先维信息技术有限公司 | Feature selection method and system for high-dimensional manufacturing process data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108846259B (en) | Gene classification method and system based on clustering and random forest algorithm | |
Deng et al. | A survey on soft subspace clustering | |
Aliguliyev | Performance evaluation of density-based clustering methods | |
Aliniya et al. | A novel combinatorial merge-split approach for automatic clustering using imperialist competitive algorithm | |
CN116821715A (en) | Artificial bee colony optimization clustering method based on semi-supervision constraint | |
Sun et al. | Gene expression data analysis with the clustering method based on an improved quantum-behaved Particle Swarm Optimization | |
Wang et al. | CLUES: A non-parametric clustering method based on local shrinking | |
Yi et al. | An improved initialization center algorithm for K-means clustering | |
CN107291895B (en) | Quick hierarchical document query method | |
CN110210973A (en) | Insider trading recognition methods based on random forest and model-naive Bayesian | |
CN109543741A (en) | A kind of FCM algorithm optimization method based on improvement artificial bee colony | |
Pu et al. | An efficient hybrid approach based on PSO, ABC and k-means for cluster analysis | |
Priya et al. | Heuristically repopulated Bayesian ant colony optimization for treating missing values in large databases | |
Zhou et al. | Fractional-order modeling and fuzzy clustering of improved artificial bee colony algorithms | |
Zhou et al. | Region purity-based local feature selection: A multiobjective perspective | |
Xing et al. | Fuzzy c-means algorithm automatically determining optimal number of clusters | |
CN114663770A (en) | Hyperspectral image classification method and system based on integrated clustering waveband selection | |
Zhang et al. | A comparative study of ensemble learning approaches in the classification of breast cancer metastasis | |
Gopal et al. | Text clustering algorithm using fuzzy whale optimization algorithm | |
Wang et al. | Fuzzy C-means clustering algorithm for automatically determining the number of clusters | |
CN116092581A (en) | Annular RNA marker prediction method based on natural semantic enhancement | |
Purnomo et al. | Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis | |
CN114970684A (en) | Community detection method for extracting network core structure by combining VAE | |
Luo et al. | Deep unsupervised hashing by distilled smooth guidance | |
Zhang et al. | A Weighted KNN Algorithm Based on Entropy Method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |