Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a distributed clustering method based on a mixed factor analysis model in a sensor network, wherein clustering refers to a process of dividing data into a plurality of classes by a certain method. The class generated by a cluster is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. Since the class label to which the data belongs is unknown in the clustering, the clustering of the data is an unsupervised learning process in the field of machine learning. The existing data clustering methods are many, but most of the existing data clustering methods assume that all data clustering is completed on one processing center, and in a sensor network, distributed processing is very critical, so that the method is designed based on a mixed factor analysis model to solve the problem. Its advantages mainly include: (1) the mixed factor analysis model can effectively process high-dimensional data; (2) by designing a cooperation mode among nodes, a satisfactory clustering result can be obtained only by transmitting an intermediate result, and compared with a mode of transmitting original data, the method reduces the communication overhead, protects the privacy information on the data and ensures the data security in the network.
The technical scheme adopted by the invention for solving the technical problems is as follows: a distributed clustering method based on a mixed factor analysis model in a sensor network comprises the following steps:
if M sensor nodes exist in the sensor network, the mth node collects NmThe number of the data is one,is shown asWherein y ism,nRepresenting the nth data at node m with dimension p. Description of Y with Mixed factor analysis Model (MFA)mDistribution of (M1...., M), note that the data of all nodes share the same MFA. MFA is a mixed model with component I; for each data ym,nIt can be expressed as:
ym,n=μi+Aium,n+em,n,iwith probability pii(i=1,...,I),
Wherein, muiIs the p-dimensional mean vector of the ith mixture component, um,nIs and data ym,nCorresponding factor in low dimensional space with dimension q (q < p) obeying Gaussian distribution N (u)m,n|0,Iq) The value of q is selected according to the size of p in the specific problem, and is generally an arbitrary integer between p/6 and p/2; a. theiA factor load matrix of (p × q); error variable em,n,iObeying Gaussian distribution N (e)m,n,i|0,Di) Wherein D isiA diagonal matrix of (p × p); probability piiSatisfy the requirement ofThen the parameter set Θ of the MFA is { π }i,Ai,μi,Di}i=1,...,I. Note that the values of the individual parameters in the MFA parameter set that it is to estimate are the same for all nodes.
In addition, the data transmission range of each node is set as W, for the current node m, all nodes with the distance smaller than W are neighbor nodes, and the neighbor node set of the node m is represented as Rm. The relationship between nodes in a sensor network is shown in fig. 1, where circles represent nodes, and if two nodes are connected by edges, it means that the two nodes can communicate with each other to transmit information. The dotted boxes in FIG. 1 represent the R of m for a nodem. In the present invention, the netThe network topology is well established before distributed clustering is implemented and interworking between any two nodes is guaranteed either directly or over multiple hops.
After the sensor network topology and the MFA describing the data distribution are established, the distributed clustering process is started, as shown in fig. 2, and the specific steps include:
step 1: initializing; there are M sensor nodes in the sensor network, and the M-th node acquires NmData, expressed asWherein y ism,nRepresenting the nth data at the node m with the dimension p; the network topology is determined in advance, and the neighbor node set of the node m is represented as Rm(ii) a Description of Y with Mixed factor analysis Model (MFA)mA distribution of (M ═ 1.. said., M), the data of all nodes sharing the same MFA; the parameter set of the MFA is { pii,Ai,μi,Di}i=1,...,IIn which piiIs the weight of the ith mixture component, AiA (p × q) factor load matrix of the ith mixed component, wherein q is the dimension of a low-dimensional factor and is an arbitrary integer between p/6 and p/2; mu.siIs the p-dimensional mean vector of the ith mixture component, DiA covariance matrix which is an error of the ith mixture component;
firstly, setting a mixed fraction I in the MFA, which is also the number of categories to be clustered; setting initial values of parameters in the MFA according to I, p and q; wherein at each node Randomly selected from the data collected by the node,andeach element in (a) is generated from a standard normal distribution N (0, 1); in addition, the number N of data collected by each node llBroadcast to its neighbor nodes; when a certain node m receives all its neighbor nodes l (l is belonged to R)m) After the number of data broadcast, the node calculates the weight c according to the following equationlm:
After the initialization is completed, the iteration counter iter is equal to 1, and the iteration process is started;
step 2: local calculation; at each node l, based on the data Y it has collectedlFirst, the intermediate variable g is calculatedi,Ωi,And<zl,n,i>,(n=1,...,Nl;i=1,...,I):
wherein,for the parameter values obtained after the previous iteration is completed (initial values of the parameters at the first iteration)<zl,n,i>Represents the nth data y at the node ll,nProbability of belonging to the ith class (mixture component);
next, the node calculates local sufficient statisticsThe method comprises the following steps:
and step 3: broadcasting diffusion; each node l in the sensor network will calculate the local sufficient statistic LSSlBroadcasting and diffusing to the neighbor nodes;
and 4, step 4: performing joint calculation; when node M (M ═ 1.. times.m) receives information from all its neighbor nodes l (l ∈ R)m) LSS oflThen, the node m calculates the joint sufficient statistic
And 5: estimating parameters; the node M (M1.., M) is calculated according to the CSS calculated in the previous stepmEstimate Θ to be ═ pii,Ai,μi,Di}i=1,...,IWherein, { pi-i,μi}i=1,...,IThe estimation process of (2) is as follows:
for { Ai,Di}i=1,...,IThe process is as follows:
step 6: judging convergence; node M (M ═ 1.., M) computes the log-likelihood value at the current iteration:
if logp (Y)m|Θ)-logp(Ym|Θold) If the value is less than epsilon, convergence is carried out, and iteration is stopped; otherwise, executing step 2, and starting the next iteration (iter + 1); where Θ represents the value of the parameter estimated for the current iteration, ΘoldRepresenting the value of the parameter estimated in the last iteration, i.e. the pair of two adjacent iterationsThe number likelihood value is smaller than a threshold value epsilon, and the algorithm converges; taking 10 from epsilon-5~10-6Any value of (1); because each node in the network processes data in parallel, all nodes cannot converge at the same time in one iteration; when node l has converged and node m has not, then node l no longer sends LSSlAnd the information transmitted by the neighbor node is not received any more; the node m uses the LSS sent by the last received node llUpdating its CSSm(ii) a The nodes which are not converged continue to iterate until all the nodes in the network are converged;
and 7: clustering and outputting; after steps 1 to 6, node M (M1.., M) obtains each data thereofCorresponding to<zm,n,i>(n=1,...,Nm(ii) a I1.., I), will be<zm,n,i>The sequence number corresponding to the maximum value in (I ═ 1.. multidot.i) is defined as ym,nClass C to which it is finally assignedm,nNamely:
obtaining clustering results of all data on all nodes in this way
The method comprises the steps of modeling data to be clustered at each node in a sensor network by using a mixed factor analysis model, calculating local sufficient statistics by each node based on own data, then broadcasting the local sufficient statistics to neighbor nodes in a diffusion mode, obtaining joint sufficient statistics by the nodes after the nodes receive all the local sufficient statistics from the neighbor nodes, estimating each parameter in the mixed factor analysis model based on the statistics, and finally finishing clustering based on the estimated model. The mixed factor analysis model established by the invention can complete the dimensionality reduction of data while clustering, and a distributed clustering mode is adopted, so that the network collapse caused by a central node in the traditional centralized processing mode is avoided.
Has the advantages that:
1. the mixed factor analyzer adopted in the invention can reduce the dimension of the high-dimensional data, thereby smoothly completing clustering while reducing the dimension and obtaining better clustering performance.
2. The distributed clustering method based on the mixed factor analysis model adopted by the invention enables each node in the sensor network to fully utilize information contained in data of other nodes, and the clustering performance is superior to that of a centralized method.
3. According to the distributed clustering method based on the mixed factor analysis model, local sufficient statistics are exchanged instead of directly transmitting original data in the node cooperation process, and the number and the dimensionality of the local sufficient statistics are far smaller than those of the data, so that the method saves communication overhead on one hand, and is beneficial to fully protecting privacy information in the data on the other hand, and the safety performance of a system adopting the method is improved.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
In order to better illustrate the distributed clustering method based on the mixed factor analysis model in the sensor network, the method is applied to the clustering of the wine composition data. In some countries, several detection stations are distributed in different areas for detecting the content of the ingredients in the wine. The types of wine sent to the testing station vary. It is therefore desirable to cluster similar classes of wine. If the detection station can form a sensor network with other detection stations, the data of wine in other detection stations can be fully utilized through mutual cooperation, and therefore clustering accuracy is improved. Here, the wine data to be clustered is derived from the UCI machine learning database, where there are 178 data, for a total of 3 classes. The dimension of each datum is 13, and represents the content of each component in the wine. The sensor network has 8 nodes, the average number of neighbor nodes of each node is 2, and the network is connectable (direct or indirect arriving paths exist between any two nodes). Thus in this example, M is 8, p is 13, I is 3 and q is 3. Further, the data amount at each node: n is a radical of1=21,N2=22,N3=21,N4=21,N5=22,N6=22,N7=21,N828; neighbor nodes of each node: r1={3,5,6},R2={3,5},R3={1,4,2},R4={3},R5={1,2},R6={1,7},R7={6,8},R8={3,7}。
According to the flow of the inventive content (shown in fig. 2), distributed clustering is started:
(1) initialization: initial values of parameters in the MFA are set. Wherein at each node Is randomly selected from the data of the node,andis generated from a standard normal distribution N (0, 1). In addition, each node l (1.,. M) has the number N of data collected by itlBroadcast to its neighbor nodes. When a node m receives the number of data broadcast by all its neighbor nodes, the node calculates the weight clm:
The meaning of the weight is that each neighbor node l (l belongs to R) for measuring the node mm) The importance of each transmitted message at node m. After the initialization is completed, the iteration counter iter is equal to 1, and the iteration process is started.
(2) Local calculation: this step does not require information of the neighbor nodes. At each node l, based on the data Y it has collectedlFirst, g is calculatedi,Ωi,And<zl,n,i>,(n=1,...,Nl;i=1,...,I):
whereinFor the parameter values obtained after the previous iteration is completed (initial values of the parameters at the first iteration)<zl,n,i>Represents the nth data y at the node ll,nProbability of belonging to the ith class (mixture component).
Second, the node calculates local sufficient statisticsThe following were used:
(3) broadcasting diffusion: each node l in the sensor network will calculate the local statistic LSSlThe broadcast is flooded to its neighbor nodes as shown in fig. 1.
(4) Joint calculation: when node M (M ═ 1.. times.m) receives information from all its neighbor nodes l (l ∈ R)m) LSS oflThen, the node m calculates the joint sufficient statistic
(5) And (3) estimating parameters: the node M (M1.., M) is calculated according to the CSS calculated in the previous stepmEstimate Θ to be ═ pii,Ai,μi,Di}i=1,...,IWherein, { pi-i,μi}i=1,...,IThe estimation process of (2) is as follows:
for { Ai,Di}i=1,...,IThe process is as follows:
(6) and (4) judging convergence: node M (M ═ 1.., M) computes the log-likelihood value at the current iteration:
if logp (Y)m|Θ)-logp(Ym|Θold) If the value is less than epsilon, convergence is carried out, and iteration is stopped; otherwise, executing step (2), and starting the next iteration (iter + 1); where Θ represents the value of the parameter estimated for the current iteration, ΘoldRepresenting the parameter value estimated in the last iteration, namely, the logarithm likelihood value of two adjacent iterations is less than a threshold value epsilon, and the algorithm converges; taking 10 from epsilon-5~10-6Any value of (1); it is worth noting that since each node in the network processes data in parallel, it is not possible to converge simultaneously in one iteration; for example, when node l has converged and node m has not, then node l no longer sends LSSlAnd the information transmitted by the neighbor node is not received any more; the node m uses the LSS sent by the last received node llUpdating its CSSm(ii) a The non-converged nodes continue to iterate until all nodes in the network converge.
(7) And (6) clustering and outputting. After steps (1) - (6), node M (M ═ 1.., M), with which each datum { y } is obtainedm,n}n=1,...,NmCorresponding to<zm,n,i>(n=1,...,Nm(ii) a I1.., I), will be<zm,n,i>I is 1, the serial number corresponding to the maximum value in I is used as ym,nClass C to which it is finally assignedm,nNamely:
obtaining clustering results of all data on all nodes in this way
Performance evaluation:
the results obtained by the clustering method according to the inventionAnd the correct generic result, so that the effectiveness and accuracy of the method can be evaluated and measured. The clustering result at each node is shown in fig. 3, in which the abscissa indicates 178 data, the non-vacant position indicates the node to which the data is classified, and the ordinate indicates the category number (total 3 categories) to which the data is classified. In the figure, "o" indicates correctly clustered data and "x" indicates incorrectly clustered data. From fig. 3, the clustering accuracy at 8 nodes is: 100%, 100%, 95.2%, 95.5%, 100%, 95.5%, 100%, 92.9%. Only five data in total were clustered by error, with an average accuracy of 97.2% for the entire network. Compared with the result (98%) obtained by adopting the centralized transmission method, the accuracy is basically the same. The disadvantage of centralized transmission is quite obvious, and firstly, once a central node fails, the whole network crashes; secondly, each node directly transmits the original data to the central node, so that the communication burden in the network is increased, and the privacy in the data is easily revealed. Therefore, the method of the invention overcomes the defects and obtains good distributed clustering performance.
The scope of the invention is not limited to the description of the embodiments.