The content of the invention
Present invention aims at solve the defects of prior art, it is proposed that one kind in sensor network based on mixing because
The distributed clustering method of sub- analysis model, cluster refer to the process of data are divided into multiple classes by certain method.By gathering
The class that class is generated is the set of one group of data object, these objects and the object in same cluster are similar to each other, with other clusters
In object it is different.Since in cluster, the class label belonging to data is unknown, therefore in machine learning field, to data
Cluster be a unsupervised learning process.There are many existing data clustering method, but most hypothesis total datas
Cluster all completed in a processing center, and in sensor network, distributed treatment is very crucial, therefore, this method
Precisely in order to solving the problems, such as this, a kind of distributed clustering method based on hybrid cytokine analysis model is designed.Its advantage is main
Have:(1) hybrid cytokine analysis model can effectively handle high dimensional data;(2) by cooperation mode between design node, only transmit
Intermediate result is obtained with satisfied cluster result, compared with transmitting initial data mode, has not only reduced the expense of communication, but also
Protect the privacy information in data, it is ensured that the data safety in network.
The technical scheme adopted by the invention to solve the technical problem is that:Based on hybrid cytokine point in a kind of sensor network
The distributed clustering method of model is analysed, this method comprises the following steps:
If there is M sensor node in sensor network, m-th of node collects NmA data, are expressed asWherein ym,nRepresent the nth data at node m, dimension p.With hybrid cytokine point
Model (MFA) is analysed to describe YmThe distribution of (m=1 ..., M) pays attention to the public same MFA of data of all nodes.MFA is one
A component number is the mixed model of I;For each data ym,n, can be expressed as:
ym,n=μi+Aium,n+em,n,iWith probability πi(i=1 ..., I),
Wherein, μiMean value vector, u are tieed up for the p of i-th of blending constituentm,nFor with data ym,nIn corresponding lower dimensional space
The factor, its dimension be q (q < < p), Gaussian distributed N (um,n|0,Iq), the value of q is according to the size of p in particular problem
It is chosen, generally takes the arbitrary integer between q=p/6~p/2;AiFor the Factor load-matrix of (p × q);Error variance
em,n,iGaussian distributed N (em,n,i|0,Di), wherein DiFor the diagonal matrix of (p × p);Probability πiMeetThe parameter sets Θ of so MFA is { πi,Ai,μi,Di}I=1 ..., I.Note that for all nodes, treat
Parameters value is identical in the MFA parameter sets of estimation.
In addition, the data transmission range of each node is set to W, and for present node m, all sections for being less than W with its distance
Point is its neighbor node, and the neighbor node set expression of node m is Rm.Illustrate that some sensor network interior joint is each in Fig. 1
Relation between a node, wherein circle represents node, if having side to be connected between two nodes, then it represents that can between two nodes
To communicate, information is transmitted.Dotted line frame in Fig. 1 represents the R of the m of nodem.In the present invention, network topology is in distribution
Cluster has determined before implementing, and to ensure between any two node directly or through intercommunication after multi-hop.
After the MFA of sensor of the invention network topology and description data distribution is established, then start Distributed Cluster
Process, as shown in Fig. 2, its specific steps includes:
Step 1:Initialization;There is M sensor node in sensor network, m-th of node collects NmA data represent
ForWherein ym,nRepresent the nth data at node m, dimension p;Network topology is
Through being determined in advance, the neighbor node set expression of node m is Rm;Y is described with hybrid cytokine analysis model (MFA)m(m=
1 ..., M) distribution, the same MFA of data sharing of all nodes;The parameter sets of MFA are { πi,Ai,μi,Di}I=1 ..., I,
Wherein πiFor the weight of i-th of blending constituent, AiFor the Factor load-matrix of (p × q) of i-th of blending constituent, q for low-dimensional because
The dimension of son, takes the arbitrary integer between q=p/6~p/2;μiMean value vector, D are tieed up for the p of i-th of blending constituentiFor i-th
The covariance matrix of the error of blending constituent;
First, set in MFA and be mixed into fraction I and classification number to be clustered;MFA is set according to I, p and q
In each parameter initial value;Wherein, at each node It is randomly selected in the data collected from the node,With
In the generation from the standardized normal distribution N (0,1) of each element;In addition, the data amount check N that each node l is collectedl
It is broadcast to its neighbor node;When some node m receives its all neighbor node l (l ∈ Rm) broadcast come data amount check after,
The node calculates weight c according to the following formulalm:
After the completion of initialization, iteration count iter=1 starts iterative process;
Step 2:Local calculation;At each node l, the data Y that is collected based on itl, intermediate variable is calculated first
gi, Ωi,With<zl,n,i>, (n=1 ..., Nl;I=1 ..., I):
Wherein,The parameter value obtained afterwards for the completion of preceding an iteration (changes for the first time
The initial value of Dai Shiwei parameters<zl,n,i>Represent nth data y at node ll,nBelong to i-th of class
The probability of (blending constituent);
Then, node calculates local sufficient statisticIncluding:
Step 3:Broadcast diffusion;The local sufficient statistic LSS that each node l in sensor network will be calculatedlExtensively
It broadcasts diffusion and gives its neighbor node;
Step 4:Combined calculation;When node m (m=1 ..., M) is received from its all neighbor node l (l ∈ Rm) LSSl
Afterwards, node m calculates joint sufficient statistic
Step 5:Estimate parameter;The CSS that node m (m=1 ..., M) is calculated according to previous stepm, estimate Θ={ πi,
Ai,μi,Di}I=1 ..., I, wherein, { πi,μi}I=1 ..., IEstimation procedure it is as follows:
For { Ai,Di}I=1 ..., IEstimation, process is as follows:
Step 6:Judgement convergence;Node m (m=1 ..., M) calculates the log-likelihood under current iteration:
If logp (Ym|Θ)-logp(Ym|Θold) < ε, then it restrains, stops iteration;Otherwise step 2 is performed, under starting
An iteration (iter=iter+1);Wherein Θ represents the parameter value that current iteration estimates, ΘoldIt represents in last iteration
The parameter value of estimation, i.e. the log-likelihood of adjacent iteration twice is less than threshold epsilon, algorithmic statement;ε takes 10-5~10-6In appoint
Meaning value;It is all simultaneously to be restrained in an iteration since each node is parallel data processing in network;Work as node
L has restrained and when node m not yet restrains, then node l does not retransmit LSSl, also no longer receive the information that neighbor node transmits;
The LSS that the node l that node m is then received with last time is sentlUpdate its CSSm;Not converged node continues iteration, until network
In all nodes all restrain;
Step 7:Cluster output;After step 1- steps 6, node m (m=1 ..., M) is obtained and each of which dataIt is corresponding<zm,n,i>(n=1 ..., Nm;I=1 ..., I), it will<zm,n,i>Maximum in (i=1 ..., I)
The corresponding sequence number of value is as ym,nThe class C being finally allocated tom,n, i.e.,:
The cluster result of all data on all nodes is obtained in such a way
The present invention models data to be clustered at each node in sensor network, each node with hybrid cytokine analysis model
Local sufficient statistic is calculated based on its data, amount diffusion is then broadcast to its neighbor node, when node receive it is all
After local sufficient statistic from neighbor node, joint sufficient statistic can be obtained, and is estimated based on the statistic
Go out the parameters in hybrid cytokine analysis model, be based ultimately upon the model estimated and complete cluster.The mixing that the present invention establishes
Factor Analysis Model can complete the dimensionality reduction of data while cluster, and use Distributed Cluster mode, avoid tradition
Centralized processing mode in the periods of network disruption brought by Centroid, in addition, in the distributed clustering method of the present invention, respectively
What is transmitted between node is sufficient statistic rather than data, and communication overhead is not only greatly saved, but also can preferably protect data
In privacy information so that greatly increased using the security of system of this method.
Advantageous effect:
1. the hybrid cytokine analyzer employed in the present invention can carry out dimensionality reduction to high dimensional data, so as in the same of dimensionality reduction
When smoothly complete cluster, obtain better clustering performance.
2. the distributed clustering method based on hybrid cytokine analysis model employed in the present invention so that sensor network
In each node can make full use of information included in the data of other nodes, clustering performance is better than centralized approach.
3. the distributed clustering method based on hybrid cytokine analysis model employed in the present invention, in node cooperating process
In, it exchanges local sufficient statistic rather than directly transmits initial data, since the quantity and dimension of local sufficient statistic are remote
Less than data, therefore on the one hand this mode saves the expense of communication, on the other hand, hidden in the data that are conducive to adequately protect
Personal letter ceases, and improves the security performance of the system using this method.
Specific embodiment
The invention is described in further detail with reference to Figure of description.
In order to which the distribution being better described based on hybrid cytokine analysis model in sensor network of the present invention is gathered
Class method is applied to the cluster of wine compositional data.In some countries, some measuring stations are distributed in different zones,
For detecting each component content in wine.The species for being sent to the wine of measuring station is different.Therefore the wine progress to similar categorization is needed
Cluster.If the measuring station can form Sensor Network with other measuring stations, by cooperating with each other, other detections can be made full use of
The data of wine in standing, so as to improve cluster accuracy.Here, wine data source to be clustered is in UCI machine learning databases, this
In have 178 data, in total from 3 classes.The dimension of each data is 13, represents the content of each ingredient in wine.Sensor
Have 8 nodes in network, the average neighbor node number of each node is 2, network be can connect (between any two node all
In the presence of the path directly or indirectly reached).Therefore in the present example, M=8, p=13, I=3, q=3.In addition, the place of each node
Data bulk:N1=21, N2=22, N3=21, N4=21, N5=22, N6=22, N7=21, N8=28;The neighbours of each node
Node:R1={ 3,5,6 }, R2={ 3,5 }, R3={ 1,4,2 }, R4={ 3 }, R5={ 1,2 }, R6={ 1,7 }, R7={ 6,8 }, R8
={ 3,7 }.
According to the flow of the content of the invention (shown in Fig. 2), start Distributed Cluster:
(1) initialize:Set the initial value of parameter in MFA.Wherein, at each node It is randomly selected from the data of node,WithIn the generation from the standardized normal distribution N (0,1) of each element.In addition,
The data amount check N that each node l (l=1 ..., M) is collectedlIt is broadcast to its neighbor node.When some node m receives it
The data amount check that comes of all neighbor nodes broadcast after, which calculates weight clm:
The meaning of the weight is for weighing each neighbor node l (l ∈ R of node mm) information transmitted every time is in node m
The importance at place.After the completion of initialization, iteration count iter=1 starts iterative process.
(2) local calculation:The information of neighbor node is not required in this step.At each node l, the number that is collected based on it
According to Yl, g is calculated firsti, Ωi,With<zl,n,i>, (n=1 ..., Nl;I=1 ..., I):
Wherein(it is during iteration for the first time for the parameter value that obtains afterwards of preceding an iteration completion
The initial value of parameter<zl,n,i>Represent nth data y at node ll,nBelong to i-th of class (mixing
Ingredient) probability.
Secondly, node calculates local sufficient statisticIt is as follows:
(3) broadcast diffusion:The local statistic LSS that each node l in sensor network will be calculatedlBroadcast diffusion is given
Its neighbor node, as shown in Figure 1.
(4) combined calculation:When node m (m=1 ..., M) is received from its all neighbor node l (l ∈ Rm) LSSlAfterwards,
Node m calculates joint sufficient statistic
(5) parameter is estimated:The CSS that node m (m=1 ..., M) is calculated according to previous stepm, estimate Θ={ πi,Ai,
μi,Di}I=1 ..., I, wherein, { πi,μi}I=1 ..., IEstimation procedure it is as follows:
For { Ai,Di}I=1 ..., IEstimation, process is as follows:
(6) judgement convergence:Node m (m=1 ..., M) calculates the log-likelihood under current iteration:
If logp (Ym|Θ)-logp(Ym|Θold) < ε, then it restrains, stops iteration;Otherwise step (2) is performed, is started
Next iteration (iter=iter+1);Wherein Θ represents the parameter value that current iteration estimates, ΘoldRepresent last iteration
The parameter value of middle estimation, i.e. the log-likelihood of adjacent iteration twice is less than threshold epsilon, algorithmic statement;ε takes 10-5~10-6In
Arbitrary value;Significantly, since each node is parallel data processing in network, thus it is all can not possibly be in an iteration
It restrains simultaneously;For example, when node l has restrained and node m not yet restrains, then node l does not retransmit LSSl, also no longer receive
The information of neighbor node transmission;The LSS that the node l that node m is then received with last time is sentlUpdate its CSSm;Not converged section
Point continues iteration, until all nodes are all restrained in network.
(7) cluster output.After step (1)-(6), node m (m=1 ..., M) is obtained and each of which data
{ym,n}N=1 ..., NmIt is corresponding<zm,n,i>(n=1 ..., Nm;I=1 ..., I), it will<zm,n,i>, in i=1 ..., I most
The corresponding sequence number of big value is as ym,nThe class C being finally allocated tom,n, i.e.,:
The cluster result of all data on all nodes is obtained in such a way
Performance evaluation:
The obtained result of clustering method involved in the present invention will be usedWith correct generic result into
Row compares, so as to evaluate and weigh out the validity of method according to the present invention and accuracy.Cluster at each node
The results are shown in Figure 3, and the abscissa of the figure represents 178 data, and the position of non-vacancy represents that the data have been assigned to that section
Point, ordinate represent the classification sequence number (totally 3 class) that the data are assigned to.In the figure, the data that " o " expression correctly clusters, " x "
Represent the data of mistake cluster.From Fig. 3, the cluster accuracy at 8 nodes is:100%, 100%, 95.2%, 95.5%,
100%, 95.5%, 100%, 92.9%.There was only five data in total by the cluster of mistake, the average accuracy of whole network is
97.2%.It is compared with the result (98%) that the method using localized transmission obtains, accuracy is essentially identical.And centralization passes
The shortcomings that defeated is fairly obvious, once first, Centroid fails, then whole network is collapsed;Second, each node is directly by original
Beginning data are transferred to Centroid, not only increase the communications burden in network, and the easily privacy in leak data.Cause
This, method using the present invention overcome more than the shortcomings that, obtain good Distributed Cluster performance.
The claimed scope of the present invention is not limited only to the description of present embodiment.