Summary of the invention
The object of the invention is the defect solving prior art, propose a kind of in sensor network the distributed clustering method based on hybrid cytokine analytical model, cluster refers to the process by certain method, data being divided into multiple class.The class generated by cluster is the set of one group of data object, and these objects are similar each other to the object in same bunch, different with the object in other bunches.Due in cluster, the class label belonging to data is unknown, therefore in machine learning field, is processes of a unsupervised learning to the cluster of data.Existing data clustering method is a lot, but the cluster of most hypothesis total data all completes in a processing center, and in sensor network, distributed treatment is very crucial, therefore, the method, just in order to address this problem, designs a kind of distributed clustering method based on hybrid cytokine analytical model.Its advantage mainly contains: (1) hybrid cytokine analytical model effectively can process high dimensional data; (2) by cooperation mode between design node; only transmission intermediate object program just can obtain satisfied cluster result, compared with transmission initial data mode, has both reduced the expense of communication; protect again the privacy information in data, ensure that the data security in network.
The present invention solves the technical scheme that its technical problem takes: based on the distributed clustering method of hybrid cytokine analytical model in a kind of sensor network, the method comprises the steps:
If there be M sensor node in sensor network, m node collects N
mindividual data, are expressed as
wherein y
m,nrepresent n-th data at node m place, dimension is p.By hybrid cytokine analytical model (MFA), Y is described
m(m=1 ..., M) distribution, note the public same MFA of the data of all nodes.MFA is a component number is the mixed model of I; For each data y
m,n, it can be expressed as:
Y
m,n=μ
i+ A
iu
m,n+ e
m, n, iwith probability π
i(i=1 ..., I),
Wherein, μ
ibe the p dimension mean value vector of i-th blending constituent, u
m,nfor with data y
m,nthe factor in corresponding lower dimensional space, its dimension is q (q < < p), Gaussian distributed N (u
m,n| 0, I
q), the value of q is chosen according to the size of p in particular problem, generally gets the arbitrary integer between q=p/6 ~ p/2; A
ifor the Factor load-matrix of (p × q); Error variance e
m, n, igaussian distributed N (e
m, n, i| 0, D
i), wherein D
ifor the diagonal matrix of (p × p); Probability π
imeet
so the parameter sets Θ of MFA is { π
i, A
i, μ
i, D
i}
i=1 ..., I.Note, for all nodes, in its MFA parameter sets to be estimated, parameters value is identical.
In addition, the data transmission range of each node is set to W, and for present node m, all nodes being less than W with its distance are its neighbor node, and the neighbor node set expression of node m is R
m.Illustrate the relation between certain each node of sensor network interior joint in Fig. 1, wherein circle represents node, if there is limit to be connected between two nodes, then represents and can communicate mutually between two nodes, transmission information.The R of the m of the empty wire frame representation node in Fig. 1
m.In the present invention, network topology determined before Distributed Cluster is implemented, and will ensure direct between any two nodes or intercommunication after multi-hop.
After the MFA that sensor network topological of the present invention and data of description distribute establishes, then start Distributed Cluster process, as shown in Figure 2, its concrete steps comprise:
Step 1: initialization; Have M sensor node in sensor network, m node collects N
mindividual data, are expressed as
wherein y
m,nrepresent n-th data at node m place, dimension is p; Network topology is determined in advance, and the neighbor node set expression of node m is R
m; By hybrid cytokine analytical model (MFA), Y is described
m(m=1 ..., M) distribution, the same MFA of data sharing of all nodes; The parameter sets of MFA is { π
i, A
i, μ
i, D
i}
i=1 ..., I, wherein π
ibe the weight of i-th blending constituent, A
ibe the Factor load-matrix of (p × q) of i-th blending constituent, q is the dimension of the low-dimensional factor, gets the arbitrary integer between q=p/6 ~ p/2; μ
ibe the p dimension mean value vector of i-th blending constituent, D
iit is the covariance matrix of the error of i-th blending constituent;
First, setting MFA in be mixed into mark I, be also classification number to be clustered; The initial value of each parameter in MFA is set according to I, p and q; Wherein, each Nodes
random selecting from the data that this node collects,
with
in each element generate from standardized normal distribution N (0,1); In addition, the data amount check N that collected of each node l
lbe broadcast to its neighbor node; When certain node m receives its all neighbor node l (l ∈ R
m) broadcast come data amount check after, this node calculates weight c according to following formula
lm:
After initialization completes, iteration count iter=1, starts iterative process;
Step 2: local calculation; At each node l place, based on the data Y that it collects
l, first calculate intermediate variable g
i, Ω
i,
and <z
l, n, i>, (n=1 ..., N
l; I=1 ..., I):
Wherein,
for a front iteration complete after the parameter value that obtains (be the initial value of parameter first during iteration
<z
l, n, i> represents node l place n-th data y
l,nbelong to the probability of i-th class (blending constituent);
Then, node calculate local sufficient statistic
comprise:
Step 3: broadcast diffusion; The local sufficient statistic LSS that each node l in sensor network will calculate
lbroadcast diffusion is to its neighbor node;
Step 4: combined calculation; When node m (m=1 ..., M) receive from its all neighbor node l (l ∈ R
m) LSS
lafter, node m calculates associating sufficient statistic
Step 5: estimated parameter; Node m (m=1 ..., M) CSS that calculates according to previous step
m, estimate Θ={ π
i, A
i, μ
i, D
i}
i=1 ..., I, wherein, { π
i, μ
i}
i=1 ..., Iestimation procedure as follows:
For { A
i, D
i}
i=1 ..., Iestimation, process is as follows:
Step 6: judgement convergence; Node m (m=1 ..., M) calculate log-likelihood under current iteration:
If logp is (Y
m| Θ)-logp (Y
m| Θ
old) < ε, then restrain, stop iteration; Otherwise perform step 2, start next iteration (iter=iter+1); Wherein Θ represents the parameter value that current iteration estimates, Θ
oldrepresent the parameter value estimated in last iteration, that is, the log-likelihood of adjacent twice iteration is less than threshold epsilon, algorithmic statement; ε gets 10
-5~ 10
-6in arbitrary value; Because node each in network is parallel data processing, therefore allly can not to restrain in an iteration simultaneously; When node l restrains, node m not yet restrains, then node l no longer sends LSS
l, also no longer receive the information of neighbor node transmission; Node m is then with the LSS that the node l received for the last time sends
lupgrade its CSS
m; The node of not restraining continues iteration, until all nodes are all restrained in network;
Step 7: cluster exports; After step 1-step 6, and node m (m=1 ..., M) obtain data each with it
corresponding <z
m, n, i> (n=1 ..., N
m; I=1 ..., I), by <z
m, n, i> (i=1 ..., I) in the sequence number corresponding to maximum as y
m,nthe class C be finally allocated to
m,n, that is:
Obtain the cluster result of all data on all nodes in such a way
The present invention's hybrid cytokine analytical model carrys out the data that in modeling sensor network, each Nodes is to be clustered, each node calculates local sufficient statistic based on its data, then the diffusion of this amount is broadcast to its neighbor node, after node receives all local sufficient statistics from neighbor node, it can obtain associating sufficient statistic, and estimate the parameters in hybrid cytokine analytical model based on this statistic, finally complete cluster based on the model estimated.The hybrid cytokine analytical model that the present invention sets up can complete the dimensionality reduction of data while cluster; and adopt Distributed Cluster mode; avoid the periods of network disruption brought by Centroid in traditional centralized processing mode; in addition; in distributed clustering method of the present invention, each inter-node transmission be sufficient statistic instead of data, both greatly saved communication overhead; again can privacy information preferably in protected data, make to adopt the security of system of the method greatly to increase.
Beneficial effect:
1. the hybrid cytokine analyzer adopted in the present invention can carry out dimensionality reduction to high dimensional data, thus completes cluster smoothly while dimensionality reduction, obtains better clustering performance.
2. the distributed clustering method based on hybrid cytokine analytical model adopted in the present invention, make each node in sensor network can make full use of the information comprised in the data of other node, clustering performance is better than centralized approach.
3. the distributed clustering method based on hybrid cytokine analytical model adopted in the present invention; in node cooperation process; exchange local sufficient statistic instead of directly transmit initial data; because the quantity of local sufficient statistic and dimension are much smaller than data; therefore this mode saves the expense of communication on the one hand; on the other hand, be conducive to the privacy information adequately protected in data, improve the security performance of the system adopting this method.
Embodiment
Below in conjunction with Figure of description, the invention is described in further detail.
In order to the distributed clustering method based on hybrid cytokine analytical model in the sensor network that the present invention relates to is described better, be applied to the cluster of wine compositional data.In some countries, in zones of different, be distributed with some measuring stations, for detecting each component content in wine.The kind delivering to the wine of measuring station is different.Therefore need to carry out cluster to other wine of Similarity Class.If this measuring station can form Sensor Network with other measuring stations, by cooperating with each other, the data of wine in other measuring stations can be made full use of, thus improve cluster accuracy.Here, wine to be clustered, data from UCI machine learning databases, has 178 data here, altogether from 3 classes.The dimension of each data is 13, represents the content of each composition in wine.Have 8 nodes in sensor network, the average neighbor node number of each node is 2, and network is (all there is the path directly or indirectly arrived between any two nodes) that can be communicated with.Therefore in the present example, M=8, p=13, I=3, q=3.In addition, the data bulk at the place of each node: N
1=21, N
2=22, N
3=21, N
4=21, N
5=22, N
6=22, N
7=21, N
8=28; The neighbor node of each node: R
1={ 3,5,6}, R
2={ 3,5}, R
3={ Isosorbide-5-Nitrae, 2}, R
4={ 3}, R
5={ 1,2}, R
6={ 1,7}, R
7={ 6,8}, R
8={ 3,7}.
According to the flow process of summary of the invention (shown in Fig. 2), start Distributed Cluster:
(1) initialization: the initial value of parameter in setting MFA.Wherein, each Nodes
random selecting from the data of node,
with
in each element generate from standardized normal distribution N (0,1).In addition, and each node l (l=1 ..., M) the data amount check N that collected
lbe broadcast to its neighbor node.After certain node m receives its data amount check of all neighbor nodes broadcast, this node calculate weight c
lm:
The implication of this weight is each neighbor node l (the l ∈ R for weighing node m
m) each importance of information at node m place transmitted.After initialization completes, iteration count iter=1, starts iterative process.
(2) local calculation: this step does not need the information of neighbor node.At each node l place, based on the data Y that it collects
l, first calculate g
i, Ω
i,
and <z
l, n, i>, (n=1 ..., N
l; I=1 ..., I):
Wherein
for a front iteration complete after the parameter value that obtains (be the initial value of parameter first during iteration
<z
l, n, i> represents node l place n-th data y
l,nbelong to the probability of i-th class (blending constituent).
Secondly, node calculate local sufficient statistic
as follows:
(3) broadcast diffusion: the local statistic LSS that each node l in sensor network will calculate
lbroadcast is spread to its neighbor node, as shown in Figure 1.
(4) combined calculation: when node m (m=1 ..., M) receive from its all neighbor node l (l ∈ R
m) LSS
lafter, node m calculates associating sufficient statistic
(5) estimated parameter: node m (m=1 ..., M) CSS that calculates according to previous step
m, estimate Θ={ π
i, A
i, μ
i, D
i}
i=1 ..., I, wherein, { π
i, μ
i}
i=1 ..., Iestimation procedure as follows:
For { A
i, D
i}
i=1 ..., Iestimation, process is as follows:
(6) judgement convergence: node m (m=1 ..., M) calculate log-likelihood under current iteration:
If logp is (Y
m| Θ)-logp (Y
m| Θ
old) < ε, then restrain, stop iteration; Otherwise perform step (2), start next iteration (iter=iter+1); Wherein Θ represents the parameter value that current iteration estimates, Θ
oldrepresent the parameter value estimated in last iteration, that is, the log-likelihood of adjacent twice iteration is less than threshold epsilon, algorithmic statement; ε gets 10
-5~ 10
-6in arbitrary value; It should be noted that because node each in network is parallel data processing, therefore allly can not to restrain in an iteration simultaneously; Such as, when when node l restrains, node m not yet restrains, then node l no longer sends LSS
l, also no longer receive the information of neighbor node transmission; Node m is then with the LSS that the node l received for the last time sends
lupgrade its CSS
m; The node of not restraining continues iteration, until all nodes are all restrained in network.
(7) cluster exports.After step (1)-(6), and node m (m=1 ..., M), obtain data { y each with it
m,n}
n=1 ..., Nmcorresponding <z
m, n, i> (n=1 ..., N
m; I=1 ..., I), by <z
m, n, i>, i=1 ..., the sequence number corresponding to the maximum in I is as y
m,nthe class C be finally allocated to
m,n, that is:
Obtain the cluster result of all data on all nodes in such a way
Performance evaluation:
By the result adopting clustering method involved in the present invention to obtain
compare with correct generic result, thus can evaluate and weigh out validity and the accuracy of method involved in the present invention.As shown in Figure 3, the abscissa of this figure represents 178 data to the cluster result of each Nodes, and these data of the positional representation of non-vacancy have been assigned to that node, and ordinate represents the classification sequence number (totally 3 classes) that these data are assigned to.In the figure, " o " represents the data of correct cluster, and " x " represents the data of wrong cluster.From Fig. 3, the cluster accuracy of 8 Nodes is: 100%, 100%, 95.2%, 95.5%, 100%, 95.5%, 100%, 92.9%.Altogether only have five data by the cluster of mistake, the average accuracy of whole network is 97.2%.Compare with the result (98%) adopting the method for localized transmission to obtain, its accuracy is substantially identical.And the shortcoming of localized transmission is fairly obvious, one, once Centroid lost efficacy, then whole periods of network disruption; Its two, each node directly by original data transmissions to Centroid, not only increase the communications burden in network, and the privacy easily in leak data.Therefore, adopt method of the present invention to overcome above shortcoming, obtain good Distributed Cluster performance.
The scope of request protection of the present invention is not limited only to the description of this embodiment.