Summary of the invention
The object of the invention is the defect solving prior art, propose a kind of in sensor network the distributed clustering method based on hybrid cytokine analytical model.
The present invention solves the technical scheme that its technical matters takes: a kind of distributed Handwritten Digit Recognition method analyzed based on t hybrid cytokine, the method comprises the steps:
Step 1: the collection of data and feature extraction;
Be provided with M platform computing machine/computing node (that is: node), form a network, m node collects from numeral 0 ~ 9 by the handwriting pad be attached thereto, the raw data of totally 10 classes, handwriting pad records the two-dimensional coordinate position of each point on each character writing track automatically, gets the coordinate of 8 points in the first-class compartment of terrain of track
as the characteristic s corresponding to each raw data, totally 16 dimensions.In order to represent convenient, if the collection of node m place the training dataset of the digital d obtained through feature extraction are
wherein
represent node m place, for n-th characteristic of handwritten numeral d of training, dimension is p,
for the training data number of digital d.
Describe by t hybrid cytokine analytical model (tMFA)
distribution, note, all Nodes carry out modeling about the public same tMFA of training data of digital d.TMFA is a component number is the mixture model of I; For each data
it can be expressed as:
With probability π
i(i=1 ..., I),
Wherein, μ
ibe i-th blending constituent mean value vector, dimension is p;
for with
the factor in corresponding lower dimensional space, its dimension is q (q < < p), obeys t distribution
the value of q is chosen according to the size of p in particular problem, generally gets the arbitrary integer between q=p/5 ~ p/3; A
iit is the Factor load-matrix of (p × q) of i-th blending constituent; Error variance
obey t distribution
wherein D
ifor the diagonal matrix of (p × p), ν
iit is the degree of freedom of i-th blending constituent; The weight π of each blending constituent
imeet
so the parameter sets Θ of tMFA is { π
i, A
i, μ
i, D
i, ν
i}
i=1 ..., I.Note, for all nodes, in its tMFA parameter sets to be estimated, parameters value is identical.Here it should be noted that t distribution can be launched into the integration of Gaussian distribution and Gamma distribution:
Wherein,
be with
corresponding integration hidden variable.
In addition, the data transmission range of each node is set to Dis, and for present node m, all nodes being less than Dis with its distance are its neighbor node, and the neighbor node set expression of node m is R
m.Illustrate the relation between each node in certain network in Fig. 1, wherein computer icon representation node, if there is limit to be connected between two nodes, then represents and can communicate mutually between two nodes, transmission information.The R of the m of the empty wire frame representation node in Fig. 1
m.In the present invention, network topology determines in advance, only needs to ensure at least to exist between any two nodes one directly or the path that can arrive through multi-hop.
Step 2: distributed training, will
for distributed training, obtain the tMFA parameter sets of every class numeral corresponding to d
After the tMFA that network topology and data of description distribute establishes, then start distributed training.Here be trained for example with digital d, as shown in Figure 2, its concrete steps are as follows for training process:
Step 2-1: initialization; Setting tMFA in be mixed into mark I.Here I determines the complexity of tMFA model, and I can get the arbitrary integer in 3 ~ 8, gets I=5 and can obtain good performance in Handwritten Digit Recognition.The initial value of parameter in MFA is set according to the dimension p of I and data.Wherein, each Nodes
Random selecting from the data that this node collects;
with
in each element generate from standardized normal distribution N (0,1);
this group parameter gets the arbitrary integer between 1 ~ 5.In addition, and each node l (l=1 ..., M) data amount check that collected
be broadcast to its neighbor node.After certain node m receives its data amount check of all neighbor nodes broadcast, this node calculate weight c
lm:
The implication of this weight is each neighbor node l (the l ∈ R for weighing node m
m) each importance of information at node m place transmitted.After initialization completes, iteration count iter=1, starts iterative process.
Step 2-2: calculate local statistic and broadcast; This step does not need the information of neighbor node.At each node l place, based on the data that it collects
first g is calculated
i, Ω
i,
with
Wherein
for a front iteration complete after obtain parameter value (that is: first iteration time be the initial value of parameter
represent node l place n-th data
belong to the probability of i-th class (that is: blending constituent),
for
expectation value.
Had the result of calculation of above-mentioned variable, node calculate goes out local statistic (LS)
As follows:
Finally, the local statistic LS that will calculate of each node l
lbroadcast is spread to its neighbor node, as shown in Figure 1.
Step 2-3: calculate associating statistic; When node m (m=1 ..., M) receive from its all neighbor node l (l ∈ R
m), node m calculates associating statistic
Step 2-4: each parameter in estimation model; Node m (m=1 ..., M) CS that calculates according to previous step
m, estimate each parameter Θ={ π
i, A
i, μ
i, D
i, ν
i}
i=1 ..., I, wherein, { π
i, μ
i}
i=1 ..., Iestimation procedure as follows:
For { A
i, D
i}
i=1 ..., Iestimation, process is as follows:
In addition, for { ν
i}
i=1 ..., I, obtained by solution equation below:
Wherein ψ () the digamma function that is standard, specifically solves employing Newton method.
Step 2-5: judge whether convergence; Node m (m=1 ..., M) calculate log-likelihood under current iteration iter:
If
then algorithm convergence, stops iteration; Otherwise perform step 2-2, start next iteration (iter=iter+1; ).Wherein Θ represents the parameter value that current iteration estimates, Θ
oldrepresent the parameter value estimated in last iteration.That is, the log-likelihood of adjacent twice iteration is less than threshold epsilon, algorithm convergence.ε gets 10
-5~ 10
-6in arbitrary value.It should be noted that because node each in network is parallel data processing, therefore allly can not to restrain in an iteration simultaneously.Such as, when when node l restrains, node m not yet restrains, then node l no longer sends LS
l, also no longer receive the information of neighbor node transmission.Node m is then with the LS that the node l received for the last time sends
lupgrade its CS
m.The node of not restraining continues iteration, until all nodes are all restrained in network.
After above-mentioned steps 2-1 ~ step 2-5, the corresponding model tMFA (that is: representing with parameter Θ during training convergence) obtained by the training data of handwritten numeral d.Repeat above-mentioned steps 10 times, thus obtain 10 numerals tMFA model corresponding separately, in order to represent convenient, and being distinguished, using
The tMFA model that representative digit d is corresponding.Distributed training completes.
Step 3: Distributed identification; When the arbitrary computer acquisition in network to new for identify handwritten numeral time, first obtain its character pair by step (1), be expressed as s', then calculate about Θ
(d)(d=0,1 ..., 9) log-likelihood logp (s'| Θ
(d)) (d=0,1 ..., 9):
Using the recognition result d' of sequence number corresponding for max log likelihood value as s':
The idiographic flow of Distributed identification method of the present invention as shown in Figure 2.
Beneficial effect:
1. the t hybrid cytokine analysis adopted in the present invention, to the outlier existed in data, there is higher robustness, and high dimensional data can be described better, thus better obtain the model corresponding with data, thus also can obtain better training and recognition performance.
2. the distributed training process based on t hybrid cytokine analytical model adopted in the present invention, the information comprised in each computer node in network can be made full use of data that other computer node collects, thus the more accurate model of training place.
3. the distributed training process based on t hybrid cytokine analytical model adopted in the present invention; in computer node cooperating process; exchange local statistic instead of directly transmit raw data; because the quantity of local statistic and dimension are much smaller than data; therefore this mode saves the expense of communication on the one hand; on the other hand, be conducive to the privacy information adequately protected in data, improve the security performance of the system adopting this method.
4. the Distributed identification process based on t hybrid cytokine analytical model adopted in the present invention, can gather new data in any computer node place in a network, can obtain identical recognition result.
Embodiment
Below in conjunction with Figure of description, the invention is described in further detail.
In order to a kind of distributed Handwritten Digit Recognition method analyzed based on t hybrid cytokine that the present invention relates to is described better.We describe with a concrete application example.
(1) collection of data and feature extraction: establish and always have 44 people, everyone each digital handwriting 25 times, is total up to 25*10*44=11000 raw data.20 computing machine/nodes (M=20) altogether in network, the neighbor node number of each node is 3, has directly or multihop path intercommunication between any two nodes.The raw data (250*30=7500) of the wherein handwritten numeral of 30 people, for distributed training, is divided into 20 parts, is assigned randomly in 20 nodes.For each raw data, equally spaced get the coordinate of 8 points on its track
as the characteristic s corresponding to this raw data, totally 16 dimensions.In order to represent convenient, if the training dataset of digital d that node m place obtains through feature extraction is
wherein
represent node m place, for n-th characteristic of handwritten numeral d of training, dimension is p=16,
for the training data number of digital d.
(2) distributed training: after step (1) completes, start distributed training.Here be trained for example with digital d, carry out the distribution of Modelling feature data with tMFA
as shown in Figure 2, its concrete steps are as follows for training process:
(2-1) initialization: setting tMFA in be mixed into mark I=5.The initial value of parameter in MFA is set according to the dimension p of I and data.Wherein, each Nodes
Random selecting from the data that this node collects;
with
in each element generate from standardized normal distribution N (0,1);
in addition, and each node l (l=1 ..., M) data amount check that collected
be broadcast to its neighbor node.After certain node m receives its data amount check of all neighbor nodes broadcast, this node calculate weight c
lm:
The implication of this weight is each neighbor node l (the l ∈ R for weighing node m
m) each importance of information at node m place transmitted.After initialization completes, iteration count iter=1, starts iterative process.
(2-2) calculate local statistic and broadcast: this step does not need the information of neighbor node.At each node l place, based on the data that it collects
first intermediate variable g is calculated
i, Ω
i,
With
Wherein
for last iteration complete after the parameter value that obtains (be the initial value of parameter first during iteration
represent node l place n-th data
belong to the probability of i-th class (blending constituent),
for
expectation value.
Had the result of calculation of above-mentioned intermediate variable, node calculate goes out local statistic (LS)
As follows:
Finally, the local statistic LS that will calculate of each node l
lbroadcast is spread to its neighbor node, as shown in Figure 1.
(2-3) calculate associating statistic: when node m (m=1 ..., M) receive from its all neighbor node l (l ∈ R
m), node m calculates associating statistic
(2-4) each parameter in estimation model: node m (m=1 ..., M) CS that calculates according to previous step
m, estimate Θ={ π
i, A
i, μ
i, D
i, ν
i}
i=1 ..., I, wherein, { π
i, μ
i}
i=1 ..., Iestimation procedure as follows:
For { A
i, D
i}
i=1 ..., Iestimation, process is as follows:
In addition, for { ν
i}
i=1 ..., I, obtained by solution equation below:
Wherein ψ () the digamma function that is standard, generally with the above-mentioned equation of Newton method solution.
(2-5) convergence is judged whether: node m (m=1 ..., M) calculate log-likelihood under current iteration iter:
If
then algorithm convergence, stops iteration; Otherwise perform step (2-2), start next iteration (iter=iter+1; ).Wherein Θ represents the parameter value that current iteration estimates, Θ
oldrepresent the parameter value estimated in last iteration.That is, the log-likelihood of adjacent twice iteration is less than threshold epsilon, algorithm convergence.ε gets 10
-5~ 10
-6in arbitrary value.It should be noted that because node each in network is parallel data processing, therefore allly can not to restrain in an iteration simultaneously.Such as, when when node l restrains, node m not yet restrains, then node l no longer sends LS
l, also no longer receive the information of neighbor node transmission.Node m is then with the LS that the node l received for the last time sends
lupgrade its CS
m.The node of not restraining continues iteration, until all nodes are all restrained in network.
After above-mentioned steps (2-1) ~ (2-5), the corresponding model tMFA (representing with parameter Θ during training convergence) obtained by the training data of handwritten numeral d.Repeat above-mentioned steps 10 times, thus obtain 10 numerals tMFA model corresponding separately, in order to represent convenient, and being distinguished, using
The tMFA model that representative digit d is corresponding.Distributed training completes.
(3) Distributed identification: when the arbitrary computer acquisition in network to new for identify handwritten numeral time, first obtain its character pair by step (1), be expressed as s', then calculate about Θ
(d)(d=0,1 ..., 9) log-likelihood logp (s'| Θ
(d)) (d=0,1 ..., 9):
Using the recognition result d' of sequence number corresponding for max log likelihood value as s':
Distributed identification flow process of the present invention as shown in Figure 2.
Performance evaluation:
Because digital realistic value to be identified is known, compare adopting recognition methods involved in the present invention and its true value, obtain recognition correct rate (that is: quantity/(20*3500) of the handwritten numeral that recognition correct rate=all nodes correctly identify), thus can evaluate and weigh out validity and the accuracy of method involved in the present invention.In order to compare the present invention propose based on the distributed Handwritten Digit Recognition method (being called for short distributed tMFA) of tMFA and the performance of additive method, here with based on the centralized Handwritten Digit Recognition method (being called for short centralized tMFA) of tMFA, compare without the Handwritten Digit Recognition method (being called for short without cooperation tMFA) of cooperation between each node based on tMFA.It should be noted that; in centralized tMFA; all nodes need by original data transmissions to certain Centroid, adopt traditional MFA to carry out training and identified, then again result is returned to each node by Centroid; this mode in practice little; one is that transmission raw data communication expense is very large, once occur that packet loss or packet damage, very large on the impact of last recognition performance; two is the secret protections be unfavorable in data, and network security causes anxiety.Here object is whether the recognition methods in order to compare the distributed tMFA that the present invention proposes can reach the same performance of centralized tMFA.Recognition result represents by quantitative and qualitative analysis two kinds of modes respectively.In the qualitative representation of result, adopt the hinton figure of confusion matrix, as shown in Figure 3.In the figure, each recognition result of row representative digit 0 ~ 9 and the true value of each row representative digit 0 ~ 9.Blockage on principal diagonal represents correct situation about identifying, the size of blockage shows that more greatly the numeral of this identification correct is more, and other positions occurs blockage shows to exist the situation of wrong identification.As can be seen from this figure, centralized tMFA and distributed tMFA of the present invention (only provide node 3 wherein as space is limited, other nodes come to the same thing) better performances, and without cooperation tMFA poor-performing.In the quantificational expression of result, adopt average and the variance two indices of discrimination, as shown in Figure 4.In the figure, the recognition correct rate of the distributed tMFA of the present invention's design is substantially identical with the average of the recognition correct rate of centralized tMFA, and without cooperating the poor of tMFA, the variance of the recognition correct rate of distributed tMFA is also much smaller than nothing cooperation tMFA in addition.Therefore, adopt method of the present invention to overcome the shortcoming of the Handwritten Digit Recognition of traditional centralized tMFA, achieve Distributed identification and there is good performance.
The scope of request protection of the present invention is not limited only to the description of this embodiment.