CN101516099B

CN101516099B - Test method for sensor network anomaly

Info

Publication number: CN101516099B
Application number: CN2009100615378A
Authority: CN
Inventors: 刘文予; 蒋洪波; 舒乐; 张松涛; 刘文平; 陈金华; 刘军
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2009-04-07
Filing date: 2009-04-07
Publication date: 2010-12-01
Anticipated expiration: 2029-04-07
Also published as: CN101516099A

Abstract

The invention relates to a test method for sensor network anomaly, which clusters the network. A cluster head in each cluster collects high dimensional data sequences of all nodes in every unit time of the cluster; a hidden Markov model building method is adopted for building the high dimensional data transition model of the nodes in each unit time, takes the similarity of the models as the classification benchmark, classifies the high dimensional data transition model of all nodes, collects all high dimensional data sequences in the unit time of the high dimensional data transition model of the nodes of the same type, and again the hidden Markov model building method is adopted for building new high dimensional data transition models of the nodes which are used for anomaly detection in the cluster. The method fully uses the relativity of the time and space of the data the in the sensor network, effectively reduces the data redundancy and communication overhead, prolongs the service life of the nodes of the sensors, thus achieving the purpose of anomaly detection.

Description

A kind of test method for sensor network anomaly

Technical field

The present invention relates to wireless sensor network, particularly relate to the method for wireless sensor network abnormality detection.

Background technology

In wireless sensor network, the space or the temporal correlation of height generally all arranged between a large amount of original sensor datas, all these full data transmission had both been wasted energy to base-station node (Sink) also there is no need.Data fusion (data aggregation, or be called convergence) main thought be exactly to carry out interactively cooperative cooperating between the node on the transmission path, greatly reduce data volume by removing the correlation redundancy, thereby with still less the same or suitable amount of information of data representation.

The achievement in research of the current existing part of data fusion of carrying out based on sensor network nodes data space correlation or temporal correlation, but a theoretical system do not formed.Especially all isolate at the utilization of spatial coherence and temporal correlation and come, with spatial coherence and temporal correlation organically, the combining of system, thereby hindered the further raising of data fusion efficient;

Existing data fusion model all be the part independently, to the explanation and the modeling of relation between each part and multi-level information fusion neither one science.In fact the correlation between the data has level, and be to weaken gradually along with the increase of distance, also have correlation between each part, existing research all is that zone of deterministic delimitation or boundary are the part, and the data of only regarding as this part have local correlations.

Existing Data Fusion model at first is to pay close attention to sensing data, has just carried out simple processing for the incident that the is detected modeling of data reflection, and this is just as the gap of amplitude coding in the speech coding and model based coding; In addition, existing data processing model is supported not enough to senior application characteristic, as being the performance temperature data equally, need the data fusion research (asking for) of topology information different fully, do not have unified Mathematical Modeling form with the method for the data fusion research employing that only needs statistical information as contour.The research of merging at the uniform data of multiple sensing data (as showing temperature, humidity simultaneously) does not also have significant achievement in addition.And the theory of these data processing, method often need original low-dimensional data set is converted into the high dimensional data collection, and present research mainly is confined to the low-dimensional application scenario, and the analysis and modeling work of high dimensional data collection does not launch as yet fully; At last, existing data processing model mainly concentrates on and uses linear system to analyze room and time correlation with deal with data, reduce transfer of data thereby for example use linear prediction model to predict, but the system of reality is often comparatively complicated, and the linear system model does not often have versatility.

Have a lot of exceptional value detection algorithms all are that the difference of judging sensor node data and its adjacent node data determines whether it being exceptional value, the time of Xu Yaoing is grown and can not guarantee its accuracy like this.Utilize the framework of data fusion, exceptional value detection problem is changed into the reasoning problems of the framework of given data fusion, promptly for given sequence, infer the status switch that its corresponding maximum possible according to the framework of existing data fusion, if judge that according to the likelihood function value that existing model uses existing status switch to obtain the result will be more practical.

Summary of the invention

The object of the present invention is to provide a kind of wireless sensor network method for detecting abnormality, take into full account data in time with the space on correlation, effectively reduce data redundancy and communication overhead, prolong the life-span of sensor node, reach the purpose of abnormality detection.

A kind of test method for sensor network anomaly carries out sub-clustering to network, and each bunch carries out abnormality detection as follows:

The high dimensional data sequence of all nodes in i unit interval in step 1) bunch head converges bunch, with this high dimensional data sequence is training sample, adopt the HMM construction method to make up the node high dimensional data transition model of i unit interval, i=1,2,, N, the unit interval quantity of N for extracting;

Step 2) serve as the classification benchmark with the transition model similitude, with the 1st, 2 ..., the node high dimensional data transition model of N unit interval is classified;

Step 3) is for the initial transition model of node high dimensional data that belongs to the j class, all high dimensional data sequences in its corresponding unit interval are converged, constitute new training sample, adopt the HMM construction method to make up j category node high dimensional data transition model, j=1,2 ... N1, N1 are step 2) number of categories that obtains;

Step 4) utilize j category node high dimensional data transition model to bunch in the data of all node collections carry out abnormality detection.

Technique effect of the present invention is embodied in:

(1) multidimensional data merges.High dimensional data collection in the network (as comprising temperature, humidity simultaneously) is carried out data fusion, adopt the HMM model that the high dimensional data collection is carried out modeling, set up the data fusion model of the general data collection that is fit to any dimension;

(2) reduce redundancy, reduce expense.The present invention makes full use of sensor network and collects correlation on data time and the space, at first by the sub-clustering algorithm, by the correlation on the space, bunch data difference that interior nodes collects is little, bunch being that unit carries out modeling, reduced the redundancy of model, the also convenient follow-up data that bunch interior nodes is collected carries out abnormality detection; Consider temporal correlation,, rebuild new training sample and make up new model, further reduce the redundancy of model category of model; Adopt HMM to the high dimensional data collection transition carry out modeling, make all data that in the process of abnormality detection, need not to collect send to leader cluster node from node, only need transition sequence (i.e. sequence after the sampling) is sent to leader cluster node, thereby significantly reduced communication overhead.

Description of drawings

Fig. 1 is a wireless sensor network structural representation of the present invention;

Fig. 2 is a method for detecting abnormality overview flow chart of the present invention;

Fig. 3 is a wireless sensor network sub-clustering result schematic diagram;

Fig. 4 is the training sample schematic diagram;

To be abnormality detection result schematic diagram: Fig. 5 (a) be node 2-D data transition figure in 17 days bunches of March to Fig. 5; Fig. 5 (b) is node 2-D data transition figure in 18 days bunches of March; Fig. 5 (c) is node 2-D data figure (do not comprise node 3, thick line is the transition curve of node 4) in 17 days bunches of March; Fig. 5 (d), 5 (e), 5 (f) are node 2-D data figure (do not comprise node 3, the thick line among the figure is respectively

node

2,7,9 transition curve) in 18 days bunches of March.

Embodiment

Purpose, technical scheme and advantage in order more clearly to show this patent make a detailed description this method below in conjunction with accompanying drawing and instantiation.

The present invention is directed to the abnormality detection problem in the wireless sensor network, can carry out data fusion, extract useful information, finally draw abnormality detection result the high dimensional data collection in the network (reading that comprises temperature and humidity simultaneously).

Fig. 1 has described the layout of the used sensor network of embodiment, used sensing data among the embodiment is to collect the data of 1～54 node from February 28th, 2004 to April 5 by Intel Berkeley research laboratory (Intel Berkeley Research lab).Unit interval is 12 hours a integral multiple, is that one day (24 hours) illustrate with the unit interval in the example.

Fig. 2 is the schematic flow sheet of embodiment, may further comprise the steps:

Step 1 is carried out sub-clustering according to the adjacency of locus to network.

For network configuration shown in Figure 1, we carry out sub-clustering to network by the following method: each node has a status indication position, is initialized as 0,0 expression node and determines state, and 1 expression node has determined it oneself is leader cluster node or from node; Each node has a level (leval) marker bit, is initialized as ∞; Each node is the parent marker bit of underlined father node also.Each node (0, N ³) in select oneself interim ID number at random, wherein N represents the sum of node in the network.

Communicate between step 11 node, each node obtains that h jumps and h jumps information with interior neighbor node, and h is value arbitrarily, and h is taken as 2 in this experiment.Status indication is 0 node with oneself ID number is to make comparisons for ID number that 0 h jumps and h jumps with interior neighbor node with the mark mark, if the ID maximum of oneself, then confirm oneself to be leader cluster node, being about to the leval mark position is 0, the expression leader cluster node is at the 0th layer, and with the status indication position of oneself is 1, enters step 12.

To be that all nodes of 0 are relatively more own jump with spacing h and h jumps similitude with interior leader cluster node in step 12 status indication position, and we represent the similitude of two nodes with coefficient correlation among the present invention, promptly suppose x _k, x ₁Be respectively node S _k, S ₁The sequence samples of in one day, reading, correlation coefficient r _{K, l}Be defined as,

r_{k, l} = \frac{E (x_{k} x_{l}) - E (x_{k}) E (x_{l})}{\sqrt{E (x_{k}^{2}) - E^{2} (x_{k})} \sqrt{E (x_{l}^{2}) - E^{2} (x_{l})}}

Wherein E () is for getting the average symbol.

If coefficient correlation is greater than relevance threshold, two curve similitudes are big more, and coefficient correlation levels off to 1, and we establish coefficient correlation greater than 0.97 in this experiment, and node is 1 with the status indication position of oneself, and node is from node.

Step 13 repeating step 11, step 12, the status indication position of all nodes in network are 1 entirely.

Each calculates the similitude of all leader cluster nodes in own and the spacing one jumping scope step 14 from node, find the leader cluster node of coefficient correlation maximum, if this coefficient correlation is greater than relevance threshold, as coefficient correlation greater than 0.97, then will be somebody's turn to do from the leval mark of node and be changed to 1, should be changed to ID number of leader cluster node from the parent mark of node.

The node of step 15leval=∞ obtains h and h and jumps information with all nodes of leval ≠ ∞ in the interior scope, select the node S of the leval≤h-1 of similitude maximum and node, the leval mark of this node is changed to leval mark+1 of node s, and the parent mark of this node is changed to s.

More than be the process of leader cluster node election, so far, the node in the network belongs to and only belongs to one bunch, and the maximum hop count of each bunch is h, and wherein the size of h can be adjusted at the beginning of program running.Figure as a result after the sub-clustering as shown in Figure 3.

Next, each bunch is the center with bunch head, to bunch in data carry out analyzing and processing, draw the abnormality detection result in this bunch.

Each bunch of step 2 gathered sample training HMM (HMM), and sample comprises the data sequence of all nodes in N days bunches.Each node was gathered a data value (mean values of all data that data collect at two hours for this node) every two hours, 12 values were just arranged in one day, the sequences that these 12 values are formed are as sample, for the stability that guarantees to train, we converge to the leader cluster node training to the sequence of all nodes in bunch.Calculate the parameters of these HMMs with Bao Mu-Wei Erqi (Baum-Welch) algorithm iteration, comprise initial probability vector, state transition probability matrix, the parameters of mixed Gauss model, hybrid matrix etc.By accompanying drawing 4 (expression March 1 by

node

4,6,7,8,9,10,11,12,13 formed bunch), we can see that the transition curve of data in same bunch is consistent basically.Our used training sequence is a multisequencing, and the result of Baum-Welch iteration makes the summation maximum of a plurality of sequences probability of occurrence in model.

Step 3 need be classified similar model in order further to reduce the redundancy of model.Because temporal correlation, be in the adjacent fate bunch in one day the curve of high dimensional nonlinear data set transition similar, can be with the K mean algorithm to being that the HMM of unit training is classified with the sky, after the classification, with belonging to the convergence of the corresponding fate of model of a class together, form new training sample to train HMM again.What adopt in the experiment is to be divided into 4 classes with 15 days data with the K mean algorithm.

All transition sequences of the data set of all nodes some days in given any one bunch of step 4, the forward direction algorithm of utilization Hidden Markov the inside, calculate this all sequences respectively in step 3 training HMM in probability of occurrence, get and make probability and maximum HMM, if the probability that the transition sequence of certain node occurs in this model, judges then that the data of this node are for unusual less than the unusual judgment threshold of setting.Probability is more little, and poor more with the model match, for unusual possibility is big more, establishing unusual judgment threshold in this experiment is 0.1.Mj represents j HMM, and establishing label and be the probability that the sequence of i occurs in mj is P (i|mj), and this sequence length is Ti, with L (i|mj)=(1/T _i) the pairing value of log (P (i|mj)) expression normalization probability logarithm.We were with March 17, and the data on March 18 are gone the accuracy of four models of training in the checking procedure three, with by

node

4,6,7,8,9,10,11,12,13 constituted bunch be example, the result of match is as shown in table 1.Wherein as shown in Figure 5, in this bunch of data representation in the table one 2 days data of 9 nodes respectively at model I, the L (i|mj) in 2,3,4.Our best with model 3 matches of the data on March 17 as can be seen from this table, the data on March 18 are best with model 4 matches.The temperature reading of node 3 is all undesired in these two days, is more than 100 degree, and curve is respectively as accompanying drawing 5 (a), (solid line is represented temperature curve, and dotted line is represented voltage curve, and the line of overstriking is represented the temperature and the magnitude of voltage of node 3) shown in the accompanying drawing 5 (b); For the ease of observing, we remove node 3.March 17, in the remaining node, the L of node 4 (i|mj) minimum is observed its curve transition figure, finds other node height of voltage ratio of node 4, shown in accompanying drawing 5 (c); March 18, in the remaining node, node 2,7,9 L (i|mj) is little more a lot of than other nodes, their curve transition figure such as accompanying drawing 5 (d), accompanying drawing 5 (e) shown in the accompanying drawing 5 (f), can be defined as abnormal nodes.

Table one model match 0317,0,318 three day data result

In sum, a kind of test method for sensor network anomaly based on level decision-making and HMM is effective.

Claims

1. a test method for sensor network anomaly carries out sub-clustering to network, and each bunch carries out abnormality detection as follows:

Step 3) is for the node high dimensional data transition model that belongs to the j class, all high dimensional data sequences in its corresponding unit interval are converged, constitute new training sample, adopt the HMM construction method to make up j category node high dimensional data transition model, j=1,2 ... N1, N1 are step 2) number of categories that obtains;

The transition sequence of the data set of all nodes some days in given bunch of the step 4), utilization Hidden Markov forward direction algorithm, calculate all sequences respectively at the probability of occurrence of the node high dimensional data transition model of j class, get and make probability and maximum node high dimensional data transition model, if the probability that the transition sequence of certain node occurs in this model, judges then that the data of this node are for unusual less than the unusual judgment threshold of setting;

Described network cluster dividing carries out in the following manner:

The node of identify label is relatively not own jumps with h at interval and h jumps with interior neighbor node ID number for step 01, if the ID maximum of self, is a bunch head with self identify label then, and the level mark puts 0;

The step 02 not node of identify label is calculated oneself and spacing h jumping and the h jumping similitude with interior bunch head, if exist similitude greater than relevance threshold, is from node with this node identify label then;

Step 03 repeating step 01～02 is identified up to the identity of all nodes;

Step 04 is calculated the similitude of all bunches head that own and spacing one jump from node, finds a bunch T of similitude maximum, if maximum comparability, is then confirmed this greater than relevance threshold from the child node of node for a bunch T, and will be somebody's turn to do from the level mark of node and put 1;

Step 05 do not do the level mark from node calculate self with spacing h and h jumping with interior all similitudes of having made the level mark from node, select similitude maximum and level mark≤h-1 from node S, should equal level mark+1 from the level mark of node, confirm that this was the child node from node S from node from node S;

Step 06 repeating step 05, up to all from the level mark of node by assignment;

Described similitude is calculated in the following manner: make x _k, x ₁Be respectively node S _k, S ₁The high dimensional data sequence of in a unit interval, gathering, similitude r _{K, l}Be defined as:

r_{k, l} = \frac{E (x_{k} x_{l}) - E (x_{k}) E (x_{l})}{\sqrt{E (x_{k}^{2}) - E^{2} (x_{k})} \sqrt{E (x_{l}^{2}) - E^{2} (x_{l})}}

Wherein average is got in E () expression.

2. a kind of test method for sensor network anomaly according to claim 1, described step 2) adopt the K mean algorithm to classify.

3. a kind of test method for sensor network anomaly according to claim 1 is characterized in that, the described unit interval is 12 hours a integral multiple.