CN111601358B

CN111601358B - Multi-stage hierarchical clustering spatial correlation temperature perception data redundancy removing method

Info

Publication number: CN111601358B
Application number: CN202010361344.0A
Authority: CN
Inventors: 朱容波; 王俊; 李媛丽
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-05-18
Anticipated expiration: 2040-04-30
Also published as: CN111601358A

Abstract

The invention discloses a multi-stage hierarchical clustering space correlation temperature perception data redundancy removing method, which comprises the following steps: step 1: acquiring a large amount of temperature sensing data acquired by a temperature sensor network, improving a k-Means method by using Euclidean distance and Pearson distance on a Sink node, and performing node similarity analysis on the node according to a node position coordinate to obtain a redundant node cluster; step 2: performing similarity judgment on data in the cluster by using a Gaussian mixture clustering method at a cluster head CHs node of redundant node clustering, thereby further performing data redundancy clustering on the nodes in the cluster; and step 3: after the data redundancy clustering is obtained, carrying out random weighting on the data in the data redundancy clustering to obtain a final redundancy removing result; and 4, step 4: and transmitting the temperature data with the redundancy removed to the Sink node. The method can judge the redundant node more accurately, so that the judgment of the redundant data is more accurate, and the error of the result after redundancy removal is smaller.

Description

Multi-stage hierarchical clustering spatial correlation temperature perception data redundancy removing method

Technical Field

The invention relates to the technical field of wireless sensor networks, in particular to a multi-stage hierarchical clustering space correlation temperature perception data redundancy removing method.

Background

Wireless Sensor Networks (WSNs) are deployed in one area and used for monitoring physical phenomena such as temperature, humidity and earthquake events. In order to obtain accurate information of the environment or events, a large number of sensing nodes are deployed to collect data, and the data are reported to the aggregation node Sink in a high-frequency mode. Data generated by the sensor nodes generally has high space-time correlation and contains a large amount of redundant data. Meanwhile, transmitting redundant data causes unnecessary power consumption. Therefore, how to reduce the transmission energy consumption of the WSNs redundant data and extend the lifetime of the WSNs are very important issues.

By studying the space-time correlation, two synchronous predictors are used on the sensor node and the Sink. If the data prediction error is smaller than a given threshold value, the sensor node will not send data to the Sink. The Sink takes the predicted value as sensing data, so that the cost of data transmission and communication energy can be reduced, and the service life of the network is prolonged. However, this method increases the computational complexity of each sensor, and also fails to guarantee the true reliability of the predicted values. Meanwhile, the method for judging the redundant node only according to the node position lacks accuracy.

Aiming at the problem that the judgment of redundant nodes is inaccurate due to insufficient judgment conditions of the redundant nodes in the WSNs, a staged hierarchical clustering similarity redundancy removing method (TSDA) is provided. The method mainly comprises three stages: firstly, the Sink judges the node similarity by using an improved k-Means algorithm based on the node position information, and clusters all the nodes; in the second stage, the cluster heads CHs judge the similarity of the sensing data generated by the nodes in the cluster at the same moment by using a Gaussian mixture clustering algorithm so as to accurately judge the similarity of the nodes in the cluster; and three stages, randomly weighting the sensing data of the similar nodes in the cluster as a redundancy removing result, and transmitting and storing the result. The algorithm is suitable for a clustering network and mainly comprises a k-Means classification model, a Gaussian mixture classification model and a random weighting redundancy removal model. According to the similarity between the node position and the sensing data of the nodes in the cluster, the redundant data is removed, the accuracy of the node similarity can be effectively improved, the judgment on the redundant data is improved, and the life cycle of the network is further improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a multi-stage hierarchical clustering space correlation temperature perception data redundancy removing method aiming at the defect that in the prior art, the judgment of redundant nodes is inaccurate due to insufficient judgment conditions of the redundant nodes in a mode of judging the redundant nodes only according to node positions in WSNs.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a multi-stage hierarchical clustering space correlation temperature perception data redundancy removing method, which comprises the following steps:

step 1: acquiring a large amount of temperature sensing data acquired by a temperature sensor network, improving a k-Means method by using Euclidean distance and Pearson distance on a Sink node, and performing node similarity analysis on the node according to a node position coordinate to obtain a redundant node cluster;

step 2: performing similarity judgment on data in the cluster by using a Gaussian mixture clustering method at a cluster head CHs node of redundant node clustering, thereby further performing data redundancy clustering on the nodes in the cluster;

and step 3: after the data redundancy clustering is obtained, carrying out random weighting on the data in the data redundancy clustering to obtain a final redundancy removing result;

and 4, step 4: and transmitting the temperature data with the redundancy removed to the Sink node.

Further, the k-Means method is improved by the Euclidean distance and the Pearson distance in step 1 of the invention as follows:

the spatial similarity distance D (i, j) of the two nodes is as follows:

D(i,j)＝D_E(i,j)+βD_P(i,j)

wherein, Euclidean distance D_E(i, j) is:

pearson correlation distance D_P(i, j) is:

wherein beta is a scale factor and represents D_P(i, j) influence on the weight of D (i, j); the spatial position coordinates of n SNs nodes in the sensor network S1 are respectively (x)_i,y_i) Where 1 ≦ i ≦ n, the node is represented as the set S ≦ S₁,s₂,…,s_n}; and the Sink node runs an improved k-Means algorithm according to the set S ═ S₁,s₂,…,s_nThe coordinate position set L ═ L corresponding to each node in the structure₁,l₂,…,l_nAnd l_i＝(x_i,y_i)&&1≤i is less than or equal to n, set S is { S ═ S₁,s₂,…,s_nN nodes in the tree are classified into K mutually disjoint subsets C_iC, wherein C ═ { C ═ C₁,C₂,…,C_KAnd C, and C₁∪C₂∪…∪C_K(ii) S, wherein,

and is

i is not equal to j; by improved k-means algorithm, S is equal to S₁,s₂,…,s_nClustering, and obtaining a cluster division C ═ C₁,C₂,…,C_K}。

Further, the improved k-Means algorithm in step 1 of the present invention comprises the following specific steps:

step 1.1, setting the number k of clustering centers of an improved k-Means algorithm;

step 1.2, randomly selecting k nodes from the sensor network S1 as an initial mean value (mu)₁,μ₂,…,μ_k}；

Step 1.3, respectively solving the position coordinates l_jAnd the mean vector mu_i(1. ltoreq. i. ltoreq. k) spatial similarity distance D (i, j): d (i, j) ← D_E(i,j)+βD_P(i,j)；

Step 1.4, mixing of application and mu_iDetermining the node position l by D (i, j) with the minimum distance_jCluster classification of (2):

step 1.5, update mu_i：

And step 1.6, repeatedly executing the step 1.3 to the step 1.5 until a clustering result is obtained.

Further, the method in step 2 of the present invention specifically comprises:

wireless sensor network S1 consists of K clusters, where all data produced by a cluster is represented as the set X ═ X₁,X₂,…,X_n}；X_i＝{x_i(t₁),x_i(t₂),…,x_i(t₂) Wherein i is more than or equal to 1 and less than or equal to n is a sensor node s per T seconds_iA generated time series set; each cluster head CH in the whole wireless sensor network continues to classify and cluster the data correlation of the nodes in the cluster, and the Gaussian mixture clustering algorithm is adopted to collect the data sensed at the same time in the similar cluster of the same spatial node

Is divided into component K₁A cluster of 1 ≦ j&&1≤h≤K₁(ii) a Sample set

The division result is K₁Each cluster C ═ C_i1,C_i2,C_i3,...,C_iK1}，0＜i≤K₁。

Further, the gaussian mixture clustering method adopted in step 2 of the present invention specifically comprises:

let random variable

Represents node j₁Is sensed data

The gaussian mixture component of (a), which is a random value;

prior probability of (2)

Corresponds to alpha_i(i＝1,2,…,K₁) (ii) a According to the Bayes' theorem,

a posterior distribution of (A) corresponds to：

Wherein the content of the first and second substances,

is expressed as a sample

The posterior probability generated from the ith Gaussian mixture component is recorded as

After the Gaussian mixture distribution is obtained, the Gaussian mixture clustering will collect the sample set

Is divided into K₁Each cluster C ═ C_i1,C_i2,C_i3,...,C_iK1}，0＜i≤K₁Each sample of

Cluster mark of

Comprises the following steps:

model parameters

Solving:

using EM algorithm to carry out iterative optimization solution to obtain a sample set

The result of the division of (1).

Further, the method in step 3 of the present invention specifically comprises:

according to the cluster division obtained in step 2

As a result, the CHs carries out random weighted average on data generated by nodes in the data similarity cluster, and the redundancy removing result

Comprises the following steps:

wherein, beta₁，β₂，…，β_vIs a weighting factor, and β₁+β₂+…+β_v＝1；x_w(t_j),x_a(t_j),…,x_b(t_j) Are respectively s_w,s_a,…,s_bNode at t_jThe perception data generated at the moment, and

the invention has the following beneficial effects: the invention discloses a multi-stage hierarchical clustering space correlation temperature perception data redundancy removing method, which comprises the steps of performing redundancy removing processing on space redundancy nodes in three stages; in the process of removing redundancy of the sensing data, the redundant node can be judged more accurately, so that the judgment of the redundant data is more accurate, and the error of the result after removing redundancy is smaller. The invention improves the algorithm, so that the redundant data is removed more reasonably, and the network energy consumption is effectively reduced; experiments show that 70% of spatial redundancy data can be reduced, the data error is 0.2 ℃ on average, and meanwhile, 1.25% of energy consumption can be further reduced.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flowchart of a method for removing redundancy of multi-stage hierarchical clustering temperature-sensing data based on spatial correlation according to an embodiment of the present invention;

FIG. 2 is an algorithmic flow diagram of an embodiment of the present invention;

FIG. 3 is a system model of an embodiment of the invention;

FIG. 4 is a raw data presentation of a node of an embodiment of the present invention;

FIG. 5 is a block diagram of an embodiment of the present invention for implementing node classification clustering by improved k-Means;

FIG. 6 is a cluster C of an embodiment of the present invention₁The data similarity distribution of (a);

FIG. 7 is a cluster C of an embodiment of the present invention₁₃Comparing the data before and after redundancy removal;

FIG. 8 is a cluster C of an embodiment of the present invention₁₃Comparing errors before and after redundancy removal;

FIG. 9 shows K in an embodiment of the present invention₁The impact on data de-redundancy rate;

FIG. 10 shows K according to an embodiment of the present invention₁The impact on network energy consumption is 4.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The temperature sensing data of the intel berkeley laboratory was used for analysis in the examples of the present invention. This data, perceived collection for temperature, collected about forty thousand pieces of data per sensor node, for a total of 54 sensor nodes, with a data volume of about two million. This embodiment will carry out the relevant work around the temperature sensing data of the laboratory, the overall flow chart of the algorithm, as shown in fig. 1 and 2. System model, as shown in fig. 3.

Step 1: firstly, node similarity analysis is carried out on nodes on Sink according to the node position coordinates. Since accurate clustering requires precise definition of the closeness between samples, based on similarity or distance of pairings. Among the various distances, the euclidean distance is probably the most common distance for numerical data. However, the euclidean distance describes only the magnitude difference of the two eigenvector components. The euclidean distance of two differently shaped feature vectors may be smaller than the euclidean distance of similarly shaped feature vectors. The problem of the direction difference, rather than the size, of the two vectors is measured for the correlation distance. Therefore, the spatial similarity distance D (i, j) between two nodes is:

D(i,j)＝D_E(i,j)+βD_P(i,j) (1)

euclidean distance D_E(i, j) is:

pearson correlation distance D_P(i, j) is:

wherein β is a scale factor and represents D_P(i, j) influence on the D (i, j) weight. The dual metric distance satisfies three distance characteristics: positive, symmetrical and self-reflecting. In terms of dual metric distance, any pair of active feature vectors can be compared from the magnitude of the euclidean distance measure and the shape change of the associated distance measure.

The spatial position coordinates of n SNs nodes in the sensor network S1 are respectively (x)_i,y_i) Where 1 ≦ i ≦ n, the node is represented as the set S ≦ S₁,s₂,…,s_n}. Sink is obtained by running a modified k-Means algorithm according to the set S ═ S₁,s₂,…,s_nThe coordinate position set L ═ L corresponding to each node in the structure₁,l₂,…,l_nAnd l_i＝(x_i,y_i)&&I is not less than 1 and not more than n, and the set S is not less than S₁,s₂,…,s_nIn (1)n nodes are classified into K mutually disjoint subsets C_iC, wherein C ═ { C ═ C₁,C₂,…,C_KAnd C, and C₁∪C₂∪…∪C_K(ii) S, wherein,

and is

i ≠ j. By improved k-means algorithm, S is equal to S₁,s₂,…,s_nClustering, and obtaining a cluster division C ═ C₁,C₂,…,C_K}. Wherein the minimized square error e in clustering is:

wherein the content of the first and second substances,

is a cluster C_iThe mean vector of (2).

Step 2: after the similar cluster division is carried out on the spatial node positions in the first step, because the Gaussian mixture clustering can accurately quantize objects, the similarity analysis is further carried out on the data acquired at the same time in the same cluster by adopting the Gaussian mixture clustering algorithm in the second stage, so that the redundancy of the nodes on the spatial correlation is more accurate. The core of Gaussian mixture clustering is a probability model, prototype data are analyzed and described by adopting the probability model, and cluster division is mainly determined by posterior probability corresponding to a prototype.

The gaussian distribution is defined as a random variable X in an n-dimensional sample space X, and if X follows a gaussian distribution, its probability density function p (X) is:

wherein, mu represents an n-dimensional mean vector,

representing an n x n covariance matrix. The Gaussian distribution is completely composed of a mean vector mu and a covariance matrix

These two parameters are determined. Therefore, for convenience of description, the probability density function of the dependency relationship of the gaussian distribution on the corresponding parameter is expressed as

Gaussian mixture distribution p_MComprises the following steps:

the distribution is totally defined by K₁Each mixture component is corresponding to a Gaussian distribution. Wherein mu_iAnd

is a parameter of the ith Gaussian mixture component, and alpha_i>0 is the corresponding "mixing coefficient", and

the wireless sensor network S1 is composed of K clusters, where all data generated by a cluster can be represented as a set X ═ X₁,X₂,…,X_n}。X_i＝{x_i(t₁),x_i(t₂),…,x_i(t₂) Wherein i is more than or equal to 1 and less than or equal to n is a sensor node s per T seconds_iThe generated time series set. Each cluster head CH in the whole wireless sensor network continues to classify and cluster the data correlation of the nodes in the cluster, and the Gaussian mixture clustering algorithm is adopted to collect the data sensed at the same time in the similar cluster of the same spatial node

Is divided into component K₁A cluster of 1 ≦ j&&1≤h≤K₁。

Let random variable

Represents node j₁Is sensed data

The gaussian mixture component of (a), which is a random value.

Prior probability of (2)

Corresponds to alpha_i(i＝1,2,…,K₁). According to the Bayes' theorem,

the posterior distribution of (a) corresponds to:

is expressed as a sample

Is divided into K₁Each cluster C ═ C_i1,C_i2,C_i3,…,C_iK1}(0<i≤K₁) Each sample of

Cluster mark of

Comprises the following steps:

model parameters

Solving:

and carrying out iterative optimization solution by using an EM algorithm.

If parameter

To maximize (4.8), then

Is provided with

And is

Is provided with

μ_iThe mean for each mixture component can be estimated by a sample weighted average, the sample weight being the posterior probability that each sample belongs to that componentRate of change

Similarly, composed of

Can obtain the product

For the mixing coefficient alpha_iIn addition to maximizing LL (D), it is desirable to satisfy a_i≥0，

Lagrangian forms of LL (D)

Where ρ is the lagrange multiplier. Is represented by formula (4.12) to alpha_iHas a derivative of 0, has

Both sides are multiplied by alpha_iSumming all the mixed components to obtain rho ═ m, where

I.e. the mixing coefficient of each gaussian component is determined by the average a posteriori probability that the sample belongs to that component.

Thus, the sample set

The division result is K₁Each cluster C ═ C_i1,C_i2,C_i3,…,C_iK1}(0<i≤K₁)。

And step 3: cluster partitioning according to step 2

Comprises the following steps:

wherein beta is₁，β₂，…，β_vIs a weighting factor, and β₁+β₂+…+β_v＝1；x_w(t_j),x_a(t_j),…,x_b(t_j) Are respectively s_w,s_a,…,s_bNode at t_jThe perception data generated at the moment, and

and 4, step 4: removing redundant result in step 3

And transmitting the information to an aggregation node Sink.

And (3) redundancy removal and energy consumption analysis of experimental data:

in this experimental part, the data from the intel berkeley research laboratory were used mainly for research analysis. The laboratory has totally arranged 54 sensor nodes, monitors the temperature change condition of different positions of the whole laboratory respectively, and the sensor nodes collect data once every 0.5 minute, collects data of about one month, and the data volume of each node is about forty thousand, totally 54 nodes, therefore, the total data volume reaches two million pieces, and the data volume is huge. In the preliminary test, four ten thousand pieces of data of one node are mainly used for analysis and improvement. The raw data for node 1 is shown in FIG. 4.

In fig. 4, the x-axis represents time and the y-axis represents temperature. The temperature varied dramatically with time, especially near time 430 minutes, reaching a minimum, then increasing with time to reach a maximum peak around 750 minutes, then decreasing again with time to begin at a minimum at 1800 minutes, then increasing again, and reaching a maximum at 2250 minutes and beginning to decrease again. Since the variability of the data is very large, it is very intuitive that the data is the most specific maximum node in the data at around 430, 750, 1800 and 2250 minutes. Therefore, in the data redundancy removing process, the data redundancy removing situation of the four positions needs to be particularly focused.

To verify the performance of the proposed method, the study was simulated using python 3.6. The model uses a single hop approach to transmit data. In order to verify a data transmission algorithm and a network life cycle in the network, a data transmission model and a node energy consumption model are adopted to describe data transmission and node energy consumption conditions in the network. The model parameters are shown in table 1.

Table 1 parameter set-up for simulation experiments

(1) In the first stage, Sink classifies and clusters according to the coordinate position of a node by running an improved k-Means clustering algorithm. Assuming that k is 4 and β is {0,0.3,0.5,0.7,1}, the four classification clustering results are obviously changed along with the change of β. The diamonds in the figure represent the cluster centers in four classification clusters, and the classification clustering results are shown in fig. 5.

As can be seen from fig. 5, the classification of 54 nodes mainly appears in two cases: 1) the node classification varies with β and the cluster classification is not altered. 2) The node classification varies with β and alters the cluster classification. The nodes that apparently change in the figure are S ═ {0,2,5,9,10,19,20,32,33,45,46}, and if the classification cluster is labeled clockwise from the top left, the label is used to indicate that the classification cluster is changed clockwiseIs marked as a cluster C₁Cluster C₂Cluster C₃And cluster C₄. And respectively carrying out probability calculation of four categories on the nodes which are obviously changed, and further classifying the nodes with the maximum probability into corresponding clusters. The corresponding probability distribution is shown in table 2.

TABLE 2 node distribution Cluster probability

Through the results in table 2, the Sink classifies nodes which are easy to change into a corresponding class according to the probability ratio, so that the final classification result cluster C of all nodes₁Cluster C₂Cluster C₃And cluster C₄Respectively expressed as:

C₁＝{0,2,21,22,23,24,25,26,27,28,29,30,31,32,33}；

C₂＝{1,34,35,36,37,38,39,40,41,42,43,44}；

C₃＝{3,4,5,6,7,8,9,45,46,47,48,49,50,51,52,53}；

C₄＝{10,11,12,13,14,15,16,17,18,19,20}。

(2) second stage, Cluster C₁Cluster C₂Cluster C₃And cluster C₄The cluster heads CHs in the cluster respectively operate a Gaussian mixture clustering algorithm, the sensing data of two different moments of the nodes in the cluster are continuously acquired to analyze the similarity of the data between the nodes, and after the data similarity judgment for a period of time, four clusters C are calculated₁Cluster C₂Cluster C₃And cluster C₄And (5) final classification results of the middle nodes. Cluster C₁The similar classification result between two consecutive sensing data between each node in the graph is shown in fig. 5.

In fig. 6, the abscissa represents the previous sensed data of two data sensed in succession, and the ordinate represents two data sensed in successionThe latter perception of the data, and the cluster C is evident from the figure₁The similar classification result clusters among the nodes in (1). Cluster C₁Is divided into C₁₁＝{22,25,28,30,32}，C₁₂＝{23,24,26}，C₁₃＝{27,29,31,33}，C₁₄＝{0,2,21}。

In the same way, Cluster C₂Is divided into C₂₁＝{1,34,35,36}，C₂₂＝{37,38,39}，C ₂₃40,4143, 44; cluster C₃Is divided into C₃₁＝{3,4,5,6,7}，C₃₂＝{8,9,45,46}，C₃₃{47,48,49,50,51,52,53 }; cluster C₄Is divided into C₄₁＝{10,11,12}，C₄₂＝{13,14,15,16}，C₄₃＝{17,18,19,20}。

(3) And in the third stage, carrying out random weighting on the data in the data similarity cluster to obtain a final redundancy removing result. This stage is mainly in clusters C₁Middle sub-cluster C₁₃For example, {27,29,31,33}, let the random weighting factor β be calculated for convenience₁+β₂+…+β_v1 and beta₁＝β₂＝…+＝β_vAnd then analyzing the data result after removing redundancy, as shown in fig. 7, and the relationship of the error between the data and the original data, as shown in fig. 8.

TABLE 3 Cluster C₁₃Mean error comparison of middle nodes

The cluster C can be seen in FIG. 7₁₃The resulting data after redundancy removal tends to center the sensing data of the redundant node, and the sensing data of the node 29 and the node 33 tends to approach the resulting data, whereas the node 27 and the node 31 are relatively far away from the resulting data. The cluster C is reflected by FIG. 8₁₃The error of each node in the data and each sensing data of the result data after redundancy removal is obviously found from the figure, and the error of the node 29 and the node 33 is relatively much lower than that of the node 27 and the node 31. From Table 3, cluster C₁₃Of individual sensing data of individual nodesMean error, it can be seen that the mean error for node 27 is 0.348, the mean error for node 29 is 0.043, the mean error for node 31 is 0.337, and the mean error for node 33 is 0.056. Indicating that even if they belong to the class of data similarity, there are still large differences between the data. Therefore, for the method of performing similarity analysis using only the coordinate positions of the nodes, the spatial correlation analysis between the data is lacking, which in turn may cause a larger error between the data. By means of grading, layering and clustering, accuracy of similarity among node data is guaranteed.

(4) Clustering K on data similarity of TSDA algorithm data redundancy removal rate and data in the second stage₁Has a relationship of, but K₁The value of (a) in turn affects the accuracy of the data correlation, while K₁The larger the data correlation, the more accurate the data correlation. Further analysis with K₁Influence of variation of (2) on the Deredundancy Rate (0)<K₁Cluster C not more than_iNumber of inner nodes) as shown in fig. 9, and the effect on energy consumption as shown in fig. 10.

As can be seen from FIG. 9, with K₁The redundancy removal rate of the system is gradually reduced when K is greater₁When 1, the cluster C is described₁Cluster C₂Cluster C₃And cluster C₄Does not perform the division of similar sub-clusters, but rather divides cluster C₁Cluster C₂Cluster C₃And cluster C₄All nodes in the cluster are divided into similar clusters, and the cluster C is simultaneously divided into similar clusters₁Cluster C₂Cluster C₃And cluster C₄The nodes in each cluster are regarded as redundant nodes, and random weighted redundancy removal is carried out on the redundant nodes to obtain result data, so that the redundancy removal rate is maximum at the moment. However, due to K₁When the value is 1, it is equivalent to that the node position similarity determination is performed only on all the nodes, and the data similarity determination is not performed, so that the resulting data error is also the largest. When K is₁When 10, it is equivalent to put each cluster C₁Cluster C₂Cluster C₃And cluster C₄The nodes in (1) are divided into 10 sub-data similar clusters respectively, and then the clusters C are respectively processed₁Cluster C₂Cluster C₃Hezhou clusterC₄The 10 sub-similar clusters in the cluster are weighted randomly to obtain the data de-redundancy result, so that the last retained data is the cluster C₁Cluster C₂Cluster C₃And cluster C₄And the data of the middle 10 sub-similar clusters are displayed, so that the data redundancy removing rate is the lowest, and the accuracy of the redundancy removing result data is further ensured. K is selected in compromise for ensuring the data accuracy and the data redundancy removal rate₁And 4, analyzing and solving the network energy consumption.

As can be seen from FIG. 10, the TSDA algorithm curve is mainly distributed between 97.50% and 98.0%, the TCDA algorithm curve is mainly distributed between 96.26% and 96.75%, and the TSDA algorithm curve variation and the TCDA algorithm curve variation are maintained at 1.25%. Therefore, the TSDA algorithm is combined with the TCDA algorithm, 70% of redundant data can be further removed, and the network energy consumption can be further improved by 1.25%. Meanwhile, the accuracy of the redundant node can be maintained between 0.043 and 0.35. The redundancy removing rate can be made to be the highest within the error range allowed by a user.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A multi-stage hierarchical clustering spatial correlation temperature sensing data de-redundancy method is characterized by comprising the following steps:

the method for improving the k-Means method by using the Euclidean distance and the Pearson distance in the step 1 comprises the following steps:

the spatial similarity distance D (i, j) of the two nodes is as follows:

D(i,j)＝D_E(i,j)+βD_P(i,j)

wherein, Euclidean distance D_E(i, j) is:

pearson correlation distance D_P(i, j) is:

wherein beta is a scale factor and represents D_P(i, j) influence on the weight of D (i, j); the spatial position coordinates of n SNs nodes in the sensor network S1 are respectively (x)_i,y_i) Where 1 ≦ i ≦ n, the node is represented as the set S ≦ S₁,s₂,…,s_n}; and the Sink node runs an improved k-Means algorithm according to the set S ═ S₁,s₂,…,s_nThe coordinate position set L ═ L corresponding to each node in the structure₁,l₂,…,l_nAnd l_i＝(x_i,y_i)&&I is not less than 1 and not more than n, and the set S is not less than S₁,s₂,…,s_nN nodes in the tree are classified into K mutually disjoint subsets C_iC, wherein C ═ { C ═ C₁,C₂,…,C_KAnd C, and C₁∪C₂∪…∪C_K(ii) S, wherein,

and is

By improved k-means algorithm, S is equal to S₁,s₂,…,s_nClustering, and obtaining a cluster division C ═ C₁,C₂,…,C_K}；

The improved k-Means algorithm in the step 1 comprises the following specific steps:

step 1.5, update mu_i：

Step 1.6, repeatedly executing the step 1.3 to the step 1.5 until a clustering result is obtained;

the method in the step 2 specifically comprises the following steps:

the wireless sensor network S1 is composed of K clusters, where all data generated by a cluster is represented as a set X ═ X₁,X₂,…,X_n}；X_i＝{x_i(t₁),x_i(t₂),…,x_i(t₂) Wherein i is more than or equal to 1 and less than or equal to n is a sensor node s per T seconds_iA generated time series set; each cluster head CH in the whole wireless sensor network continues to classify and cluster the intra-cluster nodes according to the data correlation, and a Gaussian mixture clustering algorithm is adopted to collect data D sensed at the same time in the same spatial node similar cluster_CHh(t_j)＝{x₁(t_j),x₂(t_j),…,x_z(t_j) Is divided into K₁A cluster of 1 ≦ j&&1≤h≤K₁(ii) a Sample set