CN104994170B

CN104994170B - Distributed clustering method based on mixed factor analysis model in sensor network

Info

Publication number: CN104994170B
Application number: CN201510414218.6A
Authority: CN
Inventors: 魏昕; 周亮; 周全; 陈建新; 王磊; 赵力
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing Tian Gu Information Technology Co ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2015-07-15
Filing date: 2015-07-15
Publication date: 2018-06-05
Anticipated expiration: 2035-07-15
Also published as: CN104994170A

Abstract

The invention discloses a distributed clustering method based on a mixed factor analysis model in a sensor network, which uses the mixed factor analysis model to model data to be clustered at each node in the sensor network, each node calculates local sufficient statistic based on the data of the node, then broadcasts the quantity to neighbor nodes in a diffusion mode, after the node receives all local sufficient statistic from the neighbor nodes, the node can obtain joint sufficient statistic, estimates each parameter in the mixed factor analysis model based on the statistic, and finally completes clustering based on the estimated model. The invention establishes a mixed factor analysis model, can complete the dimensionality reduction of data while clustering, and adopts a distributed clustering mode to avoid network collapse caused by a central node in the traditional centralized processing mode. In the distributed clustering method, the nodes transmit sufficient statistics instead of data, so that communication overhead is greatly saved, and privacy information in the data can be well protected.

Description

A Distributed Clustering Method Based on Mixed Factor Analysis Model in Sensor Networks

技术领域technical field

本发明涉及一种传感器网络中基于混合因子分析模型的分布式聚类方法，属于数据的并行分布式处理方法与应用的技术领域。The invention relates to a distributed clustering method based on a mixed factor analysis model in a sensor network, and belongs to the technical field of data parallel distributed processing methods and applications.

背景技术Background technique

传感器网络是由部署在监测区域里的大量静止或移动的微型传感器节点所组成的，其单个传感器节点对于数据的采集、存储、处理和传输的能力十分有限。因此，对于传感器网络中的数据处理而言，需要对传统的数据处理进行改进。目前，在传感器中的数据处理主要有两种方式，集中式处理和分布式处理。在集中式处理方式中，将其中某个节点指定为中心节点，其他节点将采集到的原始数据传输汇总到中心节点，在中心节点处完成数据的处理，而后再将处理结果返回到各节点。该方式的缺点在于一旦中心节点失效则会给整个网络带来致命影响。另一种处理方式为分布式处理。在该方式中，所有节点地位相同，通过邻居节点间的通信和协作，最终完成数据处理任务。与集中式数据处理相比，分布式处理可以避免因为中心节点的失效所带来的不利影响，整个网络的鲁棒性更强。而本发明能够很好地解决上面的问题。The sensor network is composed of a large number of stationary or mobile miniature sensor nodes deployed in the monitoring area, and its single sensor node has very limited ability to collect, store, process and transmit data. Therefore, for data processing in sensor networks, traditional data processing needs to be improved. At present, there are mainly two ways of data processing in sensors, centralized processing and distributed processing. In the centralized processing method, one of the nodes is designated as the central node, and other nodes transmit and summarize the collected raw data to the central node, complete the data processing at the central node, and then return the processing results to each node. The disadvantage of this method is that once the central node fails, it will have a fatal impact on the entire network. Another processing method is distributed processing. In this method, all nodes have the same status, and through communication and cooperation among neighboring nodes, the data processing task is finally completed. Compared with centralized data processing, distributed processing can avoid the adverse effects caused by the failure of the central node, and the robustness of the entire network is stronger. And the present invention can well solve the above problems.

发明内容Contents of the invention

本发明目的在于解决了现有技术的缺陷，提出了一种在传感器网络中基于混合因子分析模型的分布式聚类方法，聚类是指通过一定的方法将数据分成多个类的过程。由聚类所生成的类是一组数据对象的集合，这些对象与同一个簇中的对象彼此相似，与其他簇中的对象相异。由于在聚类中，数据所属的类标签是未知的，因此在机器学习领域，对数据的聚类是一个无监督学习的过程。现有的数据聚类方法很多，但是绝大多数假设全部数据的聚类都在一个处理中心上完成，而在传感器网络中，分布式处理十分关键，因此，该方法正是为了解决这一问题，设计一种基于混合因子分析模型的分布式聚类方法。其优点主要有：(1)混合因子分析模型可以有效的处理高维数据；(2)通过设计节点间协作方式，只传输中间结果就可以获得满意的聚类结果，与传输原始数据方式相比，既减小了通信的开销，又保护了数据上的隐私信息，确保了网络中的数据安全。The purpose of the present invention is to solve the defects of the prior art, and propose a distributed clustering method based on a mixed factor analysis model in a sensor network. Clustering refers to the process of dividing data into multiple classes by a certain method. A class generated by clustering is a collection of data objects that are similar to objects in the same cluster and different from objects in other clusters. Since in clustering, the class label to which data belongs is unknown, so in the field of machine learning, clustering data is an unsupervised learning process. There are many existing data clustering methods, but most of them assume that the clustering of all data is completed on a processing center, and in sensor networks, distributed processing is very critical, so this method is just to solve this problem , to design a distributed clustering method based on a mixed factor analysis model. Its main advantages are: (1) The mixed factor analysis model can effectively process high-dimensional data; (2) By designing the cooperation mode between nodes, only the intermediate results can be transmitted to obtain satisfactory clustering results, compared with the original data transmission method , which not only reduces the communication overhead, but also protects the privacy information on the data and ensures the data security in the network.

本发明解决其技术问题所采取的技术方案是：一种传感器网络中基于混合因子分析模型的分布式聚类方法，该方法包括如下步骤：The technical scheme adopted by the present invention to solve the technical problems is: a distributed clustering method based on a mixed factor analysis model in a sensor network, the method comprising the following steps:

设传感器网络中有M个传感器节点，第m个节点采集到N_m个数据，表示为其中y_m,n表示节点m处的第n个数据，维度为p。用混合因子分析模型(MFA)来描述Y_m(m＝1,...,M)的分布，注意所有节点的数据公用同一个MFA。MFA为一个成分数为I的混合模型；对于每一数据y_m,n，其可以表示为：Suppose there are M sensor nodes in the sensor network, and the mth node collects N _m data, expressed as Among them, y _m,n represents the nth data at node m, and the dimension is p. Use the mixed factor analysis model (MFA) to describe the distribution of Y _m (m=1,...,M), and note that the data of all nodes share the same MFA. MFA is a mixed model with component number I; for each data y _m,n , it can be expressed as:

y_m,n＝μ_i+A_iu_m,n+e_m,n,i以概率π_i(i＝1,...,I)，y _m,n =μ _i +A _i u _m,n +e _m,n,i with probability π _i (i=1,...,I),

其中，μ_i为第i个混合成分的p维均值矢量，u_m,n为与数据y_m,n相对应的低维空间中的因子，它的维数为q(q＜＜p)，服从高斯分布N(u_m,n|0,I_q)，q的值根据具体问题中p的大小进行选取，一般取q＝p/6～p/2之间的任意整数；A_i为(p×q)的因子载荷矩阵；误差变量e_m,n,i服从高斯分布N(e_m,n,i|0,D_i)，其中D_i为(p×p)的对角矩阵；概率π_i满足那么MFA的参数集合Θ为{π_i,A_i,μ_i,D_i}_i＝1,...,I。注意，对于所有节点，其待估计的MFA参数集合中各个参数值相同。Among them, μ _i is the p-dimensional mean value vector of the i-th mixed component, u _m,n is the factor in the low-dimensional space corresponding to the data y _m,n , and its dimension is q (q<<p), Obey the Gaussian distribution N(u _m,n |0,I _q ), the value of q is selected according to the size of p in the specific problem, generally any integer between q=p/6～p/2; A _i is ( p×q) factor loading matrix; error variables e _m,n,i obey Gaussian distribution N(e _m,n,i |0,D _i ), where D _i is a diagonal matrix of (p×p); probability π _i satisfies Then the parameter set Θ of the MFA is {π _i , A _i , μ _i , D _i } _i=1,...,I . Note that for all nodes, each parameter value in the MFA parameter set to be estimated is the same.

此外，每个节点的数据传输范围设为W，对于当前节点m，所有与其距离小于W的节点为其邻居节点，节点m的邻居节点集合表示为R_m。图1中表示了某个传感器网络中节点各个节点之间的关系，其中圆表示节点，如果两个节点之间有边相连，则表示两个节点之间可以互相通信，传输信息。图1中的虚线框表示节点的m的R_m。在本发明中，网络拓扑在分布式聚类实施之前已经确定好，并且要保证任意两个节点之间直接或经过多跳之后互通。In addition, the data transmission range of each node is set to W. For the current node m, all nodes whose distance is less than W are its neighbor nodes, and the set of neighbor nodes of node m is denoted as R _m . Figure 1 shows the relationship between nodes in a sensor network, where a circle represents a node, and if two nodes are connected by an edge, it means that the two nodes can communicate with each other and transmit information. The dotted box in Fig. 1 represents _Rm for m of the node. In the present invention, the network topology has been determined before the implementation of distributed clustering, and it is necessary to ensure that any two nodes communicate directly or after multiple hops.

本发明的传感器网络拓扑和描述数据分布的MFA建立好之后，则开始分布式聚类过程，如图2所示，其具体步骤包括：After the sensor network topology of the present invention and the MFA describing data distribution are established, the distributed clustering process starts, as shown in Figure 2, and its specific steps include:

步骤1：初始化；传感器网络中有M个传感器节点，第m个节点采集到N_m个数据，表示为其中y_m,n表示节点m处的第n个数据，维度为p；网络拓扑已经事先确定，节点m的邻居节点集合表示为R_m；用混合因子分析模型(MFA)来描述Y_m(m＝1,...,M)的分布，所有节点的数据共用同一个MFA；MFA的参数集合为{π_i,A_i,μ_i,D_i}_i＝1,...,I，其中π_i为第i个混合成分的权重，A_i为第i个混合成分的(p×q)的因子载荷矩阵，q为低维因子的维度，取q＝p/6～p/2之间的任意整数；μ_i为第i个混合成分的p维均值矢量，D_i为第i个混合成分的误差的协方差矩阵；Step 1: Initialization; there are M sensor nodes in the sensor network, and the mth node collects N _m data, expressed as Among them, y _{m, n} represent the nth data at node m, and the dimension is p; the network topology has been determined in advance, and the set of neighbor nodes of node m is represented as R _m ; use mixed factor analysis model (MFA) to describe Y _m (m =1,...,M), the data of all nodes share the same MFA; the parameter set of MFA is {π _i ,A _i ,μ _i ,D _i } _i=1,...,I , where π _i is the weight of the i-th mixed component, A _i is the (p×q) factor loading matrix of the i-th mixed component, and q is the dimension of the low-dimensional factor, which is between q=p/6～p/2 Any integer of ; μ _i is the p-dimensional mean vector of the i-th mixed component, and D _i is the covariance matrix of the error of the i-th mixed component;

首先，设定MFA中的混合成分数I，也是待聚类的类别个数；根据I,p和q来设定MFA中各参数的初始值；其中，各节点处的从该节点采集到的数据中随机选取，和中的每个元素都从标准正态分布N(0,1)中生成；此外，每个节点l将其采集到的数据个数N_l广播给其邻居节点；当某个节点m收到它的所有邻居节点l(l∈R_m)广播来的数据个数之后，该节点根据下式来计算权重c_lm：First, set the mixture fraction I in MFA, which is also the number of categories to be clustered; set the initial value of each parameter in MFA according to I, p and q; among them, the Randomly select from the data collected by the node, and Each element in is generated from the standard normal distribution N(0,1); in addition, each node l broadcasts the number of data N _l it collects to its neighbor nodes; when a node m receives it After the number of data broadcasted by all neighbor nodes l(l∈R _m ), the node calculates the weight c _lm according to the following formula:

初始化完成后，迭代计数器iter＝1，开始迭代过程；After the initialization is completed, the iteration counter iter=1, and the iteration process starts;

步骤2：局部计算；在每个节点l处，基于其采集到的数据Y_l，首先计算出中间变量g_i，Ω_i，和<z_l,n,i>，(n＝1,...,N_l；i＝1,...,I)：Step 2: Local calculation; at each node l, based on the collected data Y _l , first calculate the intermediate variables g _i , Ω _i , and < _zl,n,i >, (n=1,..., _Nl ; i=1,...,I):

其中，为前一次迭代完成之后得到的参数值(首次迭代时为参数的初始值<z_l,n,i>表示节点l处第n个数据y_l,n属于第i个类(混合成分)的概率；in, The parameter value obtained after the previous iteration is completed (the initial value of the parameter in the first iteration <z _l,n,i > indicates the probability that the nth data y _l,n at node l belongs to the i-th class (mixed component);

接着，节点计算局部充分统计量包括：Next, the node computes the local sufficient statistics include:

步骤3：广播扩散；传感器网络中的每个节点l将计算好的局部充分统计量LSS_l广播扩散给其邻居节点；Step 3: broadcast diffusion; each node l in the sensor network broadcasts the calculated local sufficient statistic LSS _l to its neighbor nodes;

步骤4：联合计算；当节点m(m＝1,...,M)收到来自其所有邻居节点l(l∈R_m)的LSS_l后，节点m计算出联合充分统计量 Step 4: Joint calculation; when node m(m=1,...,M) receives LSS _l from all its neighbor nodes l(l∈R _m ), node m calculates joint sufficient statistics

步骤5：估计参数；节点m(m＝1,...,M)根据上一步计算出的CSS_m，估计出Θ＝{π_i,A_i,μ_i,D_i}_i＝1,...,I，其中，{π_i,μ_i}_i＝1,...,I的估计过程如下：Step 5: Estimate parameters; node m (m=1,...,M) estimates Θ={π _i ,A _i _, μ _i ,D _i } _{i=1,. ..,I} , where {π _i ,μ _i } _i=1,...,I, the estimation process is as follows:

对于{A_i,D_i}_i＝1,...,I的估计，过程如下：For the estimation of {A _i ,D _i } _i=1,...,I , the process is as follows:

步骤6：判决收敛；节点m(m＝1,...,M)计算当前迭代下的对数似然值：Step 6: Judgment convergence; node m (m=1,...,M) calculates the logarithmic likelihood value under the current iteration:

如果logp(Y_m|Θ)-logp(Y_m|Θ^old)＜ε，则收敛，停止迭代；否则执行步骤2，开始下一次迭代(iter＝iter+1)；其中Θ表示当前迭代估计出的参数值，Θ^old表示上一次迭代中估计的参数值，即，相邻两次迭代的对数似然值小于阈值ε，算法收敛；ε取10^-5～10^-6中的任意值；由于网络中各节点是并行处理数据，因此所有不可能在一次迭代中同时收敛；当节点l已经收敛而节点m尚未收敛时，则节点l不再发送LSS_l，也不再接收邻居节点传输的信息；节点m则用最后一次收到的节点l发来的LSS_l更新其CSS_m；未收敛的节点继续迭代，直至网络中所有节点都收敛；If logp(Y _m |Θ)-logp(Y _m |Θ ^old )<ε, it converges and stops iteration; otherwise, execute step 2 and start the next iteration (iter=iter+1); where Θ indicates that the current iteration estimates , Θ ^old represents the parameter value estimated in the last iteration, that is, the logarithmic likelihood value of two adjacent iterations is less than the threshold ε, and the algorithm converges; ε takes any value from 10 ^-5 to 10 ^-6 ; Since each node in the network processes data in parallel, it is impossible to converge at the same time in one iteration; when node l has converged but node m has not yet converged, node l will no longer send LSS _l , and will no longer receive LSS l transmitted by neighbor nodes. information; node m updates its CSS _m with the last received LSS _l from node l; unconverged nodes continue to iterate until all nodes in the network converge;

步骤7：聚类输出；经过步骤1-步骤6之后，节点m(m＝1,...,M)得到与其每个数据相对应的<z_m,n,i>(n＝1,...,N_m；i＝1,...,I)，将<z_m,n,i>(i＝1,...,I)中的最大值所对应的序号作为y_m,n所最终分配到的类C_m,n，即：Step 7: Clustering output; after step 1-step 6, node m (m=1,...,M) gets each data Corresponding <z _m,n,i >(n=1,...,N _m ; i=1,...,I), will <z _m,n,i >(i=1,.. .,I) The serial number corresponding to the maximum value is the class C _m,n to which y m _,n is finally assigned, namely:

用这样的方式得到所有节点上的所有数据的聚类结果 Get the clustering results of all data on all nodes in this way

本发明用混合因子分析模型来建模传感器网络中各节点处待聚类的数据，各节点基于自身数据计算局部充分统计量，而后将该量扩散广播给其邻居节点，当节点收到所有来自邻居节点的局部充分统计量之后，其可以获得联合充分统计量，并基于该统计量估计出混合因子分析模型中的各个参数，最终基于估计出的模型完成聚类。本发明建立的混合因子分析模型可以在聚类的同时完成数据的降维，并且采用分布式聚类方式，避免了传统的集中式处理方式中由中心节点带来的网络崩溃，此外，在本发明的分布式聚类方法中，各节点间传输的是充分统计量而不是数据，既大大节省了通信开销，又可以较好地保护数据中的隐私信息，使得采用该方法的系统安全性大大增加。The present invention uses the mixed factor analysis model to model the data to be clustered at each node in the sensor network. Each node calculates local sufficient statistics based on its own data, and then spreads the amount to its neighbor nodes. When the node receives all After local sufficient statistics of neighbor nodes, it can obtain joint sufficient statistics, and estimate each parameter in the mixed factor analysis model based on the statistics, and finally complete clustering based on the estimated model. The mixed factor analysis model established by the present invention can complete data dimensionality reduction while clustering, and adopts a distributed clustering method, which avoids the network collapse caused by the central node in the traditional centralized processing method. In addition, in this In the distributed clustering method invented, sufficient statistics are transmitted between nodes instead of data, which not only greatly saves communication overhead, but also better protects private information in data, making the system security using this method greatly improved. Increase.

有益效果：Beneficial effect:

1.本发明中所采用的混合因子分析器能够对高维数据进行降维，从而在降维的同时顺利完成聚类，获得更好的聚类性能。1. The mixed factor analyzer used in the present invention can reduce the dimensionality of high-dimensional data, thereby successfully completing clustering while reducing dimensionality, and obtaining better clustering performance.

2.本发明中所采用的基于混合因子分析模型的分布式聚类方法，使得传感器网络中的各个节点可以充分利用其它节点的数据中所包含的信息，聚类性能优于集中式方法。2. The distributed clustering method based on the mixed factor analysis model adopted in the present invention enables each node in the sensor network to fully utilize the information contained in the data of other nodes, and the clustering performance is better than the centralized method.

3.本发明中所采用的基于混合因子分析模型的分布式聚类方法，在节点协作过程中，交换局部充分统计量而不是直接传输原始数据，由于局部充分统计量的数量和维度远小于数据，因此这种方式一方面节省了通信的开销，另一方面，有利于充分保护数据中的隐私信息，提高了采用本方法的系统的安全性能。3. The distributed clustering method based on the mixed factor analysis model used in the present invention, in the process of node cooperation, exchanges local sufficient statistics instead of directly transmitting original data, because the quantity and dimension of local sufficient statistics are much smaller than data , so on the one hand, this method saves the communication overhead, on the other hand, it is beneficial to fully protect the private information in the data, and improves the security performance of the system using this method.

附图说明Description of drawings

图1为本发明的传感器网络中节点m的邻居节点集R_m，以及节点之间收发局部充分统计量(即：LSS)的示意图。FIG. 1 is a schematic diagram of the neighbor node set R _m of node m in the sensor network of the present invention, and local sufficient statistics (ie: LSS) sent and received between nodes.

图2为本发明涉及的传感器网络中基于混合因子分析模型的分布式聚类方法的流程图。Fig. 2 is a flow chart of the distributed clustering method based on the mixed factor analysis model in the sensor network involved in the present invention.

图3为本发明的实施例中各节点处的数据聚类结果示意图。FIG. 3 is a schematic diagram of data clustering results at each node in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合说明书附图对本发明创造作进一步的详细说明。The invention will be described in further detail below in conjunction with the accompanying drawings.

为了更好地说明本发明涉及的传感器网络中基于混合因子分析模型的分布式聚类方法，将其应用于酒成分数据的聚类。在一些国家中，在不同区域里分布有一些检测站，用于检测酒中的各成分含量。送到检测站的酒的种类不同。因此需要对相似类别的酒进行聚类。如果该检测站可以和其他检测站组成传感网，通过互相协作，可以充分利用其他检测站中酒的数据，从而提高聚类准确性。这里，待聚类的酒数据来源于UCI机器学习数据库，这里有178个数据，总共来自3个类。每个数据的维度为13，表示酒中各个成分的含量。传感器网络中有8个节点，每个节点的平均邻居节点数为2，网络是可连通的(任意两个节点之间都存在直接或间接到达的路径)。因此在本例子中，M＝8,p＝13,I＝3,q＝3。此外，各节点的处的数据数量：N₁＝21,N₂＝22,N₃＝21,N₄＝21,N₅＝22,N₆＝22,N₇＝21,N₈＝28；各节点的邻居节点：R₁＝{3,5,6},R₂＝{3,5},R₃＝{1,4,2},R₄＝{3},R₅＝{1,2},R₆＝{1,7},R₇＝{6,8},R₈＝{3,7}。In order to better illustrate the distributed clustering method based on the mixed factor analysis model in the sensor network involved in the present invention, it is applied to the clustering of wine component data. In some countries, there are some testing stations distributed in different regions for testing the content of various components in wine. The type of wine sent to the testing station varies. Therefore, it is necessary to cluster wines of similar categories. If the detection station can form a sensor network with other detection stations, through mutual cooperation, it can make full use of the wine data in other detection stations, thereby improving the clustering accuracy. Here, the wine data to be clustered comes from the UCI machine learning database. There are 178 data here, which come from 3 categories in total. The dimension of each data is 13, indicating the content of each component in the wine. There are 8 nodes in the sensor network, the average number of neighbor nodes of each node is 2, and the network is connectable (there is a direct or indirect path between any two nodes). So in this example, M=8, p=13, I=3, q=3. In addition, the amount of data at each node: N ₁ =21, N ₂ =22, N ₃ =21, N ₄ =21, N ₅ =22, N ₆ =22, N ₇ =21, N ₈ =28; Neighbor nodes of each node: R ₁ ={3,5,6}, R ₂ ={3,5}, R ₃ ={1,4,2}, R ₄ ={3}, R ₅ ={1, 2}, R ₆ ={1,7}, R ₇ ={6,8}, R ₈ ={3,7}.

按照发明内容(图2所示)的流程，开始分布式聚类：According to the process of the content of the invention (shown in Figure 2), start distributed clustering:

(1)初始化：设定MFA中参数的初始值。其中，各节点处的从节点的数据中随机选取，和中的每个元素都从标准正态分布N(0,1)中生成。此外，每个节点l(l＝1,...,M)将其采集到的数据个数N_l广播给其邻居节点。当某个节点m收到它的所有邻居节点广播来的数据个数之后，该节点计算权重c_lm：(1) Initialization: Set the initial value of the parameters in the MFA. Among them, at each node randomly selected from the node's data, and Each element in is generated from a standard normal distribution N(0,1). In addition, each node l (l=1, . . . , M) broadcasts the number N _l of collected data to its neighbor nodes. When a node m receives the number of data broadcasted by all its neighbor nodes, the node calculates the weight c _lm :

该权重的含义为用于衡量节点m的各邻居节点l(l∈R_m)每次传输的信息在节点m处的重要性。初始化完成后，迭代计数器iter＝1，开始迭代过程。The meaning of the weight is to measure the importance of the information transmitted by each neighbor node l(l∈R _m ) of the node m at the node m. After the initialization is completed, the iteration counter iter=1, and the iteration process starts.

(2)局部计算：此步骤不需要邻居节点的信息。在每个节点l处，基于其采集到的数据Y_l，首先计算出g_i，Ω_i，和<z_l,n,i>，(n＝1,...,N_l；i＝1,...,I)：(2) Local calculation: This step does not require the information of neighbor nodes. At each node l, based on the collected data Y _l , first calculate g _i , Ω _i , and < _zl,n,i >, (n=1,..., _Nl ; i=1,...,I):

其中为前一次迭代完成之后得到的参数值(首次迭代时为参数的初始值<z_l,n,i>表示节点l处第n个数据y_l,n属于第i个类(混合成分)的概率。in The parameter value obtained after the previous iteration is completed (the initial value of the parameter in the first iteration <z _l,n,i > represents the probability that the nth data y _l,n at the node l belongs to the i-th class (mixture component).

其次，节点计算局部充分统计量如下：Second, the nodes compute local sufficient statistics as follows:

(3)广播扩散：传感器网络中的每个节点l将计算好的局部统计量LSS_l广播扩散给其邻居节点，如图1所示。(3) Broadcast diffusion: Each node l in the sensor network broadcasts the calculated local statistics LSS _l to its neighbor nodes, as shown in Figure 1.

(4)联合计算：当节点m(m＝1,...,M)收到来自其所有邻居节点l(l∈R_m)的LSS_l后，节点m计算出联合充分统计量 (4) Joint calculation: when node m (m=1,...,M) receives LSS _l from all its neighbor nodes l(l∈R _m ), node m calculates joint sufficient statistics

(5)估计参数：节点m(m＝1,...,M)根据上一步计算出的CSS_m，估计出Θ＝{π_i,A_i,μ_i,D_i}_i＝1,...,I，其中，{π_i,μ_i}_i＝1,...,I的估计过程如下：(5) Estimated parameters: node m (m=1,...,M) estimates Θ={π _i ,A _i _, μ _i ,D _i } _{i=1,. ..,I} , where {π _i ,μ _i } _i=1,...,I, the estimation process is as follows:

(6)判决收敛：节点m(m＝1,...,M)计算当前迭代下的对数似然值：(6) Judgment convergence: node m (m=1,...,M) calculates the logarithmic likelihood value under the current iteration:

如果logp(Y_m|Θ)-logp(Y_m|Θ^old)＜ε，则收敛，停止迭代；否则执行步骤(2)，开始下一次迭代(iter＝iter+1)；其中Θ表示当前迭代估计出的参数值，Θ^old表示上一次迭代中估计的参数值，即，相邻两次迭代的对数似然值小于阈值ε，算法收敛；ε取10^-5～10^-6中的任意值；值得注意的是，由于网络中各节点是并行处理数据，因此所有不可能在一次迭代中同时收敛；例如，当节点l已经收敛而节点m尚未收敛时，则节点l不再发送LSS_l，也不再接收邻居节点传输的信息；节点m则用最后一次收到的节点l发来的LSS_l更新其CSS_m；未收敛的节点继续迭代，直至网络中所有节点都收敛。If logp(Y _m |Θ)-logp(Y _m |Θ ^old )<ε, converge and stop iteration; otherwise, execute step (2) and start the next iteration (iter=iter+1); where Θ represents the current iteration Estimated parameter value, Θ ^old represents the parameter value estimated in the last iteration, that is, the logarithmic likelihood value of two adjacent iterations is less than the threshold ε, and the algorithm converges; ε takes any value from 10 ^-5 to 10 ^-6 It is worth noting that since each node in the network processes data in parallel, it is impossible to converge at the same time in one iteration; for example, when node l has converged but node m has not yet converged, node l will no longer send LSS _l , and no longer receive information transmitted by neighbor nodes; node m updates its CSS _m with the last received LSS l sent by node _l ; unconverged nodes continue to iterate until all nodes in the network converge.

(7)聚类输出。经过步骤(1)-(6)之后，节点m(m＝1,...,M)，得到与其每个数据{y_m,n}_n＝1,...,Nm相对应的<z_m,n,i>(n＝1,...,N_m；i＝1,...,I)，将<z_m,n,i>，i＝1,...,I中的最大值所对应的序号作为y_m,n所最终分配到的类C_m,n，即：(7) Cluster output. After steps (1)-(6), node m (m=1,...,M) obtains <z corresponding to each of its data {y _m,n } _n=1,...,Nm _m,n,i >(n=1,...,N _m ; i=1,...,I), will <z _m,n,i >, i=1,...,I The serial number corresponding to the maximum value is used as the class C _m,n to which y _m,n is finally assigned, namely:

性能评价：Performance evaluation:

将采用本发明所涉及聚类方法所得到的结果和正确的类属结果进行比较，从而可以评价和衡量出本发明所涉及的方法的有效性和准确性。各节点处的聚类结果如图3所示，该图的横坐标表示178个数据，非空缺的位置表示该数据被分到了那个节点，纵坐标表示该数据被分到的类别序号(共3类)。在该图中，“o”表示正确聚类的数据，“x”表示错误聚类的数据。从图3中，8个节点处的聚类正确率为：100％,100％,95.2％,95.5％,100％,95.5％,100％,92.9％。总共只有五个数据被错误的聚类，整个网络的平均正确率为97.2％。和采用集中式传输的方法获得的结果(98％)相比，其正确率基本相同。而集中式传输的缺点是十分明显的，其一，一旦中心节点失效，则整个网络崩溃；其二，各节点直接将原始数据传输给中心节点，不仅增加了网络中的通信负担，而且容易泄露数据中的隐私。因此，采用本发明的方法克服了以上的缺点，获得了很好的分布式聚类性能。Will adopt the result that the clustering method involved in the present invention obtains Compared with the correct generic results, the effectiveness and accuracy of the method involved in the present invention can be evaluated and measured. The clustering results at each node are shown in Figure 3. The abscissa of the figure represents 178 data, the non-vacant position represents the node to which the data is assigned, and the ordinate represents the category number to which the data is assigned (a total of 3 kind). In this figure, "o" indicates correctly clustered data and "x" indicates incorrectly clustered data. From Figure 3, the clustering accuracy at 8 nodes is: 100%, 100%, 95.2%, 95.5%, 100%, 95.5%, 100%, 92.9%. In total, only five data were incorrectly clustered, and the average correct rate of the entire network was 97.2%. Compared with the result (98%) obtained by the centralized transmission method, the correct rate is basically the same. The shortcomings of centralized transmission are very obvious. First, once the central node fails, the entire network will collapse; second, each node directly transmits the original data to the central node, which not only increases the communication burden in the network, but also easily leaks Privacy in data. Therefore, the above disadvantages are overcome by adopting the method of the present invention, and good distributed clustering performance is obtained.

本发明请求保护的范围并不仅仅局限于本具体实施方式的描述。The scope of protection claimed in the present invention is not limited only to the description of this specific embodiment.

Claims

1. the distributed clustering method based on mixed factor analysis model in the sensor network, it is characterized in that, described method comprises the steps:

Step 1: Initialization; there are M sensor nodes in the sensor network, and the mth node collects N _m data, expressed as Among them, y _{m, n} represent the nth data at node m, and the dimension is p; the network topology has been determined in advance, and the set of neighbor nodes of node m is represented as R _m ; use mixed factor analysis model (MFA) to describe Y _m (m =1,...,M), the data of all nodes share the same MFA; the parameter set of MFA is {π _i ,A _i ,μ _i ,D _i } _i = ₁ ,..., _I , where π _i is the weight of the i-th mixed component, A _i is the (p×q) factor loading matrix of the i-th mixed component, and q is the dimension of the low-dimensional factor, which is between q=p/6～p/2 Any integer of ; μ _i is the p-dimensional mean vector of the i-th mixed component, and D _i is the covariance matrix of the error of the i-th mixed component;

First, set the mixture fraction I in MFA, which is also the number of categories to be clustered; set the initial value of each parameter in MFA according to I, p and q; among them, the Randomly select from the data collected by the node, and Each element in is generated from the standard normal distribution N(0,1); in addition, each node l broadcasts the number of data N _l it collects to its neighbor nodes; when a node m receives it After the number of data broadcasted by all neighbor nodes l(l∈R _m ), the node calculates the weight c _lm according to the following formula:

After the initialization is completed, the iteration counter iter=1, and the iteration process starts;

Step 2: Local calculation; at each node l, based on the collected data Y _l , first calculate the intermediate variables g _i , Ω _i , and <z _l , _{n, i} >, (n=1,...,N _l ; i=1,...,I):

in, It is the parameter value obtained after the previous iteration is completed, and it is the initial value of the parameter at the first iteration <z _l,n,i >indicates the probability that the nth data y _l,n at node l belongs to the i-th class mixture component;

Next, the node computes the local sufficient statistics include:

;

Step 3: broadcast diffusion; each node l in the sensor network broadcasts the calculated local sufficient statistic LSS _l to its neighbor nodes;

Step 4: Joint calculation; when node m(m=1,...,M) receives LSS _l from all its neighbor nodes l(l∈R _m ), node m calculates joint sufficient statistics

Step 5: Estimate parameters; node m (m=1,...,M) is calculated according to the previous step , estimated , where {π _i ,μ _i } _{i=1,..., the estimation process of I} is as follows:

For the estimation of {A _i ,D _i } _i=1,...,I , the process is as follows:

Step 6: Judgment convergence; node m (m=1,...,M) calculates the logarithmic likelihood value under the current iteration:

If logp(Y _m |Θ)-logp(Y _m |Θ ^old )<ε, it converges and stops iteration; otherwise, execute step 2 and start the next iteration (iter=iter+1); where Θ indicates that the current iteration estimates , Θ ^old represents the parameter value estimated in the last iteration, that is, the logarithmic likelihood value of two adjacent iterations is less than the threshold ε, and the algorithm converges; ε takes any value from 10 ^-5 to 10 ^-6 ; Since each node in the network processes data in parallel, it is impossible to converge at the same time in one iteration; when node l has converged but node m has not yet converged, node l will no longer send LSS _l , and will no longer receive LSS l transmitted by neighbor nodes. information; node m updates its CSS _m with the last received LSS _l from node l; unconverged nodes continue to iterate until all nodes in the network converge;

Step 7: Clustering output; after step 1-step 6, node m (m=1,...,M) gets each data Corresponding <z _m,n,i >(n=1,...,N _m ; i=1,...,I), will <z _m,n,i >(i=1,.. .,I) The serial number corresponding to the maximum value is the class C _m,n to which y m _,n is finally assigned, namely:

Get the clustering results of all data on all nodes

2. The distributed clustering method based on the mixed factor analysis model in the sensor network according to claim 1, characterized in that: the method is applied to the clustering of wine component data.