CN117950912A

CN117950912A - High-robustness distributed data recovery method based on compressed sensing

Info

Publication number: CN117950912A
Application number: CN202311779924.1A
Authority: CN
Inventors: 薛广涛; 杜雨衡; 陈奕超
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-04-30

Abstract

The present invention relates to a highly robust distributed data recovery method based on compressed sensing. The method utilizes the inherent characteristics of the original data that may have multiple defects, converts the data recovery process into an optimization problem, decomposes the data matrix, noise matrix, abnormal matrix and error matrix from the original data by designing the goal and constraints of the problem, and designs a set of distributed algorithms to greatly reduce the demand for computing power, thereby achieving highly robust and high-precision rapid recovery of data; in addition, a method for balancing recovery effect and recovery time is also provided, which directly affects and predicts the final recovery performance by adjusting the division scheme of the decomposition stage. In general, the data recovery method can be divided into three parts: distributed data decomposition, highly robust data recovery, and iterative data enhancement. Compared with the prior art, the present invention has the advantages of high robustness, strong adjustability, and easy use.

Description

A highly robust distributed data recovery method based on compressed sensing

技术领域Technical Field

本发明涉及数据恢复技术领域，尤其是涉及一种基于压缩感知的高鲁棒性分布式数据恢复方法。The present invention relates to the technical field of data recovery, and in particular to a highly robust distributed data recovery method based on compressed sensing.

背景技术Background Art

数据是物联网中连接网络与现实世界的桥梁。与传统网络中数据直接向人提供服务的模式有所不同，在物联网中，设备是数据的主要生产者与消费者。在这种M2M模式下，数据作为设备间的通信媒介，被用于提供无处不在的智能服务。但由于物联网自身的特性，如物联网设备受到能源和体积的限制，并不能提供较强的计算能力；物联网中因场景和传感器性能的限制，往往使用更简单的网络协议栈，其无线传输过程更容易受到电磁干扰，在网络连接上有较高的间歇性断联风险。如果没有充分考虑这些物联网特性对数据质量的影响，那么后续基于物联网数据的运算和服务将会极为不可靠，可能会对上层应用和业务造成不可估量的损失。因此，必须特别关注数据质量。Data is the bridge between the Internet of Things and the real world. Different from the traditional network model where data directly provides services to people, in the Internet of Things, devices are the main producers and consumers of data. In this M2M model, data is used as a communication medium between devices to provide ubiquitous intelligent services. However, due to the characteristics of the Internet of Things itself, such as the limitations of energy and volume, IoT devices cannot provide strong computing power; due to the limitations of scenarios and sensor performance, the Internet of Things often uses a simpler network protocol stack, and its wireless transmission process is more susceptible to electromagnetic interference, and there is a higher risk of intermittent disconnection in network connections. If the impact of these IoT characteristics on data quality is not fully considered, then subsequent operations and services based on IoT data will be extremely unreliable, which may cause immeasurable losses to upper-layer applications and businesses. Therefore, special attention must be paid to data quality.

针对物联网中计算能力不足的缺点，传感器采集到的感知数据常常会使用边缘网络进行处理。但是，传感器采集到的数据往往同时存在着多种瑕疵和缺陷，而大部分算法都不允许输入中存在数据空缺或超限的异常值。为了避免给后续的分析处理带来麻烦，常见做法是进行数据恢复的预处理，但目前已有的数据恢复算法都存在很多限制。In view of the shortcomings of insufficient computing power in the Internet of Things, the perception data collected by sensors are often processed using edge networks. However, the data collected by sensors often have multiple flaws and defects at the same time, and most algorithms do not allow data gaps or outliers in the input. In order to avoid trouble in subsequent analysis and processing, a common practice is to perform preprocessing of data recovery, but the existing data recovery algorithms have many limitations.

首先，现有数据恢复算法普遍没有充分考虑到数据中可能存在的缺陷，其中包括异常、缺失和噪声。如果不对数据中的这些缺陷进行处理，将会严重降低后续数据分析的性能和精度。First, existing data recovery algorithms generally do not fully consider the possible defects in the data, including anomalies, missing data, and noise. If these defects in the data are not handled, the performance and accuracy of subsequent data analysis will be seriously reduced.

其次，现有数据恢复算法的运算时间过长。网络领域有很多大规模的数据集，这是因为数据采集的成本低、速度快，可以比较容易的获取到储存量达到GB级别的数据。不过对于数据恢复算法来说，这样大规模的数据会使得节点计算能力面临严重挑战。但计算能力成为瓶颈后，数据恢复需要占用大量时间，产生极高的时延，恶化数据分析的实时性。Secondly, the operation time of existing data recovery algorithms is too long. There are many large-scale data sets in the network field. This is because the cost of data collection is low and the speed is fast. It is relatively easy to obtain data with storage capacity of GB level. However, for data recovery algorithms, such large-scale data will pose a serious challenge to node computing power. But when computing power becomes a bottleneck, data recovery will take a lot of time, resulting in extremely high latency, which will deteriorate the real-time nature of data analysis.

发明内容Summary of the invention

本发明的目的是为了提供一种基于压缩感知的高鲁棒性分布式数据恢复方法，结合相关场景特性，在物联网侧就能够完成对数据中多种缺陷的检测和恢复，以此达成增强数据质量的目的，尽可能地减小物联网数据缺陷对更高层次业务的影响。The purpose of the present invention is to provide a highly robust distributed data recovery method based on compressed sensing. Combined with relevant scenario characteristics, it can complete the detection and recovery of various defects in the data on the Internet of Things side, so as to achieve the purpose of enhancing data quality and minimize the impact of Internet of Things data defects on higher-level services.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved by the following technical solutions:

一种基于压缩感知的高鲁棒性分布式数据恢复方法，所述方法将数据的恢复过程转换为一个最优化问题，通过对该问题目标和约束的设计，从原始数据中分解出数据矩阵、噪声矩阵、异常矩阵和错误矩阵，并根据数据矩阵得到恢复数据。A highly robust distributed data recovery method based on compressed sensing, the method converts the data recovery process into an optimization problem, through the design of the problem objectives and constraints, decomposes the original data into a data matrix, a noise matrix, an abnormal matrix and an error matrix, and obtains the recovered data according to the data matrix.

所述方法基于分布式算法实现，主节点执行数据分解过程和数据增强过程，其中数据分解过程将原始数据拆分为多个子矩阵，数据增强过程通过迭代进行分解结果的优化，从节点对主节点拆分的子矩阵执行数据恢复过程。The method is implemented based on a distributed algorithm. The master node executes a data decomposition process and a data enhancement process. The data decomposition process splits the original data into multiple sub-matrices. The data enhancement process optimizes the decomposition results through iteration. The slave node executes a data recovery process on the sub-matrices split by the master node.

所述方法包括以下步骤：The method comprises the following steps:

S1，分布式数据分解：S1, distributed data decomposition:

S101，主节点获取待恢复的原始数据并检查原始矩阵的低秩性，若符合低秩性要求则执行下一步，否则报错；S101, the master node obtains the original data to be restored and checks the low rank of the original matrix. If it meets the low rank requirement, the next step is executed, otherwise an error is reported;

S102，根据预先配置的参数，主节点对原始数据进行聚类以拆分矩阵，同一个簇中的数据被拼接为一个子矩阵，以此得到多个子矩阵；S102, according to pre-configured parameters, the master node clusters the original data to split the matrix, and the data in the same cluster is spliced into a sub-matrix, thereby obtaining multiple sub-matrices;

S103，主节点将拆分后的子矩阵以矩阵规模为权重，均匀发送给多个从节点进行数据恢复；S103, the master node sends the split sub-matrices to multiple slave nodes evenly with the matrix size as the weight for data recovery;

S2，高鲁棒数据恢复：S2, highly robust data recovery:

S201，每个从节点从对应的子矩阵中提取所有空缺位置，构造错误矩阵；S201, each slave node extracts all vacant positions from the corresponding sub-matrix to construct an error matrix;

S202，根据数据中缺陷特点，构建最优化问题的目标和优化约束；S202, constructing the goal and optimization constraints of the optimization problem according to the defect characteristics in the data;

S203，带入约束项得到增广拉格朗日公式，每个从节点利用ADM法最小化对应的增广拉格朗日公式，求解最优化问题，得到分解出的数据矩阵、噪声矩阵和异常矩阵；S203, bringing in the constraint term to obtain the augmented Lagrangian formula, minimizing the corresponding augmented Lagrangian formula for each slave node using the ADM method, solving the optimization problem, and obtaining the decomposed data matrix, noise matrix, and abnormality matrix;

S3，迭代式数据增强：S3, iterative data augmentation:

S301，从节点完成矩阵分解后，主节点汇总各个从节点上所有子矩阵的分解结果，按照拆分位置重新拼接得到4个与输入的原始数据规模一致的矩阵，其中包含恢复后的数据矩阵；S301, after the slave nodes complete the matrix decomposition, the master node summarizes the decomposition results of all sub-matrices on each slave node, and reassembles them according to the split positions to obtain 4 matrices consistent with the input original data scale, including the restored data matrix;

S302，对恢复后的数据矩阵再次进行聚类得到拆分后的子矩阵，并由主节点比较本次拆分结果与上次的拆分结果之间的差异性，若差异小于阈值则完成恢复，将步骤301拼接得到的4个矩阵作为恢复结果，否则继续下一步；S302, clustering the restored data matrix again to obtain the split sub-matrices, and the master node compares the difference between the current split result and the previous split result. If the difference is less than the threshold, the recovery is completed, and the four matrices spliced in step 301 are used as the recovery result, otherwise proceed to the next step;

S303，根据预配置参数，主节点判断计算时间和迭代次数是否达到最大值，若达到最大值则完成恢复，将步骤301拼接得到的4个矩阵作为恢复结果，否则根据步骤S302再次聚类得到的子矩阵，返回步骤S103继续下一轮恢复。S303, based on the pre-configured parameters, the master node determines whether the calculation time and the number of iterations have reached the maximum value. If so, the recovery is completed and the four matrices obtained by splicing in step 301 are used as the recovery result. Otherwise, the sub-matrices obtained by clustering again in step S302 are returned to step S103 to continue the next round of recovery.

所述步骤S101具体为：主节点获取待恢复的原始数据，并在原始数据所有缺失的位置临时填充上数值0，对填充后的原始数据矩阵进行奇异值分解，以计算其中数据分量的秩，确定低秩性指标；若计算得到的低秩性指标小于预设阈值，则执行下一步，否则返回错误值进行报错。The step S101 is specifically as follows: the master node obtains the original data to be restored, and temporarily fills all missing positions of the original data with the value 0, performs singular value decomposition on the filled original data matrix to calculate the rank of the data components therein, and determines the low-rank index; if the calculated low-rank index is less than the preset threshold, execute the next step, otherwise return an error value to report an error.

所述步骤S102中，主节点对原始数据进行聚类以拆分矩阵采用基于密度的改进聚类算法OPTICS，在OPTICS算法基础上，增加对区域末尾点的判定，利用算法生成序列过程中最小堆结构自身维护先驱结点的特性来判决末尾点是否应该被剔除出簇；完成聚类后，所有簇按照规模大小降序排列后从1开始标注id，即id＝1的簇包含最多的数据点，所有被标记的离群点聚合成一个额外的簇，记录id＝0。In step S102, the master node clusters the original data to split the matrix using the density-based improved clustering algorithm OPTICS. On the basis of the OPTICS algorithm, the determination of the end point of the region is added, and the characteristic of the minimum heap structure itself maintaining the pioneer node in the process of algorithm generation sequence is used to determine whether the end point should be removed from the cluster; after clustering is completed, all clusters are arranged in descending order according to size and marked with id starting from 1, that is, the cluster with id=1 contains the most data points, and all marked outliers are aggregated into an additional cluster, and id=0 is recorded.

所述步骤S103中，主节点利用网络连接向下属的从节点下发各个子矩阵，并以矩阵规模的平方作为权重，利用负载因子在从节点间进行任务分配，过程如下：In step S103, the master node sends each sub-matrix to the subordinate slave nodes using the network connection, and uses the square of the matrix size as the weight and the load factor to distribute tasks among the slave nodes. The process is as follows:

预先为第j个节点按照计算能力配置权重w_j，默认全为1，节点初始负载L_j＝1，定义负载因子 Pre-configure the weight w _j for the jth node according to its computing power. The default value is 1. The initial load of the node is L _j = 1. Define the load factor

对于簇C_i对应的子矩阵D_i，定义任务量T_i为规模的平方，即For the submatrix D _i corresponding to cluster C _i , the task volume T _i is defined as the square of the scale, that is,

T_i＝(M_i·col)² T _i =(M _i· col) ²

选择分配给当前拥有最大负载因子的节点，并更新负载因子，即Select the node that currently has the largest load factor and update the load factor, that is,

L′_Node＝L_Node+T_i L′ _Node = L _Node + _Ti

重复上述步骤直到完成分配。Repeat the above steps until the assignment is complete.

所述设共划分了N个子矩阵，其中id＝i的簇对应的子矩阵M_i规模为m_i×n_i，步骤S2把它分解为一个低秩数据矩阵A_i、一个稀疏异常矩阵B_i、一个小幅噪声矩阵C_i和一个错误矩阵E_i，则最优化问题表示为：The assumption is that a total of N sub-matrices are divided, where the sub-matrix M _i corresponding to the cluster with id=i is of size m _i ×n _i . Step S2 decomposes it into a low-rank data matrix A _i , a sparse anomaly matrix B _i , a small noise matrix C _i and an error matrix E _i . The optimization problem is expressed as:

minimize: minimize:

subject to: subject to:

D_i·*E_i＝D_i Di _· *E _i = _Di

其中α、β、σ是三个对应矩阵分量的系数，以满足线性变换和量纲转换的需要；X、Y、Z为全局系数矩阵；D_i为与E_i等大的用于将E_i从二元矩阵转换为数值矩阵的中间变量；和为增添的辅助变量，用于解耦目标和约束中A_i和B_i变量；为惩罚项，其中γ为对应的系数，K为惩罚项的数量，U^k、V^k和R^k是三个根据领域知识设置的全局矩阵参数。Where α, β, σ are the coefficients of the three corresponding matrix components to meet the needs of linear transformation and dimensional conversion; X, Y, Z are global coefficient matrices; _Di is an intermediate variable of the same size as _Ei that is used to convert _Ei from a binary matrix to a numerical matrix; and It is an additional auxiliary variable used to decouple the _Ai and _Bi variables in the objective and constraints; is the penalty term, where γ is the corresponding coefficient, K is the number of penalty terms, and U ^k , V ^k and R ^k are three global matrix parameters set according to domain knowledge.

所述S302中，比较本次拆分结果与上次的拆分结果之间的差异性采用的计算方法为：In S302, the calculation method used to compare the difference between the current split result and the previous split result is:

其中，Lab_Diff表示得到的划分差异性结果，N为子矩阵个数，Labt(i)表示第t次迭代的拆分结果，Labt-1(i)表示第t次迭代的上一次迭代的拆分结果。Wherein, Lab_Diff represents the obtained partition difference result, N is the number of sub-matrices, Labt(i) represents the split result of the t-th iteration, and Labt-1(i) represents the split result of the previous iteration before the t-th iteration.

所述方法通过调整数据分解阶段的划分方案对数据恢复性能进行预测，所述数据恢复性能包括恢复效果和恢复时间，其中，根据子矩阵数量预测恢复效果，根据子矩阵规模的平方和预测恢复时间。The method predicts data recovery performance by adjusting the partitioning scheme in the data decomposition stage, and the data recovery performance includes recovery effect and recovery time, wherein the recovery effect is predicted according to the number of sub-matrices, and the recovery time is predicted according to the sum of squares of the sub-matrix sizes.

所述恢复时间的预测方法为：The prediction method of the recovery time is:

其中，N为拆分的子矩阵个数，k′为时间系数，m×n为输入矩阵规模，Square_Sum为簇规模平方和，Where N is the number of sub-matrices to be split, k′ is the time coefficient, m×n is the input matrix size, Square_Sum is the sum of squares of cluster sizes,

恢复效果使用归一化的平均绝对误差NMAE来衡量，定义为The recovery effect is measured using the normalized mean absolute error (NMAE), defined as

其中，A表示理想的数据矩阵，表示步骤S3输出的恢复后的数据矩阵，M表示带有缺陷的原始矩阵。Where A represents the ideal data matrix, represents the restored data matrix outputted in step S3, and M represents the original matrix with defects.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)相较于在最优化层面的算法如ADMM，本发明主要关注输入数据矩阵而不是最优化目标和约束的拆分，这种拆分更加直接，对节约计算时间的效果更加明显，并且，本发明在迭代步骤时也能够使用此类最优化算法进行求解。(1) Compared with algorithms at the optimization level such as ADMM, the present invention mainly focuses on the splitting of the input data matrix rather than the optimization objectives and constraints. This splitting is more direct and has a more obvious effect on saving computing time. In addition, the present invention can also use this type of optimization algorithm to solve the problem during the iterative step.

(2)相较于一般的分布式调度算法，本发明能更充分考虑数据矩阵的内部结构，避免随意拆分破坏内在结构，更好地发挥节点间的协同作用。(2) Compared with general distributed scheduling algorithms, the present invention can more fully consider the internal structure of the data matrix, avoid arbitrary splitting and destruction of the internal structure, and better play the synergy between nodes.

(3)相较于边缘计算，本发明能够较好地适配边缘场景，可以选择把部分原始子矩阵传输给边缘节点或云端进行求解以充分利用算力。不过在此场景中，可能需要根据传输开销和多层级节点优先级的影响进行额外的最优化求解，以更好的发挥边缘计算的优势。(3) Compared with edge computing, the present invention can better adapt to edge scenarios, and can choose to transmit part of the original sub-matrix to the edge node or the cloud for solution to make full use of computing power. However, in this scenario, additional optimization solutions may be required based on the influence of transmission overhead and multi-level node priority to better play the advantages of edge computing.

(4)鲁棒性高：不同于很多只针对特定缺陷的恢复算法，本发明允许原始矩阵中同时存在噪声、缺失和异常值等多种缺陷，大大提高了恢复的鲁棒性。(4) High robustness: Unlike many recovery algorithms that only target specific defects, the present invention allows the original matrix to simultaneously contain multiple defects such as noise, missing values, and outliers, greatly improving the robustness of the recovery.

(5)可调性强：本发明在第一次划分时就能够预测恢复时间和效果，用户可以灵活调整聚类参数，平衡好数据的恢复效果和运行时间的矛盾。此外，本发明允许用户限制最大运行时间和迭代次数，为恢复性能与计算资源之间的平衡提供了更多选择。(5) Strong adjustability: The present invention can predict the recovery time and effect at the first partitioning, and users can flexibly adjust clustering parameters to balance the contradiction between data recovery effect and running time. In addition, the present invention allows users to limit the maximum running time and number of iterations, providing more options for balancing recovery performance and computing resources.

(6)使用方便：一方面，本发明中部分参数可以自适应调整，减小了配置的复杂性；另一方面，用户既可以根据具体数据源自定义划分时的聚类方法，也可以使用通用划分方案，能够满足大多数场景的需要。(6) Easy to use: On the one hand, some parameters in the present invention can be adaptively adjusted, reducing the complexity of configuration; on the other hand, users can customize the clustering method when dividing according to the specific data source, or use a general division scheme, which can meet the needs of most scenarios.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的流程图；Fig. 1 is a flow chart of the present invention;

图2为本发明的具体实现步骤图；Fig. 2 is a diagram of specific implementation steps of the present invention;

图3为一种实施例中的聚类部分伪代码；FIG3 is a pseudo code of a clustering part in an embodiment;

图4为一种实施例中不同错误率下的恢复性能比较图。FIG. 4 is a graph comparing recovery performance under different error rates in an embodiment.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention is described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is implemented based on the technical solution of the present invention, and provides a detailed implementation method and specific operation process, but the protection scope of the present invention is not limited to the following embodiments.

针对现有技术存在的缺陷，本发明充分利用了边缘网络的结构，结合现有数据恢复算法LENS开发了一套分布式数据恢复算法。在边缘服务器上，边缘节点采集的感知数据将会按照一定的规则被收集、重新排列、分发，原始数据矩阵会被划分为多个原始子矩阵，传递给与之相连的边缘节点上。在每一个边缘节点，每个原始子矩阵会被精确拆分为低秩数据矩阵、稀疏异常矩阵、小幅噪声矩阵和错误矩阵。这样的拆分自然反映了现实世界中数据固有的结构，明确反映和表现了数据中可能存在的三种缺陷，远比现有的压缩感知技术更能够直观清晰的提高数据质量。之后，本实施例表明这个问题是一个凸优化问题，并使用了一种交替下降法ADM进行求解。In view of the defects of the prior art, the present invention makes full use of the structure of the edge network and develops a set of distributed data recovery algorithms in combination with the existing data recovery algorithm LENS. On the edge server, the perception data collected by the edge node will be collected, rearranged, and distributed according to certain rules, and the original data matrix will be divided into multiple original sub-matrices and passed to the edge nodes connected to it. At each edge node, each original sub-matrix will be accurately split into a low-rank data matrix, a sparse anomaly matrix, a small noise matrix, and an error matrix. Such a split naturally reflects the inherent structure of data in the real world, clearly reflects and expresses the three defects that may exist in the data, and can improve data quality more intuitively and clearly than the existing compressed sensing technology. Afterwards, this embodiment shows that this problem is a convex optimization problem, and uses an alternating descent method ADM to solve it.

本发明是一种基于压缩感知的高鲁棒性分布式数据恢复方法，命名为LADEN。该方法利用原始数据中可能存在多种缺陷的自身特性，将数据的恢复过程转换为一个最优化问题，通过对该问题目标和约束的巧妙设计，从原始数据中分解出数据矩阵、噪声矩阵、异常矩阵和错误矩阵，并设计了一套分布式算法以大幅降低对算力的需求，进而实现数据的高鲁棒高精确性快速恢复。除此之外，LADEN还提供了一种恢复效果与恢复时间的均衡方法以供使用者取舍：通过理论和实验证明，恢复效果与子矩阵数量、恢复时间与子矩阵规模的平方和这两组数据均成高度线性关系，因此可以通过调整分解阶段的划分方案，直接影响和预测LADEN最终的恢复性能。图1为本发明的流程示意图。The present invention is a highly robust distributed data recovery method based on compressed sensing, named LADEN. This method utilizes the inherent characteristics of the original data that may have multiple defects, converts the data recovery process into an optimization problem, and decomposes the data matrix, noise matrix, abnormal matrix and error matrix from the original data through the clever design of the problem's objectives and constraints, and designs a set of distributed algorithms to greatly reduce the demand for computing power, thereby achieving high robustness and high accuracy of data rapid recovery. In addition, LADEN also provides a method for balancing recovery effect and recovery time for users to choose: It has been proved through theory and experiments that the two sets of data, the recovery effect and the number of sub-matrices, and the recovery time and the sum of the squares of the sub-matrix sizes, are highly linearly related. Therefore, the final recovery performance of LADEN can be directly affected and predicted by adjusting the division scheme in the decomposition stage. Figure 1 is a flow chart of the present invention.

从示意图中可以看出，本发明主要面向分布式系统。其中“传感器”代表了数据源，而LADEN主要逻辑和步骤都在主节点和从节点。主节点只有一个，负责执行步骤S1数据分解和S3数据增强，而从节点可能有多个，负责执行步骤S2数据恢复。需要说明的是，这里的节点仅作为逻辑执行的区分，一个节点既可以是一个线程，也可以是一台服务器，都不影响算法的正确性。As can be seen from the schematic diagram, the present invention is mainly aimed at distributed systems. The "sensor" represents the data source, and the main logic and steps of LADEN are in the master node and the slave node. There is only one master node, which is responsible for executing step S1 data decomposition and S3 data enhancement, and there may be multiple slave nodes, which are responsible for executing step S2 data recovery. It should be noted that the nodes here are only used as a distinction for logical execution. A node can be either a thread or a server, which does not affect the correctness of the algorithm.

本发明的名字LADEN取自5个单词的首字母，包括：低秩Low-rank、异常Anomaly、分布式Distributed、错误Error、噪声Noise，他们也是本发明的关键词。The name LADEN of the present invention is taken from the first letters of five words, including: Low-rank, Anomaly, Distributed, Error, and Noise, which are also the keywords of the present invention.

具体的，如图2所示，所述LADEN方法包括以下步骤：Specifically, as shown in FIG2 , the LADEN method comprises the following steps:

S1，分布式数据分解S1, distributed data decomposition

S101，主节点获取待恢复的原始数据M0后，首先为其中的空缺位置临时填充0，再通过SVD分解检查原始数据的低秩性是否符合要求，若符合低秩性要求则执行下一步，否则报错，发出警告以提醒用户。S101, after the master node obtains the original data M0 to be restored, it first temporarily fills the empty positions with 0, and then checks whether the low rank of the original data meets the requirements through SVD decomposition. If it meets the low rank requirements, it executes the next step, otherwise it reports an error and issues a warning to remind the user.

具体的，在初始化阶段，主节点会负责接收需要恢复的原始数据，这样的原始数据往往会充满瑕疵。为了后续更精确地进行恢复操作，主节点会首先会在所有缺失的位置临时填充上数值0，再对填充后的原始数据矩阵M0进行奇异值分解(SVD)。奇异值分解是一种在线性代数和数值分析中广泛应用的矩阵分解方法，通过保留部分奇异值和相应的奇异向量，可以近似地重构原始矩阵，实现对数据的特征提取，以计算其中的秩。分解形式如下：Specifically, in the initialization phase, the master node will be responsible for receiving the original data that needs to be restored, which is often full of flaws. In order to perform the recovery operation more accurately later, the master node will first temporarily fill all missing positions with the value 0, and then perform singular value decomposition (SVD) on the filled original data matrix M0. Singular value decomposition is a matrix decomposition method widely used in linear algebra and numerical analysis. By retaining some singular values and corresponding singular vectors, the original matrix can be approximately reconstructed to achieve feature extraction of the data and calculate its rank. The decomposition form is as follows:

M＝UΣV^T M＝UΣV ^T

其中，U和V都是酉矩阵，U是一个列正交矩阵，V是一个行正交矩阵，矩阵Σ主对角线上的元素为奇异值，其余均为0。由于原始数据中含有多种缺陷，特别是噪声和异常值的影响，按照传统方法计算出的矩阵M秩会显著偏大失真。因此，可以对奇异值进行从大至小的排序，并选择占有前σ强度的奇异值数量定义为矩阵M的秩rank_σ(M)，σ为接近1的比例。不失一般性地，本实施例选择σ＝90％。之后，再定义Low_Rank值为矩阵秩与矩阵列数之比，用于衡量数据矩阵低秩性的强弱，即低秩性指标。Among them, U and V are both unitary matrices, U is a column orthogonal matrix, V is a row orthogonal matrix, the elements on the main diagonal of the matrix Σ are singular values, and the rest are 0. Due to the various defects in the original data, especially the influence of noise and outliers, the rank of the matrix M calculated by the traditional method will be significantly larger and distorted. Therefore, the singular values can be sorted from large to small, and the number of singular values occupying the top σ intensity is defined as the rank _σ (M) of the matrix M, and σ is a ratio close to 1. Without loss of generality, this embodiment selects σ=90%. Afterwards, the Low_Rank value is defined as the ratio of the matrix rank to the number of matrix columns, which is used to measure the strength of the low rank of the data matrix, that is, the low rank index.

S102，根据预先配置的参数，主节点对原始数据M0进行聚类以拆分矩阵，同一个簇中的数据被拼接为一个子矩阵，以此得到多个子矩阵，记录划分结果为Lab0。S102, according to the pre-configured parameters, the master node clusters the original data M0 to split the matrix, and the data in the same cluster is spliced into a sub-matrix, thereby obtaining multiple sub-matrices, and the division result is recorded as Lab0.

假设输入是规模为m×n的矩阵M，在物联网中这样的数据往往来源于n个传感器在m个时刻的采集，这种数据也可以被看作是m个n维数据点的集合。而聚类作为一种典型的非监督学习方法，非常适合根据相似特征划分点集，有助于发现隐藏结构和理解数据中的模式，能够保证划分后子矩阵的低秩特性。Assume that the input is a matrix M of size m×n. In the Internet of Things, such data often comes from the collection of n sensors at m moments. This data can also be regarded as a collection of m n-dimensional data points. Clustering, as a typical unsupervised learning method, is very suitable for dividing point sets according to similar features, which helps to discover hidden structures and understand patterns in data, and can ensure the low-rank characteristics of the sub-matrices after division.

尽管目前主流聚类算法都可以用于拆分，但LADEN提供一个更通用的高效聚类算法。LADEN中的划分方案主要参考了基于密度的聚类算法OPTICS，主要思想是通过计算构建可达性图，从而识别聚类结构，尤其适用于处理具有不同密度和大小的簇。它特点在于不需要预先指定簇的数量，而是根据数据的分布自动确定簇。此外，它也有对噪声和离群点相对不敏感、对簇的大小和形状自适应的优点。Although the current mainstream clustering algorithms can be used for splitting, LADEN provides a more general and efficient clustering algorithm. The partitioning scheme in LADEN mainly refers to the density-based clustering algorithm OPTICS. The main idea is to construct a reachability graph by calculation to identify the clustering structure, which is particularly suitable for processing clusters with different densities and sizes. Its feature is that it does not need to pre-specify the number of clusters, but automatically determines the clusters based on the distribution of the data. In addition, it also has the advantages of being relatively insensitive to noise and outliers, and adaptive to the size and shape of the cluster.

在LADEN聚类算法需要指定最小邻居数MinPts和最大邻域半径MaxEps。根据这两个参数，可以将点集中的点分为三类：核心点，边界点和离群点。如果一个点在不超过MaxEps的邻域半径内至少有MinPts个点，那么该点就被认为是一个核心点。如果一个点不是核心点，但邻域内存在核心点，那么该点被认为是一个边界点。其余的点被认为是不属于任何簇的离群点。之后，需要量化点之间的可达性，定义为两点间距离与起始点o“核心距离”的较大值，即In the LADEN clustering algorithm, it is necessary to specify the minimum number of neighbors MinPts and the maximum neighborhood radius MaxEps. Based on these two parameters, the points in the point set can be divided into three categories: core points, boundary points, and outliers. If a point has at least MinPts points within a neighborhood radius not exceeding MaxEps, then the point is considered to be a core point. If a point is not a core point, but there are core points in the neighborhood, then the point is considered to be a boundary point. The remaining points are considered to be outliers that do not belong to any cluster. Afterwards, it is necessary to quantify the reachability between points, which is defined as the larger value of the distance between the two points and the "core distance" of the starting point o, that is,

Reachability(o→p)＝max{Distance(o,p),Core(o)}Reachability(o→p)=max{Distance(o,p),Core(o)}

其中核心距离仅对核心点有效，指的是使核心点的邻居数为MinPts的最小邻域半径，否则为无穷大∞，即The core distance is only valid for core points, which refers to the minimum neighborhood radius that makes the number of neighbors of the core point equal to MinPts, otherwise it is infinite ∞, that is,

从上面两个式子可以看出，核心距离是在距离MinPts和MaxEps参数下点自身的一种性质，而可达性是一个点到另一个点的距离度量，具有不对称性。可达性越小，说明点的密度越大。From the above two formulas, we can see that the core distance is a property of the point itself under the distance parameters MinPts and MaxEps, while the reachability is a distance measure from one point to another, which is asymmetric. The smaller the reachability, the greater the density of the points.

在定义了这两种距离后，本实施例使用了一个最小队列，可以近似地由可达性衡量数据集中密度聚类结构。它按照贪心原则从密集区域中找出当前可达性最低的点，也就是期望密度最高的点。将这个点从最小队列转移到当前簇序列的末端，再在最小队列中更新其邻居的可达性。如此迭代至最小队列为空，就可以得到一个按照可达性排序的增广簇序列。After defining these two distances, this embodiment uses a minimum queue, which can approximate the density clustering structure of the data set by the reachability measurement. It finds the point with the lowest current reachability from the dense area according to the greedy principle, that is, the point with the highest expected density. This point is transferred from the minimum queue to the end of the current cluster sequence, and then the reachability of its neighbors is updated in the minimum queue. Iterate until the minimum queue is empty, and you can get an augmented cluster sequence sorted by reachability.

在增广簇序列中，可以按照密度从高到低的顺序寻找簇，对应的邻域半径也总是从小到大逐渐增长。主要方法是把全局可达性的分位数作为阈值以分割序列，少数高于阈值的点被判定为离群点，其余低于该阈值的部分会形成多个不连续区域，每个区域代表一个可达性“低谷”，谷中的低可达性代表高密度的簇。这里本实施例优化了对区域末尾点的判定，充分利用生成序列过程中最小堆结构自身维护先驱结点的特性来判决末尾点是否应该被剔除出簇。该优化可以纠正在边界条件上的错误聚类。In the augmented cluster sequence, clusters can be found in order of density from high to low, and the corresponding neighborhood radius always increases gradually from small to large. The main method is to use the quantile of global reachability as a threshold to split the sequence. A few points above the threshold are judged as outliers, and the rest below the threshold will form multiple discontinuous areas, each of which represents a "valley" of reachability. The low reachability in the valley represents a high-density cluster. Here, this embodiment optimizes the determination of the end point of the region, making full use of the characteristic of the minimum heap structure itself to maintain the pioneer node in the process of generating the sequence to determine whether the end point should be removed from the cluster. This optimization can correct incorrect clustering on boundary conditions.

需要说明的是，增广簇序列这种聚类顺序天然地就方便区分开不同密度的簇，使得邻域半径的自适应性成为可能，因此不需要显式指定邻域半径，而是只需要限制最大邻域半径MaxEps。注意到，簇数量与MaxEps单调递增。在单调性的保证下，如果是为了找到足够多的簇，可以把MaxEps设置为正无穷，再根据需要的簇规模下限调整MinPts即可。但不失一般性的，LADEN中默认设置Minpts＝2，通过改变距离参数Epsilon改变划分结果。It should be noted that the clustering order of the augmented cluster sequence naturally makes it easy to distinguish clusters of different densities, making the adaptability of the neighborhood radius possible. Therefore, there is no need to explicitly specify the neighborhood radius, but only to limit the maximum neighborhood radius MaxEps. Note that the number of clusters increases monotonically with MaxEps. Under the guarantee of monotonicity, if the purpose is to find enough clusters, MaxEps can be set to positive infinity, and then MinPts can be adjusted according to the lower limit of the required cluster size. But without loss of generality, Minpts=2 is set by default in LADEN, and the division result is changed by changing the distance parameter Epsilon.

LADEN聚类部分伪代码如图3。The pseudo code of LADEN clustering is shown in Figure 3.

完成聚类后，所有簇按照规模大小降序排列后从1开始标注id，即id＝1的簇包含最多的数据点。所有被标记的离群点还会聚合成一个额外的簇，记录id＝0。之后，M0中所有点的划分可以由向量Lab0记录，Lab0(i)＝id_j，代表第i个点被划分给了簇j。After clustering, all clusters are sorted in descending order of size and labeled with ids starting from 1, i.e., the cluster with id=1 contains the most data points. All labeled outliers are also aggregated into an additional cluster with id=0. Afterwards, the division of all points in M0 can be recorded by vector Lab0, where Lab0(i)=id _j , indicating that the i-th point is assigned to cluster j.

S103，主节点将拆分后的子矩阵以矩阵规模为权重，均匀发送给多个从节点进行数据恢复。S103, the master node sends the split sub-matrices to multiple slave nodes evenly with the matrix size as the weight for data recovery.

在主节点完成对原始数据矩阵的拆分后，就能利用网络连接向下属的从节点下发各个子矩阵；网络连接包括但不限于广域无线网、有线网络等。为了保证从节点间公平性并提高效率，将会以规模(即行数量)的平方作为权重，利用负载因子在从节点间进行任务分配，主要过程如下：After the master node has completed the splitting of the original data matrix, it can use the network connection to send each sub-matrix to the subordinate slave nodes; the network connection includes but is not limited to wide area wireless network, wired network, etc. In order to ensure fairness among slave nodes and improve efficiency, the square of the scale (i.e. the number of rows) will be used as the weight, and the load factor will be used to distribute tasks among slave nodes. The main process is as follows:

本实施例预先为第j个节点按照计算能力配置权重w_j，默认全为1。节点初始负载L_j＝1，定义负载因子 In this embodiment, the weight w _j is pre-configured for the jth node according to the computing capacity, and the default value is 1. The initial load of the node L _j = 1, and the load factor is defined as

对于簇C_i对应的子矩阵D_i时，定义任务量T_i为规模的平方，即For the submatrix D _i corresponding to cluster C _i , the task volume T _i is defined as the square of the scale, that is,

T_i＝(M_i·col)² T _i =(M _i· col) ²

L′_Node＝L_Node+T_i L′ _Node = L _Node + _Ti

重复上述步骤直到完成分配。由于上文已经对所有簇进行了降序排序，该分配策略得到的结果是近似最优的，保证了从节点计算资源能得到最高程度的利用。Repeat the above steps until the allocation is completed. Since all clusters have been sorted in descending order above, the result obtained by this allocation strategy is approximately optimal, ensuring that the computing resources of the slave nodes can be utilized to the highest extent.

S2，高鲁棒数据恢复S2, highly robust data recovery

S201，每个从节点从对应的子矩阵中提取所有空缺位置，构造错误矩阵。S201, each slave node extracts all vacant positions from the corresponding sub-matrix to construct an error matrix.

从节点在接收到子矩阵后，需要考虑其中可能包含的三种缺陷：错误、异常和噪声。After receiving the sub-matrix, the slave node needs to consider three possible defects: errors, anomalies, and noise.

首先，由于聚类拆分的过程中充分考虑到了矩阵的内部结构，在输入原始矩阵近似低秩的前提下，该步骤接受到的子矩阵也会是近似低秩的，这就为恢复过程指明了目标方向。根据前提假设和对现有数据集的校验，传感器从物理世界中采集到的数据矩阵往往呈现出低秩性的特征。从本质上看，这样的低秩性是数据本身冗余的体现。矩阵的秩越低，说明冗余越多，对恢复越有利。在度量上，计算秩的直观方法是使用矩阵的初等变换。不过为了减小数据中缺陷和浮点数计算可能带来的误差，本实施例仍然选择通过SVD分解计算奇异值的方法判断矩阵的秩，可以沿用Low_Rank值作为对子矩阵中数据分量的评价指标。对于id超过0的子矩阵，其低秩性不会劣于原始数据矩阵，即First of all, since the internal structure of the matrix is fully considered in the process of cluster splitting, under the premise that the input original matrix is approximately low-rank, the sub-matrix received in this step will also be approximately low-rank, which indicates the target direction for the recovery process. According to the premise assumptions and the verification of existing data sets, the data matrices collected by sensors from the physical world often show low-rank characteristics. In essence, such low-rank is a reflection of the redundancy of the data itself. The lower the rank of the matrix, the more redundancy it has, and the more beneficial it is for recovery. In terms of measurement, the intuitive method to calculate the rank is to use elementary transformations of the matrix. However, in order to reduce defects in the data and errors that may be caused by floating-point calculations, this embodiment still chooses to determine the rank of the matrix by calculating singular values through SVD decomposition, and the Low_Rank value can be used as an evaluation indicator for the data components in the sub-matrix. For sub-matrices with id greater than 0, their low rank will not be inferior to the original data matrix, that is

Low_Rank(M_i)≥Low_Rank(M)Low_Rank(M _i )≥Low_Rank(M)

为了方便表达，后续使用范数的形式表示奇异值之和。因此，本实施例把第i个子矩阵中的低秩数据矩阵分量记为‖A_i‖_S。For the convenience of expression, the sum of singular values is subsequently expressed in the form of a norm. Therefore, in this embodiment, the low-rank data matrix component in the i-th sub-matrix is recorded as ‖A _i ‖ _S .

之后，分析三种缺陷的特点。由前文可知，异常值往往来自于突发事件。相比于通常情况下的正常数据，这样的异常数据必然是特殊且稀少的，那么由异常值组成的异常矩阵必然是一个稀疏矩阵。因此，把矩阵中所有元素绝对值之和作为子矩阵稀疏分量的优化目标，以矩阵曼哈顿范数(Manhattan Norm)的形式记录稀疏异常矩阵，即‖B_i‖_M＝∑|b|。After that, the characteristics of the three defects are analyzed. As mentioned above, outliers often come from emergencies. Compared with normal data under normal circumstances, such abnormal data must be special and rare, so the abnormal matrix composed of outliers must be a sparse matrix. Therefore, the sum of the absolute values of all elements in the matrix is used as the optimization target of the sparse component of the submatrix, and the sparse abnormal matrix is recorded in the form of the Manhattan norm of the matrix, that is, ‖B _i ‖ _M ＝∑|b|.

另一种普遍存在的缺陷是噪声。物联网数据中的噪声主要来源于传感器采集过程中的随机扰动，包括：电磁干扰、机械振动、温湿度和光照变化、电源噪声等，在数据传输过程中也可能由于传输介质和数据压缩引入更多噪声。尽管在计算Low_Rank值时可以通过舍弃较小奇异值的方法减小噪声对判定低秩性破坏，但在恢复过程中还必须充分考虑噪声的影响以提高数据的准确性。这里本实施例使用高斯噪声模型，认为上述噪声之和服从正态分布，那么子矩阵中噪声的整体幅度必然偏小且普遍存在，与异常值存在明显的特征差异。因此可以使用矩阵中所有元素平方和作为子矩阵噪声分量的优化目标，以F范数(Frobenius Norm)平方的形式记录小幅噪声矩阵，即 Another common defect is noise. The noise in IoT data mainly comes from random disturbances during the sensor acquisition process, including: electromagnetic interference, mechanical vibration, temperature and humidity and light changes, power supply noise, etc. During the data transmission process, more noise may also be introduced due to the transmission medium and data compression. Although the damage to the low-rank judgment can be reduced by discarding smaller singular values when calculating the Low_Rank value, the impact of noise must be fully considered in the recovery process to improve the accuracy of the data. Here, this embodiment uses a Gaussian noise model, and it is assumed that the sum of the above noises obeys a normal distribution. Then the overall amplitude of the noise in the submatrix must be small and ubiquitous, and there is a clear characteristic difference from the outliers. Therefore, the sum of the squares of all elements in the matrix can be used as the optimization target of the noise component of the submatrix, and the small noise matrix is recorded in the form of the square of the F norm (Frobenius Norm), that is,

由于传感器的偶发故障和通信传输中丢包的情况，数据中可能会出现错误，被标注为NaN(Not a Number)。与异常和噪声不同，子矩阵M_i中的错误位置明确，其错误分量可以视作一个已知的0-1二元矩阵E_i。为了更好地在约束项中表示错误矩阵分量，需要引入与E_i一个等大的D_i作为从二元矩阵转换为数值矩阵的中间变量，以点乘的形式约束，即Due to occasional sensor failures and packet loss during communication transmission, errors may occur in the data and are marked as NaN (Not a Number). Unlike anomalies and noise, the error position in the submatrix _Mi is clear, and its error component can be regarded as a known 0-1 binary matrix _Ei . In order to better represent the error matrix components in the constraint terms, it is necessary to introduce _Di, which is equal to _Ei, as an intermediate variable for converting from a binary matrix to a numerical matrix, and constrain it in the form of dot product, that is,

D_i·*E_i＝D_i Di _· *E _i = _Di

其中，当且仅当E_i中某处等于1时，D_i中对应位置不为0，代表此位置出现了错误。因此可以将M_i中对应位置置0，把错误分量从子矩阵中剥离。Among them, if and only if a certain position in E _i is equal to 1, the corresponding position in D _i is not 0, which means that an error has occurred at this position. Therefore, the corresponding position in M _i can be set to 0 to remove the error component from the sub-matrix.

S202，根据数据中缺陷特点，构建最优化问题的目标和优化约束。S202, constructing the goal and optimization constraints of the optimization problem according to the defect characteristics in the data.

根据步骤S201的分解结构，设共划分了N个子矩阵，其中id＝i的簇对应的子矩阵M_i规模为m_i×n_i，需要把它分解为一个低秩数据矩阵A_i、一个稀疏异常矩阵B_i、一个小幅噪声矩阵C_i和一个错误矩阵E_i，那么就可以初步得到如下的凸优化问题：According to the decomposition structure of step S201, suppose a total of N sub-matrices are divided, where the sub-matrix M _i corresponding to the cluster with id=i is of size m _i ×n _i , and it needs to be decomposed into a low-rank data matrix A _i , a sparse anomaly matrix B _i , a small noise matrix C _i and an error matrix E _i , then the following convex optimization problem can be preliminarily obtained:

minimize: minimize:

subject to:A_i+B_i+C_i+D_i＝M_i subject to:A _i +B _i +C _i +D _i =M _i

其中α、β、σ是三个对应矩阵分量的系数，以满足线性变换和量纲转换的需要。Among them, α, β, and σ are the coefficients of three corresponding matrix components to meet the needs of linear transformation and dimensional conversion.

他们的默认值设定为其中代表了矩阵M_i的标准差。Their default values are set to in Represents the standard deviation of the matrix _Mi.

为了更好地满足和适应现实中不同领域数据恢复的多变需求，本发明还进行两处修改。第一部分，考虑到子矩阵分量的低秩、稀疏和小幅的特性不一定是显式的，因此需要允许用户根据已有经验分别配置对应线性变化的过程，以更好地捕捉三个分量自身的特征。从形式上就是添加三个全局系数矩阵，记为X、Y和Z。需要说明的是，这三个矩阵均不带有下标i以表明他们是全局共享的，但实际计算时会根据具体子矩阵的规模大小做出调整以满足矩阵乘法的要求。默认情况下，X、Y和Z均为对角矩阵。In order to better meet and adapt to the changing needs of data recovery in different fields in reality, the present invention also makes two modifications. In the first part, considering that the low-rank, sparse and small-amplitude characteristics of the submatrix components are not necessarily explicit, it is necessary to allow the user to configure the corresponding linear change process separately according to existing experience to better capture the characteristics of the three components themselves. Formally, three global coefficient matrices are added, denoted as X, Y and Z. It should be noted that these three matrices do not have the subscript i to indicate that they are globally shared, but in actual calculations, adjustments will be made according to the size of the specific submatrix to meet the requirements of matrix multiplication. By default, X, Y and Z are all diagonal matrices.

第二部分，还需要在目标中增添一个惩罚项以充分发挥具体领域中先验知识对检验底层数据结构的重要指导作用，其中γ为对应的系数，K为惩罚项的数量，U^k、V^k和R^k是三个可以根据领域知识设置的全局矩阵参数。默认情况下，为了更好地捕捉数据的局部不变性，设定K＝1，U¹为对角矩阵，R¹为零向量，V¹为常对角阵(Toeplitz Matrix)且主对角线为1，上对角线为-1，其余位置为0。这种默认设定可以减小数据矩阵A_i中相邻位置的数值差，体现时域上的局部稳定性。类似第一部分，这些参数也是全局共享且可根据A_i规模进行调整的。In the second part, we need to add a penalty item to the target In order to give full play to the important guiding role of prior knowledge in specific fields in testing the underlying data structure, γ is the corresponding coefficient, K is the number of penalty terms, and U ^k , V ^k and R ^k are three global matrix parameters that can be set according to domain knowledge. By default, in order to better capture the local invariance of the data, K = 1 is set, U ¹ is a diagonal matrix, R ¹ is a zero vector, V ¹ is a constant diagonal matrix (Toeplitz Matrix) with the main diagonal of 1, the upper diagonal of -1, and the rest of the positions are 0. This default setting can reduce the numerical difference between adjacent positions in the data matrix _Ai , reflecting the local stability in the time domain. Similar to the first part, these parameters are also globally shared and can be adjusted according to the scale of _Ai .

出于解耦目标和约束中A_i和B_i变量的目的，再增添辅助变量和得到如下形式：In order to decouple the _Ai and _Bi variables in the objective and constraints, auxiliary variables are added and The following form is obtained:

minimize: minimize:

subject to: subject to:

D_i·*E_i＝D_i Di _· *E _i = _Di

S203，带入约束项得到增广拉格朗日公式，每个从节点利用交替方向法ADM最小化对应的增广拉格朗日公式，求解最优化问题，得到分解出的数据矩阵、噪声矩阵和异常矩阵。S203, the constraints are brought in to obtain the augmented Lagrangian formula, and each slave node uses the alternating direction method ADM to minimize the corresponding augmented Lagrangian formula to solve the optimization problem and obtain the decomposed data matrix, noise matrix and anomaly matrix.

将步骤S202中的最优化目标带入约束项，得到如下的增广拉格朗日公式：Substituting the optimization objective in step S202 into the constraint term, the following augmented Lagrangian formula is obtained:

其中<P,A>代表矩阵P和A点乘之和，P_i、Q_i都是拉格朗日算子。这样，就可以把原本的有等式约束的优化问题转换为一个无约束优化问题。当增广拉格朗日函数取最小值时，其偏导一定为0，必定满足原来的等式约束。Where <P,A> represents the sum of the dot product of matrix P and A, _Pi , _Qi are all Lagrangian operators. In this way, the original optimization problem with equality constraints can be converted into an unconstrained optimization problem. When the augmented Lagrangian function takes the minimum value, its partial derivative must be 0, which must satisfy the original equality constraint.

针对最优化问题有多种凸优化求解工具可供选择，如CVX。本实施例使用交替方向法ADM对该最优化问题进行求解，收敛后即可得到子矩阵M_i中的三个分量：低秩数据矩阵A_i、稀疏异常矩阵B_i和小幅噪声矩阵C_i。For optimization problems There are many convex optimization solving tools available, such as CVX. This embodiment uses the alternating direction method ADM to solve the optimization problem. After convergence, three components in the submatrix _Mi can be obtained: low-rank data matrix _Ai , sparse anomaly matrix _Bi and small noise matrix _Ci .

完成求解后，从节点向主节点返回分解结果。After the solution is completed, the slave node returns the decomposition result to the master node.

S3，迭代式数据增强S3, iterative data augmentation

S301，从节点完成矩阵分解后，主节点汇总各个从节点上所有子矩阵的分解结果，按照拆分位置重新拼接得到4个与输入的原始数据规模一致的矩阵，其中包含恢复后的数据矩阵Mt，Mt表示第t次迭代得到的恢复后的数据矩阵。S301, after the slave nodes complete the matrix decomposition, the master node summarizes the decomposition results of all sub-matrices on each slave node, and reassembles them according to the split positions to obtain 4 matrices consistent with the scale of the input original data, including the recovered data matrix Mt, Mt represents the recovered data matrix obtained at the tth iteration.

具体的，在第一次执行本步骤时，待从节点完成分解后，主节点可以按照步骤S102中分解步骤的下标记录Lab0，逆序进行拼接。记步骤S101中输入矩阵为M0，那么所有分解结果就可以拼接得到4个与M0等大的矩阵，分别代表了低秩数据矩阵、稀疏异常矩阵、小幅噪声矩阵和错误矩阵。本发明主要关注低秩数据矩阵，记为M1。Specifically, when this step is executed for the first time, after the slave node completes the decomposition, the master node can splice in reverse order according to the subscript record Lab0 of the decomposition step in step S102. Let the input matrix in step S101 be M0, then all the decomposition results can be spliced to obtain 4 matrices of the same size as M0, representing the low-rank data matrix, sparse anomaly matrix, small noise matrix and error matrix respectively. The present invention mainly focuses on the low-rank data matrix, denoted as M1.

同理，当进行了多次迭代时，主节点可以按照上一次迭代分解步骤的下标记录Labt-1，逆序进行拼接，所有分解结果仍可以拼接得到4个与M0等大的矩阵，将其中的低秩数据矩阵，记为Mt。Similarly, when multiple iterations are performed, the master node can record Labt-1 according to the subscript of the decomposition step of the previous iteration, and splice them in reverse order. All decomposition results can still be spliced to obtain 4 matrices of the same size as M0, and the low-rank data matrix among them is recorded as Mt.

S302，对恢复后的数据矩阵Mt再次进行聚类得到拆分后的子矩阵，将拆分结果记录为Labt，并由主节点比较本次拆分结果Labt与上次的拆分结果Labt-1之间的差异性，若差异小于阈值则完成恢复，将将步骤301拼接得到的4个矩阵作为恢复结果，否则继续下一步。S302, cluster the recovered data matrix Mt again to obtain the split sub-matrices, record the split result as Labt, and compare the difference between the current split result Labt and the previous split result Labt-1 by the master node. If the difference is less than the threshold, the recovery is completed and the four matrices spliced in step 301 are used as the recovery result, otherwise proceed to the next step.

本实施例以第一次迭代为例进行详细描述，按照步骤S102中相同的配置参数，再次对M1进行聚类划分。类似步骤S102中的下标记录，此次划分结果记录为Lab1。This embodiment takes the first iteration as an example for detailed description, and clusters M1 again according to the same configuration parameters in step S102. Similar to the subscript record in step S102, the result of this division is recorded as Lab1.

簇数量为N时，按照下列公式计算Lab0与Lab1之间的差异性，记为Lab_Diff值。When the number of clusters is N, the difference between Lab0 and Lab1 is calculated according to the following formula and recorded as Lab_Diff value.

若为第t次迭代，则上式可以表示为：If it is the tth iteration, the above formula can be expressed as:

得到的划分差异性Lab_Diff会与配置的阈值∈进行比较，阈值∈可以设置为0.05。如果Lab_Diff<∈，则代表两次划分的结果基本一致，可以认为收敛，LADEN将会完成分解、返回包括Mt在内的分解结果，即步骤S301中拼接得到的4个矩阵；否则，还需要继续执行步骤S303。The obtained partition difference Lab_Diff will be compared with the configured threshold ∈, which can be set to 0.05. If Lab_Diff<∈, it means that the results of the two partitions are basically the same, which can be considered converged. LADEN will complete the decomposition and return the decomposition results including Mt, that is, the four matrices spliced in step S301; otherwise, it is necessary to continue to execute step S303.

S303，根据预配置参数，主节点判断计算时间和迭代次数是否达到最大值，若达到最大值则完成恢复，将将步骤301拼接得到的4个矩阵作为恢复结果，否则根据步骤S302再次聚类得到的子矩阵，返回步骤S103继续下一轮恢复。S303, based on the pre-configured parameters, the master node determines whether the calculation time and the number of iterations have reached the maximum value. If so, the recovery is completed and the four matrices obtained by splicing in step 301 are used as the recovery result. Otherwise, the sub-matrices obtained by clustering again according to step S302 are returned to step S103 to continue the next round of recovery.

即，主节点会记录当前计算时间以及当前迭代次数。根据用户配置，再判断当前计算时间是否已超过最大计算时间，或当前迭代次数是否已达到最大迭代次数。如果任一指标已达到限制，则也完成分解、返回结果；否则根据Mt、Labt，按照Labt中的记录进行划分，再返回到步骤S103，继续进行下一轮迭代。这种多轮恢复的方案既允许用户按照需求和资源限制最大值，也可以避免高维数据下聚类时的偶然偏差对结果产生影响，更好地实现增强数据质量的目的。That is, the master node will record the current calculation time and the current number of iterations. According to the user configuration, it is determined whether the current calculation time has exceeded the maximum calculation time, or whether the current number of iterations has reached the maximum number of iterations. If any indicator has reached the limit, the decomposition is completed and the result is returned; otherwise, according to Mt and Labt, the division is performed according to the records in Labt, and then the process returns to step S103 to continue the next round of iterations. This multi-round recovery scheme allows users to limit the maximum value according to demand and resources, and can also avoid the impact of accidental deviations during clustering of high-dimensional data on the results, thereby better achieving the purpose of enhancing data quality.

最后，本实施例从理论和实验两个方面证明恢复效果与子矩阵数量、恢复时间与子矩阵规模的平方和这两组变量的线性关系。Finally, this embodiment proves the linear relationship between the two groups of variables, namely, the recovery effect and the number of sub-matrices, and the recovery time and the sum of the squares of the sub-matrix sizes, from both theoretical and experimental aspects.

假设输入矩阵M的规模是m×n。已知优化后的聚类步骤在m个点情况下的最坏时间复杂度为且常数项很小，在实际计算中的开销基本可以忽略。数据恢复步骤的主要运算在于规模m×m的矩阵与规模m×n的矩阵乘法，其时间复杂度为如果考虑到恢复中每次迭代的更新数量和最大迭代次数均为常数，拆分得到了N个子矩阵，那么簇规模平方和为Assume that the size of the input matrix M is m×n. It is known that the worst time complexity of the optimized clustering step in the case of m points is And the constant term is very small, and the overhead in actual calculation can be basically ignored. The main operation of the data recovery step is the multiplication of a matrix of size m×m and a matrix of size m×n, and its time complexity is If we consider that the number of updates per iteration and the maximum number of iterations in the recovery are constant, and N sub-matrices are obtained by splitting, then the sum of the squares of the cluster sizes is

设时间系数为k′，则运行时间可以估算为Assuming the time coefficient is k′, the running time can be estimated as

从式中显然可以看出，运行时间与簇规模平方和线性相关。It is obvious from the formula that the running time is linearly related to the square of the cluster size.

之后，使用归一化的平均绝对误差(NMAE)来衡量恢复效果，定义为Afterwards, the normalized mean absolute error (NMAE) is used to measure the recovery effect, which is defined as

其中，A表示理想的数据矩阵，表示LADEN得到的数据矩阵，M表示带有缺陷的原始矩阵。NMAE越小，代表恢复效果越好。随着簇数量的增加，每个子矩阵内的结构会被逐渐破坏，其中可供恢复的信息量减少，LADEN的恢复效果越差，数据矩阵与原始数据的偏差越大，NMAE越大。Where A represents the ideal data matrix, represents the data matrix obtained by LADEN, and M represents the original matrix with defects. The smaller the NMAE, the better the recovery effect. As the number of clusters increases, the structure within each submatrix will be gradually destroyed, and the amount of information available for recovery will decrease. The worse the recovery effect of LADEN, the greater the deviation between the data matrix and the original data, and the larger the NMAE.

变量间的线性关系的强弱可以用相关系数衡量，取值范围通常在-1到1之间。其绝对值越接近1，表示变量间的相关性越强。一般认为，相关系数的绝对值超过0.7就可以说明变量间的关系非常紧密。由于本实施例假设样本数据呈正态分布，所以只能使用斯皮尔曼等级相关系数(Spearman's rank correlation coefficient)检验相关性。它以牺牲一部分对线性关系的精确性和对异常值的敏感性为代价，摆脱对总体分布参数的依赖。除了相关系数外，相关分析还需要计算显著性。显著性可以由在零假设成立情况下获得同样结果的概率所量化，称为P值。一般认为，当P<0.05时可以认为零假设被拒绝，结论具有显著性。The strength of the linear relationship between variables can be measured by the correlation coefficient, and the range of values is usually between -1 and 1. The closer its absolute value is to 1, the stronger the correlation between the variables. It is generally believed that the absolute value of the correlation coefficient exceeds 0.7, which can illustrate that the relationship between the variables is very close. Since the present embodiment assumes that the sample data is normally distributed, only the Spearman's rank correlation coefficient can be used to test the correlation. It sacrifices a part of the accuracy of the linear relationship and the sensitivity to outliers, and gets rid of the dependence on the overall distribution parameters. In addition to the correlation coefficient, correlation analysis also needs to calculate significance. Significance can be quantified by the probability of obtaining the same result when the null hypothesis is established, which is called the P value. It is generally believed that when P<0.05, it can be considered that the null hypothesis is rejected and the conclusion is significant.

为了更直观的证明这两组线性关系，选用数据集Abilene，其中包括了北美骨干网络Abilene中12个路由器在24周内每五分钟的数据流量大小，代表了传统网络分析中常用的流量数据特征。本实施例选用了其中规模为1000×100的数据，Low_Rank值为17.0％，低秩性比较理想。In order to more intuitively prove the two sets of linear relationships, the Abilene data set is selected, which includes the data flow size of 12 routers in the North American backbone network Abilene every five minutes in 24 weeks, representing the traffic data characteristics commonly used in traditional network analysis. This embodiment selects data with a scale of 1000×100, and the Low_Rank value is 17.0%, which is relatively ideal for low rank.

数据集中的数据已经经过清洗，不包含缺陷。因此需要人为加入各种缺陷以检验LADEN的恢复效果，设置异常率为0.1％，异常规模为1，噪声规模为0.01，不限制最大恢复迭代次数。为了减小偶然性影响，本实施例在每组参数上重复测量了4次，且统计运行时间时累加了各个从节点上的实际运行时间，等效于并发度为1。在这些前提下，本实施例分别在数据中加入了比例不等的错误，并改变聚类参数以得到多种划分方案，分别记录恢复性能，得到的结果如图4所示。The data in the data set has been cleaned and does not contain defects. Therefore, it is necessary to artificially add various defects to verify the recovery effect of LADEN, set the anomaly rate to 0.1%, the anomaly scale to 1, the noise scale to 0.01, and do not limit the maximum number of recovery iterations. In order to reduce the impact of chance, this embodiment repeated the measurement 4 times on each set of parameters, and accumulated the actual running time on each slave node when calculating the running time, which is equivalent to a concurrency of 1. Under these premises, this embodiment added errors of varying proportions to the data, changed the clustering parameters to obtain a variety of partitioning schemes, and recorded the recovery performance separately. The results are shown in Figure 4.

从图4的左图中可以明显看出，簇数量与NMAE线性相关。在不同错误率下，NMAE和簇矩阵数量之间形成了四条清晰的平行线。当错误率为0.3时，相关系数r最小为0.943，对应P值1.0e-20，说明线性程度极高。在图4的右图中可以看出簇规模的平方和与恢复时间也呈现出明显强线性关系。平均相关系数仍然达到了0.864，最低为0.838，对应P值为4.3e-12，因此可认为具有强线性相关性。It can be clearly seen from the left figure of Figure 4 that the number of clusters is linearly correlated with NMAE. At different error rates, four clear parallel lines are formed between NMAE and the number of cluster matrices. When the error rate is 0.3, the correlation coefficient r is as low as 0.943, corresponding to a P value of 1.0e-20, indicating a very high degree of linearity. It can be seen from the right figure of Figure 4 that the sum of the squares of the cluster sizes and the recovery time also show a clear strong linear relationship. The average correlation coefficient still reached 0.864, and the lowest was 0.838, corresponding to a P value of 4.3e-12, so it can be considered to have a strong linear correlation.

根据实验可以看出，在固定场景下可以通过少量测试拟合出对于恢复时间和恢复效果的线性转换关系，允许用户对LADEN的恢复性能按照需求进行调节，显著增强了算法的可调节性和对复杂需求的适应性，大大提高了实用价值。According to the experiments, it can be seen that in a fixed scenario, a linear conversion relationship between recovery time and recovery effect can be fitted through a small number of tests, allowing users to adjust the recovery performance of LADEN according to their needs, significantly enhancing the adjustability of the algorithm and its adaptability to complex needs, greatly improving its practical value.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术人员无需创造性劳动就可以根据本发明的构思做出诸多修改和变化。因此，凡本技术领域中技术人员依据本发明的构思在现有技术的基础上通过逻辑分析、推理、或者有限的实验可以得到的技术方案，皆应在权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention are described in detail above. It should be understood that a person skilled in the art can make many modifications and changes based on the concept of the present invention without creative work. Therefore, any technical solution that can be obtained by a person skilled in the art through logical analysis, reasoning, or limited experiments based on the concept of the present invention on the basis of the prior art should be within the scope of protection determined by the claims.

Claims

1. A highly robust distributed data recovery method based on compressed sensing, characterized in that the method converts the data recovery process into an optimization problem, decomposes the data matrix, noise matrix, anomaly matrix and error matrix from the original data by designing the goal and constraints of the problem, and obtains the recovered data based on the data matrix.

2. According to claim 1, a highly robust distributed data recovery method based on compressed sensing is characterized in that the method is implemented based on a distributed algorithm, and the master node performs a data decomposition process and a data enhancement process, wherein the data decomposition process splits the original data into multiple sub-matrices, and the data enhancement process optimizes the decomposition results through iteration, and the slave node performs a data recovery process on the sub-matrices split by the master node.

3. The method for highly robust distributed data recovery based on compressed sensing according to claim 1, characterized in that the method comprises the following steps:

S1, distributed data decomposition:

S101, the master node obtains the original data to be restored and checks the low rank of the original matrix. If it meets the low rank requirement, the next step is executed, otherwise an error is reported;

S102, according to pre-configured parameters, the master node clusters the original data to split the matrix, and the data in the same cluster is spliced into a sub-matrix, thereby obtaining multiple sub-matrices;

S103, the master node sends the split sub-matrices to multiple slave nodes evenly with the matrix size as the weight for data recovery;

S2, highly robust data recovery:

S201, each slave node extracts all vacant positions from the corresponding sub-matrix to construct an error matrix;

S202, constructing the goal and optimization constraints of the optimization problem according to the defect characteristics in the data;

S203, bringing in the constraint term to obtain the augmented Lagrangian formula, minimizing the corresponding augmented Lagrangian formula for each slave node using the ADM method, solving the optimization problem, and obtaining the decomposed data matrix, noise matrix, and abnormality matrix;

S3, iterative data augmentation:

S301, after the slave nodes complete the matrix decomposition, the master node summarizes the decomposition results of all sub-matrices on each slave node, and reassembles them according to the split positions to obtain 4 matrices consistent with the input original data scale, including the restored data matrix;

S302, clustering the restored data matrix again to obtain the split sub-matrices, and the master node compares the difference between the current split result and the previous split result. If the difference is less than the threshold, the recovery is completed, and the four matrices spliced in step 301 are used as the recovery result, otherwise proceed to the next step;

S303, based on the pre-configured parameters, the master node determines whether the calculation time and the number of iterations have reached the maximum value. If so, the recovery is completed and the four matrices obtained by splicing in step 301 are used as the recovery result. Otherwise, the sub-matrices obtained by clustering again in step S302 are returned to step S103 to continue the next round of recovery.

4. According to a highly robust distributed data recovery method based on compressed sensing in claim 3, it is characterized in that the step S101 is specifically as follows: the master node obtains the original data to be recovered, and temporarily fills the value 0 in all missing positions of the original data, and performs singular value decomposition on the filled original data matrix to calculate the rank of the data components therein and determine the low-rank index; if the calculated low-rank index is less than a preset threshold, execute the next step, otherwise return an error value to report an error.

5. According to the highly robust distributed data recovery method based on compressed sensing described in claim 3, it is characterized in that in the step S102, the master node clusters the original data to split the matrix using the density-based improved clustering algorithm OPTICS. On the basis of the OPTICS algorithm, the determination of the end point of the region is added, and the characteristic of the minimum heap structure itself maintaining the pioneer node in the process of algorithm generation sequence is used to determine whether the end point should be removed from the cluster; after clustering is completed, all clusters are arranged in descending order according to size and then labeled with id starting from 1, that is, the cluster with id=1 contains the most data points, and all marked outliers are aggregated into an additional cluster, and id=0 is recorded.

6. According to claim 3, a highly robust distributed data recovery method based on compressed sensing is characterized in that in step S103, the master node sends each sub-matrix to the subordinate slave nodes using the network connection, and uses the square of the matrix size as the weight and the load factor to distribute tasks among the slave nodes, and the process is as follows:

Pre-configure the weight w _j for the jth node according to its computing power. The default value is 1. The initial load of the node is L _j = 1. Define the load factor

For the submatrix D _i corresponding to cluster C _i , the task volume T _i is defined as the square of the scale, that is,

T _i =(M _i .col) ²

Select the node that currently has the largest load factor and update the load factor, that is,

L′ _Node = L _Node + _Ti

Repeat the above steps until the assignment is complete.

7. A highly robust distributed data recovery method based on compressed sensing according to claim 3, characterized in that, the assumption is divided into N sub-matrices, wherein the sub-matrix M _i corresponding to the cluster with id=i is of size m _i ×n _i , and step S2 decomposes it into a low-rank data matrix A _i , a sparse anomaly matrix B _i , a small noise matrix C _i and an error matrix E _i , then the optimization problem is expressed as:

D _i. *E _i = D _i

Where α, β, σ are the coefficients of the three corresponding matrix components to meet the needs of linear transformation and dimensional conversion; X, Y, Z are global coefficient matrices; _Di is an intermediate variable of the same size as _Ei that is used to convert _Ei from a binary matrix to a numerical matrix; and It is an additional auxiliary variable used to decouple the _Ai and _Bi variables in the objective and constraints; is the penalty term, where γ is the corresponding coefficient, K is the number of penalty terms, and U ^k , V ^k and R ^k are three global matrix parameters set according to domain knowledge.

8. The highly robust distributed data recovery method based on compressed sensing according to claim 3, characterized in that, in S302, the calculation method used to compare the difference between the current split result and the previous split result is:

Wherein, Lab_Diff represents the obtained partition difference result, N is the number of sub-matrices, Labt(i) represents the split result of the t-th iteration, and Labt-1(i) represents the split result of the previous iteration before the t-th iteration.

9. According to claim 3, a highly robust distributed data recovery method based on compressed sensing is characterized in that the method predicts data recovery performance by adjusting the partitioning scheme in the data decomposition stage, and the data recovery performance includes recovery effect and recovery time, wherein the recovery effect is predicted based on the number of sub-matrices, and the recovery time is predicted based on the sum of the squares of the sub-matrix sizes.

10. The highly robust distributed data recovery method based on compressed sensing according to claim 9, characterized in that the recovery time prediction method is:

Where N is the number of sub-matrices to be split, k′ is the time coefficient, m×n is the input matrix size, Square_Sum is the sum of squares of cluster sizes,

The recovery effect is measured using the normalized mean absolute error (NMAE), defined as

Where A represents the ideal data matrix, represents the restored data matrix outputted in step S3, and M represents the original matrix with defects.