CN115438035A

CN115438035A - A data anomaly processing method based on KPCA and hybrid similarity

Info

Publication number: CN115438035A
Application number: CN202211321839.6A
Authority: CN
Inventors: 马勇; 赵从俊; 戴梦轩; 贺嘉; 李博嘉; 何兵兵; 唐泳
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2022-12-06
Anticipated expiration: 2042-10-27
Also published as: CN115438035B

Abstract

The invention discloses a data exception handling method based on KPCA and mixed similarity, which comprises the following steps: s1: the terminal generates a task and uploads the task to the edge terminal; s2: the edge terminal receives the task and divides the data related to the task into high-dimensional data and low-dimensional data; s3: processing the high-dimensional data and the low-dimensional data; s4: and the edge terminal uploads the processed data to the cloud terminal. Through the mode, the data exception handling method provided by the invention has higher integrity on data feature mining, and the data exception handling method based on KPCA and mixed similarity has higher accuracy, so that the quality management level of a data set is improved, and the safe, stable and high-quality operation of a cloud end and an edge end to a task is promoted.

Description

A data anomaly processing method based on KPCA and hybrid similarity

技术领域technical field

本发明涉及大数据处理领域，特别是涉及一种基于KPCA和混合相似度的数据异常处理方法。The invention relates to the field of big data processing, in particular to a data anomaly processing method based on KPCA and mixed similarity.

背景技术Background technique

最近几年，传统工业控制系统和互联网、云平台逐渐连接起来，构成了工业互联网平台。与此同时，随着物联网与5G技术的蓬勃发展，导致移动终端设备产生海量的数据。终端设备收集的所有数据都将会通过网络传输至云端，在云端进行清洗、挖掘等工作。这样不仅会造成网络带宽巨大压力带来长时延，同时会浪费云计算中心的计算资源，因此，在边缘端进行合理的数据清洗处理后，再将干净的数据上传至云端存储和利用是非常有必要的，现有技术中常常包含有关工业数据异常值的检测与清洗，却很少包含有关冗余数据的去重，然而，由于很多冗余数据是可以由其他数据推演出来，因此冗余信息在数据处理时不会起作用，对冗余数据的去重可以有效的减轻数据传输压力以及任务负载。In recent years, traditional industrial control systems have been gradually connected with the Internet and cloud platforms to form an industrial Internet platform. At the same time, with the vigorous development of the Internet of Things and 5G technology, mobile terminal devices generate massive amounts of data. All data collected by terminal devices will be transmitted to the cloud through the network, where cleaning, mining and other work will be performed. This will not only cause huge pressure on the network bandwidth and long delay, but also waste the computing resources of the cloud computing center. Therefore, it is very important to upload the clean data to the cloud for storage and utilization after reasonable data cleaning processing at the edge. It is necessary. The existing technology often includes the detection and cleaning of industrial data outliers, but rarely includes the deduplication of redundant data. However, since many redundant data can be deduced from other data, the redundant Information will not play a role in data processing, and deduplication of redundant data can effectively reduce data transmission pressure and task load.

中国发明专利（申请号：201811519395.0，公布号：CN 109635958 A）公开了一种智能电力数据异常检测方法，对有效离线数据样本进行降维，并计算得到时序样本序列，包括：使用PCA主成分分析法对有效离线数据样本进行降维处理，去除三维以上的各个维度特征的关联性，得到降维后的离线数据样本；对降维后的离线数据样本进行序列化处理得到时序样本序列。该方案存在的不足之处在于：传统工业数据大多为非线性较强的高维数据，PCA算法对于非线性数据处理效果一般，降维后的数据信息保存较差，非线性特征难以获取，导致异常检测后的数据准确性较低。Chinese invention patent (Application No.: 201811519395.0, Publication No.: CN 109635958 A) discloses a smart power data anomaly detection method, which reduces the dimensionality of effective offline data samples and calculates time-series sample sequences, including: using PCA principal component analysis The method performs dimensionality reduction processing on effective offline data samples, removes the correlation of each dimension feature above three dimensions, and obtains offline data samples after dimensionality reduction; serializes offline data samples after dimensionality reduction to obtain a sequence of time series samples. The disadvantages of this scheme are: traditional industrial data are mostly high-dimensional data with strong nonlinearity, the PCA algorithm has a general effect on nonlinear data processing, the data information after dimensionality reduction is poorly preserved, and nonlinear features are difficult to obtain, resulting in Data accuracy after anomaly detection is lower.

中国发明专利（申请号：201911423436.0，公布号：CN 111275288 A）公开了一种基于XGBoost的多维数据异常检测方法与装置，包括：数据采集清洗，对清洗后的数据进行标准化处理，统一不同维度数据之间量纲；特征抽取及降维，构建异常检测模型训练，用XGBoost方法对降维数据进行训练，建立设备异常的预测模型；进行异常在线检测，若超过了给定阀值，那么判定发生异常。该方案存在的不足在于只考虑了皮尔逊相关系数只对于关联性关系强的数据集测试效果较好，而对于非线性关系较强的工业数据效果较差，冗余数据的检测精确性不足，导致去重效果不佳。Chinese invention patent (application number: 201911423436.0, publication number: CN 111275288 A) discloses a multi-dimensional data anomaly detection method and device based on XGBoost, including: data collection and cleaning, standardized processing of cleaned data, and unification of different dimensions of data Between dimensions; feature extraction and dimensionality reduction, building anomaly detection model training, using the XGBoost method to train dimensionality reduction data, and establishing a prediction model for equipment anomalies; online anomaly detection, if it exceeds a given threshold, then it is determined that it has occurred abnormal. The disadvantage of this scheme is that it only considers the Pearson correlation coefficient, which is only good for data sets with strong correlation, but poor for industrial data with strong nonlinear relationship, and the detection accuracy of redundant data is insufficient. Lead to poor deduplication effect.

发明内容Contents of the invention

为了解决上述技术问题或者至少部分地解决上述技术问题，本公开提供了一种基于KPCA和混合相似度的数据异常处理方法，其特征在于，包括以下步骤：In order to solve the above technical problems or at least partly solve the above technical problems, the present disclosure provides a data anomaly processing method based on KPCA and hybrid similarity, which is characterized in that it includes the following steps:

S1：终端产生任务，并将任务上传至边缘端；S1: The terminal generates a task and uploads the task to the edge;

S2：边缘端接收所述任务，并将所述任务所涉及的数据按照维度划分为高维数据和低维数据；S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data according to dimensions;

S3：对所述高维数据和低维数据进行处理；S3: Processing the high-dimensional data and low-dimensional data;

S4：边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud.

进一步的，所述高维数据，为维度>=3的数据；Further, the high-dimensional data is data with dimension >= 3;

所述低维数据，为维度<3的数据；The low-dimensional data is data with a dimension <3;

进一步的，所述对所述高维数据和低维数据进行处理；包括：Further, the processing of the high-dimensional data and low-dimensional data includes:

S31.对所述高维数据和低维数据进行异常检测，得到检测结果；S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;

S32. 对所述检测结果进行清洗，得到清洗后的数据集；S32. Cleaning the detection results to obtain a cleaned data set;

S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set.

进一步的，所述对所述高维数据和低维数据进行异常检测，得到检测结果，包括：Further, the anomaly detection is performed on the high-dimensional data and low-dimensional data, and detection results are obtained, including:

S311.对低维数据采用iForest进行异常检测，得到各个低维数据对应的路径长度与异常分数；S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;

S312.将高维数据采用KPCA算法转换为特征数据，再对所述特征数据采用iForest进行异常检测，得到各个高维数据对应的路径长度与异常分数；S312. Convert the high-dimensional data into characteristic data using the KPCA algorithm, and then use iForest to perform anomaly detection on the characteristic data, and obtain the path length and abnormal score corresponding to each high-dimensional data;

进一步的，所述将高维数据采用KPCA算法转换为特征数据，包括：Further, the high-dimensional data is converted into feature data using the KPCA algorithm, including:

建立高维数据映射数据库，在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据。A high-dimensional data mapping database is established, and all original high-dimensional data and corresponding feature data are recorded in the high-dimensional data mapping database.

进一步的，所述对所述检测结果进行清洗，包括：Further, the cleaning of the detection results includes:

S321.获取高维数据和低维数据的路径长度与异常分数，计算平均路径长度；S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;

S322. 将所述平均路径长度在0~0.15范围内，且异常分数在 0.85~1范围内的数据作为异常值，进行数据清洗。S322. Use the data whose average path length is in the range of 0 to 0.15 and the abnormal score in the range of 0.85 to 1 as an abnormal value, and perform data cleaning.

进一步的，所述对所述检测结果进行清洗，高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法，分开进行。Further, the cleaning of the detection results, the high-dimensional data and the low-dimensional data are performed separately using the methods involved in the above S31, S32, and S33.

进一步的，所述对所述清洗后的数据集进行冗余数据判断并进行处理，包括：Further, the judging and processing the redundant data of the cleaned data set includes:

S331.获取所述平均路径长度和所述异常分数相似的数据，将获取到的数据假定为

，则将

视为冗余数据；其中，所述S331步骤中，低维数据与高维数据均采用上述方法，并分开同步进行；S331. Obtain data with similar average path lengths and abnormal scores, and assume the obtained data as

, then the

It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;

S332.分析

的数据类型，若

为低维冗余数据，则转S333，若

为高维冗余数据，转S334；S332. Analysis

data type, if

It is low-dimensional redundant data, then turn to S333, if

For high-dimensional redundant data, turn to S334;

S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度

；公式如下：S333. Using the Pearson correlation coefficient to obtain the similarity of the low-dimensional redundant data

;The formula is as follows:

S334.从所述高维数据映射数据库中获取所述

对应的原始高维数据

，采用混合相似度算法获取所述高维冗余数据的相似度

；公式如下：S334. Obtain the above from the high-dimensional data mapping database

Corresponding original high-dimensional data

, using a hybrid similarity algorithm to obtain the similarity of the high-dimensional redundant data

;The formula is as follows:

其中

为斯皮尔曼相关系数所占权重，

为

数据的斯皮尔曼相关系数，

为

的互信息值；in

is the weight of the Spearman correlation coefficient,

for

The Spearman correlation coefficient of the data,

for

mutual information value;

S335.将所述

或

与预设阈值

比较，若H₁>δ或H₂>δ，则表示

中存在冗余数据，进行数据清除。S335. The said

or

with preset threshold

Comparison, if H ₁ >δ or H ₂ >δ, it means

If there is redundant data in , clear the data.

进一步的，所述

、预设阈值

的由人工取值，

范围为0~1，优选取值为0.5，

范围不超过计算出的相似度最大值，优选的，

取值设为最大相似度值的90%。Further, the

, preset threshold

The value is taken manually,

The range is 0~1, the preferred value is 0.5,

The range does not exceed the calculated maximum value of the similarity, preferably,

The value is set to 90% of the maximum similarity value.

进一步的，所述数据清洗，包括：在所述

中随机选择一个数据进行删除。Further, the data cleaning includes: in the

Randomly select a piece of data to delete.

本发明提供的技术方案与现有技术相比具有如下优点：Compared with the prior art, the technical solution provided by the invention has the following advantages:

本发明提供的一种基于KPCA和混合相似度的数据异常处理方法，能够分析终端产生并上传至边缘端的任务，并将所述任务所涉及的数据划分为高维数据和低维数据，对所述高维数据和低维数据进行处理，边缘端将处理好的数据上传至云端。同时，针对工业数据的维度变化较大的特点，本发明将数据类型划分为高维数据和低维数据，对所述高维数据采用KPCA 算法进行数据处理，通过特征提取来减少数据集的维度，实现高维数据和低维数据的异常检测；针对工业数据非线性特征难以挖掘的特性，本发明采用皮尔逊相关系数结合混合相似度算法实现冗余数据的检测，其中，对于高维数据的非线性特征以及高维数据之间的相似性存在一定依赖关系，采用斯皮尔曼相关系数结合互信息值方法进行高维数据的相似度计算。如此，本发明提供的数据异常处理方法对数据特征的挖掘具有较高的完整性，提供的数据异常检测与去重的方案具有较高的准确性，进而提升数据集的质量管理水平，促进云端和边缘端对任务的安全稳定优质运行。The data anomaly processing method based on KPCA and hybrid similarity provided by the present invention can analyze the tasks generated by the terminal and uploaded to the edge end, and divide the data involved in the tasks into high-dimensional data and low-dimensional data. The above high-dimensional data and low-dimensional data are processed, and the edge end uploads the processed data to the cloud. At the same time, in view of the characteristics of large changes in the dimensions of industrial data, the present invention divides the data types into high-dimensional data and low-dimensional data, uses the KPCA algorithm for data processing on the high-dimensional data, and reduces the dimension of the data set through feature extraction , to realize the abnormal detection of high-dimensional data and low-dimensional data; for the characteristics that the nonlinear characteristics of industrial data are difficult to mine, the present invention uses the Pearson correlation coefficient combined with the mixed similarity algorithm to realize the detection of redundant data, wherein, for the high-dimensional data There is a certain dependence between the nonlinear features and the similarity between high-dimensional data, and the Spearman correlation coefficient combined with the mutual information value method is used to calculate the similarity of high-dimensional data. In this way, the data anomaly processing method provided by the present invention has high integrity for data feature mining, and the data anomaly detection and deduplication scheme provided has high accuracy, thereby improving the quality management level of data sets and promoting cloud computing. And the edge end is safe, stable and high-quality for tasks.

附图说明Description of drawings

图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图。FIG. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention.

图2是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的高维数据低维数据处理方法流程图。FIG. 2 is a flowchart of a high-dimensional data and low-dimensional data processing method based on a data anomaly processing method based on KPCA and mixed similarity provided by the present invention.

图3是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的异常数据清洗流程图。Fig. 3 is a flow chart of cleaning abnormal data of a data abnormal processing method based on KPCA and hybrid similarity provided by the present invention.

具体实施方式detailed description

下面结合附图对本发明的较佳实施例进行详细阐述，以使本发明的优点和特征能更易于被本领域技术人员理解，从而对本发明的保护范围做出更为清楚明确的界定。The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.

在下面的描述中阐述了很多具体细节以便于充分理解本公开，但本公开还可以采用其他不同于在此描述的方式来实施；显然，说明书中的实施例只是本公开的一部分实施例，而不是全部的实施例。In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can also be implemented in other ways than described here; obviously, the embodiments in the description are only some of the embodiments of the present disclosure, and Not all examples.

图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图，该方法包括：Fig. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention, the method comprising:

进一步的，参见图2，所述对所述高维数据和低维数据进行处理；包括：Further, referring to FIG. 2, the processing of the high-dimensional data and low-dimensional data includes:

S32.对所述检测结果进行清洗，得到清洗后的数据集；S32. Cleaning the detection results to obtain a cleaned data set;

进一步的，所述路径长度的计算公式为：Further, the calculation formula of the path length is:

其中，所述

为路径长度，

为样本数，

为欧拉常数；Among them, the

is the path length,

is the number of samples,

is Euler's constant;

所述异常分数的计算公式为：The formula for calculating the abnormal score is:

其中，所述

表示异常分数，

表示路径长度期望，所述

为调和函数，

。Among them, the

Indicates the outlier score,

represents the path length desired, the

is the harmonic function,

.

所述

为数据在所有iTree上的路径长度期望，经过iForest算法输出结果为0~1的值。said

It is the expected path length of data on all iTrees, and the output result of the iForest algorithm is a value of 0~1.

建立高维数据映射数据库，在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据；Establishing a high-dimensional data mapping database, recording all original high-dimensional data and corresponding feature data in the high-dimensional data mapping database;

可以理解的是，所述特征数据由所述原始高维数据降维而得，在所述高维数据和低维数据进行异常检测中，高维数据存在非线性特征，因此采用效果较好的KPCA算法获取高维数据的特征数据，对所述特征数据进行处理；而在对所述清洗后的数据集进行冗余数据判断并进行处理中，为保证高维数据信息的完整性，因此选择对原始高维数据进行处理；所述高维数据映射数据库构建的目的为保证原始高维数据与特征数据的保存，使方案具备更高的灵活性和可靠性。It can be understood that the feature data is obtained by reducing the dimensionality of the original high-dimensional data. In the anomaly detection of the high-dimensional data and low-dimensional data, the high-dimensional data has nonlinear characteristics, so the effective The KPCA algorithm obtains the feature data of high-dimensional data, and processes the feature data; while judging and processing the redundant data of the cleaned data set, in order to ensure the integrity of high-dimensional data information, it is selected The original high-dimensional data is processed; the purpose of constructing the high-dimensional data mapping database is to ensure the preservation of the original high-dimensional data and feature data, so that the scheme has higher flexibility and reliability.

S322.将所述平均路径长度在0~0.15范围内，且异常分数在 0.85~1范围内的数据作为异常值，进行数据清洗；S322. Using the average path length in the range of 0 to 0.15 and the data with an abnormal score in the range of 0.85 to 1 as abnormal values, perform data cleaning;

具体的，范围的确定本领域技术人员可根据数据特征与实际需求设置，此处提供的值可作参考，并不作为限定。Specifically, those skilled in the art can determine the range according to data characteristics and actual needs, and the values provided here can be used as a reference and not as a limitation.

进一步的，所述对所述检测结果进行清洗，高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法，并分开同步进行，其中，所述高维数据各自选取维度相同的数据进行处理，例如，高维数据维度为N_i，（i=0，1，…，n）则获取各自维度的N_i维数据使用上述方法进行，此处不再赘述。Further, in the cleaning of the detection results, the high-dimensional data and low-dimensional data adopt the methods involved in the above-mentioned S31, S32, and S33 respectively, and are carried out separately and synchronously, wherein the high-dimensional data each select the same dimension For example, if the dimension of high-dimensional data is N _i , ( _i =0, 1, .

进一步的，参见图3，所述对所述清洗后的数据集进行冗余数据判断并进行处理，包括：Further, referring to FIG. 3 , the redundant data judgment and processing of the cleaned data set includes:

，则将

视为冗余数据；其中，所述S331步骤中，低维数据与高维数据均采用上述方法，并分开同步进行；S331. Acquire data with similar average path lengths and abnormal scores, and assume the acquired data as

, then the

S332.分析

的数据类型，若

为低维冗余数据，则转S333，若

为高维冗余数据，转S334；S332. Analysis

data type, if

It is low-dimensional redundant data, then turn to S333, if

For high-dimensional redundant data, turn to S334;

S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度

;The formula is as follows:

S334.从所述高维数据映射数据库中获取所述

对应的原始高维数据

，采用混合相似度算法获取所述高维冗余数据的相似度

Corresponding original high-dimensional data

;The formula is as follows:

其中

为斯皮尔曼相关系数所占权重，

为

数据的斯皮尔曼相关系数，

为

的互信息值,其中：in

is the weight of the Spearman correlation coefficient,

for

The Spearman correlation coefficient of the data,

for

The mutual information value of , where:

，

表示数据

的联合概率，

表示

、

出现的概率，log的底数通常取为e。

,

represent data

the joint probability of

express

,

The probability of occurrence, the base of log is usually taken as e.

例如：

=[0,0,1] ,

=[1,1,0]，可得

，

，E.g:

=[0,0,1] ,

=[1,1,0], available

,

,

，而本例的

=

+

=0.6365。

,

, and in this case the

=

+

=0.6365.

在本方案中，互信息是俩数据相互依赖程度的度量，互信息值越大，则表明俩数之间的依赖程度越大；In this scheme, mutual information is a measure of the degree of interdependence between two data, and the greater the value of mutual information, the greater the degree of dependence between the two data;

S335.将所述

或

与预设阈值

比较，若H₁>δ或H₂>δ，则表示

中存在冗余数据，进行数据清除。S335. The said

or

with preset threshold

Comparison, if H ₁ >δ or H ₂ >δ, it means

If there is redundant data in , clear the data.

进一步的，所述

、预设阈值

的取值可视情形而定，

优选为0.5。Further, the

, preset threshold

The value of can depend on the situation,

Preferably it is 0.5.

具体的，

的确定本领域技术人员可根据数据特征与实际需求设置，优选的，为人为设定的固定阈值为当前相似度上限值的90％，此处提供的值可作参考，并不作为限定。specific,

Those skilled in the art can set it according to data characteristics and actual needs. Preferably, the artificially set fixed threshold is 90% of the current upper limit of similarity. The values provided here can be used as a reference and not as a limitation.

进一步的，所述数据清洗，包括：在所述

Randomly select a piece of data to delete.

附图中的流程图和框图，图示了按照本发明各种实施例的方法可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow charts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations that may be implemented by methods according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

需要说明的是，在本文中，关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations . Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

1. A data anomaly processing method based on KPCA and hybrid similarity, is characterized in that, comprises the following steps:

S1: The terminal generates a task and uploads the task to the edge;

S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data;

S3: Processing the high-dimensional data and low-dimensional data;

S4: The edge uploads the processed data to the cloud.

2. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 1, is characterized in that,

The high-dimensional data is data with dimensions >= 3;

The low-dimensional data is data with dimension <3.

3. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 1, is characterized in that,

The processing of the high-dimensional data and low-dimensional data includes:

S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;

S32. Cleaning the detection results to obtain a cleaned data set;

S33. Perform redundant data judgment and processing on the cleaned data set.

4. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 3, is characterized in that,

The anomaly detection is performed on the high-dimensional data and low-dimensional data to obtain detection results, including:

S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;

S312. Convert the high-dimensional data into feature data using the KPCA algorithm, and then use iForest to perform anomaly detection on the feature data, and obtain path lengths and abnormal scores corresponding to each high-dimensional data.

5. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 4, is characterized in that,

The described high-dimensional data adopts KPCA algorithm to be converted into feature data, including:

A high-dimensional data mapping database is established, and all original high-dimensional data and corresponding feature data are recorded in the high-dimensional data mapping database.

6. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 5, is characterized in that,

The cleaning of the detection results includes:

S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;

S322. Use the data whose average path length is in the range of 0 to 0.15 and whose abnormal score is in the range of 0.85 to 1 as an abnormal value, and perform data cleaning.

7. A kind of data anomaly processing method based on KPCA and hybrid similarity as described in any one of claim 4-6, it is characterized in that,

In the cleaning of the detection results, the high-dimensional data and the low-dimensional data are respectively carried out using the methods involved in the above S31, S32, and S33, and the high-dimensional data are each selected from data with the same dimension for processing. .

8. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 6, is characterized in that,

The redundant data judgment and processing of the cleaned data set includes:

S331. Acquire data with similar average path lengths and abnormal scores, and assume the acquired data as

, then the

S332. Analysis

data type, if

It is low-dimensional redundant data, then turn to S333, if

For high-dimensional redundant data, turn to S334;

S333. Using the Pearson correlation coefficient to obtain the similarity H ₁ of the low-dimensional redundant data; the formula is as follows:

H ₁ =corr

S334. Obtain the above from the high-dimensional data mapping database

Corresponding original high-dimensional data

, using a hybrid similarity algorithm to obtain the similarity H ₂ of the high-dimensional redundant data; the formula is as follows:

Where μ is the weight of the Spearman correlation coefficient,

for

The Spearman correlation coefficient of the data,

for

mutual information value;

S335. Comparing the H ₁ or H ₂ with the preset threshold δ, if H ₁ >δ or H ₂ >δ, it means

If there is redundant data in , clear the data.

9. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 8, is characterized in that,

The μ and the preset threshold δ are manually selected, the range of μ is 0-1, and the range of δ does not exceed the calculated maximum value of the similarity.