CN115438035A - A data anomaly processing method based on KPCA and hybrid similarity - Google Patents
A data anomaly processing method based on KPCA and hybrid similarity Download PDFInfo
- Publication number
- CN115438035A CN115438035A CN202211321839.6A CN202211321839A CN115438035A CN 115438035 A CN115438035 A CN 115438035A CN 202211321839 A CN202211321839 A CN 202211321839A CN 115438035 A CN115438035 A CN 115438035A
- Authority
- CN
- China
- Prior art keywords
- data
- dimensional data
- dimensional
- low
- kpca
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims description 19
- 238000000034 method Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 22
- 238000001514 detection method Methods 0.000 claims description 38
- 230000002159 abnormal effect Effects 0.000 claims description 23
- 238000004140 cleaning Methods 0.000 claims description 19
- 238000013506 data mapping Methods 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000005065 mining Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
Abstract
Description
技术领域technical field
本发明涉及大数据处理领域,特别是涉及一种基于KPCA和混合相似度的数据异常处理方法。The invention relates to the field of big data processing, in particular to a data anomaly processing method based on KPCA and mixed similarity.
背景技术Background technique
最近几年,传统工业控制系统和互联网、云平台逐渐连接起来,构成了工业互联网平台。与此同时,随着物联网与5G技术的蓬勃发展,导致移动终端设备产生海量的数据。终端设备收集的所有数据都将会通过网络传输至云端,在云端进行清洗、挖掘等工作。这样不仅会造成网络带宽巨大压力带来长时延,同时会浪费云计算中心的计算资源,因此,在边缘端进行合理的数据清洗处理后,再将干净的数据上传至云端存储和利用是非常有必要的,现有技术中常常包含有关工业数据异常值的检测与清洗,却很少包含有关冗余数据的去重,然而,由于很多冗余数据是可以由其他数据推演出来,因此冗余信息在数据处理时不会起作用,对冗余数据的去重可以有效的减轻数据传输压力以及任务负载。In recent years, traditional industrial control systems have been gradually connected with the Internet and cloud platforms to form an industrial Internet platform. At the same time, with the vigorous development of the Internet of Things and 5G technology, mobile terminal devices generate massive amounts of data. All data collected by terminal devices will be transmitted to the cloud through the network, where cleaning, mining and other work will be performed. This will not only cause huge pressure on the network bandwidth and long delay, but also waste the computing resources of the cloud computing center. Therefore, it is very important to upload the clean data to the cloud for storage and utilization after reasonable data cleaning processing at the edge. It is necessary. The existing technology often includes the detection and cleaning of industrial data outliers, but rarely includes the deduplication of redundant data. However, since many redundant data can be deduced from other data, the redundant Information will not play a role in data processing, and deduplication of redundant data can effectively reduce data transmission pressure and task load.
中国发明专利(申请号:201811519395.0,公布号:CN 109635958 A)公开了一种智能电力数据异常检测方法,对有效离线数据样本进行降维,并计算得到时序样本序列,包括:使用PCA主成分分析法对有效离线数据样本进行降维处理,去除三维以上的各个维度特征的关联性,得到降维后的离线数据样本;对降维后的离线数据样本进行序列化处理得到时序样本序列。该方案存在的不足之处在于:传统工业数据大多为非线性较强的高维数据,PCA算法对于非线性数据处理效果一般,降维后的数据信息保存较差,非线性特征难以获取,导致异常检测后的数据准确性较低。Chinese invention patent (Application No.: 201811519395.0, Publication No.: CN 109635958 A) discloses a smart power data anomaly detection method, which reduces the dimensionality of effective offline data samples and calculates time-series sample sequences, including: using PCA principal component analysis The method performs dimensionality reduction processing on effective offline data samples, removes the correlation of each dimension feature above three dimensions, and obtains offline data samples after dimensionality reduction; serializes offline data samples after dimensionality reduction to obtain a sequence of time series samples. The disadvantages of this scheme are: traditional industrial data are mostly high-dimensional data with strong nonlinearity, the PCA algorithm has a general effect on nonlinear data processing, the data information after dimensionality reduction is poorly preserved, and nonlinear features are difficult to obtain, resulting in Data accuracy after anomaly detection is lower.
中国发明专利(申请号:201911423436.0,公布号:CN 111275288 A)公开了一种基于XGBoost的多维数据异常检测方法与装置,包括:数据采集清洗,对清洗后的数据进行标准化处理,统一不同维度数据之间量纲;特征抽取及降维,构建异常检测模型训练,用XGBoost方法对降维数据进行训练,建立设备异常的预测模型;进行异常在线检测,若超过了给定阀值,那么判定发生异常。该方案存在的不足在于只考虑了皮尔逊相关系数只对于关联性关系强的数据集测试效果较好,而对于非线性关系较强的工业数据效果较差,冗余数据的检测精确性不足,导致去重效果不佳。Chinese invention patent (application number: 201911423436.0, publication number: CN 111275288 A) discloses a multi-dimensional data anomaly detection method and device based on XGBoost, including: data collection and cleaning, standardized processing of cleaned data, and unification of different dimensions of data Between dimensions; feature extraction and dimensionality reduction, building anomaly detection model training, using the XGBoost method to train dimensionality reduction data, and establishing a prediction model for equipment anomalies; online anomaly detection, if it exceeds a given threshold, then it is determined that it has occurred abnormal. The disadvantage of this scheme is that it only considers the Pearson correlation coefficient, which is only good for data sets with strong correlation, but poor for industrial data with strong nonlinear relationship, and the detection accuracy of redundant data is insufficient. Lead to poor deduplication effect.
发明内容Contents of the invention
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,包括以下步骤:In order to solve the above technical problems or at least partly solve the above technical problems, the present disclosure provides a data anomaly processing method based on KPCA and hybrid similarity, which is characterized in that it includes the following steps:
S1:终端产生任务,并将任务上传至边缘端;S1: The terminal generates a task and uploads the task to the edge;
S2:边缘端接收所述任务,并将所述任务所涉及的数据按照维度划分为高维数据和低维数据;S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data according to dimensions;
S3:对所述高维数据和低维数据进行处理;S3: Processing the high-dimensional data and low-dimensional data;
S4:边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud.
进一步的,所述高维数据,为维度>=3的数据;Further, the high-dimensional data is data with dimension >= 3;
所述低维数据,为维度<3的数据;The low-dimensional data is data with a dimension <3;
进一步的,所述对所述高维数据和低维数据进行处理;包括:Further, the processing of the high-dimensional data and low-dimensional data includes:
S31.对所述高维数据和低维数据进行异常检测,得到检测结果;S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;
S32. 对所述检测结果进行清洗,得到清洗后的数据集;S32. Cleaning the detection results to obtain a cleaned data set;
S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set.
进一步的,所述对所述高维数据和低维数据进行异常检测,得到检测结果,包括:Further, the anomaly detection is performed on the high-dimensional data and low-dimensional data, and detection results are obtained, including:
S311.对低维数据采用iForest进行异常检测,得到各个低维数据对应的路径长度与异常分数;S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;
S312.将高维数据采用KPCA算法转换为特征数据,再对所述特征数据采用iForest进行异常检测,得到各个高维数据对应的路径长度与异常分数;S312. Convert the high-dimensional data into characteristic data using the KPCA algorithm, and then use iForest to perform anomaly detection on the characteristic data, and obtain the path length and abnormal score corresponding to each high-dimensional data;
进一步的,所述将高维数据采用KPCA算法转换为特征数据,包括:Further, the high-dimensional data is converted into feature data using the KPCA algorithm, including:
建立高维数据映射数据库,在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据。A high-dimensional data mapping database is established, and all original high-dimensional data and corresponding feature data are recorded in the high-dimensional data mapping database.
进一步的,所述对所述检测结果进行清洗,包括:Further, the cleaning of the detection results includes:
S321.获取高维数据和低维数据的路径长度与异常分数,计算平均路径长度;S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;
S322. 将所述平均路径长度在0~0.15范围内,且异常分数在 0.85~1范围内的数据作为异常值,进行数据清洗。S322. Use the data whose average path length is in the range of 0 to 0.15 and the abnormal score in the range of 0.85 to 1 as an abnormal value, and perform data cleaning.
进一步的,所述对所述检测结果进行清洗,高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法,分开进行。Further, the cleaning of the detection results, the high-dimensional data and the low-dimensional data are performed separately using the methods involved in the above S31, S32, and S33.
进一步的,所述对所述清洗后的数据集进行冗余数据判断并进行处理,包括:Further, the judging and processing the redundant data of the cleaned data set includes:
S331.获取所述平均路径长度和所述异常分数相似的数据,将获取到的数据假定为,则将 视为冗余数据;其中,所述S331步骤中,低维数据与高维数据均采用上述方法,并分开同步进行;S331. Obtain data with similar average path lengths and abnormal scores, and assume the obtained data as , then the It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;
S332.分析的数据类型,若为低维冗余数据,则转S333,若为高维冗余数据,转S334;S332. Analysis data type, if It is low-dimensional redundant data, then turn to S333, if For high-dimensional redundant data, turn to S334;
S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度;公式如下:S333. Using the Pearson correlation coefficient to obtain the similarity of the low-dimensional redundant data ;The formula is as follows:
S334.从所述高维数据映射数据库中获取所述对应的原始高维数据,采用混合相似度算法获取所述高维冗余数据的相似度;公式如下:S334. Obtain the above from the high-dimensional data mapping database Corresponding original high-dimensional data , using a hybrid similarity algorithm to obtain the similarity of the high-dimensional redundant data ;The formula is as follows:
其中为斯皮尔曼相关系数所占权重,为数据的斯皮尔曼相关系数,为的互信息值;in is the weight of the Spearman correlation coefficient, for The Spearman correlation coefficient of the data, for mutual information value;
S335.将所述或与预设阈值比较,若H1>δ或H2>δ,则表示中存在冗余数据,进行数据清除。S335. The said or with preset threshold Comparison, if H 1 >δ or H 2 >δ, it means If there is redundant data in , clear the data.
进一步的,所述、预设阈值的由人工取值, 范围为0~1,优选取值为0.5, 范围不超过计算出的相似度最大值,优选的,取值设为最大相似度值的90%。Further, the , preset threshold The value is taken manually, The range is 0~1, the preferred value is 0.5, The range does not exceed the calculated maximum value of the similarity, preferably, The value is set to 90% of the maximum similarity value.
进一步的,所述数据清洗,包括:在所述中随机选择一个数据进行删除。Further, the data cleaning includes: in the Randomly select a piece of data to delete.
本发明提供的技术方案与现有技术相比具有如下优点:Compared with the prior art, the technical solution provided by the invention has the following advantages:
本发明提供的一种基于KPCA和混合相似度的数据异常处理方法,能够分析终端产生并上传至边缘端的任务,并将所述任务所涉及的数据划分为高维数据和低维数据,对所述高维数据和低维数据进行处理,边缘端将处理好的数据上传至云端。同时,针对工业数据的维度变化较大的特点,本发明将数据类型划分为高维数据和低维数据,对所述高维数据采用KPCA 算法进行数据处理,通过特征提取来减少数据集的维度,实现高维数据和低维数据的异常检测;针对工业数据非线性特征难以挖掘的特性,本发明采用皮尔逊相关系数结合混合相似度算法实现冗余数据的检测,其中,对于高维数据的非线性特征以及高维数据之间的相似性存在一定依赖关系,采用斯皮尔曼相关系数结合互信息值方法进行高维数据的相似度计算。如此,本发明提供的数据异常处理方法对数据特征的挖掘具有较高的完整性,提供的数据异常检测与去重的方案具有较高的准确性,进而提升数据集的质量管理水平,促进云端和边缘端对任务的安全稳定优质运行。The data anomaly processing method based on KPCA and hybrid similarity provided by the present invention can analyze the tasks generated by the terminal and uploaded to the edge end, and divide the data involved in the tasks into high-dimensional data and low-dimensional data. The above high-dimensional data and low-dimensional data are processed, and the edge end uploads the processed data to the cloud. At the same time, in view of the characteristics of large changes in the dimensions of industrial data, the present invention divides the data types into high-dimensional data and low-dimensional data, uses the KPCA algorithm for data processing on the high-dimensional data, and reduces the dimension of the data set through feature extraction , to realize the abnormal detection of high-dimensional data and low-dimensional data; for the characteristics that the nonlinear characteristics of industrial data are difficult to mine, the present invention uses the Pearson correlation coefficient combined with the mixed similarity algorithm to realize the detection of redundant data, wherein, for the high-dimensional data There is a certain dependence between the nonlinear features and the similarity between high-dimensional data, and the Spearman correlation coefficient combined with the mutual information value method is used to calculate the similarity of high-dimensional data. In this way, the data anomaly processing method provided by the present invention has high integrity for data feature mining, and the data anomaly detection and deduplication scheme provided has high accuracy, thereby improving the quality management level of data sets and promoting cloud computing. And the edge end is safe, stable and high-quality for tasks.
附图说明Description of drawings
图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图。FIG. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention.
图2是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的高维数据低维数据处理方法流程图。FIG. 2 is a flowchart of a high-dimensional data and low-dimensional data processing method based on a data anomaly processing method based on KPCA and mixed similarity provided by the present invention.
图3是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的异常数据清洗流程图。Fig. 3 is a flow chart of cleaning abnormal data of a data abnormal processing method based on KPCA and hybrid similarity provided by the present invention.
具体实施方式detailed description
下面结合附图对本发明的较佳实施例进行详细阐述,以使本发明的优点和特征能更易于被本领域技术人员理解,从而对本发明的保护范围做出更为清楚明确的界定。The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can also be implemented in other ways than described here; obviously, the embodiments in the description are only some of the embodiments of the present disclosure, and Not all examples.
图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图,该方法包括:Fig. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention, the method comprising:
S1:终端产生任务,并将任务上传至边缘端;S1: The terminal generates a task and uploads the task to the edge;
S2:边缘端接收所述任务,并将所述任务所涉及的数据按照维度划分为高维数据和低维数据;S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data according to dimensions;
S3:对所述高维数据和低维数据进行处理;S3: Processing the high-dimensional data and low-dimensional data;
S4:边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud.
进一步的,所述高维数据,为维度>=3的数据;Further, the high-dimensional data is data with dimension >= 3;
所述低维数据,为维度<3的数据;The low-dimensional data is data with a dimension <3;
进一步的,参见图2,所述对所述高维数据和低维数据进行处理;包括:Further, referring to FIG. 2, the processing of the high-dimensional data and low-dimensional data includes:
S31.对所述高维数据和低维数据进行异常检测,得到检测结果;S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;
S32.对所述检测结果进行清洗,得到清洗后的数据集;S32. Cleaning the detection results to obtain a cleaned data set;
S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set.
进一步的,所述对所述高维数据和低维数据进行异常检测,得到检测结果,包括:Further, the anomaly detection is performed on the high-dimensional data and low-dimensional data, and detection results are obtained, including:
S311.对低维数据采用iForest进行异常检测,得到各个低维数据对应的路径长度与异常分数;S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;
S312.将高维数据采用KPCA算法转换为特征数据,再对所述特征数据采用iForest进行异常检测,得到各个高维数据对应的路径长度与异常分数;S312. Convert the high-dimensional data into characteristic data using the KPCA algorithm, and then use iForest to perform anomaly detection on the characteristic data, and obtain the path length and abnormal score corresponding to each high-dimensional data;
进一步的,所述路径长度的计算公式为:Further, the calculation formula of the path length is:
其中,所述为路径长度,为样本数,为欧拉常数;Among them, the is the path length, is the number of samples, is Euler's constant;
所述异常分数的计算公式为:The formula for calculating the abnormal score is:
其中,所述表示异常分数,表示路径长度期望,所述为调和函数, 。Among them, the Indicates the outlier score, represents the path length desired, the is the harmonic function, .
所述为数据在所有iTree上的路径长度期望,经过iForest算法输出结果为0~1的值。said It is the expected path length of data on all iTrees, and the output result of the iForest algorithm is a value of 0~1.
进一步的,所述将高维数据采用KPCA算法转换为特征数据,包括:Further, the high-dimensional data is converted into feature data using the KPCA algorithm, including:
建立高维数据映射数据库,在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据;Establishing a high-dimensional data mapping database, recording all original high-dimensional data and corresponding feature data in the high-dimensional data mapping database;
可以理解的是,所述特征数据由所述原始高维数据降维而得,在所述高维数据和低维数据进行异常检测中,高维数据存在非线性特征,因此采用效果较好的KPCA算法获取高维数据的特征数据,对所述特征数据进行处理;而在对所述清洗后的数据集进行冗余数据判断并进行处理中,为保证高维数据信息的完整性,因此选择对原始高维数据进行处理;所述高维数据映射数据库构建的目的为保证原始高维数据与特征数据的保存,使方案具备更高的灵活性和可靠性。It can be understood that the feature data is obtained by reducing the dimensionality of the original high-dimensional data. In the anomaly detection of the high-dimensional data and low-dimensional data, the high-dimensional data has nonlinear characteristics, so the effective The KPCA algorithm obtains the feature data of high-dimensional data, and processes the feature data; while judging and processing the redundant data of the cleaned data set, in order to ensure the integrity of high-dimensional data information, it is selected The original high-dimensional data is processed; the purpose of constructing the high-dimensional data mapping database is to ensure the preservation of the original high-dimensional data and feature data, so that the scheme has higher flexibility and reliability.
进一步的,所述对所述检测结果进行清洗,包括:Further, the cleaning of the detection results includes:
S321.获取高维数据和低维数据的路径长度与异常分数,计算平均路径长度;S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;
S322.将所述平均路径长度在0~0.15范围内,且异常分数在 0.85~1范围内的数据作为异常值,进行数据清洗;S322. Using the average path length in the range of 0 to 0.15 and the data with an abnormal score in the range of 0.85 to 1 as abnormal values, perform data cleaning;
具体的,范围的确定本领域技术人员可根据数据特征与实际需求设置,此处提供的值可作参考,并不作为限定。Specifically, those skilled in the art can determine the range according to data characteristics and actual needs, and the values provided here can be used as a reference and not as a limitation.
进一步的,所述对所述检测结果进行清洗,高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法,并分开同步进行,其中,所述高维数据各自选取维度相同的数据进行处理,例如,高维数据维度为Ni,(i=0,1,…,n)则获取各自维度的Ni维数据使用上述方法进行,此处不再赘述。Further, in the cleaning of the detection results, the high-dimensional data and low-dimensional data adopt the methods involved in the above-mentioned S31, S32, and S33 respectively, and are carried out separately and synchronously, wherein the high-dimensional data each select the same dimension For example, if the dimension of high-dimensional data is N i , ( i =0, 1, .
进一步的,参见图3,所述对所述清洗后的数据集进行冗余数据判断并进行处理,包括:Further, referring to FIG. 3 , the redundant data judgment and processing of the cleaned data set includes:
S331.获取所述平均路径长度和所述异常分数相似的数据,将获取到的数据假定为,则将视为冗余数据;其中,所述S331步骤中,低维数据与高维数据均采用上述方法,并分开同步进行;S331. Acquire data with similar average path lengths and abnormal scores, and assume the acquired data as , then the It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;
S332.分析的数据类型,若为低维冗余数据,则转S333,若为高维冗余数据,转S334;S332. Analysis data type, if It is low-dimensional redundant data, then turn to S333, if For high-dimensional redundant data, turn to S334;
S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度;公式如下:S333. Using the Pearson correlation coefficient to obtain the similarity of the low-dimensional redundant data ;The formula is as follows:
S334.从所述高维数据映射数据库中获取所述对应的原始高维数据,采用混合相似度算法获取所述高维冗余数据的相似度;公式如下:S334. Obtain the above from the high-dimensional data mapping database Corresponding original high-dimensional data , using a hybrid similarity algorithm to obtain the similarity of the high-dimensional redundant data ;The formula is as follows:
其中为斯皮尔曼相关系数所占权重, 为数据的斯皮尔曼相关系数,为的互信息值,其中:in is the weight of the Spearman correlation coefficient, for The Spearman correlation coefficient of the data, for The mutual information value of , where:
, 表示数据的联合概率,表示、出现的概率,log的底数通常取为e。 , represent data the joint probability of express , The probability of occurrence, the base of log is usually taken as e.
例如:=[0,0,1] , =[1,1,0],可得, ,E.g: =[0,0,1] , =[1,1,0], available , ,
,,而本例的=+=0.6365。 , , and in this case the = + =0.6365.
在本方案中,互信息是俩数据相互依赖程度的度量,互信息值越大,则表明俩数之间的依赖程度越大;In this scheme, mutual information is a measure of the degree of interdependence between two data, and the greater the value of mutual information, the greater the degree of dependence between the two data;
S335.将所述或与预设阈值比较,若H1>δ或H2>δ,则表示中存在冗余数据,进行数据清除。S335. The said or with preset threshold Comparison, if H 1 >δ or H 2 >δ, it means If there is redundant data in , clear the data.
进一步的,所述、预设阈值的取值可视情形而定,优选为0.5。Further, the , preset threshold The value of can depend on the situation, Preferably it is 0.5.
具体的,的确定本领域技术人员可根据数据特征与实际需求设置,优选的,为人为设定的固定阈值为当前相似度上限值的90%,此处提供的值可作参考,并不作为限定。specific, Those skilled in the art can set it according to data characteristics and actual needs. Preferably, the artificially set fixed threshold is 90% of the current upper limit of similarity. The values provided here can be used as a reference and not as a limitation.
进一步的,所述数据清洗,包括:在所述中随机选择一个数据进行删除。Further, the data cleaning includes: in the Randomly select a piece of data to delete.
附图中的流程图和框图,图示了按照本发明各种实施例的方法可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flow charts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations that may be implemented by methods according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
需要说明的是,在本文中,关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations . Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211321839.6A CN115438035B (en) | 2022-10-27 | 2022-10-27 | Data exception handling method based on KPCA and mixed similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211321839.6A CN115438035B (en) | 2022-10-27 | 2022-10-27 | Data exception handling method based on KPCA and mixed similarity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115438035A true CN115438035A (en) | 2022-12-06 |
CN115438035B CN115438035B (en) | 2023-04-07 |
Family
ID=84252560
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211321839.6A Active CN115438035B (en) | 2022-10-27 | 2022-10-27 | Data exception handling method based on KPCA and mixed similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115438035B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140162274A1 (en) * | 2012-06-28 | 2014-06-12 | Taxon Biosciences, Inc. | Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences |
CN104091337A (en) * | 2014-07-11 | 2014-10-08 | 北京工业大学 | Deformation medical image registration method based on PCA and diffeomorphism Demons |
CN106709869A (en) * | 2016-12-25 | 2017-05-24 | 北京工业大学 | Dimensionally reduction method based on deep Pearson embedment |
CN106886601A (en) * | 2017-03-02 | 2017-06-23 | 大连理工大学 | A Cross-modal Retrieval Algorithm Based on Subspace Hybrid Hypergraph Learning |
CN109214503A (en) * | 2018-08-01 | 2019-01-15 | 华北电力大学 | Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM |
CN110069467A (en) * | 2019-04-16 | 2019-07-30 | 沈阳工业大学 | System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method |
CN111275288A (en) * | 2019-12-31 | 2020-06-12 | 华电国际电力股份有限公司十里泉发电厂 | XGboost-based multi-dimensional data anomaly detection method and device |
CN111338897A (en) * | 2020-02-24 | 2020-06-26 | 京东数字科技控股有限公司 | Identification method of abnormal node in application host, monitoring equipment and electronic equipment |
US20200293554A1 (en) * | 2018-03-15 | 2020-09-17 | Alibaba Group Holding Limited | Abnormal sample prediction |
CN111931868A (en) * | 2020-09-24 | 2020-11-13 | 常州微亿智造科技有限公司 | Time series data abnormity detection method and device |
US20210200746A1 (en) * | 2019-12-30 | 2021-07-01 | Royal Bank Of Canada | System and method for multivariate anomaly detection |
CN113420691A (en) * | 2021-06-30 | 2021-09-21 | 昆明理工大学 | Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient |
CN113901993A (en) * | 2021-09-24 | 2022-01-07 | 上海海事大学 | Fault diagnosis method based on PCCs secondary feature optimization |
CN114239807A (en) * | 2021-12-17 | 2022-03-25 | 山东省计算中心(国家超级计算济南中心) | RFE-DAGMM-based high-dimensional data anomaly detection method |
WO2022110557A1 (en) * | 2020-11-25 | 2022-06-02 | 国网湖南省电力有限公司 | Method and device for diagnosing user-transformer relationship anomaly in transformer area |
CN115150744A (en) * | 2022-08-02 | 2022-10-04 | 天津城建大学 | A method for locating indoor signal interference sources in large conference venues |
-
2022
- 2022-10-27 CN CN202211321839.6A patent/CN115438035B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140162274A1 (en) * | 2012-06-28 | 2014-06-12 | Taxon Biosciences, Inc. | Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences |
CN104091337A (en) * | 2014-07-11 | 2014-10-08 | 北京工业大学 | Deformation medical image registration method based on PCA and diffeomorphism Demons |
CN106709869A (en) * | 2016-12-25 | 2017-05-24 | 北京工业大学 | Dimensionally reduction method based on deep Pearson embedment |
CN106886601A (en) * | 2017-03-02 | 2017-06-23 | 大连理工大学 | A Cross-modal Retrieval Algorithm Based on Subspace Hybrid Hypergraph Learning |
US20200293554A1 (en) * | 2018-03-15 | 2020-09-17 | Alibaba Group Holding Limited | Abnormal sample prediction |
CN109214503A (en) * | 2018-08-01 | 2019-01-15 | 华北电力大学 | Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM |
CN110069467A (en) * | 2019-04-16 | 2019-07-30 | 沈阳工业大学 | System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method |
US20210200746A1 (en) * | 2019-12-30 | 2021-07-01 | Royal Bank Of Canada | System and method for multivariate anomaly detection |
CN111275288A (en) * | 2019-12-31 | 2020-06-12 | 华电国际电力股份有限公司十里泉发电厂 | XGboost-based multi-dimensional data anomaly detection method and device |
CN111338897A (en) * | 2020-02-24 | 2020-06-26 | 京东数字科技控股有限公司 | Identification method of abnormal node in application host, monitoring equipment and electronic equipment |
CN111931868A (en) * | 2020-09-24 | 2020-11-13 | 常州微亿智造科技有限公司 | Time series data abnormity detection method and device |
WO2022110557A1 (en) * | 2020-11-25 | 2022-06-02 | 国网湖南省电力有限公司 | Method and device for diagnosing user-transformer relationship anomaly in transformer area |
CN113420691A (en) * | 2021-06-30 | 2021-09-21 | 昆明理工大学 | Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient |
CN113901993A (en) * | 2021-09-24 | 2022-01-07 | 上海海事大学 | Fault diagnosis method based on PCCs secondary feature optimization |
CN114239807A (en) * | 2021-12-17 | 2022-03-25 | 山东省计算中心(国家超级计算济南中心) | RFE-DAGMM-based high-dimensional data anomaly detection method |
CN115150744A (en) * | 2022-08-02 | 2022-10-04 | 天津城建大学 | A method for locating indoor signal interference sources in large conference venues |
Non-Patent Citations (4)
Title |
---|
NING ZHANG: "Magnetic Anomaly Detection Method Based on Feature Fusion and Isolation Forest Algorithm", 《IEEE ACCESS ( VOLUME: 10)》 * |
李为州: "说话人识别中基于深度信念网络的超向量降维的研究", 《电脑知识与技术》 * |
杨英华等: "基于子空间混合相似度的过程监测与故障诊断", 《仪器仪表学报》 * |
陈茂: "工业物联网中基于边缘计算的大数据清洗算法的研究", 《CNKI优秀硕士学位论文全文库》 * |
Also Published As
Publication number | Publication date |
---|---|
CN115438035B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109727446B (en) | Method for identifying and processing abnormal value of electricity consumption data | |
CN117421684A (en) | Abnormal data monitoring and analyzing method based on data mining and neural network | |
CN113328755B (en) | Compressed data transmission method facing edge calculation | |
CN112462261B (en) | Motor abnormality detection method and device, electronic equipment and storage medium | |
CN111506672B (en) | Method, device, equipment and storage medium for analyzing environment-friendly monitoring data in real time | |
CN114281864A (en) | A Correlation Analysis Method for Power Network Alarm Information | |
CN111949501A (en) | IT system operation risk monitoring method and device | |
CN111275821A (en) | Power line fitting method, system and terminal | |
CN117670575A (en) | Intelligent workshop management system and method for furniture production | |
CN105913064B (en) | A fitting optimization method for image visual saliency detection | |
CN115438035A (en) | A data anomaly processing method based on KPCA and hybrid similarity | |
CN114580534A (en) | Industrial data anomaly detection method and device, electronic equipment and storage medium | |
CN114398828A (en) | Drilling rate intelligent prediction and optimization method, system, equipment and medium | |
CN116050579B (en) | Building energy consumption prediction method and system based on deep feature fusion network | |
CN117404853A (en) | External circulating water cooling system and method for tunnel boring machine | |
CN106778252A (en) | Intrusion detection method based on rough set theory Yu WAODE algorithms | |
CN113420733B (en) | Efficient distributed big data acquisition implementation method and system | |
CN105389592A (en) | Method and apparatus for identifying image | |
CN114330143B (en) | Distributed parameter system state prediction method based on multi-source space-time information | |
CN110807466A (en) | Method and device for processing order data | |
CN114372689A (en) | Road network operation characteristic variable point identification method based on dynamic planning | |
CN113315524A (en) | Landmark data compression transmission method and device based on deep learning | |
CN113919542A (en) | Distribution network edge side load identification method and device and terminal equipment | |
CN118275309B (en) | Online monitoring system and method for condensable particles in industrial exhaust flue gas | |
CN112560992B (en) | Method, device, electronic equipment and storage medium for optimizing picture classification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |