CN115438035A - A data anomaly processing method based on KPCA and hybrid similarity - Google Patents

A data anomaly processing method based on KPCA and hybrid similarity Download PDF

Info

Publication number
CN115438035A
CN115438035A CN202211321839.6A CN202211321839A CN115438035A CN 115438035 A CN115438035 A CN 115438035A CN 202211321839 A CN202211321839 A CN 202211321839A CN 115438035 A CN115438035 A CN 115438035A
Authority
CN
China
Prior art keywords
data
dimensional data
dimensional
low
kpca
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211321839.6A
Other languages
Chinese (zh)
Other versions
CN115438035B (en
Inventor
马勇
赵从俊
戴梦轩
贺嘉
李博嘉
何兵兵
唐泳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal University filed Critical Jiangxi Normal University
Priority to CN202211321839.6A priority Critical patent/CN115438035B/en
Publication of CN115438035A publication Critical patent/CN115438035A/en
Application granted granted Critical
Publication of CN115438035B publication Critical patent/CN115438035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data exception handling method based on KPCA and mixed similarity, which comprises the following steps: s1: the terminal generates a task and uploads the task to the edge terminal; s2: the edge terminal receives the task and divides the data related to the task into high-dimensional data and low-dimensional data; s3: processing the high-dimensional data and the low-dimensional data; s4: and the edge terminal uploads the processed data to the cloud terminal. Through the mode, the data exception handling method provided by the invention has higher integrity on data feature mining, and the data exception handling method based on KPCA and mixed similarity has higher accuracy, so that the quality management level of a data set is improved, and the safe, stable and high-quality operation of a cloud end and an edge end to a task is promoted.

Description

一种基于KPCA和混合相似度的数据异常处理方法A data anomaly processing method based on KPCA and hybrid similarity

技术领域technical field

本发明涉及大数据处理领域,特别是涉及一种基于KPCA和混合相似度的数据异常处理方法。The invention relates to the field of big data processing, in particular to a data anomaly processing method based on KPCA and mixed similarity.

背景技术Background technique

最近几年,传统工业控制系统和互联网、云平台逐渐连接起来,构成了工业互联网平台。与此同时,随着物联网与5G技术的蓬勃发展,导致移动终端设备产生海量的数据。终端设备收集的所有数据都将会通过网络传输至云端,在云端进行清洗、挖掘等工作。这样不仅会造成网络带宽巨大压力带来长时延,同时会浪费云计算中心的计算资源,因此,在边缘端进行合理的数据清洗处理后,再将干净的数据上传至云端存储和利用是非常有必要的,现有技术中常常包含有关工业数据异常值的检测与清洗,却很少包含有关冗余数据的去重,然而,由于很多冗余数据是可以由其他数据推演出来,因此冗余信息在数据处理时不会起作用,对冗余数据的去重可以有效的减轻数据传输压力以及任务负载。In recent years, traditional industrial control systems have been gradually connected with the Internet and cloud platforms to form an industrial Internet platform. At the same time, with the vigorous development of the Internet of Things and 5G technology, mobile terminal devices generate massive amounts of data. All data collected by terminal devices will be transmitted to the cloud through the network, where cleaning, mining and other work will be performed. This will not only cause huge pressure on the network bandwidth and long delay, but also waste the computing resources of the cloud computing center. Therefore, it is very important to upload the clean data to the cloud for storage and utilization after reasonable data cleaning processing at the edge. It is necessary. The existing technology often includes the detection and cleaning of industrial data outliers, but rarely includes the deduplication of redundant data. However, since many redundant data can be deduced from other data, the redundant Information will not play a role in data processing, and deduplication of redundant data can effectively reduce data transmission pressure and task load.

中国发明专利(申请号:201811519395.0,公布号:CN 109635958 A)公开了一种智能电力数据异常检测方法,对有效离线数据样本进行降维,并计算得到时序样本序列,包括:使用PCA主成分分析法对有效离线数据样本进行降维处理,去除三维以上的各个维度特征的关联性,得到降维后的离线数据样本;对降维后的离线数据样本进行序列化处理得到时序样本序列。该方案存在的不足之处在于:传统工业数据大多为非线性较强的高维数据,PCA算法对于非线性数据处理效果一般,降维后的数据信息保存较差,非线性特征难以获取,导致异常检测后的数据准确性较低。Chinese invention patent (Application No.: 201811519395.0, Publication No.: CN 109635958 A) discloses a smart power data anomaly detection method, which reduces the dimensionality of effective offline data samples and calculates time-series sample sequences, including: using PCA principal component analysis The method performs dimensionality reduction processing on effective offline data samples, removes the correlation of each dimension feature above three dimensions, and obtains offline data samples after dimensionality reduction; serializes offline data samples after dimensionality reduction to obtain a sequence of time series samples. The disadvantages of this scheme are: traditional industrial data are mostly high-dimensional data with strong nonlinearity, the PCA algorithm has a general effect on nonlinear data processing, the data information after dimensionality reduction is poorly preserved, and nonlinear features are difficult to obtain, resulting in Data accuracy after anomaly detection is lower.

中国发明专利(申请号:201911423436.0,公布号:CN 111275288 A)公开了一种基于XGBoost的多维数据异常检测方法与装置,包括:数据采集清洗,对清洗后的数据进行标准化处理,统一不同维度数据之间量纲;特征抽取及降维,构建异常检测模型训练,用XGBoost方法对降维数据进行训练,建立设备异常的预测模型;进行异常在线检测,若超过了给定阀值,那么判定发生异常。该方案存在的不足在于只考虑了皮尔逊相关系数只对于关联性关系强的数据集测试效果较好,而对于非线性关系较强的工业数据效果较差,冗余数据的检测精确性不足,导致去重效果不佳。Chinese invention patent (application number: 201911423436.0, publication number: CN 111275288 A) discloses a multi-dimensional data anomaly detection method and device based on XGBoost, including: data collection and cleaning, standardized processing of cleaned data, and unification of different dimensions of data Between dimensions; feature extraction and dimensionality reduction, building anomaly detection model training, using the XGBoost method to train dimensionality reduction data, and establishing a prediction model for equipment anomalies; online anomaly detection, if it exceeds a given threshold, then it is determined that it has occurred abnormal. The disadvantage of this scheme is that it only considers the Pearson correlation coefficient, which is only good for data sets with strong correlation, but poor for industrial data with strong nonlinear relationship, and the detection accuracy of redundant data is insufficient. Lead to poor deduplication effect.

发明内容Contents of the invention

为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,包括以下步骤:In order to solve the above technical problems or at least partly solve the above technical problems, the present disclosure provides a data anomaly processing method based on KPCA and hybrid similarity, which is characterized in that it includes the following steps:

S1:终端产生任务,并将任务上传至边缘端;S1: The terminal generates a task and uploads the task to the edge;

S2:边缘端接收所述任务,并将所述任务所涉及的数据按照维度划分为高维数据和低维数据;S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data according to dimensions;

S3:对所述高维数据和低维数据进行处理;S3: Processing the high-dimensional data and low-dimensional data;

S4:边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud.

进一步的,所述高维数据,为维度>=3的数据;Further, the high-dimensional data is data with dimension >= 3;

所述低维数据,为维度<3的数据;The low-dimensional data is data with a dimension <3;

进一步的,所述对所述高维数据和低维数据进行处理;包括:Further, the processing of the high-dimensional data and low-dimensional data includes:

S31.对所述高维数据和低维数据进行异常检测,得到检测结果;S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;

S32. 对所述检测结果进行清洗,得到清洗后的数据集;S32. Cleaning the detection results to obtain a cleaned data set;

S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set.

进一步的,所述对所述高维数据和低维数据进行异常检测,得到检测结果,包括:Further, the anomaly detection is performed on the high-dimensional data and low-dimensional data, and detection results are obtained, including:

S311.对低维数据采用iForest进行异常检测,得到各个低维数据对应的路径长度与异常分数;S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;

S312.将高维数据采用KPCA算法转换为特征数据,再对所述特征数据采用iForest进行异常检测,得到各个高维数据对应的路径长度与异常分数;S312. Convert the high-dimensional data into characteristic data using the KPCA algorithm, and then use iForest to perform anomaly detection on the characteristic data, and obtain the path length and abnormal score corresponding to each high-dimensional data;

进一步的,所述将高维数据采用KPCA算法转换为特征数据,包括:Further, the high-dimensional data is converted into feature data using the KPCA algorithm, including:

建立高维数据映射数据库,在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据。A high-dimensional data mapping database is established, and all original high-dimensional data and corresponding feature data are recorded in the high-dimensional data mapping database.

进一步的,所述对所述检测结果进行清洗,包括:Further, the cleaning of the detection results includes:

S321.获取高维数据和低维数据的路径长度与异常分数,计算平均路径长度;S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;

S322. 将所述平均路径长度在0~0.15范围内,且异常分数在 0.85~1范围内的数据作为异常值,进行数据清洗。S322. Use the data whose average path length is in the range of 0 to 0.15 and the abnormal score in the range of 0.85 to 1 as an abnormal value, and perform data cleaning.

进一步的,所述对所述检测结果进行清洗,高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法,分开进行。Further, the cleaning of the detection results, the high-dimensional data and the low-dimensional data are performed separately using the methods involved in the above S31, S32, and S33.

进一步的,所述对所述清洗后的数据集进行冗余数据判断并进行处理,包括:Further, the judging and processing the redundant data of the cleaned data set includes:

S331.获取所述平均路径长度和所述异常分数相似的数据,将获取到的数据假定为

Figure 789785DEST_PATH_IMAGE001
,则将
Figure 55812DEST_PATH_IMAGE002
视为冗余数据;其中,所述S331步骤中,低维数据与高维数据均采用上述方法,并分开同步进行;S331. Obtain data with similar average path lengths and abnormal scores, and assume the obtained data as
Figure 789785DEST_PATH_IMAGE001
, then the
Figure 55812DEST_PATH_IMAGE002
It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;

S332.分析

Figure 447611DEST_PATH_IMAGE003
的数据类型,若
Figure 819686DEST_PATH_IMAGE004
为低维冗余数据,则转S333,若
Figure 765776DEST_PATH_IMAGE004
为高维冗余数据,转S334;S332. Analysis
Figure 447611DEST_PATH_IMAGE003
data type, if
Figure 819686DEST_PATH_IMAGE004
It is low-dimensional redundant data, then turn to S333, if
Figure 765776DEST_PATH_IMAGE004
For high-dimensional redundant data, turn to S334;

S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度

Figure 101074DEST_PATH_IMAGE005
;公式如下:S333. Using the Pearson correlation coefficient to obtain the similarity of the low-dimensional redundant data
Figure 101074DEST_PATH_IMAGE005
;The formula is as follows:

Figure 612958DEST_PATH_IMAGE006
Figure 612958DEST_PATH_IMAGE006

S334.从所述高维数据映射数据库中获取所述

Figure 155935DEST_PATH_IMAGE007
对应的原始高维数据
Figure 261425DEST_PATH_IMAGE008
,采用混合相似度算法获取所述高维冗余数据的相似度
Figure 180840DEST_PATH_IMAGE009
;公式如下:S334. Obtain the above from the high-dimensional data mapping database
Figure 155935DEST_PATH_IMAGE007
Corresponding original high-dimensional data
Figure 261425DEST_PATH_IMAGE008
, using a hybrid similarity algorithm to obtain the similarity of the high-dimensional redundant data
Figure 180840DEST_PATH_IMAGE009
;The formula is as follows:

Figure 688175DEST_PATH_IMAGE010
Figure 688175DEST_PATH_IMAGE010

其中

Figure 136474DEST_PATH_IMAGE011
为斯皮尔曼相关系数所占权重,
Figure 994840DEST_PATH_IMAGE012
Figure 452366DEST_PATH_IMAGE003
数据的斯皮尔曼相关系数,
Figure 735580DEST_PATH_IMAGE013
Figure 636671DEST_PATH_IMAGE003
的互信息值;in
Figure 136474DEST_PATH_IMAGE011
is the weight of the Spearman correlation coefficient,
Figure 994840DEST_PATH_IMAGE012
for
Figure 452366DEST_PATH_IMAGE003
The Spearman correlation coefficient of the data,
Figure 735580DEST_PATH_IMAGE013
for
Figure 636671DEST_PATH_IMAGE003
mutual information value;

S335.将所述

Figure 310229DEST_PATH_IMAGE005
Figure 774708DEST_PATH_IMAGE009
与预设阈值
Figure 584532DEST_PATH_IMAGE014
比较,若H1>δ或H2>δ,则表示
Figure 125366DEST_PATH_IMAGE003
中存在冗余数据,进行数据清除。S335. The said
Figure 310229DEST_PATH_IMAGE005
or
Figure 774708DEST_PATH_IMAGE009
with preset threshold
Figure 584532DEST_PATH_IMAGE014
Comparison, if H 1 >δ or H 2 >δ, it means
Figure 125366DEST_PATH_IMAGE003
If there is redundant data in , clear the data.

进一步的,所述

Figure 942013DEST_PATH_IMAGE011
、预设阈值
Figure 757653DEST_PATH_IMAGE014
的由人工取值,
Figure 687563DEST_PATH_IMAGE011
范围为0~1,优选取值为0.5,
Figure 117407DEST_PATH_IMAGE014
范围不超过计算出的相似度最大值,优选的,
Figure 31137DEST_PATH_IMAGE014
取值设为最大相似度值的90%。Further, the
Figure 942013DEST_PATH_IMAGE011
, preset threshold
Figure 757653DEST_PATH_IMAGE014
The value is taken manually,
Figure 687563DEST_PATH_IMAGE011
The range is 0~1, the preferred value is 0.5,
Figure 117407DEST_PATH_IMAGE014
The range does not exceed the calculated maximum value of the similarity, preferably,
Figure 31137DEST_PATH_IMAGE014
The value is set to 90% of the maximum similarity value.

进一步的,所述数据清洗,包括:在所述

Figure 384889DEST_PATH_IMAGE004
中随机选择一个数据进行删除。Further, the data cleaning includes: in the
Figure 384889DEST_PATH_IMAGE004
Randomly select a piece of data to delete.

本发明提供的技术方案与现有技术相比具有如下优点:Compared with the prior art, the technical solution provided by the invention has the following advantages:

本发明提供的一种基于KPCA和混合相似度的数据异常处理方法,能够分析终端产生并上传至边缘端的任务,并将所述任务所涉及的数据划分为高维数据和低维数据,对所述高维数据和低维数据进行处理,边缘端将处理好的数据上传至云端。同时,针对工业数据的维度变化较大的特点,本发明将数据类型划分为高维数据和低维数据,对所述高维数据采用KPCA 算法进行数据处理,通过特征提取来减少数据集的维度,实现高维数据和低维数据的异常检测;针对工业数据非线性特征难以挖掘的特性,本发明采用皮尔逊相关系数结合混合相似度算法实现冗余数据的检测,其中,对于高维数据的非线性特征以及高维数据之间的相似性存在一定依赖关系,采用斯皮尔曼相关系数结合互信息值方法进行高维数据的相似度计算。如此,本发明提供的数据异常处理方法对数据特征的挖掘具有较高的完整性,提供的数据异常检测与去重的方案具有较高的准确性,进而提升数据集的质量管理水平,促进云端和边缘端对任务的安全稳定优质运行。The data anomaly processing method based on KPCA and hybrid similarity provided by the present invention can analyze the tasks generated by the terminal and uploaded to the edge end, and divide the data involved in the tasks into high-dimensional data and low-dimensional data. The above high-dimensional data and low-dimensional data are processed, and the edge end uploads the processed data to the cloud. At the same time, in view of the characteristics of large changes in the dimensions of industrial data, the present invention divides the data types into high-dimensional data and low-dimensional data, uses the KPCA algorithm for data processing on the high-dimensional data, and reduces the dimension of the data set through feature extraction , to realize the abnormal detection of high-dimensional data and low-dimensional data; for the characteristics that the nonlinear characteristics of industrial data are difficult to mine, the present invention uses the Pearson correlation coefficient combined with the mixed similarity algorithm to realize the detection of redundant data, wherein, for the high-dimensional data There is a certain dependence between the nonlinear features and the similarity between high-dimensional data, and the Spearman correlation coefficient combined with the mutual information value method is used to calculate the similarity of high-dimensional data. In this way, the data anomaly processing method provided by the present invention has high integrity for data feature mining, and the data anomaly detection and deduplication scheme provided has high accuracy, thereby improving the quality management level of data sets and promoting cloud computing. And the edge end is safe, stable and high-quality for tasks.

附图说明Description of drawings

图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图。FIG. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention.

图2是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的高维数据低维数据处理方法流程图。FIG. 2 is a flowchart of a high-dimensional data and low-dimensional data processing method based on a data anomaly processing method based on KPCA and mixed similarity provided by the present invention.

图3是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的异常数据清洗流程图。Fig. 3 is a flow chart of cleaning abnormal data of a data abnormal processing method based on KPCA and hybrid similarity provided by the present invention.

具体实施方式detailed description

下面结合附图对本发明的较佳实施例进行详细阐述,以使本发明的优点和特征能更易于被本领域技术人员理解,从而对本发明的保护范围做出更为清楚明确的界定。The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.

在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can also be implemented in other ways than described here; obviously, the embodiments in the description are only some of the embodiments of the present disclosure, and Not all examples.

图1是本发明提供的一种基于KPCA和混合相似度的数据异常处理方法的流程图,该方法包括:Fig. 1 is a flow chart of a data anomaly processing method based on KPCA and hybrid similarity provided by the present invention, the method comprising:

S1:终端产生任务,并将任务上传至边缘端;S1: The terminal generates a task and uploads the task to the edge;

S2:边缘端接收所述任务,并将所述任务所涉及的数据按照维度划分为高维数据和低维数据;S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data according to dimensions;

S3:对所述高维数据和低维数据进行处理;S3: Processing the high-dimensional data and low-dimensional data;

S4:边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud.

进一步的,所述高维数据,为维度>=3的数据;Further, the high-dimensional data is data with dimension >= 3;

所述低维数据,为维度<3的数据;The low-dimensional data is data with a dimension <3;

进一步的,参见图2,所述对所述高维数据和低维数据进行处理;包括:Further, referring to FIG. 2, the processing of the high-dimensional data and low-dimensional data includes:

S31.对所述高维数据和低维数据进行异常检测,得到检测结果;S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result;

S32.对所述检测结果进行清洗,得到清洗后的数据集;S32. Cleaning the detection results to obtain a cleaned data set;

S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set.

进一步的,所述对所述高维数据和低维数据进行异常检测,得到检测结果,包括:Further, the anomaly detection is performed on the high-dimensional data and low-dimensional data, and detection results are obtained, including:

S311.对低维数据采用iForest进行异常检测,得到各个低维数据对应的路径长度与异常分数;S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data;

S312.将高维数据采用KPCA算法转换为特征数据,再对所述特征数据采用iForest进行异常检测,得到各个高维数据对应的路径长度与异常分数;S312. Convert the high-dimensional data into characteristic data using the KPCA algorithm, and then use iForest to perform anomaly detection on the characteristic data, and obtain the path length and abnormal score corresponding to each high-dimensional data;

进一步的,所述路径长度的计算公式为:Further, the calculation formula of the path length is:

Figure 559518DEST_PATH_IMAGE015
Figure 559518DEST_PATH_IMAGE015

其中,所述

Figure 973313DEST_PATH_IMAGE016
为路径长度,
Figure 171076DEST_PATH_IMAGE017
为样本数,
Figure 577787DEST_PATH_IMAGE014
为欧拉常数;Among them, the
Figure 973313DEST_PATH_IMAGE016
is the path length,
Figure 171076DEST_PATH_IMAGE017
is the number of samples,
Figure 577787DEST_PATH_IMAGE014
is Euler's constant;

所述异常分数的计算公式为:The formula for calculating the abnormal score is:

Figure 623234DEST_PATH_IMAGE018
Figure 623234DEST_PATH_IMAGE018

其中,所述

Figure 988356DEST_PATH_IMAGE019
表示异常分数,
Figure 17623DEST_PATH_IMAGE020
表示路径长度期望,所述
Figure 696867DEST_PATH_IMAGE021
为调和函数,
Figure 783771DEST_PATH_IMAGE022
。Among them, the
Figure 988356DEST_PATH_IMAGE019
Indicates the outlier score,
Figure 17623DEST_PATH_IMAGE020
represents the path length desired, the
Figure 696867DEST_PATH_IMAGE021
is the harmonic function,
Figure 783771DEST_PATH_IMAGE022
.

所述

Figure 601686DEST_PATH_IMAGE020
为数据在所有iTree上的路径长度期望,经过iForest算法输出结果为0~1的值。said
Figure 601686DEST_PATH_IMAGE020
It is the expected path length of data on all iTrees, and the output result of the iForest algorithm is a value of 0~1.

进一步的,所述将高维数据采用KPCA算法转换为特征数据,包括:Further, the high-dimensional data is converted into feature data using the KPCA algorithm, including:

建立高维数据映射数据库,在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据;Establishing a high-dimensional data mapping database, recording all original high-dimensional data and corresponding feature data in the high-dimensional data mapping database;

可以理解的是,所述特征数据由所述原始高维数据降维而得,在所述高维数据和低维数据进行异常检测中,高维数据存在非线性特征,因此采用效果较好的KPCA算法获取高维数据的特征数据,对所述特征数据进行处理;而在对所述清洗后的数据集进行冗余数据判断并进行处理中,为保证高维数据信息的完整性,因此选择对原始高维数据进行处理;所述高维数据映射数据库构建的目的为保证原始高维数据与特征数据的保存,使方案具备更高的灵活性和可靠性。It can be understood that the feature data is obtained by reducing the dimensionality of the original high-dimensional data. In the anomaly detection of the high-dimensional data and low-dimensional data, the high-dimensional data has nonlinear characteristics, so the effective The KPCA algorithm obtains the feature data of high-dimensional data, and processes the feature data; while judging and processing the redundant data of the cleaned data set, in order to ensure the integrity of high-dimensional data information, it is selected The original high-dimensional data is processed; the purpose of constructing the high-dimensional data mapping database is to ensure the preservation of the original high-dimensional data and feature data, so that the scheme has higher flexibility and reliability.

进一步的,所述对所述检测结果进行清洗,包括:Further, the cleaning of the detection results includes:

S321.获取高维数据和低维数据的路径长度与异常分数,计算平均路径长度;S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length;

S322.将所述平均路径长度在0~0.15范围内,且异常分数在 0.85~1范围内的数据作为异常值,进行数据清洗;S322. Using the average path length in the range of 0 to 0.15 and the data with an abnormal score in the range of 0.85 to 1 as abnormal values, perform data cleaning;

具体的,范围的确定本领域技术人员可根据数据特征与实际需求设置,此处提供的值可作参考,并不作为限定。Specifically, those skilled in the art can determine the range according to data characteristics and actual needs, and the values provided here can be used as a reference and not as a limitation.

进一步的,所述对所述检测结果进行清洗,高维数据和低维数据均各自采用上述S31、S32、S33中涉及的方法,并分开同步进行,其中,所述高维数据各自选取维度相同的数据进行处理,例如,高维数据维度为Ni,(i=0,1,…,n)则获取各自维度的Ni维数据使用上述方法进行,此处不再赘述。Further, in the cleaning of the detection results, the high-dimensional data and low-dimensional data adopt the methods involved in the above-mentioned S31, S32, and S33 respectively, and are carried out separately and synchronously, wherein the high-dimensional data each select the same dimension For example, if the dimension of high-dimensional data is N i , ( i =0, 1, .

进一步的,参见图3,所述对所述清洗后的数据集进行冗余数据判断并进行处理,包括:Further, referring to FIG. 3 , the redundant data judgment and processing of the cleaned data set includes:

S331.获取所述平均路径长度和所述异常分数相似的数据,将获取到的数据假定为

Figure 118249DEST_PATH_IMAGE004
,则将
Figure 70024DEST_PATH_IMAGE023
视为冗余数据;其中,所述S331步骤中,低维数据与高维数据均采用上述方法,并分开同步进行;S331. Acquire data with similar average path lengths and abnormal scores, and assume the acquired data as
Figure 118249DEST_PATH_IMAGE004
, then the
Figure 70024DEST_PATH_IMAGE023
It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;

S332.分析

Figure 683539DEST_PATH_IMAGE003
的数据类型,若
Figure 586601DEST_PATH_IMAGE003
为低维冗余数据,则转S333,若
Figure 839727DEST_PATH_IMAGE004
为高维冗余数据,转S334;S332. Analysis
Figure 683539DEST_PATH_IMAGE003
data type, if
Figure 586601DEST_PATH_IMAGE003
It is low-dimensional redundant data, then turn to S333, if
Figure 839727DEST_PATH_IMAGE004
For high-dimensional redundant data, turn to S334;

S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度

Figure 80347DEST_PATH_IMAGE005
;公式如下:S333. Using the Pearson correlation coefficient to obtain the similarity of the low-dimensional redundant data
Figure 80347DEST_PATH_IMAGE005
;The formula is as follows:

Figure 469740DEST_PATH_IMAGE006
Figure 469740DEST_PATH_IMAGE006

S334.从所述高维数据映射数据库中获取所述

Figure 160616DEST_PATH_IMAGE007
对应的原始高维数据
Figure 41984DEST_PATH_IMAGE008
,采用混合相似度算法获取所述高维冗余数据的相似度
Figure 883032DEST_PATH_IMAGE009
;公式如下:S334. Obtain the above from the high-dimensional data mapping database
Figure 160616DEST_PATH_IMAGE007
Corresponding original high-dimensional data
Figure 41984DEST_PATH_IMAGE008
, using a hybrid similarity algorithm to obtain the similarity of the high-dimensional redundant data
Figure 883032DEST_PATH_IMAGE009
;The formula is as follows:

Figure 861352DEST_PATH_IMAGE010
Figure 861352DEST_PATH_IMAGE010

其中

Figure 926391DEST_PATH_IMAGE011
为斯皮尔曼相关系数所占权重,
Figure 888531DEST_PATH_IMAGE012
Figure 2112DEST_PATH_IMAGE004
数据的斯皮尔曼相关系数,
Figure 100518DEST_PATH_IMAGE013
Figure 461092DEST_PATH_IMAGE003
的互信息值,其中:in
Figure 926391DEST_PATH_IMAGE011
is the weight of the Spearman correlation coefficient,
Figure 888531DEST_PATH_IMAGE012
for
Figure 2112DEST_PATH_IMAGE004
The Spearman correlation coefficient of the data,
Figure 100518DEST_PATH_IMAGE013
for
Figure 461092DEST_PATH_IMAGE003
The mutual information value of , where:

Figure 130102DEST_PATH_IMAGE024
Figure 562220DEST_PATH_IMAGE025
表示数据
Figure 328182DEST_PATH_IMAGE026
的联合概率,
Figure 921975DEST_PATH_IMAGE027
表示
Figure 343860DEST_PATH_IMAGE028
Figure 986194DEST_PATH_IMAGE029
出现的概率,log的底数通常取为e。
Figure 130102DEST_PATH_IMAGE024
,
Figure 562220DEST_PATH_IMAGE025
represent data
Figure 328182DEST_PATH_IMAGE026
the joint probability of
Figure 921975DEST_PATH_IMAGE027
express
Figure 343860DEST_PATH_IMAGE028
,
Figure 986194DEST_PATH_IMAGE029
The probability of occurrence, the base of log is usually taken as e.

例如:

Figure 59192DEST_PATH_IMAGE030
=[0,0,1] ,
Figure 574618DEST_PATH_IMAGE031
=[1,1,0],可得
Figure 529805DEST_PATH_IMAGE032
Figure 54458DEST_PATH_IMAGE033
,E.g:
Figure 59192DEST_PATH_IMAGE030
=[0,0,1] ,
Figure 574618DEST_PATH_IMAGE031
=[1,1,0], available
Figure 529805DEST_PATH_IMAGE032
,
Figure 54458DEST_PATH_IMAGE033
,

Figure 981963DEST_PATH_IMAGE034
,
Figure 933869DEST_PATH_IMAGE035
,而本例的
Figure 986139DEST_PATH_IMAGE036
=
Figure 298172DEST_PATH_IMAGE037
+
Figure 893232DEST_PATH_IMAGE038
=0.6365。
Figure 981963DEST_PATH_IMAGE034
,
Figure 933869DEST_PATH_IMAGE035
, and in this case the
Figure 986139DEST_PATH_IMAGE036
=
Figure 298172DEST_PATH_IMAGE037
+
Figure 893232DEST_PATH_IMAGE038
=0.6365.

在本方案中,互信息是俩数据相互依赖程度的度量,互信息值越大,则表明俩数之间的依赖程度越大;In this scheme, mutual information is a measure of the degree of interdependence between two data, and the greater the value of mutual information, the greater the degree of dependence between the two data;

S335.将所述

Figure 999728DEST_PATH_IMAGE005
Figure 617923DEST_PATH_IMAGE009
与预设阈值
Figure 405750DEST_PATH_IMAGE014
比较,若H1>δ或H2>δ,则表示
Figure 307847DEST_PATH_IMAGE004
中存在冗余数据,进行数据清除。S335. The said
Figure 999728DEST_PATH_IMAGE005
or
Figure 617923DEST_PATH_IMAGE009
with preset threshold
Figure 405750DEST_PATH_IMAGE014
Comparison, if H 1 >δ or H 2 >δ, it means
Figure 307847DEST_PATH_IMAGE004
If there is redundant data in , clear the data.

进一步的,所述

Figure 335977DEST_PATH_IMAGE011
、预设阈值
Figure 753052DEST_PATH_IMAGE014
的取值可视情形而定,
Figure 892040DEST_PATH_IMAGE011
优选为0.5。Further, the
Figure 335977DEST_PATH_IMAGE011
, preset threshold
Figure 753052DEST_PATH_IMAGE014
The value of can depend on the situation,
Figure 892040DEST_PATH_IMAGE011
Preferably it is 0.5.

具体的,

Figure 914223DEST_PATH_IMAGE014
的确定本领域技术人员可根据数据特征与实际需求设置,优选的,为人为设定的固定阈值为当前相似度上限值的90%,此处提供的值可作参考,并不作为限定。specific,
Figure 914223DEST_PATH_IMAGE014
Those skilled in the art can set it according to data characteristics and actual needs. Preferably, the artificially set fixed threshold is 90% of the current upper limit of similarity. The values provided here can be used as a reference and not as a limitation.

进一步的,所述数据清洗,包括:在所述

Figure 300205DEST_PATH_IMAGE004
中随机选择一个数据进行删除。Further, the data cleaning includes: in the
Figure 300205DEST_PATH_IMAGE004
Randomly select a piece of data to delete.

附图中的流程图和框图,图示了按照本发明各种实施例的方法可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flow charts and block diagrams in the accompanying drawings illustrate the architecture, functions and operations that may be implemented by methods according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

需要说明的是,在本文中,关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations . Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims (9)

1.一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,包括以下步骤:1. A data anomaly processing method based on KPCA and hybrid similarity, is characterized in that, comprises the following steps: S1:终端产生任务,并将任务上传至边缘端;S1: The terminal generates a task and uploads the task to the edge; S2:边缘端接收所述任务,并将所述任务所涉及的数据划分为高维数据和低维数据;S2: The edge end receives the task, and divides the data involved in the task into high-dimensional data and low-dimensional data; S3:对所述高维数据和低维数据进行处理;S3: Processing the high-dimensional data and low-dimensional data; S4:边缘端将处理好的数据上传至云端。S4: The edge uploads the processed data to the cloud. 2.如权利要求1所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,2. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 1, is characterized in that, 所述高维数据,为维度>=3的数据;The high-dimensional data is data with dimensions >= 3; 所述低维数据,为维度<3的数据。The low-dimensional data is data with dimension <3. 3.如权利要求1所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,3. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 1, is characterized in that, 所述对所述高维数据和低维数据进行处理,包括:The processing of the high-dimensional data and low-dimensional data includes: S31.对所述高维数据和低维数据进行异常检测,得到检测结果;S31. Perform anomaly detection on the high-dimensional data and low-dimensional data, and obtain a detection result; S32.对所述检测结果进行清洗,得到清洗后的数据集;S32. Cleaning the detection results to obtain a cleaned data set; S33.对所述清洗后的数据集进行冗余数据判断并进行处理。S33. Perform redundant data judgment and processing on the cleaned data set. 4.如权利要求3所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,4. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 3, is characterized in that, 所述对所述高维数据和低维数据进行异常检测,得到检测结果,包括:The anomaly detection is performed on the high-dimensional data and low-dimensional data to obtain detection results, including: S311.对低维数据采用iForest进行异常检测,得到各个低维数据对应的路径长度与异常分数;S311. Using iForest to perform anomaly detection on the low-dimensional data, and obtain the path length and abnormal score corresponding to each low-dimensional data; S312.将高维数据采用KPCA算法转换为特征数据,再对所述特征数据采用iForest进行异常检测,得到各个高维数据对应的路径长度与异常分数。S312. Convert the high-dimensional data into feature data using the KPCA algorithm, and then use iForest to perform anomaly detection on the feature data, and obtain path lengths and abnormal scores corresponding to each high-dimensional data. 5.如权利要求4所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,5. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 4, is characterized in that, 所述将高维数据采用KPCA算法转换为特征数据,包括:The described high-dimensional data adopts KPCA algorithm to be converted into feature data, including: 建立高维数据映射数据库,在所述高维数据映射数据库中记录所有原始高维数据以及对应的特征数据。A high-dimensional data mapping database is established, and all original high-dimensional data and corresponding feature data are recorded in the high-dimensional data mapping database. 6.如权利要求5所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,6. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 5, is characterized in that, 所述对所述检测结果进行清洗,包括:The cleaning of the detection results includes: S321.获取高维数据和低维数据的路径长度与异常分数,计算平均路径长度;S321. Obtain the path length and abnormal score of the high-dimensional data and the low-dimensional data, and calculate the average path length; S322.将所述平均路径长度在0~0.15范围内,且异常分数在 0.85~1范围内的数据作为异常值,进行数据清洗。S322. Use the data whose average path length is in the range of 0 to 0.15 and whose abnormal score is in the range of 0.85 to 1 as an abnormal value, and perform data cleaning. 7.如权利要求4-6任一一项所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,7. A kind of data anomaly processing method based on KPCA and hybrid similarity as described in any one of claim 4-6, it is characterized in that, 所述对所述检测结果进行清洗,高维数据和低维数据均各自采用所述S31、S32、S33中涉及的方法,分开进行,其中,所述高维数据各自选取维度相同的数据进行处理。In the cleaning of the detection results, the high-dimensional data and the low-dimensional data are respectively carried out using the methods involved in the above S31, S32, and S33, and the high-dimensional data are each selected from data with the same dimension for processing. . 8.如权利要求6所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,8. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 6, is characterized in that, 所述对所述清洗后的数据集进行冗余数据判断并进行处理,包括:The redundant data judgment and processing of the cleaned data set includes: S331.获取所述平均路径长度和所述异常分数相似的数据,将获取到的数据假定为
Figure 408635DEST_PATH_IMAGE001
,则将
Figure 847838DEST_PATH_IMAGE002
视为冗余数据;其中,所述S331步骤中,低维数据与高维数据均采用上述方法,并分开同步进行;
S331. Acquire data with similar average path lengths and abnormal scores, and assume the acquired data as
Figure 408635DEST_PATH_IMAGE001
, then the
Figure 847838DEST_PATH_IMAGE002
It is regarded as redundant data; wherein, in the step S331, both the low-dimensional data and the high-dimensional data adopt the above method, and are performed separately and synchronously;
S332.分析
Figure 297273DEST_PATH_IMAGE003
的数据类型,若
Figure 214545DEST_PATH_IMAGE001
为低维冗余数据,则转S333,若
Figure 105141DEST_PATH_IMAGE001
为高维冗余数据,转S334;
S332. Analysis
Figure 297273DEST_PATH_IMAGE003
data type, if
Figure 214545DEST_PATH_IMAGE001
It is low-dimensional redundant data, then turn to S333, if
Figure 105141DEST_PATH_IMAGE001
For high-dimensional redundant data, turn to S334;
S333.采用皮尔逊相关系数获取所述低维冗余数据的相似度H1;公式如下:S333. Using the Pearson correlation coefficient to obtain the similarity H 1 of the low-dimensional redundant data; the formula is as follows: H1=corr
Figure 964512DEST_PATH_IMAGE003
H 1 =corr
Figure 964512DEST_PATH_IMAGE003
S334.从所述高维数据映射数据库中获取所述
Figure 386397DEST_PATH_IMAGE001
对应的原始高维数据
Figure 700835DEST_PATH_IMAGE004
,采用混合相似度算法获取所述高维冗余数据的相似度H2;公式如下:
S334. Obtain the above from the high-dimensional data mapping database
Figure 386397DEST_PATH_IMAGE001
Corresponding original high-dimensional data
Figure 700835DEST_PATH_IMAGE004
, using a hybrid similarity algorithm to obtain the similarity H 2 of the high-dimensional redundant data; the formula is as follows:
Figure 773833DEST_PATH_IMAGE005
Figure 773833DEST_PATH_IMAGE005
其中μ为斯皮尔曼相关系数所占权重,
Figure 820418DEST_PATH_IMAGE006
Figure 713288DEST_PATH_IMAGE007
数据的斯皮尔曼相关系数,
Figure 893733DEST_PATH_IMAGE008
Figure 571970DEST_PATH_IMAGE007
的互信息值;
Where μ is the weight of the Spearman correlation coefficient,
Figure 820418DEST_PATH_IMAGE006
for
Figure 713288DEST_PATH_IMAGE007
The Spearman correlation coefficient of the data,
Figure 893733DEST_PATH_IMAGE008
for
Figure 571970DEST_PATH_IMAGE007
mutual information value;
S335.将所述H1或 H2与预设阈值δ比较,若H1>δ或H2>δ,则表示
Figure 773145DEST_PATH_IMAGE009
中存在冗余数据,进行数据清除。
S335. Comparing the H 1 or H 2 with the preset threshold δ, if H 1 >δ or H 2 >δ, it means
Figure 773145DEST_PATH_IMAGE009
If there is redundant data in , clear the data.
9.如权利要求8所述的一种基于KPCA和混合相似度的数据异常处理方法,其特征在于,9. a kind of data exception processing method based on KPCA and hybrid similarity as claimed in claim 8, is characterized in that, 所述μ、预设阈值δ由人工取值,μ范围为0~1,δ范围不超过计算出的相似度最大值。The μ and the preset threshold δ are manually selected, the range of μ is 0-1, and the range of δ does not exceed the calculated maximum value of the similarity.
CN202211321839.6A 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity Active CN115438035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211321839.6A CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211321839.6A CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Publications (2)

Publication Number Publication Date
CN115438035A true CN115438035A (en) 2022-12-06
CN115438035B CN115438035B (en) 2023-04-07

Family

ID=84252560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211321839.6A Active CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Country Status (1)

Country Link
CN (1) CN115438035B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140162274A1 (en) * 2012-06-28 2014-06-12 Taxon Biosciences, Inc. Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences
CN104091337A (en) * 2014-07-11 2014-10-08 北京工业大学 Deformation medical image registration method based on PCA and diffeomorphism Demons
CN106709869A (en) * 2016-12-25 2017-05-24 北京工业大学 Dimensionally reduction method based on deep Pearson embedment
CN106886601A (en) * 2017-03-02 2017-06-23 大连理工大学 A Cross-modal Retrieval Algorithm Based on Subspace Hybrid Hypergraph Learning
CN109214503A (en) * 2018-08-01 2019-01-15 华北电力大学 Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM
CN110069467A (en) * 2019-04-16 2019-07-30 沈阳工业大学 System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method
CN111275288A (en) * 2019-12-31 2020-06-12 华电国际电力股份有限公司十里泉发电厂 XGboost-based multi-dimensional data anomaly detection method and device
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
US20200293554A1 (en) * 2018-03-15 2020-09-17 Alibaba Group Holding Limited Abnormal sample prediction
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
US20210200746A1 (en) * 2019-12-30 2021-07-01 Royal Bank Of Canada System and method for multivariate anomaly detection
CN113420691A (en) * 2021-06-30 2021-09-21 昆明理工大学 Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient
CN113901993A (en) * 2021-09-24 2022-01-07 上海海事大学 Fault diagnosis method based on PCCs secondary feature optimization
CN114239807A (en) * 2021-12-17 2022-03-25 山东省计算中心(国家超级计算济南中心) RFE-DAGMM-based high-dimensional data anomaly detection method
WO2022110557A1 (en) * 2020-11-25 2022-06-02 国网湖南省电力有限公司 Method and device for diagnosing user-transformer relationship anomaly in transformer area
CN115150744A (en) * 2022-08-02 2022-10-04 天津城建大学 A method for locating indoor signal interference sources in large conference venues

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140162274A1 (en) * 2012-06-28 2014-06-12 Taxon Biosciences, Inc. Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences
CN104091337A (en) * 2014-07-11 2014-10-08 北京工业大学 Deformation medical image registration method based on PCA and diffeomorphism Demons
CN106709869A (en) * 2016-12-25 2017-05-24 北京工业大学 Dimensionally reduction method based on deep Pearson embedment
CN106886601A (en) * 2017-03-02 2017-06-23 大连理工大学 A Cross-modal Retrieval Algorithm Based on Subspace Hybrid Hypergraph Learning
US20200293554A1 (en) * 2018-03-15 2020-09-17 Alibaba Group Holding Limited Abnormal sample prediction
CN109214503A (en) * 2018-08-01 2019-01-15 华北电力大学 Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM
CN110069467A (en) * 2019-04-16 2019-07-30 沈阳工业大学 System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method
US20210200746A1 (en) * 2019-12-30 2021-07-01 Royal Bank Of Canada System and method for multivariate anomaly detection
CN111275288A (en) * 2019-12-31 2020-06-12 华电国际电力股份有限公司十里泉发电厂 XGboost-based multi-dimensional data anomaly detection method and device
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
WO2022110557A1 (en) * 2020-11-25 2022-06-02 国网湖南省电力有限公司 Method and device for diagnosing user-transformer relationship anomaly in transformer area
CN113420691A (en) * 2021-06-30 2021-09-21 昆明理工大学 Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient
CN113901993A (en) * 2021-09-24 2022-01-07 上海海事大学 Fault diagnosis method based on PCCs secondary feature optimization
CN114239807A (en) * 2021-12-17 2022-03-25 山东省计算中心(国家超级计算济南中心) RFE-DAGMM-based high-dimensional data anomaly detection method
CN115150744A (en) * 2022-08-02 2022-10-04 天津城建大学 A method for locating indoor signal interference sources in large conference venues

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NING ZHANG: "Magnetic Anomaly Detection Method Based on Feature Fusion and Isolation Forest Algorithm", 《IEEE ACCESS ( VOLUME: 10)》 *
李为州: "说话人识别中基于深度信念网络的超向量降维的研究", 《电脑知识与技术》 *
杨英华等: "基于子空间混合相似度的过程监测与故障诊断", 《仪器仪表学报》 *
陈茂: "工业物联网中基于边缘计算的大数据清洗算法的研究", 《CNKI优秀硕士学位论文全文库》 *

Also Published As

Publication number Publication date
CN115438035B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109727446B (en) Method for identifying and processing abnormal value of electricity consumption data
CN117421684A (en) Abnormal data monitoring and analyzing method based on data mining and neural network
CN113328755B (en) Compressed data transmission method facing edge calculation
CN112462261B (en) Motor abnormality detection method and device, electronic equipment and storage medium
CN111506672B (en) Method, device, equipment and storage medium for analyzing environment-friendly monitoring data in real time
CN114281864A (en) A Correlation Analysis Method for Power Network Alarm Information
CN111949501A (en) IT system operation risk monitoring method and device
CN111275821A (en) Power line fitting method, system and terminal
CN117670575A (en) Intelligent workshop management system and method for furniture production
CN105913064B (en) A fitting optimization method for image visual saliency detection
CN115438035A (en) A data anomaly processing method based on KPCA and hybrid similarity
CN114580534A (en) Industrial data anomaly detection method and device, electronic equipment and storage medium
CN114398828A (en) Drilling rate intelligent prediction and optimization method, system, equipment and medium
CN116050579B (en) Building energy consumption prediction method and system based on deep feature fusion network
CN117404853A (en) External circulating water cooling system and method for tunnel boring machine
CN106778252A (en) Intrusion detection method based on rough set theory Yu WAODE algorithms
CN113420733B (en) Efficient distributed big data acquisition implementation method and system
CN105389592A (en) Method and apparatus for identifying image
CN114330143B (en) Distributed parameter system state prediction method based on multi-source space-time information
CN110807466A (en) Method and device for processing order data
CN114372689A (en) Road network operation characteristic variable point identification method based on dynamic planning
CN113315524A (en) Landmark data compression transmission method and device based on deep learning
CN113919542A (en) Distribution network edge side load identification method and device and terminal equipment
CN118275309B (en) Online monitoring system and method for condensable particles in industrial exhaust flue gas
CN112560992B (en) Method, device, electronic equipment and storage medium for optimizing picture classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant