CN115438035A - Data exception handling method based on KPCA and mixed similarity - Google Patents

Data exception handling method based on KPCA and mixed similarity Download PDF

Info

Publication number
CN115438035A
CN115438035A CN202211321839.6A CN202211321839A CN115438035A CN 115438035 A CN115438035 A CN 115438035A CN 202211321839 A CN202211321839 A CN 202211321839A CN 115438035 A CN115438035 A CN 115438035A
Authority
CN
China
Prior art keywords
data
dimensional data
dimensional
low
kpca
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211321839.6A
Other languages
Chinese (zh)
Other versions
CN115438035B (en
Inventor
马勇
赵从俊
戴梦轩
贺嘉
李博嘉
何兵兵
唐泳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal University filed Critical Jiangxi Normal University
Priority to CN202211321839.6A priority Critical patent/CN115438035B/en
Publication of CN115438035A publication Critical patent/CN115438035A/en
Application granted granted Critical
Publication of CN115438035B publication Critical patent/CN115438035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data exception handling method based on KPCA and mixed similarity, which comprises the following steps: s1: the terminal generates a task and uploads the task to the edge terminal; s2: the edge terminal receives the task and divides the data related to the task into high-dimensional data and low-dimensional data; s3: processing the high-dimensional data and the low-dimensional data; s4: and the edge terminal uploads the processed data to the cloud terminal. Through the mode, the data exception handling method provided by the invention has higher integrity on data feature mining, and the data exception handling method based on KPCA and mixed similarity has higher accuracy, so that the quality management level of a data set is improved, and the safe, stable and high-quality operation of a cloud end and an edge end to a task is promoted.

Description

Data exception handling method based on KPCA and mixed similarity
Technical Field
The invention relates to the field of big data processing, in particular to a data exception handling method based on KPCA and mixed similarity.
Background
In recent years, a traditional industrial control system is gradually connected with an internet and cloud platform to form an industrial internet platform. Meanwhile, with the rapid development of the internet of things and the 5G technology, the mobile terminal device generates massive data. All data collected by the terminal equipment are transmitted to the cloud end through the network, and cleaning, mining and other work are carried out on the cloud end. Therefore, long time delay is brought by huge pressure of network bandwidth, and meanwhile, computing resources of a cloud computing center can be wasted, therefore, after reasonable data cleaning processing is carried out at an edge end, clean data are uploaded to a cloud end to be stored and utilized very necessarily, detection and cleaning of abnormal values of related industrial data are often included in the prior art, and duplication removal of related redundant data is rarely included.
The Chinese invention patent (application number: 201811519395.0, publication number: CN 109635958A) discloses an intelligent power data anomaly detection method, which performs dimension reduction on effective offline data samples and calculates to obtain a time sequence sample sequence, and comprises the following steps: carrying out dimensionality reduction on the effective offline data sample by using a Principal Component Analysis (PCA) method, and removing the relevance of each dimensionality characteristic above three dimensions to obtain the offline data sample after dimensionality reduction; and carrying out serialization processing on the offline data samples after dimension reduction to obtain a time sequence sample sequence. The scheme has the following defects: most of traditional industrial data are high-dimensional data with strong nonlinearity, the PCA algorithm has a common effect on nonlinear data processing, the data information after dimensionality reduction is poor in storage, nonlinear characteristics are difficult to obtain, and the data accuracy after abnormal detection is low.
Chinese invention patent (application number: 201911423436.0, publication number: CN 111275288A) discloses a multidimensional data anomaly detection method and device based on XGboost, which comprises the following steps: data acquisition and cleaning, namely performing standardized processing on the cleaned data and unifying dimensions among different dimensionality data; extracting characteristics and reducing dimensions, constructing an anomaly detection model for training, training dimension reduction data by using an XGboost method, and establishing a prediction model of equipment anomaly; and carrying out online abnormal detection, and if the abnormal detection exceeds a given threshold value, judging that the abnormality occurs. The technical scheme has the defects that only the Pearson correlation coefficient is considered, the test effect on the data set with strong correlation relation is good, the effect on the industrial data with strong nonlinear relation is poor, the detection accuracy of redundant data is insufficient, and the de-duplication effect is poor.
Disclosure of Invention
In order to solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a data exception handling method based on KPCA and hybrid similarity, comprising:
s1: the terminal generates a task and uploads the task to the edge terminal;
s2: the edge terminal receives the task and divides data related to the task into high-dimensional data and low-dimensional data according to dimensions;
s3: processing the high-dimensional data and the low-dimensional data;
s4: and the edge terminal uploads the processed data to the cloud terminal.
Further, the high-dimensional data is data with dimension > = 3;
the low-dimensional data is data with a dimension < 3;
further, the high-dimensional data and the low-dimensional data are processed; the method comprises the following steps:
s31, carrying out anomaly detection on the high-dimensional data and the low-dimensional data to obtain a detection result;
s32, cleaning the detection result to obtain a cleaned data set;
and S33, judging and processing redundant data of the cleaned data set.
Further, the performing anomaly detection on the high-dimensional data and the low-dimensional data to obtain a detection result includes:
s311, carrying out anomaly detection on the low-dimensional data by adopting iForest to obtain the path length and the anomaly score corresponding to each low-dimensional data;
s312, converting the high-dimensional data into feature data by adopting a KPCA algorithm, and performing anomaly detection on the feature data by adopting iForest to obtain the path length and the anomaly score corresponding to each high-dimensional data;
further, the converting the high-dimensional data into the feature data by using the KPCA algorithm includes:
and establishing a high-dimensional data mapping database, and recording all original high-dimensional data and corresponding characteristic data in the high-dimensional data mapping database.
Further, the cleaning the detection result includes:
s321, obtaining path lengths and abnormal scores of high-dimensional data and low-dimensional data, and calculating an average path length;
and S322, taking the data with the average path length ranging from 0 to 0.15 and the abnormal score ranging from 0.85 to 1 as abnormal values, and cleaning the data.
Further, the cleaning of the detection result is performed separately for the high-dimensional data and the low-dimensional data by using the methods described in S31, S32, and S33.
Further, the performing redundant data judgment and processing on the cleaned data set includes:
s331, obtaining the data with the average path length similar to the abnormal score, and assuming the obtained data as
Figure 789785DEST_PATH_IMAGE001
Then will be
Figure 55812DEST_PATH_IMAGE002
Regarded as the redundant data; in the step S331, the low-dimensional data and the high-dimensional data are both performed by the above method, and are separately and synchronously performed;
s332. Analysis
Figure 447611DEST_PATH_IMAGE003
Data type of (A) if
Figure 819686DEST_PATH_IMAGE004
For low-dimensional redundant data, go to S333, if
Figure 765776DEST_PATH_IMAGE004
Turning to S334 for high-dimensional redundant data;
s333, acquiring similarity of the low-dimensional redundant data by adopting Pearson correlation coefficient
Figure 101074DEST_PATH_IMAGE005
(ii) a The formula is as follows:
Figure 612958DEST_PATH_IMAGE006
s334, obtaining the high-dimensional data mapping database
Figure 155935DEST_PATH_IMAGE007
Corresponding original high dimensional data
Figure 261425DEST_PATH_IMAGE008
Acquiring the similarity of the high-dimensional redundant data by adopting a hybrid similarity algorithm
Figure 180840DEST_PATH_IMAGE009
(ii) a The formula is as follows:
Figure 688175DEST_PATH_IMAGE010
wherein
Figure 136474DEST_PATH_IMAGE011
The spearman correlation coefficient is taken as the weight,
Figure 994840DEST_PATH_IMAGE012
is composed of
Figure 452366DEST_PATH_IMAGE003
The spearman correlation coefficient of the data,
Figure 735580DEST_PATH_IMAGE013
is composed of
Figure 636671DEST_PATH_IMAGE003
A mutual information value of;
s335. The method comprises the following steps
Figure 310229DEST_PATH_IMAGE005
Or
Figure 774708DEST_PATH_IMAGE009
And a predetermined threshold
Figure 584532DEST_PATH_IMAGE014
By comparison, if H 1 >Delta or H 2 >δ, then represents
Figure 125366DEST_PATH_IMAGE003
If redundant data exists in the data, the data is cleared.
Further, the
Figure 942013DEST_PATH_IMAGE011
A preset threshold value
Figure 757653DEST_PATH_IMAGE014
Is manually taken as a value of (1),
Figure 687563DEST_PATH_IMAGE011
the range is 0 to 1, the preferable value is 0.5,
Figure 117407DEST_PATH_IMAGE014
the range does not exceed the calculated maximum value of similarity, preferably,
Figure 31137DEST_PATH_IMAGE014
the values were set to 90% of the maximum similarity value.
Further, the data cleansing includes: in the above-mentioned
Figure 384889DEST_PATH_IMAGE004
Randomly selecting one data to delete.
Compared with the prior art, the technical scheme provided by the invention has the following advantages:
the data exception handling method based on KPCA and mixed similarity provided by the invention can analyze the tasks generated by the terminal and uploaded to the edge terminal, divide the data related to the tasks into high-dimensional data and low-dimensional data, process the high-dimensional data and the low-dimensional data, and upload the processed data to the cloud terminal by the edge terminal. Meanwhile, aiming at the characteristic that the dimensionality of industrial data changes greatly, the data type is divided into high-dimensional data and low-dimensional data, the high-dimensional data is subjected to data processing by adopting a KPCA algorithm, the dimensionality of a data set is reduced by characteristic extraction, and the anomaly detection of the high-dimensional data and the low-dimensional data is realized; aiming at the characteristic that the non-linear characteristics of the industrial data are difficult to mine, the method adopts the Pearson correlation coefficient to combine with a mixed similarity algorithm to realize the detection of redundant data, wherein, for the non-linear characteristics of the high-dimensional data and the similarity between the high-dimensional data with a certain dependency relationship, the similarity of the high-dimensional data is calculated by adopting the Spireman correlation coefficient to combine with a mutual information value method. Therefore, the data exception handling method provided by the invention has higher integrity on data feature mining, and the provided data exception detection and duplicate removal scheme has higher accuracy, so that the quality management level of the data set is improved, and the safe, stable and high-quality operation of the cloud end and the edge end to the task is promoted.
Drawings
Fig. 1 is a flowchart of a data exception handling method based on KPCA and mixed similarity according to the present invention.
FIG. 2 is a flow chart of a high-dimensional data low-dimensional data processing method of the data exception handling method based on KPCA and mixed similarity provided by the invention.
FIG. 3 is an abnormal data cleaning flow chart of the data abnormality processing method based on KPCA and mixed similarity provided by the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the present invention more comprehensible to those skilled in the art, and will thus provide a clear and concise definition of the scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a flowchart of a data exception handling method based on KPCA and mixed similarity provided by the present invention, where the method includes:
s1: the terminal generates a task and uploads the task to the edge terminal;
s2: the edge terminal receives the task and divides the data related to the task into high-dimensional data and low-dimensional data according to dimensions;
s3: processing the high-dimensional data and the low-dimensional data;
s4: and the edge terminal uploads the processed data to the cloud terminal.
Further, the high-dimensional data is data with dimension > = 3;
the low-dimensional data is data with a dimension < 3;
further, referring to fig. 2, the high-dimensional data and the low-dimensional data are processed; the method comprises the following steps:
s31, carrying out anomaly detection on the high-dimensional data and the low-dimensional data to obtain a detection result;
s32, cleaning the detection result to obtain a cleaned data set;
and S33, judging and processing redundant data of the cleaned data set.
Further, the performing anomaly detection on the high-dimensional data and the low-dimensional data to obtain a detection result includes:
s311, carrying out anomaly detection on the low-dimensional data by adopting iForest to obtain the path length and the anomaly score corresponding to each low-dimensional data;
s312, converting the high-dimensional data into feature data by adopting a KPCA algorithm, and performing anomaly detection on the feature data by adopting iForest to obtain the path length and the anomaly score corresponding to each high-dimensional data;
further, the calculation formula of the path length is as follows:
Figure 559518DEST_PATH_IMAGE015
wherein, the
Figure 973313DEST_PATH_IMAGE016
As the length of the path, it is,
Figure 171076DEST_PATH_IMAGE017
is the number of samples to be tested,
Figure 577787DEST_PATH_IMAGE014
is the Euler constant;
the calculation formula of the abnormal score is as follows:
Figure 623234DEST_PATH_IMAGE018
wherein, the
Figure 988356DEST_PATH_IMAGE019
The number of the abnormal points is represented,
Figure 17623DEST_PATH_IMAGE020
indicates a path length expectation, said
Figure 696867DEST_PATH_IMAGE021
In order to be a function of the harmony,
Figure 783771DEST_PATH_IMAGE022
the above-mentioned
Figure 601686DEST_PATH_IMAGE020
And outputting a value of 0 to 1 through an iForest algorithm for the path length expectation of the data on all iTrees.
Further, the converting the high-dimensional data into the feature data by using the KPCA algorithm includes:
establishing a high-dimensional data mapping database, and recording all original high-dimensional data and corresponding feature data in the high-dimensional data mapping database;
it can be understood that the feature data is obtained by dimensionality reduction of the original high-dimensional data, and in the anomaly detection of the high-dimensional data and the low-dimensional data, the high-dimensional data has nonlinear features, so that the feature data of the high-dimensional data is obtained by using a KPCA algorithm with a good effect, and the feature data is processed; in the process of judging and processing the redundant data of the cleaned data set, in order to ensure the integrity of high-dimensional data information, the original high-dimensional data is selected to be processed; the purpose of the construction of the high-dimensional data mapping database is to ensure the storage of the original high-dimensional data and the characteristic data, so that the scheme has higher flexibility and reliability.
Further, the cleaning the detection result includes:
s321, obtaining path lengths and abnormal scores of high-dimensional data and low-dimensional data, and calculating an average path length;
s322, taking the data with the average path length within the range of 0-0.15 and the abnormal score within the range of 0.85-1 as abnormal values, and cleaning the data;
in particular, determining ranges can be set by one skilled in the art based on the data characteristics and actual requirements, and the values provided herein are not intended to be limiting.
Go toStep (b), the detection result is cleaned, and the high-dimensional data and the low-dimensional data are respectively processed by the methods in the above S31, S32, and S33, and are separately and synchronously performed, wherein the high-dimensional data respectively select data with the same dimension to be processed, for example, the dimension of the high-dimensional data is N i (i =0,1, \8230;, N) then obtains the N of each dimension i The dimension data is processed using the above method, and is not described herein.
Further, referring to fig. 3, the determining and processing redundant data for the cleaned data set includes:
s331, obtaining data with similar average path length and abnormal score, and assuming the obtained data as
Figure 118249DEST_PATH_IMAGE004
Then will be
Figure 70024DEST_PATH_IMAGE023
Regarded as the redundant data; in the step S331, the low-dimensional data and the high-dimensional data are both performed by the above method, and are separately and synchronously performed;
s332. Analysis
Figure 683539DEST_PATH_IMAGE003
Data type of (A) if
Figure 586601DEST_PATH_IMAGE003
For low-dimensional redundant data, go to S333, if
Figure 839727DEST_PATH_IMAGE004
Turning to S334 for high-dimensional redundant data;
s333, acquiring similarity of the low-dimensional redundant data by adopting Pearson correlation coefficient
Figure 80347DEST_PATH_IMAGE005
(ii) a The formula is as follows:
Figure 469740DEST_PATH_IMAGE006
s334, obtaining the high-dimensional data mapping database
Figure 160616DEST_PATH_IMAGE007
Corresponding original high dimensional data
Figure 41984DEST_PATH_IMAGE008
Acquiring the similarity of the high-dimensional redundant data by adopting a hybrid similarity algorithm
Figure 883032DEST_PATH_IMAGE009
(ii) a The formula is as follows:
Figure 861352DEST_PATH_IMAGE010
wherein
Figure 926391DEST_PATH_IMAGE011
The spearman correlation coefficient is taken as the weight,
Figure 888531DEST_PATH_IMAGE012
is composed of
Figure 2112DEST_PATH_IMAGE004
The spearman correlation coefficient of the data,
Figure 100518DEST_PATH_IMAGE013
is composed of
Figure 461092DEST_PATH_IMAGE003
The mutual information value of (a), wherein:
Figure 130102DEST_PATH_IMAGE024
Figure 562220DEST_PATH_IMAGE025
representing data
Figure 328182DEST_PATH_IMAGE026
The joint probability of (a) is determined,
Figure 921975DEST_PATH_IMAGE027
to represent
Figure 343860DEST_PATH_IMAGE028
Figure 986194DEST_PATH_IMAGE029
The base of the log is usually taken as e.
For example:
Figure 59192DEST_PATH_IMAGE030
=[0,0,1] ,
Figure 574618DEST_PATH_IMAGE031
=[1,1,0]obtained by
Figure 529805DEST_PATH_IMAGE032
Figure 54458DEST_PATH_IMAGE033
Figure 981963DEST_PATH_IMAGE034
,
Figure 933869DEST_PATH_IMAGE035
And of this example
Figure 986139DEST_PATH_IMAGE036
=
Figure 298172DEST_PATH_IMAGE037
+
Figure 893232DEST_PATH_IMAGE038
=0.6365。
In the scheme, the mutual information is the measurement of the mutual dependence degree of two data, and the larger the mutual information value is, the larger the dependence degree between the two data is;
s335. The method is implemented
Figure 999728DEST_PATH_IMAGE005
Or
Figure 617923DEST_PATH_IMAGE009
And a predetermined threshold value
Figure 405750DEST_PATH_IMAGE014
By comparison, if H 1 >Delta or H 2 >δ, then represents
Figure 307847DEST_PATH_IMAGE004
Redundant data exists in the data, and data clearing is carried out.
Further, the
Figure 335977DEST_PATH_IMAGE011
A preset threshold value
Figure 753052DEST_PATH_IMAGE014
The value of (b) can be determined depending on the situation,
Figure 892040DEST_PATH_IMAGE011
preferably 0.5.
In particular, the method comprises the following steps of,
Figure 914223DEST_PATH_IMAGE014
determination the person skilled in the art can set the threshold value according to the data characteristics and actual requirements, preferably, the manually set fixed threshold value is 90% of the current upper limit value of similarity, and the values provided herein can be used as reference, and are not limited.
Further, the data cleansing includes: in the above-mentioned
Figure 300205DEST_PATH_IMAGE004
Randomly selecting one data to delete.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It is noted that, in this document, relational terms are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims (9)

1. A data exception handling method based on KPCA and mixed similarity is characterized by comprising the following steps:
s1: the terminal generates a task and uploads the task to the edge terminal;
s2: the edge terminal receives the task and divides the data related to the task into high-dimensional data and low-dimensional data;
s3: processing the high-dimensional data and the low-dimensional data;
s4: and the edge terminal uploads the processed data to the cloud terminal.
2. A KPCA and mixed similarity based data exception handling method according to claim 1,
the high-dimensional data is data with dimension > = 3;
the low-dimensional data is data with a dimension < 3.
3. A method for processing data exceptions based on KPCA and mixed similarities according to claim 1,
the processing the high-dimensional data and the low-dimensional data comprises:
s31, carrying out anomaly detection on the high-dimensional data and the low-dimensional data to obtain a detection result;
s32, cleaning the detection result to obtain a cleaned data set;
and S33, judging and processing redundant data of the cleaned data set.
4. A KPCA and mixed similarity based data exception handling method according to claim 3,
the abnormal detection of the high-dimensional data and the low-dimensional data to obtain a detection result comprises the following steps:
s311, carrying out anomaly detection on the low-dimensional data by adopting iForest to obtain the path length and the anomaly score corresponding to each low-dimensional data;
s312, converting the high-dimensional data into feature data by adopting a KPCA algorithm, and then performing anomaly detection on the feature data by adopting iForest to obtain the path length and the anomaly score corresponding to each high-dimensional data.
5. A method for processing data exceptions based on KPCA and mixed similarities according to claim 4,
the converting the high-dimensional data into the feature data by adopting a KPCA algorithm comprises the following steps:
and establishing a high-dimensional data mapping database, and recording all original high-dimensional data and corresponding feature data in the high-dimensional data mapping database.
6. A method for processing data exceptions based on KPCA and mixed similarities according to claim 5,
the cleaning the detection result comprises:
s321, obtaining path lengths and abnormal scores of high-dimensional data and low-dimensional data, and calculating an average path length;
s322, taking the data with the average path length within the range of 0-0.15 and the abnormal score within the range of 0.85-1 as abnormal values, and carrying out data cleaning.
7. A method for KPCA and mixed similarity based data exception handling according to any one of claims 4-6,
and cleaning the detection result, wherein the high-dimensional data and the low-dimensional data are respectively processed by adopting the methods related in S31, S32 and S33, and the high-dimensional data and the low-dimensional data are respectively processed by selecting data with the same dimension.
8. A KPCA and mixed similarity based data exception handling method according to claim 6,
the redundant data judgment and processing of the cleaned data set comprises the following steps:
s331, obtaining the data with the average path length similar to the abnormal score, and assuming the obtained data as
Figure 408635DEST_PATH_IMAGE001
Then will be
Figure 847838DEST_PATH_IMAGE002
Considered as redundant data;in the step S331, the low-dimensional data and the high-dimensional data are both performed by the above method, and are separately and synchronously performed;
s332. Analysis
Figure 297273DEST_PATH_IMAGE003
Data type of (1) if
Figure 214545DEST_PATH_IMAGE001
For low-dimensional redundant data, go to S333, if
Figure 105141DEST_PATH_IMAGE001
Turning to S334 for high-dimensional redundant data;
s333, acquiring similarity H of the low-dimensional redundant data by adopting Pearson correlation coefficient 1 (ii) a The formula is as follows:
H 1 =corr
Figure 964512DEST_PATH_IMAGE003
s334, obtaining the high-dimensional data mapping database
Figure 386397DEST_PATH_IMAGE001
Corresponding original high dimensional data
Figure 700835DEST_PATH_IMAGE004
Obtaining the similarity H of the high-dimensional redundant data by adopting a hybrid similarity algorithm 2 (ii) a The formula is as follows:
Figure 773833DEST_PATH_IMAGE005
wherein mu is the weight occupied by the spearman correlation coefficient,
Figure 820418DEST_PATH_IMAGE006
is composed of
Figure 713288DEST_PATH_IMAGE007
The spearman correlation coefficient of the data,
Figure 893733DEST_PATH_IMAGE008
is composed of
Figure 571970DEST_PATH_IMAGE007
A mutual information value of;
s335. The H is processed 1 Or H 2 Comparing with a predetermined threshold value delta if H 1 >Delta or H 2 >δ, then represents
Figure 773145DEST_PATH_IMAGE009
Redundant data exists in the data, and data clearing is carried out.
9. The method of claim 8, wherein the data exception handling method based on KPCA and mixed similarity,
and the mu and the preset threshold delta are manually taken, the mu range is 0 to 1, and the delta range does not exceed the calculated maximum value of the similarity.
CN202211321839.6A 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity Active CN115438035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211321839.6A CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211321839.6A CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Publications (2)

Publication Number Publication Date
CN115438035A true CN115438035A (en) 2022-12-06
CN115438035B CN115438035B (en) 2023-04-07

Family

ID=84252560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211321839.6A Active CN115438035B (en) 2022-10-27 2022-10-27 Data exception handling method based on KPCA and mixed similarity

Country Status (1)

Country Link
CN (1) CN115438035B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140162274A1 (en) * 2012-06-28 2014-06-12 Taxon Biosciences, Inc. Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences
CN104091337A (en) * 2014-07-11 2014-10-08 北京工业大学 Deformation medical image registration method based on PCA and diffeomorphism Demons
CN106709869A (en) * 2016-12-25 2017-05-24 北京工业大学 Dimensionally reduction method based on deep Pearson embedment
CN106886601A (en) * 2017-03-02 2017-06-23 大连理工大学 A kind of Cross-modality searching algorithm based on the study of subspace vehicle mixing
CN109214503A (en) * 2018-08-01 2019-01-15 华北电力大学 Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM
CN110069467A (en) * 2019-04-16 2019-07-30 沈阳工业大学 System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method
CN111275288A (en) * 2019-12-31 2020-06-12 华电国际电力股份有限公司十里泉发电厂 XGboost-based multi-dimensional data anomaly detection method and device
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
US20200293554A1 (en) * 2018-03-15 2020-09-17 Alibaba Group Holding Limited Abnormal sample prediction
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
US20210200746A1 (en) * 2019-12-30 2021-07-01 Royal Bank Of Canada System and method for multivariate anomaly detection
CN113420691A (en) * 2021-06-30 2021-09-21 昆明理工大学 Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient
CN113901993A (en) * 2021-09-24 2022-01-07 上海海事大学 Fault diagnosis method based on PCCs secondary feature optimization
CN114239807A (en) * 2021-12-17 2022-03-25 山东省计算中心(国家超级计算济南中心) RFE-DAGMM-based high-dimensional data anomaly detection method
WO2022110557A1 (en) * 2020-11-25 2022-06-02 国网湖南省电力有限公司 Method and device for diagnosing user-transformer relationship anomaly in transformer area
CN115150744A (en) * 2022-08-02 2022-10-04 天津城建大学 Indoor signal interference source positioning method for large conference venue

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140162274A1 (en) * 2012-06-28 2014-06-12 Taxon Biosciences, Inc. Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences
CN104091337A (en) * 2014-07-11 2014-10-08 北京工业大学 Deformation medical image registration method based on PCA and diffeomorphism Demons
CN106709869A (en) * 2016-12-25 2017-05-24 北京工业大学 Dimensionally reduction method based on deep Pearson embedment
CN106886601A (en) * 2017-03-02 2017-06-23 大连理工大学 A kind of Cross-modality searching algorithm based on the study of subspace vehicle mixing
US20200293554A1 (en) * 2018-03-15 2020-09-17 Alibaba Group Holding Limited Abnormal sample prediction
CN109214503A (en) * 2018-08-01 2019-01-15 华北电力大学 Project of transmitting and converting electricity cost forecasting method based on KPCA-LA-RBM
CN110069467A (en) * 2019-04-16 2019-07-30 沈阳工业大学 System peak load based on Pearson's coefficient and MapReduce parallel computation clusters extraction method
US20210200746A1 (en) * 2019-12-30 2021-07-01 Royal Bank Of Canada System and method for multivariate anomaly detection
CN111275288A (en) * 2019-12-31 2020-06-12 华电国际电力股份有限公司十里泉发电厂 XGboost-based multi-dimensional data anomaly detection method and device
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
WO2022110557A1 (en) * 2020-11-25 2022-06-02 国网湖南省电力有限公司 Method and device for diagnosing user-transformer relationship anomaly in transformer area
CN113420691A (en) * 2021-06-30 2021-09-21 昆明理工大学 Mixed domain characteristic bearing fault diagnosis method based on Pearson correlation coefficient
CN113901993A (en) * 2021-09-24 2022-01-07 上海海事大学 Fault diagnosis method based on PCCs secondary feature optimization
CN114239807A (en) * 2021-12-17 2022-03-25 山东省计算中心(国家超级计算济南中心) RFE-DAGMM-based high-dimensional data anomaly detection method
CN115150744A (en) * 2022-08-02 2022-10-04 天津城建大学 Indoor signal interference source positioning method for large conference venue

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NING ZHANG: "Magnetic Anomaly Detection Method Based on Feature Fusion and Isolation Forest Algorithm", 《IEEE ACCESS ( VOLUME: 10)》 *
李为州: "说话人识别中基于深度信念网络的超向量降维的研究", 《电脑知识与技术》 *
杨英华等: "基于子空间混合相似度的过程监测与故障诊断", 《仪器仪表学报》 *
陈茂: "工业物联网中基于边缘计算的大数据清洗算法的研究", 《CNKI优秀硕士学位论文全文库》 *

Also Published As

Publication number Publication date
CN115438035B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN107563426B (en) Method for learning locomotive running time sequence characteristics
WO2023134086A1 (en) Convolutional neural network model pruning method and apparatus, and electronic device and storage medium
CN111414571A (en) Atmospheric pollutant monitoring method
CN112381790A (en) Abnormal image detection method based on depth self-coding
CN110750412B (en) Log abnormity detection method
CN116148656B (en) Portable analog breaker fault detection method
CN115130519B (en) Hull structure fault prediction method using convolutional neural network
CN113704201A (en) Log anomaly detection method and device and server
CN115452376A (en) Bearing fault diagnosis method based on improved lightweight deep convolution neural network
CN114594398A (en) Energy storage lithium ion battery data preprocessing method
CN115438035B (en) Data exception handling method based on KPCA and mixed similarity
CN110990383A (en) Similarity calculation method based on industrial big data set
CN116881798A (en) Conditional gracile causal analysis method based on variable selection and reverse time lag feature selection for complex systems such as weather
CN117675230A (en) Knowledge-graph-based oil well data integrity identification method
CN112950566B (en) Windshield damage fault detection method
CN114756742A (en) Information pushing method and device and storage medium
CN114155410A (en) Graph pooling, classification model training and reconstruction model training method and device
CN113240213A (en) Method, device and equipment for selecting people based on neural network and tree model
CN110321366B (en) Statistical quantity determining method and system based on online learning
CN115827982A (en) Big data information acquisition system based on computer
US20240272976A1 (en) Abnormality detection device, abnormality detection method, and abnormality detection program
CN113887718B (en) Channel pruning method and device based on relative activation rate and lightweight flow characteristic extraction network model simplification method
CN110728615B (en) Steganalysis method based on sequential hypothesis testing, terminal device and storage medium
CN113878613B (en) Industrial robot harmonic reducer early fault detection method based on WLCTD and OMA-VMD
CN115204671A (en) Big data-based annual newspaper analysis system for listed companies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant