CN114443738A - Abnormal data mining method, device, equipment and medium - Google Patents

Abnormal data mining method, device, equipment and medium Download PDF

Info

Publication number
CN114443738A
CN114443738A CN202210264250.0A CN202210264250A CN114443738A CN 114443738 A CN114443738 A CN 114443738A CN 202210264250 A CN202210264250 A CN 202210264250A CN 114443738 A CN114443738 A CN 114443738A
Authority
CN
China
Prior art keywords
data
detected
abnormal
outlier
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210264250.0A
Other languages
Chinese (zh)
Inventor
黄建平
陈浩
李钟煦
沈思琪
张建松
王艺丹
彭梁英
潘司晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202210264250.0A priority Critical patent/CN114443738A/en
Publication of CN114443738A publication Critical patent/CN114443738A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device, equipment and a medium for mining abnormal data, which comprise the following steps: acquiring first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and removing invalid data meeting preset invalid conditions from the log data to obtain target log data; determining first abnormal data of first data to be detected, eliminating the first abnormal data from the first data to be detected to obtain second data to be detected, clustering the second data to be detected to obtain a plurality of cluster clusters, calculating central data of each cluster, and determining outlier candidate data based on a first distance between the central data and the second data to be detected in the corresponding cluster clusters; determining the outlier candidate data meeting the preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data. And abnormal data mining with high adaptability, low cost and high efficiency is realized.

Description

Abnormal data mining method, device, equipment and medium
Technical Field
The invention relates to the technical field of data quality management, in particular to an abnormal data mining method, device, equipment and medium.
Background
Data quality is a very important work content in data management work, and the governing level of data quality profoundly influences the availability of mass data assets. When abnormal data mining is carried out at present, most of conventional mining modes are based on business specification requirements, abnormal data is mined by setting rules for data quality audit, substituting the rules into the abnormal data pre-warning threshold standards and establishing regular and online data quality monitoring tasks, the modes have the advantages of strong pertinence and high mining accuracy, have higher availability when mining the data quality problems with key specialties, but have the fatal problems of poor adaptability, high labor cost, need to adjust fixed thresholds at any time and the like due to the dependence on the business specification requirements.
In summary, how to implement adaptive, low-cost and efficient abnormal data mining is a problem to be solved in the field.
Disclosure of Invention
In view of the above, the present invention provides an abnormal data mining method, apparatus, device and medium, which can achieve high adaptability, low cost and high efficiency of abnormal data mining. The specific scheme is as follows:
in a first aspect, the present application discloses an abnormal data mining method, including:
collecting first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and eliminating invalid data meeting preset invalid conditions from the log data to obtain target log data;
determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected;
clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and determining outlier candidate data based on the central data and a first distance between the second data to be detected in the corresponding clustering clusters;
calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
Optionally, the removing invalid data meeting a preset invalid condition from the log data to obtain target log data includes:
analyzing the log data to obtain each subdata;
and taking the subdata meeting the preset invalid condition as invalid data, and removing the invalid data from the log data to obtain target log data.
Optionally, the determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata, and the target log data includes:
determining first abnormal data of the first data to be detected by using a preset index rule based on the first data to be detected, the metadata and the target log data; the preset index rules comprise statistical indexes and business rules.
Optionally, before clustering the second data to be detected based on the preset clustering algorithm, the method further includes:
acquiring data tag information corresponding to the second data to be detected in a data center, and identifying a field object of the second data to be detected based on the data tag information and the metadata so as to determine a data type corresponding to the second data to be detected;
and extracting field features of the second data to be detected to obtain a data feature vector so as to cluster the second data to be detected based on the data type, the data feature vector and a preset clustering algorithm.
Optionally, the calculating the center data of each cluster, and then determining outlier candidate data based on a first distance between the center data and second to-be-detected data in the corresponding cluster includes:
calculating the central data of each cluster and the target radius of each cluster, and judging whether a first distance between second data to be detected in each cluster and the central data is not smaller than the target radius;
and when the first distance between the second data to be detected in the cluster and the central data is not smaller than the target radius, determining the second data to be detected in the cluster as outlier candidate data.
Optionally, calculating the central data of any one of the clusters and the target radius of the cluster includes:
calculating central data of the cluster;
and determining the average distance between all the second data to be detected in the cluster and the central data, and taking the average distance as the target radius of the cluster.
Optionally, the calculating an outlier factor value of the outlier candidate data, and determining the outlier candidate data whose outlier factor value satisfies a preset abnormal condition as second abnormal data includes:
calculating the reachable distance of the outlier candidate data, and determining local reachable density based on the reachable distance;
calculating an outlier factor value by using the local reachable density, and determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data.
In a second aspect, the present application discloses an abnormal data mining apparatus, including:
the data acquisition module is used for acquiring first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and eliminating invalid data meeting preset invalid conditions from the log data to obtain target log data;
the data removing module is used for determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and removing the first abnormal data from the first data to be detected to obtain second data to be detected;
the candidate data determining module is used for clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and then determining outlier candidate data based on a first distance between the central data and the second data to be detected in the corresponding clustering clusters;
and the abnormal data determining module is used for calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data of which the outlier factor value meets a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the anomaly data mining method disclosed in the foregoing.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the anomaly data mining method disclosed in the foregoing.
According to the method, first to-be-detected data, metadata of the first to-be-detected data and log data related to the first to-be-detected data are collected, and invalid data meeting preset invalid conditions are removed from the log data to obtain target log data; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected; clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and determining outlier candidate data based on the central data and a first distance between the second data to be detected in the corresponding clustering clusters; calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data. Therefore, invalid data in the log data are removed, high-heat target data are reserved, and the method has calculation significance and is more reasonable; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, obtaining second data to be detected with the first abnormal data removed, and reducing subsequent calculation amount; calculating second to-be-detected data after clustering by using a preset clustering algorithm, calculating corresponding outlier candidate data based on different types of the second to-be-detected data, determining second abnormal data more accurately based on the outlier candidate data, and determining the outlier candidate data based on a first distance between the central data and the second to-be-detected data in the corresponding cluster, so that data pruning is completed, and the calculation efficiency and the accuracy of a calculation result can be effectively improved; in the process of determining the second abnormal data, the method does not depend on the business rules and manpower, so that the influence of the calculation cost and the business rules on abnormal data mining can be reduced, and the adaptability is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of an anomalous data mining method as disclosed herein;
FIG. 2 is a flow chart of a particular method of anomalous data mining disclosed herein;
FIG. 3 is a schematic diagram of an environmental deployment of a particular abnormal data mining method disclosed herein;
FIG. 4 is a flow chart of a particular method of anomalous data mining disclosed herein;
FIG. 5 is a flow chart of a particular anomalous data mining method disclosed herein;
FIG. 6 is a flow chart of a particular method of anomalous data mining disclosed herein;
FIG. 7 is a schematic structural diagram of an abnormal data mining device according to the present disclosure;
fig. 8 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When abnormal data mining is carried out at present, most of conventional mining modes are based on business specification requirements, abnormal data is mined by setting rules for data quality audit, substituting the rules into the abnormal data pre-warning threshold standards and establishing regular and online data quality monitoring tasks, the modes have the advantages of strong pertinence and high mining accuracy, have higher availability when mining the data quality problems with key specialties, but have the fatal problems of poor adaptability, high labor cost, need to adjust fixed thresholds at any time and the like due to the dependence on the business specification requirements.
Therefore, the abnormal data mining scheme is correspondingly provided, and the abnormal data mining with high adaptability, low cost and high efficiency can be realized.
Referring to fig. 1, an embodiment of the present application discloses an abnormal data mining method, including:
step S11: the method comprises the steps of collecting first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and removing invalid data meeting preset invalid conditions from the log data to obtain target log data.
In this embodiment, the removing invalid data that meets a preset invalid condition from the log data to obtain target log data includes: analyzing the log data to obtain each subdata; and taking the subdata meeting the preset invalid condition as invalid data, and removing the invalid data from the log data to obtain target log data. It is understood that step S11 is prepared for the subsequent processing of the first to-be-detected data, and includes collecting metadata of the first to-be-detected data, collecting log data of the first to-be-detected data and parsing the log data, obtaining each subdata by parsing the log data, and automatically removing invalid data, such as invalid data of an empty table, a process table, a log table, and the like, to keep a high-heat data table.
Step S12: determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected.
In this embodiment, the determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata, and the target log data includes: determining first abnormal data of the first data to be detected by using a preset index rule based on the first data to be detected, the metadata and the target log data; the preset index rules comprise statistical indexes and business rules. For example, abnormal proportion data is determined through a preset index rule, and the abnormal proportion data is removed from the first data to be detected to obtain second data to be detected.
Step S13: clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating the central data of each clustering cluster, and determining outlier candidate data based on the central data and the first distance between the second data to be detected in the corresponding clustering clusters.
In this embodiment, before the second to-be-detected data is clustered based on the preset clustering algorithm to obtain a plurality of clustering clusters, the data type corresponding to the second to-be-detected data can be calculated, and the field features of the second to-be-detected data are extracted to obtain the data feature vector, so that the second to-be-detected data is clustered based on the data type and the data feature vector by using the preset clustering algorithm, and the accuracy of a clustering result can be improved. Determining outlier candidate data in the second data to be detected, namely pruning the data with high correctness probability, can effectively improve the calculation efficiency of a subsequent outlier factor value method, and can greatly reduce the probability of misjudging the data as the outlier candidate data after data pruning.
Step S14: calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
In this embodiment, the reachable distance of the outlier candidate data is first calculated, the local reachable density is calculated based on the reachable distance, and then the outlier factor value is calculated by using the local reachable density. For example, using a Cluster-Based Local outlet Factor (CLOF) algorithm Based on Cluster pruning, an Outlier value of the Outlier candidate data is calculated, and the Outlier candidate data having an Outlier value greater than 1 is used as the second abnormal data.
According to the method, first to-be-detected data, metadata of the first to-be-detected data and log data related to the first to-be-detected data are collected, and invalid data meeting preset invalid conditions are removed from the log data to obtain target log data; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected; clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and determining outlier candidate data based on the central data and a first distance between the second data to be detected in the corresponding clustering clusters; calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data. Therefore, invalid data in the log data are removed, high-heat target data are reserved, and the method has calculation significance and is more reasonable; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, obtaining second data to be detected with the first abnormal data removed, and reducing subsequent calculation amount; calculating second to-be-detected data after clustering by using a preset clustering algorithm, calculating corresponding outlier candidate data based on different types of the second to-be-detected data, determining second abnormal data more accurately based on the outlier candidate data, and determining the outlier candidate data based on a first distance between the central data and the second to-be-detected data in the corresponding cluster, so that data pruning is completed, and the calculation efficiency and the accuracy of a calculation result can be effectively improved; in the process of determining the second abnormal data, the method does not depend on the business rules and manpower, so that the influence of the calculation cost and the business rules on abnormal data mining can be reduced, and the adaptability is improved.
Referring to fig. 2, an embodiment of the present application discloses a specific abnormal data mining method, including:
step S21: the method comprises the steps of collecting first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and removing invalid data meeting preset invalid conditions from the log data to obtain target log data.
It can be understood that there is a corresponding environment for the operation of abnormal data mining, and when abnormal data mining is performed on large-scale data of a data center, the problems of efficiency, resource load and the like need to be considered comprehensively, and corresponding software and hardware resources need to be equipped for ensuring the normal operation of a mining model, taking an arri cloud platform as an example, and the related requirements are shown in table one:
watch 1
Serial number Resource type Configuring requirements Use of
1 ECS 16C-64G Algorithm server
2 RDS 500G Database server
3 OSS 500G Unstructured data storage
4 Redis - Caching
From the aspects of a data processing process link and a deployment environment, the overall environment is deployed and verified on a cloud platform and an Alice cloud data platform. Taking a Remote Dictionary Service (Remote Dictionary Service) as a storage system, storing unstructured data in an Operation Support System (OSS), storing detailed data in a Relational Database Service (RDS), scheduling and detecting data through an algorithm environment deployed on an Electronic Computer Service (ECS), wherein the execution speed of the algorithm depends on the configuration level of the ECS, and performing development by adopting concurrent tasks during execution to guarantee execution efficiency, and in the process of performing data quality anomaly mining, security of data acquisition, transmission, processing and storage is guaranteed by desensitizing a firewall and the data, and a specific architecture is shown in fig. 3. Compared with the traditional method, the method has the advantages that the detection time in the unit of day is shortened to the time in the unit of hour through more application middlebox computing resources, the workload of manual participation in quality problem mining rule modification is reduced, and the working cost of data quality is reduced.
Step S22: determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected.
Step S23: and acquiring data tag information corresponding to the second data to be detected in the data center, and identifying the field object of the second data to be detected based on the data tag information and the metadata so as to determine the data type corresponding to the second data to be detected.
In this embodiment, data tag information corresponding to the second to-be-detected data in the data center is obtained, for example, the tag information includes a library name, a business domain, a purchase plan, and an update frequency; and identifying the field object of the second data to be detected to determine the data type corresponding to the second data to be detected, wherein the data type can be four data types including text, date, numerical value and time sequence.
Step S24: and extracting field features of the second data to be detected to obtain a data feature vector.
It can be understood that, in this embodiment, the field features of the second data to be detected are automatically extracted by using the technologies such as mathematical statistics and natural language processing to the field of the second data to be detected, so as to obtain the data feature vector.
Step S25: clustering the second data to be detected based on the data type, the data characteristic vector and a preset clustering algorithm to obtain a plurality of clustering clusters, calculating the central data of each clustering cluster, and determining outlier candidate data based on the central data and the first distance between the second data to be detected in the corresponding clustering clusters.
In this embodiment, the second data to be detected is clustered based on the data type, the data feature vector and a preset clustering algorithm to obtain a plurality of clustering clusters, and the second data to be detected is clustered by using the preset clustering algorithm based on different data types and corresponding data feature vectors, so that a plurality of accurate clustering clusters can be obtained.
Step S26: calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
It can be understood that, in this embodiment, as shown in fig. 4, after data preparation for collecting the first to-be-detected data, clear data for eliminating invalid data, feature engineering for obtaining a data feature vector, and outlier factor value calculation, visual display may be performed, that is, abnormal data drawing and display analysis of a data quality detection result of the target abnormal data are performed in a front-end tool matched with the algorithm verification environment, so as to support the next step of making an audit scheme and performing abnormal data rectification on the abnormal data. Furthermore, in a data center production environment, under the condition that 100 thousands of data volumes are displayed in a single table, 30 selected data tables are used as first data to be detected, and the accuracy rate of abnormal data mining results reaches more than 95% on average; in 6 verified service systems, abnormal data mining is carried out by using the method aiming at first to-be-detected data in different forms, and the accuracy rate of the average abnormal data mining result is still over 90 percent, so that the abnormal data mining method has higher adaptability.
For a more specific working process of the step S22, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Therefore, according to the method and the device, the first abnormal data in the first data to be detected are removed firstly to obtain the second data to be detected, and the clustering is accurately performed by using the preset clustering algorithm according to the data type and the data characteristic vector of the second data to be detected, so that the accuracy of subsequently calculating the outlier candidate data by using the clustered clusters obtained by clustering is improved, and the accuracy of mining the second abnormal data is improved.
Referring to fig. 5, an embodiment of the present application discloses a specific abnormal data mining method, including:
step S31: the method comprises the steps of collecting first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and removing invalid data meeting preset invalid conditions from the log data to obtain target log data.
Step S32: determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected.
Step S33: clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating the central data of each clustering cluster, and determining outlier candidate data based on the central data and the first distance between the second data to be detected in the corresponding clustering clusters.
In this embodiment, the calculating the center data of each cluster, and then determining outlier candidate data based on a first distance between the center data and second to-be-detected data in the corresponding cluster includes: calculating the central data of each cluster and the target radius of each cluster, and judging whether a first distance between second data to be detected in each cluster and the central data is not smaller than the target radius; and when the first distance between the second data to be detected in the cluster and the central data is not smaller than the target radius, determining the second data to be detected in the cluster as outlier candidate data.
In this embodiment, the central data of any one of the clusters and the cluster are calculatedA target radius of a cluster comprising: calculating central data of the cluster; and determining the average distance between all the second data to be detected in the cluster and the central data, and taking the average distance as the target radius of the cluster. For example, as shown in fig. 6, first, the second data to be detected is clustered into n clusters by using the K-Means clustering algorithm, and the central data C of each cluster is calculatedi(ii) a Secondly, the average distance from all the second data to be detected in any cluster to the central data is calculated and recorded as the target radius R of the clusteri(ii) a Then judging whether the distance between all the second data to be detected in the cluster and the central data of the cluster is not less than the target radius Ri(ii) a If not, it is put into the "outlier candidate". It should be noted that when the number n (i) of the second data to be detected in a certain cluster is smaller than the preset threshold m, the number of the second data to be detected in the cluster is small, and therefore, all the second data to be detected in the cluster are put into the "outlier candidate data" to complete data pruning. The calculation formula of the central data is as follows:
Figure BDA0003552036680000111
wherein C isiAnd N (i) represents the number of second data to be detected in the ith cluster, and is the jth second data to be detected in the ith cluster.
The target radius calculation formula is as follows:
Figure BDA0003552036680000112
wherein R isiDenotes the target radius, CiAnd N (i) represents the number of second data to be detected in the ith cluster, and is the jth second data to be detected in the ith cluster.
Step S34: and calculating the reachable distance of the outlier candidate data, determining local reachable density based on the reachable distance, and calculating the outlier factor value by using the local reachable density.
In this embodiment, the reachable distance calculation formula process of the outlier candidate data is as follows:
1) define the kth distance: if the second to-be-detected data P is a point that is k-th close to the second to-be-detected data O, the k-th distance of the second to-be-detected data O may be represented as:
dk(O)=d(P,O);
2) defining a kth neighborhood: n is a radical ofk(O) is the kth neighborhood of the second data to be detected O, denoted as:
Nk(O)={P'∈D\{O}|d(O,P')≤dk(O)};
wherein D represents a set of second to-be-detected data within a kth distance of the set of second to-be-detected data O.
3) Calculating the reachable distance: taking the second to-be-detected data O as a circle center, a kth distance from the second to-be-detected data P to the second to-be-detected data O can be represented as:
dk(P,O)=max{dk(O),d(P,O)};
wherein d isk(O) represents the kth distance of the second data to be detected O, d (P, O) represents the distance from the second data to be detected P to the second data to be detected O, dk(P, O) represents a k-th distance from the second to-be-detected data P to the second to-be-detected data O.
In this embodiment, the formula process of calculating the local reachable density of the outlier candidate data is as follows:
Figure BDA0003552036680000121
where ρ isk(P) local reachable density of second data P to be detected in outlier candidate data, | Nk(P) | represents the number of the second data to be detected in the k distance neighborhood of the second data to be detected P, dk(P, O) represents the kth distance, N, of the second data O to be detectedk(P) represents a k-distance neighborhood of the second data to be detected P.
In this embodiment, the formula process for calculating the outlier factor value of the outlier candidate data is as follows:
Figure BDA0003552036680000122
wherein, LOFk(P) represents the value of the outlier, Nk(P) represents a k-distance neighborhood, ρ, of the second data to be detected Pk(P) represents the local reachable density, rho, of the second data to be detected P in the outlier candidate datak(O) local reachable density of second data to be detected O in outlier candidate data, | Nk(P) | represents the number of second data to be detected in the k distance neighborhood of the second data to be detected P.
Step S35: determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
It can be understood that, in this embodiment, if the outlier factor value is larger, it indicates that the corresponding second data to be detected is more abnormal, whereas if the outlier factor value is smaller, it indicates that the corresponding second data to be detected is more normal; and outputting the abnormal second data to be detected, namely second abnormal data according to the set preset abnormal condition.
Therefore, according to the method and the device, the outlier candidate data are determined based on the first distance between the central data and the second data to be detected in the cluster corresponding to the central data, so that data pruning is completed, and the calculation efficiency and the accuracy of the calculation result can be effectively improved.
Referring to fig. 7, an embodiment of the present application discloses an abnormal data mining apparatus, including:
the data acquisition module 11 is configured to acquire first data to be detected, metadata of the first data to be detected, and log data related to the first data to be detected, and eliminate invalid data meeting a preset invalid condition from the log data to obtain target log data;
a data removing module 12, configured to determine first abnormal data of the first to-be-detected data based on the first to-be-detected data, the metadata, and the target log data, and remove the first abnormal data from the first to-be-detected data to obtain second to-be-detected data;
the candidate data determining module 13 is configured to cluster the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculate central data of each clustering cluster, and determine outlier candidate data based on a first distance between the central data and the second data to be detected in the corresponding clustering cluster;
an abnormal data determining module 14, configured to calculate an outlier factor value of the outlier candidate data, determine the outlier candidate data whose outlier factor value satisfies a preset abnormal condition as second abnormal data, and then use the first abnormal data and the second abnormal data as target abnormal data.
According to the method, first to-be-detected data, metadata of the first to-be-detected data and log data related to the first to-be-detected data are collected, and invalid data meeting preset invalid conditions are removed from the log data to obtain target log data; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected; clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and determining outlier candidate data based on the central data and a first distance between the second data to be detected in the corresponding clustering clusters; calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data of which the outlier factor value meets a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data. Therefore, invalid data in the log data are removed, high-heat target data are reserved, and the method has calculation significance and is more reasonable; determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, obtaining second data to be detected with the first abnormal data removed, and reducing subsequent calculation amount; calculating second to-be-detected data after clustering by using a preset clustering algorithm, calculating corresponding outlier candidate data based on different types of the second to-be-detected data, determining second abnormal data more accurately based on the outlier candidate data, and determining the outlier candidate data based on a first distance between the central data and the second to-be-detected data in the corresponding cluster, so that data pruning is completed, and the calculation efficiency and the accuracy of a calculation result can be effectively improved; in the process of determining the second abnormal data, the method does not depend on the business rules and manpower, so that the influence of the calculation cost and the business rules on abnormal data mining can be reduced, and the adaptability is improved.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The method specifically comprises the following steps: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the abnormal data mining method executed by a computer device disclosed in any one of the foregoing embodiments.
In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the computer device 20; the communication interface 24 can create a data transmission channel between the computer device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to acquire external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
In addition, the storage 22 is used as a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon include an operating system 221, a computer program 222, data 223, etc., and the storage may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the computer device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, which may be Windows, Unix, Linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the abnormal data mining method performed by the computer device 20 disclosed in any of the foregoing embodiments. The data 223 may include data received by the computer device and transmitted from an external device, data collected by the input/output interface 25, and the like.
Further, an embodiment of the present application also discloses a computer-readable storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method steps executed in the abnormal data mining process disclosed in any of the foregoing embodiments are implemented.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The method, the device, the equipment and the medium for mining the abnormal data provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. An abnormal data mining method, characterized by comprising:
collecting first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and eliminating invalid data meeting preset invalid conditions from the log data to obtain target log data;
determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and eliminating the first abnormal data from the first data to be detected to obtain second data to be detected;
clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and determining outlier candidate data based on the central data and a first distance between the second data to be detected in the corresponding clustering clusters;
calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
2. The abnormal data mining method according to claim 1, wherein the removing invalid data satisfying a preset invalid condition from the log data to obtain target log data comprises:
analyzing the log data to obtain each subdata;
and taking the subdata meeting the preset invalid condition as invalid data, and removing the invalid data from the log data to obtain target log data.
3. The anomalous data mining method of claim 1, wherein said determining first anomalous data of said first data to be detected based on said first data to be detected, said metadata, and said target log data comprises:
determining first abnormal data of the first data to be detected by using a preset index rule based on the first data to be detected, the metadata and the target log data; the preset index rules comprise statistical indexes and business rules.
4. The abnormal data mining method according to claim 1, wherein before clustering the second data to be detected based on a preset clustering algorithm, the method further comprises:
acquiring data tag information corresponding to the second data to be detected in a data center, and identifying a field object of the second data to be detected based on the data tag information and the metadata so as to determine a data type corresponding to the second data to be detected;
and extracting field features of the second data to be detected to obtain a data feature vector so as to cluster the second data to be detected based on the data type, the data feature vector and a preset clustering algorithm.
5. The abnormal data mining method according to claim 1, wherein the calculating of the center data of each cluster and the determining of the outlier candidate data based on the first distance between the center data and the second to-be-detected data in the corresponding cluster comprises:
calculating the central data of each cluster and the target radius of each cluster, and judging whether a first distance between second data to be detected in each cluster and the central data is not smaller than the target radius;
and when the first distance between the second data to be detected in the cluster and the central data is not smaller than the target radius, determining the second data to be detected in the cluster as outlier candidate data.
6. The method of anomalous data mining of claim 5 wherein said calculating the center data of any one of said clusters and the target radius of said cluster comprises:
calculating central data of the cluster;
and determining the average distance between all the second data to be detected in the cluster and the central data, and taking the average distance as the target radius of the cluster.
7. The method of outlier mining according to any of claims 1 to 6 wherein said calculating an outlier factor value of said outlier candidate data and determining said outlier candidate data whose said outlier factor value satisfies a predetermined outlier condition as second outlier data comprises:
calculating the reachable distance of the outlier candidate data, and determining local reachable density based on the reachable distance;
calculating an outlier factor value by using the local reachable density, and determining the outlier candidate data with the outlier factor value meeting a preset abnormal condition as second abnormal data.
8. An anomalous data mining device, comprising:
the data acquisition module is used for acquiring first data to be detected, metadata of the first data to be detected and log data related to the first data to be detected, and eliminating invalid data meeting preset invalid conditions from the log data to obtain target log data;
the data removing module is used for determining first abnormal data of the first data to be detected based on the first data to be detected, the metadata and the target log data, and removing the first abnormal data from the first data to be detected to obtain second data to be detected;
the candidate data determining module is used for clustering the second data to be detected based on a preset clustering algorithm to obtain a plurality of clustering clusters, calculating central data of each clustering cluster, and then determining outlier candidate data based on a first distance between the central data and the second data to be detected in the corresponding clustering clusters;
and the abnormal data determining module is used for calculating an outlier factor value of the outlier candidate data, determining the outlier candidate data of which the outlier factor value meets a preset abnormal condition as second abnormal data, and then taking the first abnormal data and the second abnormal data as target abnormal data.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the method of anomaly data mining according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the method of anomaly data mining according to any one of claims 1 to 7.
CN202210264250.0A 2022-03-17 2022-03-17 Abnormal data mining method, device, equipment and medium Pending CN114443738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210264250.0A CN114443738A (en) 2022-03-17 2022-03-17 Abnormal data mining method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210264250.0A CN114443738A (en) 2022-03-17 2022-03-17 Abnormal data mining method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114443738A true CN114443738A (en) 2022-05-06

Family

ID=81358764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210264250.0A Pending CN114443738A (en) 2022-03-17 2022-03-17 Abnormal data mining method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114443738A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829381A (en) * 2024-03-05 2024-04-05 成都农业科技职业学院 Agricultural greenhouse data optimization acquisition system based on Internet of things

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829381A (en) * 2024-03-05 2024-04-05 成都农业科技职业学院 Agricultural greenhouse data optimization acquisition system based on Internet of things
CN117829381B (en) * 2024-03-05 2024-05-14 成都农业科技职业学院 Agricultural greenhouse data optimization acquisition system based on Internet of things

Similar Documents

Publication Publication Date Title
EP2453381B1 (en) System for an engine for forecasting cyber threats and method for forecasting cyber threats using the system
CN110287316A (en) A kind of Alarm Classification method, apparatus, electronic equipment and storage medium
CN115033463B (en) System exception type determining method, device, equipment and storage medium
US20150149541A1 (en) Leveraging Social Media to Assist in Troubleshooting
CN114465874A (en) Fault prediction method, device, electronic equipment and storage medium
CN114978877B (en) Abnormality processing method, abnormality processing device, electronic equipment and computer readable medium
CN111181757A (en) Information security risk prediction method and device, computing equipment and storage medium
CN114327964A (en) Method, device, equipment and storage medium for processing fault reasons of service system
CN115879017A (en) Automatic classification and grading method and device for power sensitive data and storage medium
CN112801315A (en) State diagnosis method and device for power secondary equipment and terminal
CN115102836A (en) Network equipment fault analysis method and device and storage medium
CN115509797A (en) Method, device, equipment and medium for determining fault category
CN114443738A (en) Abnormal data mining method, device, equipment and medium
CN112651172B (en) Rainfall peak type dividing method, device, equipment and storage medium
CN109359233A (en) Public network massive information monitoring method and system based on natural language processing technique
CN115794578A (en) Data management method, device, equipment and medium for power system
CN115577927A (en) Important power consumer electricity utilization safety assessment method and device based on rough set
CN115757002A (en) Energy consumption determination method, device and equipment and computer readable storage medium
CN112579402A (en) Method and device for positioning faults of application system
CN116541252B (en) Computer room fault log data processing method and device
CN114722061B (en) Data processing method and device, equipment and computer readable storage medium
CN116128482A (en) Operation maintenance method and device for electric power metering equipment, terminal and storage medium
CN118057327A (en) Information Technology (IT) system alarm data processing method and device based on knowledge graph
CN114691459A (en) Software system aging prediction method and system
CN116389265A (en) Network operation management method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination