CN114580580B - Intelligent operation and maintenance abnormity detection method and device - Google Patents
Intelligent operation and maintenance abnormity detection method and device Download PDFInfo
- Publication number
- CN114580580B CN114580580B CN202210492320.8A CN202210492320A CN114580580B CN 114580580 B CN114580580 B CN 114580580B CN 202210492320 A CN202210492320 A CN 202210492320A CN 114580580 B CN114580580 B CN 114580580B
- Authority
- CN
- China
- Prior art keywords
- independent
- sample
- tree
- preliminary
- forest
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 51
- 238000012423 maintenance Methods 0.000 title claims abstract description 46
- 230000002159 abnormal effect Effects 0.000 claims abstract description 75
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 20
- 230000009467 reduction Effects 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims description 32
- 238000005070 sampling Methods 0.000 claims description 8
- 238000009827 uniform distribution Methods 0.000 claims description 5
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000005856 abnormality Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000013450 outlier detection Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses an intelligent operation and maintenance abnormity detection method and device, wherein the method comprises the following steps: acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data; establishing an independent tree according to the sample and forming an independent forest; calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points; marking part of the positive sample; identifying an effective tree according to the marked preliminary abnormal points; assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified outliers and the number of the marked positive samples; calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest; carrying out anomaly detection according to the reconstructed independent tree and the independent forest; according to the method, the independent tree and the independent forest are reconstructed according to the preliminarily identified abnormal points, and the abnormal detection efficiency and the accuracy are high.
Description
Technical Field
The invention relates to the field of anomaly detection and calculation, in particular to an intelligent operation and maintenance anomaly detection method and device.
Background
In an intelligent operation and maintenance scene, operation and maintenance personnel often need to capture abnormal signals in time from a plurality of indexes related to system transactions and diagnose the abnormal signals, so that the aims of rapidly troubleshooting and avoiding accidents are fulfilled. The indexes associated with the system transaction include page opening delay, user click rate, CPU utilization rate and the like. The challenges often faced in this scenario are that the metrics that need to be tracked are very large in dimension, it is difficult to capture outliers in time, and there is no label to mark whether the sample is an outlier. In the existing anomaly detection technology, the conventional unsupervised training has poor accuracy, and if each sample point is labeled manually, the cost is very high.
For example, patent document CN111026925A discloses an anomaly detection method and device for parallelization of an isolated forest algorithm based on Flink, which extracts a data set to be tested from historical data to construct a binary tree, further forms an independent forest, scores the anomaly according to the depth of a sample point in each independent binary tree, and determines whether a sample in the data set is abnormal according to the anomaly score.
According to the scheme, an unsupervised detection algorithm is adopted to carry out abnormity detection on the sample, and the abnormity degree of the sample point is scored through the independent tree, so that the abnormity point can be identified in time. However, there is a problem that the abnormality point determination is performed only by the abnormality degree score in the soliton, which is inefficient and not accurate.
Disclosure of Invention
The invention provides an intelligent operation and maintenance abnormity detection method and device, which are used for reconstructing an independent tree and an independent forest according to an initially identified abnormal point, realizing the integration of an unsupervised independent forest algorithm and supervised learning, and having high abnormity detection efficiency and high accuracy.
An intelligent operation and maintenance abnormity detection method comprises the following steps:
acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
establishing an independent tree according to the sample and forming an independent forest;
calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
marking part of the positive sample;
identifying an effective tree according to the marked preliminary abnormal points;
assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified preliminary outliers and the number of the marked positive samples;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
and carrying out anomaly detection according to the reconstructed independent tree and the independent forest.
Further, the operation and maintenance data are collected and subjected to dimension reduction treatment, and the method comprises the following steps:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
Further, establishing independent trees according to the samples and forming independent forests, comprising:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes form an independent forest.
Further, the preliminary anomaly score for each sample is calculated by the following formula:
wherein,representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
Further, identifying the valid tree according to the marked preliminary abnormal points comprises:
and determining the independent tree in which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
Further, the total score is calculated by the following formula:
wherein,representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees in which the outlier P is identified,the sum of the scores representing the relevant features of the preliminary outliers P,representing the total score, n represents the number of positive samples of the marker.
Further, the feature selection probability is calculated by the following formula:
wherein,shows the mth feature selection outlineThe ratio of the total weight of the particles,the total score is represented as a function of the total score,representing the mth feature.
Further, calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest, comprising:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristicAs a root node, the characteristicsSatisfies the following conditions:
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
Further, the anomaly detection is carried out according to the reconstructed independent trees and the independent forests, and comprises the following steps:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
wherein,a final anomaly score is indicated which is indicative of,represents the path length of the sample p at the leaf node where a reorganized independent tree is located,then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
An intelligent operation and maintenance abnormity detection device comprises:
the data processing module is used for acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
the preliminary forest establishment module is used for establishing an independent tree according to the sample and forming an independent forest;
the preliminary judgment module is used for calculating the preliminary abnormal score of each sample according to the independent tree and the independent forest and marking the sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
the marking module is used for marking part of the positive samples;
the identification module is used for identifying the effective tree according to the marked preliminary abnormal points;
the total score calculating module is used for giving scores to the features of the identified primary abnormal points in the effective trees and calculating the total score according to the number of the independent trees of the identified primary abnormal points and the number of the marked positive samples;
the reconstruction module is used for calculating the feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
and the anomaly detection module is used for carrying out anomaly detection according to the reconstructed independent tree and the independent forest.
The intelligent operation and maintenance abnormity detection method and device provided by the invention at least have the following beneficial effects:
(1) the operation and maintenance data are subjected to dimension reduction processing before anomaly detection, sample data applied to anomaly detection are simplified, operation time is saved, and the working efficiency of an anomaly detection algorithm is improved.
(2) The method is characterized in that a part of positive samples are marked in an artificial labeling mode, and a labeled supervised learning mode is added into an unsupervised independent forest algorithm, so that the advantages of the two algorithms can be combined, the accuracy of the algorithm is improved, and the efficiency of the algorithm is guaranteed.
(3) All the characteristics related to the samples are assigned through a plurality of effective trees of a plurality of positive samples, the total value of the characteristics is calculated to describe the action size of each characteristic in the anomaly detection process, the total value is used as a basis for selecting root nodes when the independent trees are reconstructed, and the identification accuracy of the reconstructed independent forest is improved.
(4) Root node selection is performed by uniformly distributing and sampling random variables, the probability that each feature is selected can be guaranteed as the feature selection probability, and therefore the accuracy of reconstructing the independent forest is guaranteed.
Drawings
Fig. 1 is a flowchart of an embodiment of an intelligent operation and maintenance anomaly detection method provided by the present invention.
Fig. 2 is a flowchart of an embodiment of a method for reconstructing an independent tree and an independent forest in the method provided by the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of the intelligent operation and maintenance abnormality detection apparatus provided in the present invention.
Fig. 4 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
Reference numerals: 1-a processor, 2-a storage device, 101-a data processing module, 102-a preliminary forest establishment module, 103-a preliminary judgment module, 104-a marking module, 105-an identification module, 106-a total score calculation module, 107-a reconstruction module and 108-an abnormity detection module.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Referring to fig. 1, in some embodiments, an intelligent operation and maintenance anomaly detection method is provided, including:
s1, collecting operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
s2, establishing an independent tree according to the sample and forming an independent forest;
s3, calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
s4, marking part of positive samples;
s5, identifying an effective tree according to the marked preliminary abnormal points;
s6, giving scores to the features of the identified primary outliers in the effective trees, and calculating the total score according to the number of the independent trees of the identified primary outliers and the number of the marked positive samples;
s7, calculating the feature selection probability according to the total score, and reconstructing an independent tree and an independent forest;
and S8, carrying out abnormity detection according to the reconstructed independent tree and the independent forest.
The intelligent operation and maintenance data comprises a plurality of characteristics related to the operation of equipment, a system and a network environment, including but not limited to: network delay, request concurrency number and database capacity. In the collected operation and maintenance data, one dimension corresponds to one feature, that is, the operation and maintenance data are multidimensional data, so that the operation and maintenance data need to be subjected to dimension reduction before anomaly detection.
Specifically, in step S1, the operation and maintenance data collection and the dimension reduction processing include:
s11, forming a matrix by each operation and maintenance data according to columns;
s12, carrying out zero equalization on each line of the matrix;
s13, solving a covariance matrix of the matrix after zero-mean processing;
s14, solving eigenvalues and corresponding characteristics of the covariance matrix;
and S15, arranging the features into a feature matrix according to the sizes of the feature values in rows as samples.
As a preferred embodiment, the operation and maintenance data is subjected to pca (principal Component analysis) dimension reduction processing. Reducing k M-dimension data to M-dimension, firstly, forming the original operation and maintenance data into a matrix X with M rows and k columns according to columns 0 Then the matrix X is divided into 0 Subtracting the mean value of each row from the data of each row to obtain a matrix X after zero-averaging processing, and solving the covariance matrix of the matrix XSolving the eigenvalue and the corresponding characteristic of the covariance matrix, arranging the characteristic into a matrix from top to bottom according to the size of the corresponding eigenvalue, taking the first m rows to form a matrix P, thereby obtaining a sample with dimension reduced to m dimension, wherein the characteristic after dimension reduction is ,... 。
In step S2, establishing an independent tree according to the sample and forming an independent forest, including:
s21, randomly selecting a feature as a root node;
s22, selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a division basis, and dividing two child nodes;
s23, dividing the samples into two groups and respectively entering the two sub-nodes;
s24, the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample, and an independent tree is formed: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and S25, forming an independent forest by the independent trees generated by taking different characteristics as root nodes.
The anomaly detection method provided by the embodiment adopts an independent forest algorithm, the independent forest algorithm is an unsupervised anomaly detection method suitable for continuous data, and an anomaly value is detected by isolating sample points. The essence of each independent tree in the independent forest algorithm is a decision tree, and each sample flows to the child nodes of the independent forest algorithm from the root node according to the dividing mode of the nodes and finally falls onto one leaf node. There is no uniform rule for generating the number of independent trees, and the number of independent trees is not directly related to the number of samples. Each independent tree is independent, and the judgment of each independent tree on a sample needs to be comprehensively considered when the independent forest algorithm is adopted for abnormal scoring.
In steps S21-S25, since the abnormal data sample is relatively isolated from other data samples, the number of partitions required for the abnormal sample to be partitioned separately is relatively small compared to other samples, i.e., the path length of the abnormal sample in the independent tree is relatively short. Therefore, the possibility that the sample is an abnormal sample can be judged according to the path length of each sample which is divided out separately, and the sample is represented by a preliminary abnormal score, and the sample with the preliminary abnormal score larger than the preset value is marked as a preliminary abnormal point.
Specifically, the preliminary abnormality score of each sample in step S3 is calculated by the following formula:
wherein,representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
As a preferred embodiment, samples with a preliminary anomaly score greater than 0.9, derived according to the above formula, are labeled as preliminary anomaly points.
In step S4, a part of a small number of positive samples are marked by manual marking, where the manually marked positive samples are marked as: {}. By marking part of positive samples, a foundation is provided for realizing the integration of unsupervised independent forest algorithm and supervised learning, so that the advantages of the two algorithms can be combined, the algorithm accuracy is improved, the algorithm efficiency is ensured, and in addition, compared with marking of all samples, the cost of manual marking can be saved.
The identification precision of the preliminarily identified preliminary abnormal points is not high, so that the reconstruction of the independent trees and the independent forests is required to be further carried out.
In step S5, identifying a valid tree according to the marked preliminary abnormal point includes:
and determining the independent tree of which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
In step S6, assigning a score to the feature of the valid tree in which the preliminary outlier is identified, and calculating a total score according to the number of the independent trees in which the outlier is identified and the number of the marked positive samples, includes:
s61, assigning zero values to each feature as initial scores;
s62, executing the following steps on the initial abnormal points until all effective trees and all the initial abnormal points are traversed to obtain the total score of a certain characteristic: assigning a score to a feature in a valid tree that identifies a preliminary outlierWhereinthe path length of the preliminary outlier in the effective tree i;
s63, executing step S62 on all the characteristics to obtain the total score of all the characteristics.
In step S62, the total score is calculated by the following formula:
wherein,representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,the sum of the scores representing the relevant features of the preliminary outliers P,represents the total score, n represents the number of positive samples of the token;
in certain embodiments, the maximum path of each individual tree does not exceed D, and the individual trees that identify the preliminary outlier P when the path length does not exceed D-1 are determined to be valid trees, the valid trees for the preliminary outlier P having a total of N. The initial score of each feature is 0, and for the ith independent tree in which the preliminary outlier P is effectively identified, the features involved in the path for detecting the preliminary outlier are assigned a scoreWhereinPath length for point P in ith independent tree. It is assumed that the features involved for detecting the preliminary outlier P are,,Then, for the ith independent tree detecting the preliminary outlier P, the three features can all get the scoreThus, based on N valid trees, featuresThe total score that can be assigned by the preliminary outlier P is. The characteristics of all the positive samples are identified and given scores according to the mode, and finally the characteristics are obtainedIs given a total score of. It should be noted that if a feature is never used for the detection of any preliminary outlier, the score of the feature is always zero.
Referring to fig. 2, in step S7, calculating feature selection probabilities according to the total scores and reconstructing an independent tree and an independent forest includes:
s71, sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
s72, selecting the ith characteristicAs a root node, the characteristicsSatisfies the following conditions:whereinrepresenting the mth characteristic selection probability;
s73, selecting a characteristic value between the maximum characteristic value and the minimum characteristic value as the characteristic of the root node as a dividing basis, and dividing two child nodes;
s74, dividing the samples into two groups and respectively entering the two sub-nodes;
s75, the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and S76, recombining the independent trees generated by taking different characteristics as root nodes into an independent forest.
In step S72, the feature selection probability is calculated by the following formula:
wherein,the m-th feature selection probability is shown,the total score is represented as a function of the total score,the mth feature is shown.
The procedure of reconstructing the independent tree in step S7 is substantially the same as the procedure of initially constructing the independent tree in step S2, except that the feature selection of the root node is randomly equal in probability when the independent tree is initially constructed, the feature selection probability when the independent tree is reconstructed is determined by the total score of the features, and the higher the total score is, the higher the probability is that the feature is selected as the root node of the reconstructed independent tree. By uniformly distributing and sampling the random variable U and then selecting the root node, the probability that each feature is selected can be ensured to be. In particular, if the total score of a feature that has never been used for any preliminary outlier detection is zero, then the probability of feature selection is zero.
In step S8, performing anomaly detection according to the reconstructed independent tree and independent forest, including:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
wherein,a final anomaly score is indicated which is indicative of,represents the path length of the sample p at the leaf node where a reorganized independent tree is located,then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
As a preferred embodiment, the samples with final anomaly scores greater than 0.9 according to the above formula are labeled as final anomaly points. In the independent tree and the independent forest obtained by probability reconstruction according to the feature selection, the proportion of the features playing more roles in primary abnormal point detection in the root node is improved, so that the accuracy rate of abnormal detection by adopting the reconstructed independent tree and the independent forest is higher.
Referring to fig. 3, in some embodiments, an intelligent operation and maintenance anomaly detection device is provided, including:
the data processing module 101 is configured to acquire operation and maintenance data and perform dimension reduction processing to obtain a sample of the operation and maintenance data;
a preliminary forest establishment module 102, configured to establish an independent tree according to the sample and form an independent forest;
a preliminary judgment module 103, configured to calculate a preliminary abnormal score of each sample according to the independent tree and the independent forest, and mark a sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
a marking module 104 for marking a portion of the positive sample;
an identification module 105 for identifying the valid tree according to the marked preliminary abnormal point;
a total score calculating module 106, configured to assign a score to the feature of the identified preliminary outlier in the effective tree, and calculate a total score according to the number of the independent trees and the number of the marked positive samples, where the preliminary outlier is identified;
a reconstruction module 107, configured to calculate a feature selection probability according to the total score and reconstruct an independent tree and an independent forest;
and an anomaly detection module 108, configured to perform anomaly detection according to the reconstructed independent tree and the independent forest.
Wherein the data processing module 101 is further configured to:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
The preliminary forest establishment module 102 is further configured to establish an independent tree according to the sample and form an independent forest, including:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and forming an independent forest by using the independent trees generated by the different characteristics as root nodes.
In the preliminary judgment module 103, the preliminary abnormal score of each sample is calculated by the following formula:
wherein,representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
The identification module 105 is further configured to:
and determining the independent tree in which the preliminary abnormal point is identified when the path does not exceed the preset value as a valid tree.
In the total score calculating module 106, the total score value is calculated by the following formula:
wherein,representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,the sum of the scores representing the relevant features of the outlier P,representing the total score and n representing the number of positive samples of the token.
In the reconstruction module 107, the feature selection probability is calculated by the following formula:
wherein,the m-th feature selection probability is shown,the total score is represented as a function of the total score,the mth feature is shown.
The reconstruction module 107 is further configured to:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristicAs a root node, the characteristicsSatisfies the following conditions:whereinrepresenting the mth characteristic selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes form an independent forest again.
The anomaly detection module 108 is further configured to:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the samples with the final abnormal scores larger than a preset value as abnormal points;
the final anomaly score is calculated by the following formula:
wherein,a final anomaly score is indicated which is indicative of,represents the path length of the leaf node of the sample p in a recombined independent tree,then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
Referring to fig. 4, in some embodiments, an electronic device is provided, which includes a processor 1 and a storage 2, where the storage 2 stores a plurality of instructions, and the processor 1 is configured to read the plurality of instructions and execute the method.
According to the intelligent operation and maintenance anomaly detection method and device, operation and maintenance data are subjected to dimensionality reduction before anomaly detection, sample data applied to anomaly detection is simplified, operation time is saved, and the working efficiency of an anomaly detection algorithm is improved; the method has the advantages that the positive samples are marked in an artificial marking mode, and a marked supervised learning mode is added into an unsupervised independent forest algorithm, so that the advantages of the two algorithms can be combined, the accuracy of the algorithm is improved, and the efficiency of the algorithm is guaranteed; assigning scores to all the characteristics involved in the samples through a plurality of effective trees of a plurality of positive samples, calculating the total score of the characteristics to describe the action size of each characteristic in the abnormal detection process, and taking the total score as a basis for selecting root nodes when the independent trees are reconstructed, so that the identification accuracy of the reconstructed independent forests is improved; root node selection is performed by uniformly distributing and sampling random variables, the probability that each feature is selected can be guaranteed as the feature selection probability, and therefore the accuracy of reconstructing the independent forest is guaranteed.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (8)
1. An intelligent operation and maintenance abnormity detection method is characterized by comprising the following steps:
acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
establishing an independent tree according to the sample and forming an independent forest;
calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
marking part of the positive sample;
identifying an effective tree according to the marked preliminary abnormal points;
assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified preliminary outliers and the number of the marked positive samples;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
carrying out anomaly detection according to the reconstructed independent trees and independent forests;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest, wherein the method comprises the following steps:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristicAs a root node, the characteristicsSatisfies the following conditions:whereinrepresenting the mth characteristic selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups, and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
2. The method of claim 1, wherein collecting operation and maintenance data and performing dimension reduction processing comprises:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
3. The method of claim 2, wherein building independent trees from the samples and composing independent forests comprises:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and forming an independent forest by using the independent trees generated by the different characteristics as root nodes.
4. The method of claim 1, wherein the preliminary anomaly score for each sample is calculated by the formula:
wherein,representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree of the sample p in the independent forest;
5. The method of claim 4, wherein identifying valid trees from the labeled preliminary outliers comprises:
and determining the independent tree of which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
6. The method of claim 4, wherein the total score is calculated by the formula:
wherein,representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,the sum of the scores representing the relevant features of the preliminary outliers P,representing the total score and n representing the number of positive samples of the token.
7. The method of claim 1, wherein anomaly detection is performed based on the reconstructed independent trees and independent forests, comprising:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
wherein,a final anomaly score is indicated which is indicative of,represents the path length of the leaf node of the sample p in a recombined independent tree,then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
8. An intelligent operation and maintenance anomaly detection device applied to the method of any one of claims 1-7, comprising:
the data processing module is used for acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
the preliminary forest establishment module is used for establishing an independent tree according to the sample and forming an independent forest;
the preliminary judgment module is used for calculating the preliminary abnormal score of each sample according to the independent tree and the independent forest and marking the sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
the marking module is used for marking part of positive samples;
the identification module is used for identifying the effective tree according to the marked preliminary abnormal points;
the total score calculating module is used for giving scores to the features of the identified primary abnormal points in the effective trees and calculating the total score according to the number of the independent trees of the identified primary abnormal points and the number of the marked positive samples;
the reconstruction module is used for calculating the feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
the anomaly detection module is used for carrying out anomaly detection according to the reconstructed independent tree and the independent forest;
the reconstruction module is further configured to:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristicAs root node, the characteristicsSatisfies the following conditions:whereinrepresenting the mth feature selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups, and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a division basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210492320.8A CN114580580B (en) | 2022-05-07 | 2022-05-07 | Intelligent operation and maintenance abnormity detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210492320.8A CN114580580B (en) | 2022-05-07 | 2022-05-07 | Intelligent operation and maintenance abnormity detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114580580A CN114580580A (en) | 2022-06-03 |
CN114580580B true CN114580580B (en) | 2022-08-16 |
Family
ID=81769157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210492320.8A Active CN114580580B (en) | 2022-05-07 | 2022-05-07 | Intelligent operation and maintenance abnormity detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114580580B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109859029A (en) * | 2019-01-04 | 2019-06-07 | 深圳壹账通智能科技有限公司 | Abnormal application detection method, device, computer equipment and storage medium |
CN110149258A (en) * | 2019-04-12 | 2019-08-20 | 北京航空航天大学 | A kind of automobile CAN-bus network data method for detecting abnormality based on isolated forest |
CN111784392A (en) * | 2020-06-29 | 2020-10-16 | 中国平安财产保险股份有限公司 | Abnormal user group detection method, device and equipment based on isolated forest |
CN111833172A (en) * | 2020-05-25 | 2020-10-27 | 百维金科(上海)信息科技有限公司 | Consumption credit fraud detection method and system based on isolated forest |
CN112199670A (en) * | 2020-09-30 | 2021-01-08 | 西安理工大学 | Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning |
CN112505549A (en) * | 2020-11-26 | 2021-03-16 | 西安电子科技大学 | New energy automobile battery abnormity detection method based on isolated forest algorithm |
CN112990330A (en) * | 2021-03-26 | 2021-06-18 | 国网河北省电力有限公司营销服务中心 | User energy abnormal data detection method and device |
CN113392914A (en) * | 2021-06-22 | 2021-09-14 | 北京邮电大学 | Anomaly detection algorithm for constructing isolated forest based on weight of data features |
WO2021218314A1 (en) * | 2020-04-27 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Event identification method and apparatus based on position locating, and device and storage medium |
CN113627521A (en) * | 2021-08-09 | 2021-11-09 | 西华大学 | Intelligent logistics unmanned aerial vehicle abnormal behavior identification method based on isolated forest method |
CN113886375A (en) * | 2021-09-29 | 2022-01-04 | 东北电力大学 | Wind power data cleaning method based on isolated forest and local outlier factors |
CN113887674A (en) * | 2021-12-06 | 2022-01-04 | 深圳索信达数据技术有限公司 | Abnormal behavior detection method and system based on big data |
CN114386483A (en) * | 2021-12-17 | 2022-04-22 | 深圳索信达数据技术有限公司 | Method, apparatus, device, and medium for quantifying feature distinguishing capability |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7063022B2 (en) * | 2018-03-14 | 2022-05-09 | オムロン株式会社 | Anomaly detection system, support device and model generation method |
CN109345137A (en) * | 2018-10-22 | 2019-02-15 | 广东精点数据科技股份有限公司 | A kind of rejecting outliers method based on agriculture big data |
CN109886724B (en) * | 2018-12-29 | 2021-02-12 | 中南大学 | Robust resident travel track identification method |
-
2022
- 2022-05-07 CN CN202210492320.8A patent/CN114580580B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109859029A (en) * | 2019-01-04 | 2019-06-07 | 深圳壹账通智能科技有限公司 | Abnormal application detection method, device, computer equipment and storage medium |
CN110149258A (en) * | 2019-04-12 | 2019-08-20 | 北京航空航天大学 | A kind of automobile CAN-bus network data method for detecting abnormality based on isolated forest |
WO2021218314A1 (en) * | 2020-04-27 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Event identification method and apparatus based on position locating, and device and storage medium |
CN111833172A (en) * | 2020-05-25 | 2020-10-27 | 百维金科(上海)信息科技有限公司 | Consumption credit fraud detection method and system based on isolated forest |
CN111784392A (en) * | 2020-06-29 | 2020-10-16 | 中国平安财产保险股份有限公司 | Abnormal user group detection method, device and equipment based on isolated forest |
CN112199670A (en) * | 2020-09-30 | 2021-01-08 | 西安理工大学 | Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning |
CN112505549A (en) * | 2020-11-26 | 2021-03-16 | 西安电子科技大学 | New energy automobile battery abnormity detection method based on isolated forest algorithm |
CN112990330A (en) * | 2021-03-26 | 2021-06-18 | 国网河北省电力有限公司营销服务中心 | User energy abnormal data detection method and device |
CN113392914A (en) * | 2021-06-22 | 2021-09-14 | 北京邮电大学 | Anomaly detection algorithm for constructing isolated forest based on weight of data features |
CN113627521A (en) * | 2021-08-09 | 2021-11-09 | 西华大学 | Intelligent logistics unmanned aerial vehicle abnormal behavior identification method based on isolated forest method |
CN113886375A (en) * | 2021-09-29 | 2022-01-04 | 东北电力大学 | Wind power data cleaning method based on isolated forest and local outlier factors |
CN113887674A (en) * | 2021-12-06 | 2022-01-04 | 深圳索信达数据技术有限公司 | Abnormal behavior detection method and system based on big data |
CN114386483A (en) * | 2021-12-17 | 2022-04-22 | 深圳索信达数据技术有限公司 | Method, apparatus, device, and medium for quantifying feature distinguishing capability |
Non-Patent Citations (1)
Title |
---|
孤立森林算法在大坝监测数据异常识别中的应用;张海龙等;《人民黄河》;20200810(第08期);第154-157页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114580580A (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8140301B2 (en) | Method and system for causal modeling and outlier detection | |
CN114332984B (en) | Training data processing method, device and storage medium | |
CN111612041A (en) | Abnormal user identification method and device, storage medium and electronic equipment | |
CN107016416B (en) | Data classification prediction method based on neighborhood rough set and PCA fusion | |
CN110097120B (en) | Network flow data classification method, equipment and computer storage medium | |
CN104537383A (en) | Massive organizational structure data classification method and system based on particle swarm | |
CN114580580B (en) | Intelligent operation and maintenance abnormity detection method and device | |
US20200279148A1 (en) | Material structure analysis method and material structure analyzer | |
CN111158918B (en) | Supporting point parallel enumeration load balancing method, device, equipment and medium | |
CN111091194B (en) | Operation system identification method based on CAVWBB _ KL algorithm | |
CN112434886A (en) | Method for predicting client mortgage loan default probability | |
CN116957361A (en) | Ship task system health state detection method based on virtual-real combination | |
WO2022188080A1 (en) | Image classification network model training method, image classification method, and related device | |
CN113392086B (en) | Medical database construction method, device and equipment based on Internet of things | |
Pereira et al. | Assessing active learning strategies to improve the quality control of the soybean seed vigor | |
CN113420733B (en) | Efficient distributed big data acquisition implementation method and system | |
CN116188834A (en) | Full-slice image classification method and device based on self-adaptive training model | |
CN111654853B (en) | Data analysis method based on user information | |
CN114399407A (en) | Power dispatching monitoring data anomaly detection method based on dynamic and static selection integration | |
CN110459266B (en) | Method for establishing SNP (Single nucleotide polymorphism) pathogenic factor and disease association relation model | |
Sánchez et al. | Applicability of cluster validation indexes for large data sets | |
CN109308565B (en) | Crowd performance grade identification method and device, storage medium and computer equipment | |
CN116186503B (en) | Industrial control system-oriented malicious flow detection method and device and computer storage medium | |
CN115017125B (en) | Data processing method and device for improving KNN method | |
CN118071210B (en) | Ecological environment vulnerability assessment method combining CNN and PPM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |