CN114580580B - Intelligent operation and maintenance abnormity detection method and device - Google Patents

Intelligent operation and maintenance abnormity detection method and device Download PDF

Info

Publication number
CN114580580B
CN114580580B CN202210492320.8A CN202210492320A CN114580580B CN 114580580 B CN114580580 B CN 114580580B CN 202210492320 A CN202210492320 A CN 202210492320A CN 114580580 B CN114580580 B CN 114580580B
Authority
CN
China
Prior art keywords
independent
sample
tree
preliminary
forest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210492320.8A
Other languages
Chinese (zh)
Other versions
CN114580580A (en
Inventor
朱松涛
邵俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Suoxinda Data Technology Co ltd
Original Assignee
Shenzhen Suoxinda Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Suoxinda Data Technology Co ltd filed Critical Shenzhen Suoxinda Data Technology Co ltd
Priority to CN202210492320.8A priority Critical patent/CN114580580B/en
Publication of CN114580580A publication Critical patent/CN114580580A/en
Application granted granted Critical
Publication of CN114580580B publication Critical patent/CN114580580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an intelligent operation and maintenance abnormity detection method and device, wherein the method comprises the following steps: acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data; establishing an independent tree according to the sample and forming an independent forest; calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points; marking part of the positive sample; identifying an effective tree according to the marked preliminary abnormal points; assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified outliers and the number of the marked positive samples; calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest; carrying out anomaly detection according to the reconstructed independent tree and the independent forest; according to the method, the independent tree and the independent forest are reconstructed according to the preliminarily identified abnormal points, and the abnormal detection efficiency and the accuracy are high.

Description

Intelligent operation and maintenance abnormity detection method and device
Technical Field
The invention relates to the field of anomaly detection and calculation, in particular to an intelligent operation and maintenance anomaly detection method and device.
Background
In an intelligent operation and maintenance scene, operation and maintenance personnel often need to capture abnormal signals in time from a plurality of indexes related to system transactions and diagnose the abnormal signals, so that the aims of rapidly troubleshooting and avoiding accidents are fulfilled. The indexes associated with the system transaction include page opening delay, user click rate, CPU utilization rate and the like. The challenges often faced in this scenario are that the metrics that need to be tracked are very large in dimension, it is difficult to capture outliers in time, and there is no label to mark whether the sample is an outlier. In the existing anomaly detection technology, the conventional unsupervised training has poor accuracy, and if each sample point is labeled manually, the cost is very high.
For example, patent document CN111026925A discloses an anomaly detection method and device for parallelization of an isolated forest algorithm based on Flink, which extracts a data set to be tested from historical data to construct a binary tree, further forms an independent forest, scores the anomaly according to the depth of a sample point in each independent binary tree, and determines whether a sample in the data set is abnormal according to the anomaly score.
According to the scheme, an unsupervised detection algorithm is adopted to carry out abnormity detection on the sample, and the abnormity degree of the sample point is scored through the independent tree, so that the abnormity point can be identified in time. However, there is a problem that the abnormality point determination is performed only by the abnormality degree score in the soliton, which is inefficient and not accurate.
Disclosure of Invention
The invention provides an intelligent operation and maintenance abnormity detection method and device, which are used for reconstructing an independent tree and an independent forest according to an initially identified abnormal point, realizing the integration of an unsupervised independent forest algorithm and supervised learning, and having high abnormity detection efficiency and high accuracy.
An intelligent operation and maintenance abnormity detection method comprises the following steps:
acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
establishing an independent tree according to the sample and forming an independent forest;
calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
marking part of the positive sample;
identifying an effective tree according to the marked preliminary abnormal points;
assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified preliminary outliers and the number of the marked positive samples;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
and carrying out anomaly detection according to the reconstructed independent tree and the independent forest.
Further, the operation and maintenance data are collected and subjected to dimension reduction treatment, and the method comprises the following steps:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
Further, establishing independent trees according to the samples and forming independent forests, comprising:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes form an independent forest.
Further, the preliminary anomaly score for each sample is calculated by the following formula:
Figure 155010DEST_PATH_IMAGE001
wherein,
Figure 478675DEST_PATH_IMAGE002
representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
Figure 133779DEST_PATH_IMAGE003
Figure 325726DEST_PATH_IMAGE004
indicating the number of samples.
Further, identifying the valid tree according to the marked preliminary abnormal points comprises:
and determining the independent tree in which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
Further, the total score is calculated by the following formula:
Figure 687568DEST_PATH_IMAGE005
Figure 916555DEST_PATH_IMAGE006
wherein,
Figure 980326DEST_PATH_IMAGE007
representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees in which the outlier P is identified,
Figure 133221DEST_PATH_IMAGE008
the sum of the scores representing the relevant features of the preliminary outliers P,
Figure 864417DEST_PATH_IMAGE009
representing the total score, n represents the number of positive samples of the marker.
Further, the feature selection probability is calculated by the following formula:
Figure 139671DEST_PATH_IMAGE010
wherein,
Figure 628422DEST_PATH_IMAGE011
shows the mth feature selection outlineThe ratio of the total weight of the particles,
Figure 912904DEST_PATH_IMAGE012
the total score is represented as a function of the total score,
Figure 233026DEST_PATH_IMAGE013
representing the mth feature.
Further, calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest, comprising:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristic
Figure 210341DEST_PATH_IMAGE014
As a root node, the characteristics
Figure 514283DEST_PATH_IMAGE014
Satisfies the following conditions:
Figure 184213DEST_PATH_IMAGE015
wherein
Figure 640733DEST_PATH_IMAGE011
representing the mth characteristic selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
Further, the anomaly detection is carried out according to the reconstructed independent trees and the independent forests, and comprises the following steps:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
Figure 507058DEST_PATH_IMAGE016
wherein,
Figure 721133DEST_PATH_IMAGE017
a final anomaly score is indicated which is indicative of,
Figure 862264DEST_PATH_IMAGE018
represents the path length of the sample p at the leaf node where a reorganized independent tree is located,
Figure 970028DEST_PATH_IMAGE019
then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
Figure 289145DEST_PATH_IMAGE020
Figure 380729DEST_PATH_IMAGE004
indicating the number of samples.
An intelligent operation and maintenance abnormity detection device comprises:
the data processing module is used for acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
the preliminary forest establishment module is used for establishing an independent tree according to the sample and forming an independent forest;
the preliminary judgment module is used for calculating the preliminary abnormal score of each sample according to the independent tree and the independent forest and marking the sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
the marking module is used for marking part of the positive samples;
the identification module is used for identifying the effective tree according to the marked preliminary abnormal points;
the total score calculating module is used for giving scores to the features of the identified primary abnormal points in the effective trees and calculating the total score according to the number of the independent trees of the identified primary abnormal points and the number of the marked positive samples;
the reconstruction module is used for calculating the feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
and the anomaly detection module is used for carrying out anomaly detection according to the reconstructed independent tree and the independent forest.
The intelligent operation and maintenance abnormity detection method and device provided by the invention at least have the following beneficial effects:
(1) the operation and maintenance data are subjected to dimension reduction processing before anomaly detection, sample data applied to anomaly detection are simplified, operation time is saved, and the working efficiency of an anomaly detection algorithm is improved.
(2) The method is characterized in that a part of positive samples are marked in an artificial labeling mode, and a labeled supervised learning mode is added into an unsupervised independent forest algorithm, so that the advantages of the two algorithms can be combined, the accuracy of the algorithm is improved, and the efficiency of the algorithm is guaranteed.
(3) All the characteristics related to the samples are assigned through a plurality of effective trees of a plurality of positive samples, the total value of the characteristics is calculated to describe the action size of each characteristic in the anomaly detection process, the total value is used as a basis for selecting root nodes when the independent trees are reconstructed, and the identification accuracy of the reconstructed independent forest is improved.
(4) Root node selection is performed by uniformly distributing and sampling random variables, the probability that each feature is selected can be guaranteed as the feature selection probability, and therefore the accuracy of reconstructing the independent forest is guaranteed.
Drawings
Fig. 1 is a flowchart of an embodiment of an intelligent operation and maintenance anomaly detection method provided by the present invention.
Fig. 2 is a flowchart of an embodiment of a method for reconstructing an independent tree and an independent forest in the method provided by the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of the intelligent operation and maintenance abnormality detection apparatus provided in the present invention.
Fig. 4 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
Reference numerals: 1-a processor, 2-a storage device, 101-a data processing module, 102-a preliminary forest establishment module, 103-a preliminary judgment module, 104-a marking module, 105-an identification module, 106-a total score calculation module, 107-a reconstruction module and 108-an abnormity detection module.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Referring to fig. 1, in some embodiments, an intelligent operation and maintenance anomaly detection method is provided, including:
s1, collecting operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
s2, establishing an independent tree according to the sample and forming an independent forest;
s3, calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
s4, marking part of positive samples;
s5, identifying an effective tree according to the marked preliminary abnormal points;
s6, giving scores to the features of the identified primary outliers in the effective trees, and calculating the total score according to the number of the independent trees of the identified primary outliers and the number of the marked positive samples;
s7, calculating the feature selection probability according to the total score, and reconstructing an independent tree and an independent forest;
and S8, carrying out abnormity detection according to the reconstructed independent tree and the independent forest.
The intelligent operation and maintenance data comprises a plurality of characteristics related to the operation of equipment, a system and a network environment, including but not limited to: network delay, request concurrency number and database capacity. In the collected operation and maintenance data, one dimension corresponds to one feature, that is, the operation and maintenance data are multidimensional data, so that the operation and maintenance data need to be subjected to dimension reduction before anomaly detection.
Specifically, in step S1, the operation and maintenance data collection and the dimension reduction processing include:
s11, forming a matrix by each operation and maintenance data according to columns;
s12, carrying out zero equalization on each line of the matrix;
s13, solving a covariance matrix of the matrix after zero-mean processing;
s14, solving eigenvalues and corresponding characteristics of the covariance matrix;
and S15, arranging the features into a feature matrix according to the sizes of the feature values in rows as samples.
As a preferred embodiment, the operation and maintenance data is subjected to pca (principal Component analysis) dimension reduction processing. Reducing k M-dimension data to M-dimension, firstly, forming the original operation and maintenance data into a matrix X with M rows and k columns according to columns 0 Then the matrix X is divided into 0 Subtracting the mean value of each row from the data of each row to obtain a matrix X after zero-averaging processing, and solving the covariance matrix of the matrix X
Figure 404180DEST_PATH_IMAGE021
Solving the eigenvalue and the corresponding characteristic of the covariance matrix, arranging the characteristic into a matrix from top to bottom according to the size of the corresponding eigenvalue, taking the first m rows to form a matrix P, thereby obtaining a sample with dimension reduced to m dimension, wherein the characteristic after dimension reduction is
Figure 100872DEST_PATH_IMAGE022
Figure 840158DEST_PATH_IMAGE023
,...
Figure 232087DEST_PATH_IMAGE024
In step S2, establishing an independent tree according to the sample and forming an independent forest, including:
s21, randomly selecting a feature as a root node;
s22, selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a division basis, and dividing two child nodes;
s23, dividing the samples into two groups and respectively entering the two sub-nodes;
s24, the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample, and an independent tree is formed: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and S25, forming an independent forest by the independent trees generated by taking different characteristics as root nodes.
The anomaly detection method provided by the embodiment adopts an independent forest algorithm, the independent forest algorithm is an unsupervised anomaly detection method suitable for continuous data, and an anomaly value is detected by isolating sample points. The essence of each independent tree in the independent forest algorithm is a decision tree, and each sample flows to the child nodes of the independent forest algorithm from the root node according to the dividing mode of the nodes and finally falls onto one leaf node. There is no uniform rule for generating the number of independent trees, and the number of independent trees is not directly related to the number of samples. Each independent tree is independent, and the judgment of each independent tree on a sample needs to be comprehensively considered when the independent forest algorithm is adopted for abnormal scoring.
In steps S21-S25, since the abnormal data sample is relatively isolated from other data samples, the number of partitions required for the abnormal sample to be partitioned separately is relatively small compared to other samples, i.e., the path length of the abnormal sample in the independent tree is relatively short. Therefore, the possibility that the sample is an abnormal sample can be judged according to the path length of each sample which is divided out separately, and the sample is represented by a preliminary abnormal score, and the sample with the preliminary abnormal score larger than the preset value is marked as a preliminary abnormal point.
Specifically, the preliminary abnormality score of each sample in step S3 is calculated by the following formula:
Figure 793649DEST_PATH_IMAGE025
wherein,
Figure 62957DEST_PATH_IMAGE002
representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
Figure 723876DEST_PATH_IMAGE003
Figure 321211DEST_PATH_IMAGE004
indicating the number of samples.
As a preferred embodiment, samples with a preliminary anomaly score greater than 0.9, derived according to the above formula, are labeled as preliminary anomaly points.
In step S4, a part of a small number of positive samples are marked by manual marking, where the manually marked positive samples are marked as: {
Figure 342257DEST_PATH_IMAGE026
}. By marking part of positive samples, a foundation is provided for realizing the integration of unsupervised independent forest algorithm and supervised learning, so that the advantages of the two algorithms can be combined, the algorithm accuracy is improved, the algorithm efficiency is ensured, and in addition, compared with marking of all samples, the cost of manual marking can be saved.
The identification precision of the preliminarily identified preliminary abnormal points is not high, so that the reconstruction of the independent trees and the independent forests is required to be further carried out.
In step S5, identifying a valid tree according to the marked preliminary abnormal point includes:
and determining the independent tree of which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
In step S6, assigning a score to the feature of the valid tree in which the preliminary outlier is identified, and calculating a total score according to the number of the independent trees in which the outlier is identified and the number of the marked positive samples, includes:
s61, assigning zero values to each feature as initial scores;
s62, executing the following steps on the initial abnormal points until all effective trees and all the initial abnormal points are traversed to obtain the total score of a certain characteristic: assigning a score to a feature in a valid tree that identifies a preliminary outlier
Figure 951224DEST_PATH_IMAGE027
Wherein
Figure 907678DEST_PATH_IMAGE028
the path length of the preliminary outlier in the effective tree i;
s63, executing step S62 on all the characteristics to obtain the total score of all the characteristics.
In step S62, the total score is calculated by the following formula:
Figure 738448DEST_PATH_IMAGE005
Figure 313917DEST_PATH_IMAGE029
wherein,
Figure 636445DEST_PATH_IMAGE030
representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,
Figure 701485DEST_PATH_IMAGE031
the sum of the scores representing the relevant features of the preliminary outliers P,
Figure 398045DEST_PATH_IMAGE032
represents the total score, n represents the number of positive samples of the token;
in certain embodiments, the maximum path of each individual tree does not exceed D, and the individual trees that identify the preliminary outlier P when the path length does not exceed D-1 are determined to be valid trees, the valid trees for the preliminary outlier P having a total of N. The initial score of each feature is 0, and for the ith independent tree in which the preliminary outlier P is effectively identified, the features involved in the path for detecting the preliminary outlier are assigned a score
Figure 511626DEST_PATH_IMAGE027
Wherein
Figure 485398DEST_PATH_IMAGE028
Path length for point P in ith independent tree
Figure 455759DEST_PATH_IMAGE033
. It is assumed that the features involved for detecting the preliminary outlier P are
Figure 718244DEST_PATH_IMAGE022
Figure 884784DEST_PATH_IMAGE023
Figure 916325DEST_PATH_IMAGE034
Then, for the ith independent tree detecting the preliminary outlier P, the three features can all get the score
Figure 260849DEST_PATH_IMAGE027
Thus, based on N valid trees, features
Figure 666423DEST_PATH_IMAGE024
The total score that can be assigned by the preliminary outlier P is
Figure 652965DEST_PATH_IMAGE035
. The characteristics of all the positive samples are identified and given scores according to the mode, and finally the characteristics are obtained
Figure 398067DEST_PATH_IMAGE024
Is given a total score of
Figure 913493DEST_PATH_IMAGE036
. It should be noted that if a feature is never used for the detection of any preliminary outlier, the score of the feature is always zero.
Referring to fig. 2, in step S7, calculating feature selection probabilities according to the total scores and reconstructing an independent tree and an independent forest includes:
s71, sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
s72, selecting the ith characteristic
Figure 806362DEST_PATH_IMAGE037
As a root node, the characteristics
Figure 65437DEST_PATH_IMAGE037
Satisfies the following conditions:
Figure 602728DEST_PATH_IMAGE038
wherein
Figure 538323DEST_PATH_IMAGE039
representing the mth characteristic selection probability;
s73, selecting a characteristic value between the maximum characteristic value and the minimum characteristic value as the characteristic of the root node as a dividing basis, and dividing two child nodes;
s74, dividing the samples into two groups and respectively entering the two sub-nodes;
s75, the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and S76, recombining the independent trees generated by taking different characteristics as root nodes into an independent forest.
In step S72, the feature selection probability is calculated by the following formula:
Figure 532869DEST_PATH_IMAGE040
Figure 579322DEST_PATH_IMAGE041
wherein,
Figure 705541DEST_PATH_IMAGE042
the m-th feature selection probability is shown,
Figure 890666DEST_PATH_IMAGE043
the total score is represented as a function of the total score,
Figure 774440DEST_PATH_IMAGE024
the mth feature is shown.
The procedure of reconstructing the independent tree in step S7 is substantially the same as the procedure of initially constructing the independent tree in step S2, except that the feature selection of the root node is randomly equal in probability when the independent tree is initially constructed, the feature selection probability when the independent tree is reconstructed is determined by the total score of the features, and the higher the total score is, the higher the probability is that the feature is selected as the root node of the reconstructed independent tree. By uniformly distributing and sampling the random variable U and then selecting the root node, the probability that each feature is selected can be ensured to be
Figure 296688DEST_PATH_IMAGE044
. In particular, if the total score of a feature that has never been used for any preliminary outlier detection is zero, then the probability of feature selection is zero.
In step S8, performing anomaly detection according to the reconstructed independent tree and independent forest, including:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
Figure 152780DEST_PATH_IMAGE016
wherein,
Figure 305543DEST_PATH_IMAGE017
a final anomaly score is indicated which is indicative of,
Figure 394722DEST_PATH_IMAGE018
represents the path length of the sample p at the leaf node where a reorganized independent tree is located,
Figure 143497DEST_PATH_IMAGE019
then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
Figure 119675DEST_PATH_IMAGE045
Figure 833553DEST_PATH_IMAGE046
indicating the number of samples.
As a preferred embodiment, the samples with final anomaly scores greater than 0.9 according to the above formula are labeled as final anomaly points. In the independent tree and the independent forest obtained by probability reconstruction according to the feature selection, the proportion of the features playing more roles in primary abnormal point detection in the root node is improved, so that the accuracy rate of abnormal detection by adopting the reconstructed independent tree and the independent forest is higher.
Referring to fig. 3, in some embodiments, an intelligent operation and maintenance anomaly detection device is provided, including:
the data processing module 101 is configured to acquire operation and maintenance data and perform dimension reduction processing to obtain a sample of the operation and maintenance data;
a preliminary forest establishment module 102, configured to establish an independent tree according to the sample and form an independent forest;
a preliminary judgment module 103, configured to calculate a preliminary abnormal score of each sample according to the independent tree and the independent forest, and mark a sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
a marking module 104 for marking a portion of the positive sample;
an identification module 105 for identifying the valid tree according to the marked preliminary abnormal point;
a total score calculating module 106, configured to assign a score to the feature of the identified preliminary outlier in the effective tree, and calculate a total score according to the number of the independent trees and the number of the marked positive samples, where the preliminary outlier is identified;
a reconstruction module 107, configured to calculate a feature selection probability according to the total score and reconstruct an independent tree and an independent forest;
and an anomaly detection module 108, configured to perform anomaly detection according to the reconstructed independent tree and the independent forest.
Wherein the data processing module 101 is further configured to:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
The preliminary forest establishment module 102 is further configured to establish an independent tree according to the sample and form an independent forest, including:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and forming an independent forest by using the independent trees generated by the different characteristics as root nodes.
In the preliminary judgment module 103, the preliminary abnormal score of each sample is calculated by the following formula:
Figure 285394DEST_PATH_IMAGE047
wherein,
Figure 962494DEST_PATH_IMAGE048
representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree where the sample p is located in an independent forest;
Figure 573604DEST_PATH_IMAGE049
Figure 474695DEST_PATH_IMAGE050
indicating the number of samples.
The identification module 105 is further configured to:
and determining the independent tree in which the preliminary abnormal point is identified when the path does not exceed the preset value as a valid tree.
In the total score calculating module 106, the total score value is calculated by the following formula:
Figure 148253DEST_PATH_IMAGE005
Figure 409470DEST_PATH_IMAGE051
wherein,
Figure 794245DEST_PATH_IMAGE027
representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,
Figure 849926DEST_PATH_IMAGE031
the sum of the scores representing the relevant features of the outlier P,
Figure 276359DEST_PATH_IMAGE052
representing the total score and n representing the number of positive samples of the token.
In the reconstruction module 107, the feature selection probability is calculated by the following formula:
Figure 888737DEST_PATH_IMAGE053
wherein,
Figure 694014DEST_PATH_IMAGE054
the m-th feature selection probability is shown,
Figure 186175DEST_PATH_IMAGE055
the total score is represented as a function of the total score,
Figure 709691DEST_PATH_IMAGE024
the mth feature is shown.
The reconstruction module 107 is further configured to:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristic
Figure 922498DEST_PATH_IMAGE037
As a root node, the characteristics
Figure 362706DEST_PATH_IMAGE037
Satisfies the following conditions:
Figure 510922DEST_PATH_IMAGE038
wherein
Figure 771002DEST_PATH_IMAGE039
representing the mth characteristic selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes form an independent forest again.
The anomaly detection module 108 is further configured to:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the samples with the final abnormal scores larger than a preset value as abnormal points;
the final anomaly score is calculated by the following formula:
Figure 521920DEST_PATH_IMAGE016
wherein,
Figure 567368DEST_PATH_IMAGE017
a final anomaly score is indicated which is indicative of,
Figure 683223DEST_PATH_IMAGE018
represents the path length of the leaf node of the sample p in a recombined independent tree,
Figure 696178DEST_PATH_IMAGE019
then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
Figure 250787DEST_PATH_IMAGE056
Figure 150741DEST_PATH_IMAGE057
indicating the number of samples.
Referring to fig. 4, in some embodiments, an electronic device is provided, which includes a processor 1 and a storage 2, where the storage 2 stores a plurality of instructions, and the processor 1 is configured to read the plurality of instructions and execute the method.
According to the intelligent operation and maintenance anomaly detection method and device, operation and maintenance data are subjected to dimensionality reduction before anomaly detection, sample data applied to anomaly detection is simplified, operation time is saved, and the working efficiency of an anomaly detection algorithm is improved; the method has the advantages that the positive samples are marked in an artificial marking mode, and a marked supervised learning mode is added into an unsupervised independent forest algorithm, so that the advantages of the two algorithms can be combined, the accuracy of the algorithm is improved, and the efficiency of the algorithm is guaranteed; assigning scores to all the characteristics involved in the samples through a plurality of effective trees of a plurality of positive samples, calculating the total score of the characteristics to describe the action size of each characteristic in the abnormal detection process, and taking the total score as a basis for selecting root nodes when the independent trees are reconstructed, so that the identification accuracy of the reconstructed independent forests is improved; root node selection is performed by uniformly distributing and sampling random variables, the probability that each feature is selected can be guaranteed as the feature selection probability, and therefore the accuracy of reconstructing the independent forest is guaranteed.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. An intelligent operation and maintenance abnormity detection method is characterized by comprising the following steps:
acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
establishing an independent tree according to the sample and forming an independent forest;
calculating a preliminary abnormal score of each sample according to the independent tree and the independent forest, and marking the samples with the preliminary abnormal scores larger than a preset value as preliminary abnormal points;
marking part of the positive sample;
identifying an effective tree according to the marked preliminary abnormal points;
assigning a score to the features of the identified preliminary outliers in the valid trees, and calculating a total score according to the number of the independent trees of the identified preliminary outliers and the number of the marked positive samples;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
carrying out anomaly detection according to the reconstructed independent trees and independent forests;
calculating feature selection probability according to the total score and reconstructing an independent tree and an independent forest, wherein the method comprises the following steps:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristic
Figure 427151DEST_PATH_IMAGE001
As a root node, the characteristics
Figure 626182DEST_PATH_IMAGE001
Satisfies the following conditions:
Figure 812444DEST_PATH_IMAGE002
wherein
Figure 4391DEST_PATH_IMAGE003
representing the mth characteristic selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups, and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
2. The method of claim 1, wherein collecting operation and maintenance data and performing dimension reduction processing comprises:
forming a matrix by each operation and maintenance data according to columns;
zero-averaging each row of the matrix;
solving a covariance matrix of the matrix after zero-mean processing;
solving eigenvalues and corresponding characteristics of the covariance matrix;
and arranging the characteristics into a characteristic matrix according to the characteristic value size in rows as a sample.
3. The method of claim 2, wherein building independent trees from the samples and composing independent forests comprises:
randomly selecting a feature as a root node;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups and respectively entering the two sub-nodes;
repeatedly executing the following steps until the path reaches a preset length or the child node only contains one sample to form an independent tree: selecting a characteristic value of one characteristic from each child node as a dividing basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and forming an independent forest by using the independent trees generated by the different characteristics as root nodes.
4. The method of claim 1, wherein the preliminary anomaly score for each sample is calculated by the formula:
Figure 964302DEST_PATH_IMAGE004
wherein,
Figure 193289DEST_PATH_IMAGE005
representing the initial abnormal score, L (p) represents the path length of a leaf node where the sample p is located in an independent tree, and E (L (p)) represents the average value of the path lengths of each independent tree of the sample p in the independent forest;
Figure 601267DEST_PATH_IMAGE006
Figure 331326DEST_PATH_IMAGE007
indicating the number of samples.
5. The method of claim 4, wherein identifying valid trees from the labeled preliminary outliers comprises:
and determining the independent tree of which the preliminary abnormal point is identified when the path length does not exceed the preset value as a valid tree.
6. The method of claim 4, wherein the total score is calculated by the formula:
Figure 406729DEST_PATH_IMAGE008
Figure 681984DEST_PATH_IMAGE009
wherein,
Figure 842838DEST_PATH_IMAGE010
representing the score assigned to a feature of the preliminary outlier P, N representing the number of independent trees from which the preliminary outlier P was identified,
Figure 924058DEST_PATH_IMAGE011
the sum of the scores representing the relevant features of the preliminary outliers P,
Figure 853968DEST_PATH_IMAGE012
representing the total score and n representing the number of positive samples of the token.
7. The method of claim 1, wherein anomaly detection is performed based on the reconstructed independent trees and independent forests, comprising:
calculating the final abnormal score of each sample according to the reconstructed independent tree and the reconstructed independent forest, and marking the sample with the final abnormal score larger than a preset value as an abnormal point;
the final anomaly score is calculated by the following formula:
Figure 80550DEST_PATH_IMAGE013
wherein,
Figure 74174DEST_PATH_IMAGE014
a final anomaly score is indicated which is indicative of,
Figure 490243DEST_PATH_IMAGE015
represents the path length of the leaf node of the sample p in a recombined independent tree,
Figure 274659DEST_PATH_IMAGE016
then represents the average of the path lengths of each individual tree of sample p in the recombined individual forest;
Figure 626137DEST_PATH_IMAGE017
Figure 496004DEST_PATH_IMAGE007
indicating the number of samples.
8. An intelligent operation and maintenance anomaly detection device applied to the method of any one of claims 1-7, comprising:
the data processing module is used for acquiring operation and maintenance data and performing dimension reduction processing to obtain a sample of the operation and maintenance data;
the preliminary forest establishment module is used for establishing an independent tree according to the sample and forming an independent forest;
the preliminary judgment module is used for calculating the preliminary abnormal score of each sample according to the independent tree and the independent forest and marking the sample with the preliminary abnormal score larger than a preset value as a preliminary abnormal point;
the marking module is used for marking part of positive samples;
the identification module is used for identifying the effective tree according to the marked preliminary abnormal points;
the total score calculating module is used for giving scores to the features of the identified primary abnormal points in the effective trees and calculating the total score according to the number of the independent trees of the identified primary abnormal points and the number of the marked positive samples;
the reconstruction module is used for calculating the feature selection probability according to the total score and reconstructing an independent tree and an independent forest;
the anomaly detection module is used for carrying out anomaly detection according to the reconstructed independent tree and the independent forest;
the reconstruction module is further configured to:
sampling a random variable U, wherein the random variable U obeys uniform distribution between 0 and 1;
selecting the ith characteristic
Figure 371557DEST_PATH_IMAGE001
As root node, the characteristics
Figure 276059DEST_PATH_IMAGE001
Satisfies the following conditions:
Figure 1700DEST_PATH_IMAGE018
wherein
Figure 769981DEST_PATH_IMAGE019
representing the mth feature selection probability;
selecting a characteristic value between the maximum characteristic value and the minimum characteristic value of the characteristics of the root node as a dividing basis, and dividing two child nodes;
dividing samples into two groups, and respectively entering the two sub-nodes;
the following steps are repeatedly executed until the path reaches the preset length or the child node only contains one sample: randomly selecting a characteristic value of a characteristic vector from each child node as a division basis to divide the two child nodes again, and dividing the rest samples into two groups again to enter the two child nodes;
and the independent trees generated by taking different characteristics as root nodes are recombined into an independent forest.
CN202210492320.8A 2022-05-07 2022-05-07 Intelligent operation and maintenance abnormity detection method and device Active CN114580580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210492320.8A CN114580580B (en) 2022-05-07 2022-05-07 Intelligent operation and maintenance abnormity detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210492320.8A CN114580580B (en) 2022-05-07 2022-05-07 Intelligent operation and maintenance abnormity detection method and device

Publications (2)

Publication Number Publication Date
CN114580580A CN114580580A (en) 2022-06-03
CN114580580B true CN114580580B (en) 2022-08-16

Family

ID=81769157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210492320.8A Active CN114580580B (en) 2022-05-07 2022-05-07 Intelligent operation and maintenance abnormity detection method and device

Country Status (1)

Country Link
CN (1) CN114580580B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859029A (en) * 2019-01-04 2019-06-07 深圳壹账通智能科技有限公司 Abnormal application detection method, device, computer equipment and storage medium
CN110149258A (en) * 2019-04-12 2019-08-20 北京航空航天大学 A kind of automobile CAN-bus network data method for detecting abnormality based on isolated forest
CN111784392A (en) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 Abnormal user group detection method, device and equipment based on isolated forest
CN111833172A (en) * 2020-05-25 2020-10-27 百维金科(上海)信息科技有限公司 Consumption credit fraud detection method and system based on isolated forest
CN112199670A (en) * 2020-09-30 2021-01-08 西安理工大学 Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning
CN112505549A (en) * 2020-11-26 2021-03-16 西安电子科技大学 New energy automobile battery abnormity detection method based on isolated forest algorithm
CN112990330A (en) * 2021-03-26 2021-06-18 国网河北省电力有限公司营销服务中心 User energy abnormal data detection method and device
CN113392914A (en) * 2021-06-22 2021-09-14 北京邮电大学 Anomaly detection algorithm for constructing isolated forest based on weight of data features
WO2021218314A1 (en) * 2020-04-27 2021-11-04 深圳壹账通智能科技有限公司 Event identification method and apparatus based on position locating, and device and storage medium
CN113627521A (en) * 2021-08-09 2021-11-09 西华大学 Intelligent logistics unmanned aerial vehicle abnormal behavior identification method based on isolated forest method
CN113886375A (en) * 2021-09-29 2022-01-04 东北电力大学 Wind power data cleaning method based on isolated forest and local outlier factors
CN113887674A (en) * 2021-12-06 2022-01-04 深圳索信达数据技术有限公司 Abnormal behavior detection method and system based on big data
CN114386483A (en) * 2021-12-17 2022-04-22 深圳索信达数据技术有限公司 Method, apparatus, device, and medium for quantifying feature distinguishing capability

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7063022B2 (en) * 2018-03-14 2022-05-09 オムロン株式会社 Anomaly detection system, support device and model generation method
CN109345137A (en) * 2018-10-22 2019-02-15 广东精点数据科技股份有限公司 A kind of rejecting outliers method based on agriculture big data
CN109886724B (en) * 2018-12-29 2021-02-12 中南大学 Robust resident travel track identification method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859029A (en) * 2019-01-04 2019-06-07 深圳壹账通智能科技有限公司 Abnormal application detection method, device, computer equipment and storage medium
CN110149258A (en) * 2019-04-12 2019-08-20 北京航空航天大学 A kind of automobile CAN-bus network data method for detecting abnormality based on isolated forest
WO2021218314A1 (en) * 2020-04-27 2021-11-04 深圳壹账通智能科技有限公司 Event identification method and apparatus based on position locating, and device and storage medium
CN111833172A (en) * 2020-05-25 2020-10-27 百维金科(上海)信息科技有限公司 Consumption credit fraud detection method and system based on isolated forest
CN111784392A (en) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 Abnormal user group detection method, device and equipment based on isolated forest
CN112199670A (en) * 2020-09-30 2021-01-08 西安理工大学 Log monitoring method for improving IFOREST (entry face detection sequence) to conduct abnormity detection based on deep learning
CN112505549A (en) * 2020-11-26 2021-03-16 西安电子科技大学 New energy automobile battery abnormity detection method based on isolated forest algorithm
CN112990330A (en) * 2021-03-26 2021-06-18 国网河北省电力有限公司营销服务中心 User energy abnormal data detection method and device
CN113392914A (en) * 2021-06-22 2021-09-14 北京邮电大学 Anomaly detection algorithm for constructing isolated forest based on weight of data features
CN113627521A (en) * 2021-08-09 2021-11-09 西华大学 Intelligent logistics unmanned aerial vehicle abnormal behavior identification method based on isolated forest method
CN113886375A (en) * 2021-09-29 2022-01-04 东北电力大学 Wind power data cleaning method based on isolated forest and local outlier factors
CN113887674A (en) * 2021-12-06 2022-01-04 深圳索信达数据技术有限公司 Abnormal behavior detection method and system based on big data
CN114386483A (en) * 2021-12-17 2022-04-22 深圳索信达数据技术有限公司 Method, apparatus, device, and medium for quantifying feature distinguishing capability

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孤立森林算法在大坝监测数据异常识别中的应用;张海龙等;《人民黄河》;20200810(第08期);第154-157页 *

Also Published As

Publication number Publication date
CN114580580A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
US8140301B2 (en) Method and system for causal modeling and outlier detection
CN114332984B (en) Training data processing method, device and storage medium
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
CN110097120B (en) Network flow data classification method, equipment and computer storage medium
CN104537383A (en) Massive organizational structure data classification method and system based on particle swarm
CN114580580B (en) Intelligent operation and maintenance abnormity detection method and device
US20200279148A1 (en) Material structure analysis method and material structure analyzer
CN111158918B (en) Supporting point parallel enumeration load balancing method, device, equipment and medium
CN111091194B (en) Operation system identification method based on CAVWBB _ KL algorithm
CN112434886A (en) Method for predicting client mortgage loan default probability
CN116957361A (en) Ship task system health state detection method based on virtual-real combination
WO2022188080A1 (en) Image classification network model training method, image classification method, and related device
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
Pereira et al. Assessing active learning strategies to improve the quality control of the soybean seed vigor
CN113420733B (en) Efficient distributed big data acquisition implementation method and system
CN116188834A (en) Full-slice image classification method and device based on self-adaptive training model
CN111654853B (en) Data analysis method based on user information
CN114399407A (en) Power dispatching monitoring data anomaly detection method based on dynamic and static selection integration
CN110459266B (en) Method for establishing SNP (Single nucleotide polymorphism) pathogenic factor and disease association relation model
Sánchez et al. Applicability of cluster validation indexes for large data sets
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN116186503B (en) Industrial control system-oriented malicious flow detection method and device and computer storage medium
CN115017125B (en) Data processing method and device for improving KNN method
CN118071210B (en) Ecological environment vulnerability assessment method combining CNN and PPM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant