CN114399407A

CN114399407A - Power dispatching monitoring data anomaly detection method based on dynamic and static selection integration

Info

Publication number: CN114399407A
Application number: CN202210147086.5A
Authority: CN
Inventors: 高欣; 傅世元; 薛冰; 于家豪; 黄子健; 黄旭; 张光耀; 李康生
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-04-26

Abstract

The embodiment of the invention provides a power dispatching monitoring data abnormity detection method based on dynamic and static selection integration, which comprises the following steps: training a number of base detectors using power schedule monitoring historical data; using an isolated forest to reject a base detector with poor performance; generating a false true value of the historical data according to the output of the residual base detector by using an average value method, and respectively converting the false true value and the output of the base detector into two types of labels; removing historical data with over-small false values, and extracting meta-features and meta-tags of the base detector on the remaining historical data; training a random forest through meta features and meta tags; and extracting the meta-characteristics of the base detector on the data to be detected, inputting the meta-characteristics into a random forest, selecting the base detector according to the output of the random forest, and taking the maximum value of the output of the selected base detector as the detection result of the data to be detected. According to the technical scheme provided by the embodiment of the invention, the accuracy of the abnormal detection of the power dispatching monitoring data can be improved.

Description

Power dispatching monitoring data anomaly detection method based on dynamic and static selection integration

[ technical field ] A method for producing a semiconductor device

The invention relates to an electric power dispatching monitoring data abnormity detection method, in particular to an electric power dispatching monitoring data abnormity detection method based on dynamic and static selection integration.

[ background of the invention ]

The unified and strong smart power grid is a novel power grid formed by highly integrating modern advanced sensing measurement technology, communication technology, information technology, computer technology and control technology with a physical power grid on the basis of the physical power grid, and comprises the links of power generation, power transmission, power transformation, power distribution, power utilization and scheduling. In the actual work of the power system, the dispatching undertakes the functions of commanding, monitoring and managing the power production operation, and is an important guarantee for the safe operation of the power system. With the increasing expansion of the scale of the power grid, the requirement on the safe and stable operation of the power grid is higher and higher, and the abnormal detection of the power grid dispatching monitoring data is more and more important. Because the monitoring system can generate a large amount of monitoring data in a short time when the power grid runs, it is almost impossible to manually calibrate the positive and abnormal labels for the data in a mode of consulting experts and the like. Therefore, these stored historical grid dispatching monitoring data often lack accurate tag information. Therefore, the unsupervised anomaly detection method without using training data label information can better cope with the condition that the stored historical data lacks accurate labels. In the existing unsupervised anomaly detection method based on dynamic selection integration, the false values generated by all the initially trained base detectors are influenced by the base detectors with poor performance to generate deviation, so that the performance scores of the base detectors calculated by taking the false values as the basis are not accurate enough; and the existing dynamic selection integration method only uses a single evaluation index to measure the performance of the base detector, and has limited universality, so that the method has poor performance when the used index is not applicable. Therefore, a more accurate false true value is generated by statically selecting and rejecting a part of base detectors with poor performance, and then a dynamic and static selection integration anomaly detection method which integrates a meta-learning thought to comprehensively evaluate the performance of the detectors by combining various indexes and dynamically select the base detectors is provided, so that the accuracy of the anomaly detection method based on the integrated power dispatching monitoring data can be improved, and the method has important significance for enhancing the monitoring of the state of a power grid and ensuring the safety of the power grid.

[ summary of the invention ]

In view of this, the invention provides a power dispatching monitoring data anomaly detection method based on dynamic and static selection integration, so as to improve the accuracy of power dispatching monitoring data anomaly detection.

The invention provides a power dispatching monitoring data anomaly detection method based on dynamic and static selection integration, which comprises the following steps:

(1) the method for training a certain number of base detectors by using power dispatching monitoring historical data specifically comprises the following steps:

all power monitoring historical data are used as a training set X_TRTraining m base detectors by using different unsupervised anomaly detection algorithms based on a training set, generally taking m to be more than or equal to 50, and recording a base detector pool composed of all the base detectors as P_O. The output of each base detector is an abnormal score, and the larger the abnormal score is, the larger the abnormal degree of the input data is, the P_OThe anomaly score output by each base detector is converted into a Z score by Z score normalization. Note P_OWherein the ith base detector is at X_TRThe jth history data

The abnormal score of the upper output is

Z fraction thereof

Comprises the following steps:

wherein: 1, 2, 1, n, n is X_TRThe amount of history data in the database is,

is the average of the anomaly scores output by the ith basis detector over the entire history,

the standard deviation of the anomaly scores output for the ith basis detector over the entire history.

The input of each base detector is process real-time resource occupation data which is collected by the power dispatching monitoring system and is related to the power dispatching system service, and the process real-time resource occupation data comprises process CPU occupancy rate, memory occupancy rate, disk IO, network IO, thread number and network connection number. If the Z-fraction of the ith base detector output is less than

The input data is normal; if the Z-fraction of the ith base detector output is greater than or equal to

The input data is abnormal. The ith base detector is applied to all the training data X_TRSorting the Z-scores of the upper outputs from big to small, classification threshold of the ith base detector

Is the front R after the ordering_DAMinimum in% Z fraction. R_DAThe% is the set base detector output conversion ratio, and is generally 10%.

(2) The method comprises the following steps of using a base detector with poor performance of removing isolated forests, specifically:

using P_OIn the training set X of all m basis detectors_TRComposed of Z scores output on all n pieces of historical data

An isolated forest consisting of n _ itree isolated trees is trained, with n _ itree typically taking 100. When constructing an isolated tree, from

Sampling phi-stripe data without putting back in medium-uniform manner, and generally taking

All psi pieces of n-dimension data Score_ψ×nAs a training sample for this isolated tree. In each isolated tree sample, a dimension is randomly selected, a value is randomly selected from the maximum value and the minimum value of the sample in the dimension, the sample is divided into two branches, the sample which is smaller than the value in the dimension is divided into the left side of a node, the sample which is larger than or equal to the value is divided into the right side of the node, and a splitting condition and data sets of the left side and the right side are obtained. The above process is repeated on the data sets on the left and right sides respectively until the termination condition is reached, which has two:

1) the data set itself comprises only one sample, or all samples are identical;

2) the height of the tree reaches log₂(ψ)。

And forming an isolated forest IForest by using all the trained isolated trees, wherein the output of the isolated forest IForest is a continuous value, and the smaller the output is, the larger the abnormal degree of the input data is.

Will be provided with

The r-th data in

As an input of the isolated forest IForest, r is 1, 2

An isolated forest IForest is arranged in

M outputs ofSorting from small to large, and sorting the first R after the sorting_DThe% output corresponds to the base detector flag abnormal, R_DThe% is generally 10%. From P_ORemoving the base detectors marked as abnormal, and recording the base detector pool consisting of the m' base detectors left after screening as P_F。

(3) Generating a false true value of the historical data according to the output of the residual base detector by using an average value method, and respectively converting the false true value and the output of the base detector into two types of labels, specifically:

note P_FIn the training set X of all m' basis detectors_TRThe jth history data

Composition of Z fraction of up output

Computing

The average value of all Z fractions in the composition is used as

False true value of

Training set X_TRThe false truth set corresponding to all the historical data is

Will be provided with

False true values in (1) are sorted from large to small, threshold PScore_thrIs the front R after the ordering_GAMinimum of% false values, R_GAThe% is the set false true value conversion ratio, and is generally 20%. If the jth historical data

Corresponding false true value

Greater than or equal to PScore_thrThen its false label

Is 1, otherwise is 0. Training set X_TRThe false label set corresponding to all the historical data is

If P is_FThe a-th base detector in history data

Z score of up output

Greater than or equal to its classification threshold

a 1, 2, m', then

Class II tag with upper output

Is 1, otherwise is 0. Recording the a-th basis detector in the training set X_TRClass II tags of the upper output are

All-radical detectors at X_TRClass II tag set of the upper output

(4) And eliminating historical data with over-small false values, and extracting meta-features and meta-tags of the base detector on the remaining historical data, specifically:

false true value of all historical data

Sorting from small to large, eliminating the front R after sorting_S% false values correspond to historical data. Recording the remaining n' historical data as X_STRThe corresponding false label set and the second type label set are respectively

And

residual radical detector at X_STRZ in the above is

For X_STRThe t-th history data

Calculate it to the original training set X_TRThe jth history data

Euclidean distance of

Wherein: t 1, 2, n', l 1, 2, u, u is the dimension of the historical data,

is composed of

The value in the l-th dimension is,

is composed of

Numerical values in the l-th dimension.

Will be the original training set X_TRAccording to the historical data in

The Euclidean distance of the K-shaped elements is ranked from small to large, and the K arranged at the front is taken_RCAs a history data

Performance evaluation set of

Generally, K is 10-K_RC≤30。

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

Computing

And

euclidean distance of

Wherein:

is P_FWherein the a-th radical detector is in

The Z-score of the upper output is,

is P_FWherein the a-th radical detector is in

The Z-score of the upper output.

Will be the original training set X_TRBased on the Z-score and the sum of all historical data output by the base detector

The Euclidean distance of the K-shaped elements is ranked from small to large, and the K arranged at the front is taken_SOPAs a history data

Approximate output set of

Generally, K is 10-K_SOP≤30。

Extraction of P_FWherein the a-th radical detector is in

The six-component characteristic:

1) computing in a performance evaluation set

The quantity of the history data with the same type II labels and corresponding false labels output by the middle base detector is calculated, and the quantity of the history data is calculated to be equal to K_RCThe ratio of (A) to (B) is taken as a characteristic; this set includes a feature;

2) computing in an approximate output set

The quantity of the history data with the same type II labels and corresponding false labels output by the middle base detector is calculated, and the quantity of the history data is calculated to be equal to K_SOPThe ratio of (A) to (B) is taken as a characteristic; this set includes a feature;

3) for performance evaluation set

Whether the base detector can correctly judge the normal abnormal condition of each historical data in the data base; if the basis detector can correctly judge

Q 1, 2.., K, the q-th history data in (1)_RCThe qth feature in this group is 0, otherwise it is 1; this group comprises K_RCA feature;

4) for approximate output set

The pth history of (1, 2., K)_SOPIf so, the pth feature in this group is 0, otherwise it is 1; this group comprises K_SOPA feature;

5) set of computational performance evaluations

Z-score output by the middle base detector for each historical data and classification threshold of the base detector

The absolute value of the difference of (a); this group comprises K_RCA feature;

6) computing basis detector pairs data to extract meta-features

Output Z-score and base detector self-positive classification threshold

The absolute value of the difference of (a); this set contains 1 feature.

The six groups contain M number of element characteristics, wherein M is 3+2 xK_RC+K_SOP(ii) a Extraction of P by the above method_FWherein each base detector is at X_STRThe meta-feature on each historical data in the set constitutes a meta-feature set X_TRM，X_TRMWhich contains n '× m' pieces of meta-feature data.

Comparison P_FWherein the a-th radical detector is in

Class II tag with upper output

And

false label of

Whether or not they are the same. If they are the same, the a-th base detector is

Meta tag on

Is 0Indicating that the a-th basis detector can correctly judge

Otherwise, it is 1, which means that the a-th basis detector cannot correctly judge

Calculating P by the above method_FWherein each base detector is at X_STRSet of meta-tags L per history data_TRM，L_TRMContains n '× m' meta tags.

(5) Training a random forest through meta-features and meta-labels, specifically:

using a meta feature set X_TRMAnd meta tag set L_TRMA random forest consisting of n _ dtree decision trees is trained, n _ dtree generally being 100. When constructing a decision tree, from X_TRMThe middle uniform has the place back to sample out N pieces of data

As a training sample of this decision tree, N ═ N '× m' is generally taken. In each decision tree sample, M' dimensions are randomly taken from M dimensions, typically

And selecting the optimal division dimension and the division point on the selected M' dimensions according to the kini index to perform binary division on the samples, dividing the samples smaller than the value in the dimension to the left side of the node, and dividing the samples larger than or equal to the value to the right side of the node to obtain a splitting condition and data sets on the left side and the right side. The above process is repeated on the data sets on the left and right sides, respectively, until the data set itself includes only one sample, or the metatags of all samples are the same. And (4) forming Random Forests (RFCs) by using all the trained decision trees, outputting the RFCs as class II labels 0 or 1, and showing whether the corresponding base detectors can correctly judge corresponding data or not.

(6) Extracting the meta-characteristics of the base detector on the data to be detected, inputting the meta-characteristics into a random forest, selecting the base detector according to the output of the random forest, and taking the maximum value of the output of the selected base detector as the detection result of the data to be detected to realize the abnormal detection of the power dispatching monitoring data, which specifically comprises the following steps:

for data x to be detected_TEExtracting P by the same method as in the step (4)_FWherein each base detector is at x_TEThe M meta-features on the (A) form a detection meta-feature set X_TEM. Mixing X_TEMInputting the RFC into the random forest RFC trained in the step (5) to obtain a detection meta-tag set L containing m' second class tags_TEM。

For P_FIf the corresponding detection element tag of each base detector in (1) is 0, which means that the detector is considered to be capable of correctly judging the data to be detected, adding the data to the selected base detector pool P_SIn (1). Calculating P_SWherein the total basis detector is at x_TEThe maximum value of the Z score is used as the data x to be detected_TEThe detection result of (1). Calculating P_SThe maximum value of the classification threshold values of all the medium-base detectors is used as the detection threshold value of the current detection, and the detection result is greater than or equal to the data x to be detected of the detection threshold value_TEAnd judging the data to be abnormal data, and realizing the abnormal detection of the power dispatching monitoring data.

According to the technical scheme, the invention has the following beneficial effects:

in the technical scheme implemented by the invention, the isolated forest is used for removing part of the base detectors with poor performance on all training data in advance before dynamically selecting the base detectors, so that the accuracy of the generated false true value can be improved, and the performance of the base detectors can be evaluated more accurately; when the base detector is dynamically selected, the performance of the base detector is comprehensively evaluated by effectively combining various evaluation indexes through a meta-learning idea, the problem that the performance of a dynamic selection integration method is poor under partial conditions due to effective universality of a single index can be solved, and the accuracy of power dispatching monitoring data anomaly detection based on the integration method is improved.

[ description of the drawings ]

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a schematic flow chart of a frame of a power dispatching monitoring data anomaly detection method based on dynamic and static selection integration, which is provided by the invention;

FIG. 2 is a schematic diagram of an abnormal detection method for power dispatching monitoring data based on dynamic and static selection integration according to the present invention;

FIG. 3 is a schematic of the input data and output results of a base detector used in the present invention;

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings.

It should be understood that the described embodiments of the invention are only some, but not all embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a power dispatching monitoring data anomaly detection method based on dynamic and static selection integration. In order to meet the abnormal detection of power dispatching monitoring data, the invention uses an isolated forest screening base detector, combines various evaluation indexes to comprehensively measure the performance of the base detector, and selects the base detector with better performance from random forests to detect the data to be detected.

Fig. 1 is a schematic flow chart of a framework of a power scheduling monitoring data anomaly detection method based on dynamic and static selection integration, which is provided by the invention, and the method comprises the following steps:

step 101, training a certain number of base detectors using power scheduling monitoring historical data.

Specifically, all power monitoring historical data are used as a training set X_TRTraining m basis detectors using different unsupervised anomaly detection algorithms based on a training set,generally, m is greater than or equal to 50, and the number of the base detector cells formed by all the base detectors is recorded as P_O. The output of each base detector is an abnormal score, and the larger the abnormal score is, the larger the abnormal degree of the input data is, the P_OThe anomaly score output by each base detector is converted into a Z score by Z score normalization. Note P_OWherein the ith base detector is at X_TRThe jth history data

The abnormal score of the upper output is

Z fraction thereof

Comprises the following steps:

wherein: 1, 2, 1, n, n is X_TRThe amount of history data in the database is,

The input data is normal; if the ith base detectsThe Z fraction of the output of the device is greater than or equal to

And 102, using the isolated forest to reject the base detector with poor performance.

In particular, using P_OIn the training set X of all m basis detectors_TRComposed of Z scores output on all n pieces of historical data

1) the data set itself comprises only one sample, or all samples are identical;

2) the height of the tree reaches log₂(ψ)。

Will be provided with

The r-th data in

As an input of the isolated forest IForest, r is 1, 2

An isolated forest IForest is arranged in

The m outputs on the sequence are sorted from small to large, and the front R after the sorting is_DThe% output corresponds to the base detector flag abnormal, R_DThe% is generally 10%. From P_ORemoving the base detectors marked as abnormal, and recording the base detector pool consisting of the m' base detectors left after screening as P_F。

Algorithm 1 is pseudo code for this step:

step 103, generating a false true value of the history data according to the output of the residual basis detector by using an average value method, and converting the false true value and the output of the basis detector into two types of labels respectively.

Specifically, note P_FIn the training set X of all m' basis detectors_TRThe jth history data

Composition of Z fraction of up output

Computing

The average value of all Z fractions in the composition is used as

False true value of

Will be provided with

Corresponding false true value

Greater than or equal to PScore_thrThen its false label

If P is_FThe a-th base detector in history data

Z score of up output

Greater than or equal to its classification threshold

a 1, 2, m', then

Class II tag with upper output

All-radical detectors at X_TRClass II tag set of the upper output

And 104, eliminating the historical data with the smaller false value, and extracting the meta-characteristics and the meta-tags of the base detector on the residual historical data.

Specifically, false values of all historical data are determined

And

residual radical detector at X_STRZ in the above is

For X_STRThe t-th history data

Calculate it to the original training set X_TRThe jth history data

Euclidean distance of

Wherein: t 1, 2, n', l 1, 2, u, u is the dimension of the historical data,

is composed of

The value in the l-th dimension is,

is composed of

Numerical values in the l-th dimension.

Will be the original training set X_TRAccording to the historical data in

Performance evaluation set of

Generally, K is 10-K_RC≤30。

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

Computing

And

euclidean distance of

Wherein:

is P_FWherein the a-th radical detector is in

The Z-score of the upper output is,

is P_FWherein the a-th radical detector is in

The Z-score of the upper output.

Approximate output set of

Generally, K is 10-K_SOP≤30。

Extraction of P_FWherein the a-th radical detector is in

The six-component characteristic:

1) computing in a performance evaluation set

2) computing in an approximate output set

Class two tags and corresponding false tag phases output by a mid-base detectorThe same amount of history data is calculated, and K is calculated_SOPThe ratio of (A) to (B) is taken as a characteristic; this set includes a feature;

3) for performance evaluation set

4) for approximate output set

5) set of computational performance evaluations

6) computing basis detector pairs data to extract meta-features

Output Z-score and base detector self-positive classification threshold

The absolute value of the difference of (a); this set contains 1 feature.

Comparison P_FWherein the a-th radical detector is in

Class II tag with upper output

And

false label of

Meta tag on

Is 0, indicating that the a-th basis detector can correctly judge

Calculating P by the above method_FWherein each base detector is at X_STRSet of meta-tags L per history data_TEM，L_TRMContains n '× m' meta tags.

And 105, training a random forest through the meta-features and the meta-labels.

In particular, using the meta feature set X_TRMAnd meta tag set L_TRMA random forest consisting of n _ dtree decision trees is trained, n _ dtree generally being 100. When constructing a decision tree, from X_TEMThe middle uniform has the place back to sample out N pieces of data

Algorithm 2 is the pseudo code of step 103-105:

and 106, extracting the meta-features of the base detectors on the data to be detected, inputting the meta-features into a random forest, selecting the base detectors according to the output of the random forest, and taking the maximum output value of the selected base detectors as the detection result of the data to be detected to realize the abnormal detection of the power dispatching monitoring data.

In particular, for the data x to be detected_TEP is extracted using the same method as in step 104_FWherein each base detector is at x_TEThe M meta-features on the (A) form a detection meta-feature set X_TEM. Mixing X_TEMInputting the RFC into the random forest RFC trained in the step (5) to obtain a detection meta-tag set L containing m' second class tags_TEM。

Algorithm 3 is the pseudo code for step 106:

fig. 2 is a schematic diagram of an abnormal detection method for power dispatching monitoring data based on dynamic and static selection integration, which is provided by the invention. Firstly, training a certain number of base detectors by using power dispatching monitoring historical data, training isolated forests according to Z scores output by all the base detectors on all the historical data, and removing the base detectors corresponding to smaller outputs of the isolated forests on all the Z scores; secondly, generating a false true value of each historical data according to the Z fraction output by the residual basis detector by using an average value method, and converting the false true value into a false label; removing data with smaller false true values from all historical data, extracting the meta-features of each base detector on the residual historical data to form a meta-feature set, and generating a meta-tag set according to whether tags output by the base detectors on the residual historical data are the same as corresponding false tags or not; secondly, training a random forest by using the meta feature set and the meta tag set; and finally, extracting a detection element characteristic set of each base detector on the data to be detected, inputting the detection element characteristic set into a random forest to obtain a detection element label set, selecting the base detectors according to the detection element label set, taking the maximum value of the Z scores of the selected base detectors as a detection result, taking the maximum value of the classification threshold values of the selected base detectors as a detection threshold value of the time, judging the data to be detected, of which the detection result is greater than or equal to the detection threshold value, as abnormal data, and realizing abnormal detection of the power dispatching monitoring data.

Fig. 3 is a schematic diagram of input data and output results of the base detectors used in the present invention, where the input of each base detector is process real-time resource occupation data related to the power scheduling system service, which is acquired by the power scheduling monitoring system, and includes process CPU occupancy, memory occupancy, disk IO, network IO, thread number, and network connection number. If the Z-fraction of the ith base detector output is less than

The input data is abnormal. Sorting Z scores output by the ith base detector on all training data XTR from large to small, and classifying threshold values of the ith base detector

In a specific embodiment, three abnormal conditions in a smart grid dispatching control system (referred to as a D5000 system for short) are used: and (4) carrying out data jumping, applying network disconnection and not refreshing the telemetry table to the system monitoring data. The data jump abnormity is that for a remote measuring point, the process data of the D5000 system is collected periodically, and if the numerical difference value of adjacent sampling points is larger than an artificially set threshold value, the data jump abnormity is considered to occur. When data jump variation occurs, deviation occurs when the power dispatching position distributes power generation amount to subordinate power grid companies, the dispatching plan of a power grid is influenced, and meanwhile deviation occurs in a report form of electric quantity, and electric quantity charging is influenced. The application network disconnection abnormity is that the network connection of a server running the D5000 system application is interrupted or a network card fails, so that the key process of the D5000 system runs slowly and even stops running, and the service under the application cannot execute tasks normally, thereby influencing the power grid dispatching. The telemetering table does not refresh the abnormal state, and the automatic system of the power grid fails to update the telemetering data in time. Real-time and accurate telemetering data can be received, and the working condition of the power grid can be timely and accurately adjusted by a dispatcher. When the state of the power grid changes, corresponding telemetering data should be immediately reflected to a dispatching center, and if the telemetering meter does not update data for a long time, the overall control of the operation state of the power grid by a dispatching person is influenced.

The specific information of the system monitoring data corresponding to the three types of anomalies is shown in table 1:

TABLE 1 concrete information of system monitoring data when three kinds of abnormalities appear

Table 2 shows the basis detector algorithm and its parameters used in the examples of the present invention:

table 2 base detector algorithm and parameters used in the embodiment

In order to verify the effectiveness of the provided algorithm, the dynamic and static selection integration anomaly detection method is compared with other direct integration anomaly detection methods, such as Average, Max, AOM and MOA, the anomaly detection methods HEnS, SS-FS and Boostselect based on static selection integration and the anomaly detection methods LSCP and ELSCP based on dynamic selection integration.

The AUC values were used for the assessment in the examples of the present invention. Generally, the Area Under the ROC Curve (AUC) is used to evaluate the performance of the anomaly detection algorithm, and the more the ROC Area is close to 1, i.e., the larger the AUC value, the better the performance of the anomaly detection algorithm is.

Parameter R in the examples of the present invention_DA% is set to 10%, R_GA% and R_S% are set to 20%, K_RCAnd K_SOPBoth set to 30 and both n _ itree and n _ dtree set to 100.

The AUC results on the D5000 monitored data set for the inventive and comparative examples are shown in table 3. It can be seen that the power dispatching monitoring data anomaly detection method based on dynamic and static selection integration of the invention obtains the highest AUC on data jump anomalies and obtains the highest average AUC on three anomalies, which shows that the invention obtains higher accuracy on dispatching monitoring data anomaly detection than the prior method.

TABLE 3 AUC results over three abnormalities

Exception name	Average	Max	AOM	MOA	HEnS	SS-FS	BoostSelect	LSCP	ELSCP	The invention
											Application cut-off net	0.9908	0.9848	0.9872	0.9904	0.9795	0.9862	0.9603	0.9672	0.9757	0.9885
Data hopping	0.7571	0.8132	0.7844	0.7604	0.7506	0.7840	0.6099	0.7874	0.8095	0.8575
											Remote meter not refreshing	0.9979	0.9971	0.9971	0.9977	0.5840	0.9978	1.0000	0.9957	0.9966	0.9970
Mean AUC value	0.9153	0.9317	0.9229	0.9162	0.7714	0.9227	0.8567	0.9168	0.9272	0.9477

In summary, the embodiments of the present invention have the following beneficial effects:

in the technical scheme implemented by the invention, a certain number of base detectors are trained by using different unsupervised anomaly detection algorithms based on original power dispatching monitoring historical data; using an isolated forest to eliminate all base detectors with poor performance; generating a false true value of each historical data according to the Z fraction output by the residual basis detector by using an average value method, and converting the false true value into a false label; removing data with smaller false true values from all historical data, extracting the meta-features of each base detector on the residual historical data to form a meta-feature set, and generating a meta-tag set according to whether tags output by the base detectors on the residual historical data are the same as corresponding false tags or not; secondly, training a random forest by using the meta feature set and the meta tag set; and finally, extracting a detection element characteristic set of each base detector on the data to be detected, inputting the detection element characteristic set into a random forest to obtain a detection element label set, selecting the base detectors according to the detection element label set, taking the maximum value of the Z scores of the selected base detectors as a detection result, taking the maximum value of the classification threshold values of the selected base detectors as a detection threshold value of the time, judging the data to be detected, of which the detection result is greater than or equal to the detection threshold value, as abnormal data, and realizing abnormal detection of the power dispatching monitoring data. According to the technical scheme provided by the embodiment of the invention, when the problem of abnormality detection of the power dispatching monitoring data is faced, compared with other abnormality detection methods based on integration, the method can obtain higher accuracy.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A power dispatching monitoring data abnormity detection method based on dynamic and static selection integration is characterized by comprising the following steps:

(1) training a number of base detectors using power schedule monitoring historical data;

(2) using an isolated forest to reject a base detector with poor performance;

(3) generating a false true value of the historical data according to the output of the residual base detector by using an average value method, and respectively converting the false true value and the output of the base detector into two types of labels;

(4) removing historical data with over-small false values, and extracting meta-features and meta-tags of the base detector on the remaining historical data;

(5) training a random forest through meta features and meta tags;

(6) and extracting the meta-characteristics of the base detector on the data to be detected, inputting the meta-characteristics into a random forest, selecting the base detector according to the output of the random forest, and taking the maximum value of the output of the selected base detector as the detection result of the data to be detected to realize the abnormal detection of the power dispatching monitoring data.

2. The power dispatching monitoring data anomaly detection method based on dynamic and static selection integration according to claim 1, wherein in the step (1), a certain number of base detectors are trained by using power dispatching monitoring historical data, and specifically:

all power monitoring historical data are used as a training set X_TRTraining m base detectors by using different unsupervised anomaly detection algorithms based on a training set, generally taking m to be more than or equal to 50, and recording a base detector pool composed of all the base detectors as P_O(ii) a The output of each base detector is an abnormal score, and the larger the abnormal score is, the larger the abnormal degree of the input data is, the P_ONormalizing the Z score of the abnormal score output by each base detector to convert the Z score into a Z score; note P_OWherein the ith base detector is at X_TRThe jth history data

The abnormal score of the upper output is

Z fraction thereof

Comprises the following steps:

wherein: 1, 2, 1, n, n is X_TRThe amount of history data in the database is,

a standard deviation of the anomaly scores output for the ith basis detector over the entire historical data;

the input of each base detector is process real-time resource occupation data which is collected by the power dispatching monitoring system and is related to the power dispatching system service, and the process real-time resource occupation data comprises process CPU occupancy rate, memory occupancy rate, disk IO, network IO, thread number and network connection number; if the Z-fraction of the ith base detector output is less than

The input data is abnormal; the ith base detector is applied to all the training data X_TRSorting the Z-scores of the upper outputs from big to small, classification threshold of the ith base detector

Is the front R after the ordering_DAMinimum of% Z scores; r_DAThe% is the set base detector output conversion ratio, and is generally 10%.

3. The power dispatching monitoring data abnormity detection method based on dynamic and static selection integration according to claim 1, wherein in the step (2), a base detector with poor isolation forest elimination performance is used, and specifically:

Training an isolated forest consisting of n _ itree isolated trees, wherein n _ itree is generally 100; when constructing an isolated tree, from

All psi pieces of n-dimension data Score_ψ×nAs a training sample for this isolated tree; randomly selecting a dimension in each isolated tree sample, randomly selecting a value from the maximum value and the minimum value of the sample in the dimension, performing binary division on the sample, dividing the sample which is smaller than the value in the dimension to the left of a node, and dividing the sample which is larger than or equal to the value to the right of the node to obtain a splitting condition and data sets on the left side and the right side; the above process is repeated on the data sets on the left and right sides respectively until the termination condition is reached, which has two:

1) the data set itself comprises only one sample, or all samples are identical;

2) the height of the tree reaches log₂(ψ)；

Forming an isolated forest IForest by using all the trained isolated trees, wherein the output of the isolated forest IForest is a continuous value, and the smaller the output is, the larger the abnormal degree of input data is;

will be provided with

The r-th data in

As an input of the isolated forest IForest, r is 1, 2

An isolated forest IForest is arranged in

The m outputs on the sequence are sorted from small to large, and the front R after the sorting is_DThe% output corresponds to the base detector flag abnormal, R_D% is generally 10%; from P_ORemoving the base detectors marked as abnormal, and recording the base detector pool consisting of the m' base detectors left after screening as P_F。

4. The power scheduling monitoring data anomaly detection method based on dynamic and static selection integration according to claim 1, wherein in the step (3), an averaging method is used to generate a false true value of historical data according to the output of the residual basis detector, and the false true value and the output of the basis detector are respectively converted into two types of labels, specifically:

note P_FIn the training set X of all m' basis detectors_TRThe jth history data

Composition of Z fraction of up output

Computing

The average value of all Z fractions in the composition is used as

False true value of

Will be provided with

False true values in (1) are sorted from large to small, threshold PScore_thrIs the front R after the ordering_GAMinimum of% false values, R_GA% is a set false true value conversion ratio, and is generally 20%; if the jth historical data

Corresponding false true value

Greater than or equal to PScore_thrThen its false label

Is 1, otherwise is 0; training set X_TRThe false label set corresponding to all the historical data is

If P is_FThe a-th base detector in history data

Z score of up output

Greater than or equal to its classification threshold

Then it is at

Class II tag with upper output

Is 1, otherwise is 0; recording the a-th basis detector in the training set X_TRClass II tags of the upper output are

All-radical detectors at X_TRClass II tag set of the upper output

5. The power scheduling monitoring data anomaly detection method based on dynamic and static selection integration according to claim 1, wherein in the step (4), historical data with too small false value is removed, and meta-features and meta-tags of a base detector on the remaining historical data are extracted, specifically:

false true value of all historical data

Sorting from small to large, eliminating the front R after sorting_S% of the historical data corresponding to the false true values; recording the remaining n' historical data as X_STRThe corresponding false label set and the second type label set are respectively

And

residual radical detector at X_STRZ in the above is

For X_STRThe t-th history data

Calculate it to the original training set X_TRThe jth history data

Euclidean distance of

Wherein: t 1, 2, n', l 1, 2, u, u is the dimension of the historical data,

is composed of

The value in the l-th dimension is,

is composed of

A value in the l-dimension;

will be the original training set X_TRAccording to the historical data in

Performance evaluation set of

Generally, K is 10-K_RC≤30；

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

For the

Note P_FWherein the total basis detector is in

The Z score of the upper output is

Computing

And

euclidean distance of

Wherein:

is P_FWherein the a-th radical detector is in

The Z-score of the upper output is,

is P_FWherein the a-th radical detector is in

The Z score of the upper output;

Approximate output set of

Generally, K is 10-K_SOP≤30；

Extraction of P_FWherein the a-th radical detector is in

The six-component characteristic:

1) computing in a performance evaluation set

2) computing in an approximate output set

Middle baseThe quantity of the history data with the same type II labels and corresponding false labels output by the detector is calculated, and the quantity of the history data is calculated to be equal to K_SOPThe ratio of (A) to (B) is taken as a characteristic; this set includes a feature;

3) for performance evaluation set

4) for approximate output set

5) set of computational performance evaluations

6) computing basis detector pairs data to extract meta-features

Output Z-score and base detector self-positive classification threshold

The absolute value of the difference of (a); this set includes 1 feature;

the six groups contain M number of element characteristics, wherein M is 3+2 xK_RC+K_SOP(ii) a Extraction of P by the above method_FWherein each base detector is at X_STRThe meta-feature on each historical data in the set constitutes a meta-feature set X_TRM，X_TRMThe method comprises n '× m' pieces of meta-characteristic data;

comparison P_FWherein the a-th radical detector is in

Class II tag with upper output

And

false label of

Whether they are the same; if they are the same, the a-th base detector is

Meta tag on

Is 0, indicating that the a-th basis detector can correctly judge

6. The power dispatching monitoring data anomaly detection method based on dynamic and static selection integration according to claim 1, wherein in the step (5), a random forest is trained through meta-features and meta-tags, and specifically comprises the following steps:

using a meta feature set X_TRMAnd meta tag set L_TRMTraining a random forest consisting of n _ dtree decision trees, wherein n _ dtree generally takes 100; when constructing a decision tree, from X_TRMThe middle uniform has the place back to sample out N pieces of data

As a training sample of this decision tree, N ═ N '× m' is generally taken; in each decision tree sample, M' dimensions are randomly taken from M dimensions, typically

Selecting an optimal division dimension and a division point on the selected M' dimensions according to the kini index to perform binary division on the samples, dividing the samples smaller than the value in the dimension to the left side of the node, and dividing the samples larger than or equal to the value to the right side of the node to obtain a splitting condition and data sets on the left side and the right side; repeating the above process on the data sets on the left side and the right side respectively until the data sets only comprise one sample or the meta tags of all samples are the same; and (4) forming Random Forests (RFCs) by using all the trained decision trees, outputting the RFCs as class II labels 0 or 1, and showing whether the corresponding base detectors can correctly judge corresponding data or not.

7. The power dispatching monitoring data anomaly detection method based on dynamic and static selection integration according to claim 1, wherein in the step (6), the meta-features of the base detectors on the data to be detected are extracted, the meta-features are input into a random forest, the base detectors are selected according to the output of the random forest, the maximum value of the output of the selected base detectors is taken as the detection result of the data to be detected, and the power dispatching monitoring data anomaly detection is realized, and specifically:

for data x to be detected_TEExtracting P by the same method as in the step (4)_FWherein each base detector is at x_TEThe M meta-features on the (A) form a detection meta-feature set X_TEM(ii) a Mixing X_TEMInputting the RFC into the random forest RFC trained in the step (5) to obtain a detection meta-tag set L containing m' second class tags_TEM；

For P_FIf the corresponding detection element tag of each base detector in (1) is 0, which means that the detector is considered to be capable of correctly judging the data to be detected, adding the data to the selected base detector pool P_SPerforming the following steps; calculating P_SWherein the total basis detector is at x_TEThe maximum value of the Z score is used as the data x to be detected_TEThe detection result of (3); calculating P_SThe maximum value of the classification threshold values of all the medium-base detectors is used as the detection threshold value of the current detection, and the detection result is greater than or equal to the data x to be detected of the detection threshold value_TEAnd judging the data to be abnormal data, and realizing the abnormal detection of the power dispatching monitoring data.