CN111783904A

CN111783904A - Data anomaly analysis method, device, equipment and medium based on environmental data

Info

Publication number: CN111783904A
Application number: CN202010919414.XA
Authority: CN
Inventors: 张兵
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-10-16
Anticipated expiration: 2040-09-04
Also published as: CN111783904B

Abstract

The application relates to the field of machine learning in artificial intelligence, and discloses a data anomaly analysis method, a device, equipment and a medium based on environmental data, wherein the method comprises the following steps: acquiring environmental data to be analyzed and monitoring items; performing abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set; inputting the target abnormal feature set into an abnormal classification model for abnormal classification, wherein the abnormal classification model is a model obtained based on decision tree training; and acquiring an abnormal classification result output by the abnormal classification model, wherein the abnormal classification result is used for expressing the abnormal category of the environmental data to be analyzed. Therefore, the abnormity of the environmental data to be analyzed is reasonably analyzed.

Description

Data anomaly analysis method, device, equipment and medium based on environmental data

Technical Field

The present application relates to the field of machine learning in artificial intelligence, and in particular, to a method, an apparatus, a device, and a medium for analyzing data anomalies based on environmental data.

Background

The pollution of enterprises to the environment is always a difficult problem which puzzles every city, the government invests a large amount of manpower and material resources every year to supervise the pollution sources of the enterprises, various environment monitoring devices are installed in the enterprises to monitor the waste water and waste gas generated in the production process of the enterprises, the monitoring data are uploaded to a supervision platform in real time, a large amount of abnormity is generated in the process of collecting and uploading the monitoring data, for example, the transmission is unstable, the data are lost, the data are not changed for a long time, the data exceed the range, the data identification is abnormal, and the like.

Disclosure of Invention

The application mainly aims to provide a data anomaly analysis method, a data anomaly analysis device, data anomaly analysis equipment and a data anomaly analysis medium based on environmental data, and aims to solve the technical problem that the existing supervision platform cannot reasonably analyze received monitoring data anomalies.

In order to achieve the above object, the present application provides a data anomaly analysis method based on environmental data, the method including:

acquiring environmental data to be analyzed and monitoring items;

performing abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set;

inputting the target abnormal feature set into an abnormal classification model for abnormal classification, wherein the abnormal classification model is a model obtained based on decision tree training;

and acquiring an abnormal classification result output by the abnormal classification model, wherein the abnormal classification result is used for expressing the abnormal category of the environmental data to be analyzed.

Further, the step of acquiring the environmental data to be analyzed includes:

acquiring a data message monitored by environment monitoring equipment;

and analyzing according to the data message to obtain the environmental data to be analyzed.

Further, before the step of inputting the target abnormal feature set into an abnormal classification model for abnormal classification, the step of inputting the target abnormal feature set into the abnormal classification model based on decision tree training further includes:

obtaining a training sample set, the training sample set comprising a plurality of environmental data training samples, the environmental data training samples comprising: at least one abnormal characteristic sample value and an abnormal classification calibration value;

carrying out recursive division by adopting a CART algorithm (classification regression tree algorithm) according to the plurality of environmental data training samples to establish a CART decision tree;

carrying out random pruning and constant determination on the CART decision tree to obtain a plurality of sub decision trees to be trained, wherein each sub decision tree to be trained corresponds to a target constant;

obtaining a verification sample set, wherein the verification sample set comprises a plurality of environment data verification samples;

and determining the abnormal classification model according to the plurality of environmental data verification samples and the plurality of sub-decision trees to be trained.

Further, the step of establishing the CART decision tree by performing recursive partitioning by using a CART algorithm according to the plurality of environmental data training samples includes:

selecting an independent variable Xi, determining Vi according to the independent variable Xi, and dividing an n-dimensional space into two parts, wherein all points of one part meet Xi less than or equal to Vi, all points of the other part meet Xi more than Vi, and for non-continuous variables, the abnormal features only take two values, and the values of the abnormal features include: yes or no;

and reselecting one abnormal feature from the two parts to continue to be divided, adopting a kini index as a division standard, stopping building the tree until a stopping condition is met, and using the built binary tree as the CART decision tree, wherein the stopping condition is as follows: the number of samples of the environmental data training samples of leaf nodes is 1 or the abnormal class belongs to the same class.

Further, the step of reselecting an abnormal feature from the two parts to continue dividing, using the kini index as a division standard, stopping building the tree until a stop condition is met, and using the built binary tree as the CART decision tree includes:

taking each part of the two parts as a node to be divided;

taking each abnormal feature of the nodes to be divided as an abnormal feature to be divided;

performing a Gini index calculation according to the value of the abnormal feature to be divided and all splitting points corresponding to the value of the abnormal feature to be divided to obtain a plurality of splitting Gini indexes;

determining an optimal splitting point according to the plurality of splitting kini indexes, and taking the splitting kini index corresponding to the optimal splitting point as the optimal splitting kini index;

determining the optimal abnormal features according to the optimal split-kini indexes of all the abnormal features to be divided;

generating two sub-nodes from the node to be divided according to the optimal abnormal feature;

dividing all the environment data verification samples of the node to be divided into the two child nodes according to the optimal abnormal feature and the optimal split point of the optimal abnormal feature;

and when the stopping condition is met, taking the established binary tree as the CART decision tree, otherwise, taking the two child nodes as the two parts, and executing the step of taking each part of the two parts as a node to be divided.

Further, the step of performing random pruning and constant determination on the CART decision tree to obtain a plurality of sub decision trees to be trained, where each sub decision tree to be trained corresponds to one target constant includes:

randomly pruning the CART decision tree to obtain a plurality of sub decision trees to be trained;

determining a constant of the sub-decision tree to be trained to obtain a target constant corresponding to the sub-decision tree to be trained;

pruning internal nodes t of the sub-decision tree to be trained, and taking t as a loss function of a single-node tree as

C (t) is in a leaf nodeNumber of uncertainties, α is a variable;

pruned decisiontree T with T as root node_tHas a loss function of

，C(T_t) Is the prediction error of the set of training samples, | T_tIf is pruned decisiontree T_tThe number of leaf nodes;

when T is the loss function of the single-node tree and T is the root node of the pruned decision tree T_tWhen the loss functions of (a) and (b) are equal, the value of the variable α is taken as the target constant corresponding to the sub-decision tree to be trained.

Further, the step of performing random pruning and constant determination on the CART decision tree to obtain a plurality of sub decision trees to be trained, where each sub decision tree to be trained corresponds to one target constant, further includes:

carrying out random pruning and constant determination on the CART decision tree to obtain a plurality of sub-decision trees to be trained and a clipped environment data training sample set, wherein each sub-decision tree to be trained corresponds to a target constant; and the number of the first and second groups,

the obtaining a set of validation samples includes:

and removing the training sample set of the cut environmental data from the training sample set, and taking the rest part as the verification sample set.

The application also provides a data anomaly analysis device based on environmental data, the device includes:

the abnormal feature extraction module is used for acquiring environmental data to be analyzed and monitoring projects, and extracting abnormal features of the environmental data to be analyzed based on the monitoring projects to obtain a target abnormal feature set;

and the anomaly classification module is used for inputting the target anomaly feature set into an anomaly classification model for anomaly classification, the anomaly classification model is a model obtained based on decision tree training, and an anomaly classification result output by the anomaly classification model is obtained and used for expressing the anomaly category of the environmental data to be analyzed.

The present application also proposes a computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of any of the above-mentioned methods when executing the computer program.

The present application also proposes a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of the above.

According to the data anomaly analysis method, the data anomaly analysis device, the data anomaly analysis equipment and the data anomaly analysis medium based on the environmental data, the target anomaly feature set is obtained by extracting the environmental data to be analyzed, the target anomaly feature set is input into the anomaly classification model for anomaly classification, the anomaly classification result output by the anomaly classification model is obtained, the anomaly of the environmental data to be analyzed is visually displayed through the anomaly classification result, the anomaly classification model is obtained based on decision tree training, the accuracy of the anomaly classification result is improved, and therefore the anomaly of the environmental data to be analyzed is reasonably analyzed.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for analyzing data anomalies based on environmental data according to an embodiment of the present application;

FIG. 2 is a block diagram schematically illustrating a structure of an apparatus for analyzing data anomalies based on environmental data according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a method for analyzing data anomalies based on environmental data, where the method includes:

s1: acquiring environmental data to be analyzed and monitoring items;

s2: performing abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set;

s3: inputting the target abnormal feature set into an abnormal classification model for abnormal classification, wherein the abnormal classification model is a model obtained based on decision tree training;

s4: and acquiring an abnormal classification result output by the abnormal classification model, wherein the abnormal classification result is used for expressing the abnormal category of the environmental data to be analyzed.

The anomaly categories of the environmental data include, but are not limited to: equipment failure class, data exception class and data superstandard class.

The device failure class refers to environmental data abnormality caused by device failure, and the device failure includes but is not limited to environmental monitoring device failure, network failure, and environmental data receiving device failure. The abnormal features of the equipment failure class include, but are not limited to: unstable transmission, data loss. The transmission instability refers to an instability in the speed of transmission of the environmental data from the environmental monitoring apparatus to the environmental data receiving apparatus. Data loss refers to partial or complete environmental data loss.

The data exception class is an environmental data exception that results from the data itself not meeting expected requirements. The exception characteristics of the data exception class include, but are not limited to: data is unchanged for a long time, data exceeds the range, and data identification is abnormal. The data is not changed for a long time, which means that the time during which the value of the same type of data is not changed exceeds a preset time. The data overrange means that the environmental data exceeds a measurement range, for example, the PH value is 0-14, and the overrange is beyond 0-14. The data identification exception means that the value of the identification bit in the environment data is not within a preset identification range.

The data superscalar class is an environmental data anomaly resulting from a value exceeding an expected standard. The abnormal features of the data superscalar class include, but are not limited to: the conventional pollutants are slightly out of standard, the conventional pollutants are severely out of standard, and heavy metals and pollutants of the same type are out of standard. The conventional pollutant slightly exceeding the standard means that the data of the conventional pollutant in the environmental data exceeds the standard of the conventional pollutant slightly exceeding the standard and is lower than the standard of the conventional pollutant seriously exceeding the standard. The standard exceeding of the conventional pollutants seriously means that the data of the conventional pollutants in the environmental data is greater than or equal to the standard exceeding of the conventional pollutants seriously. The standard exceeding of the heavy metals and the pollutants means that the data of the heavy metals and the pollutants exceed the standard exceeding of the heavy metals and the pollutants.

For step S1, the environmental data to be analyzed may be obtained from a database, or may be sent by the environmental monitoring device in real time; and acquiring the monitoring items, wherein the monitoring items can be acquired from a database.

The environmental data to be analyzed is environmental data which needs to be subjected to abnormal classification. The environmental data is obtained from the environment monitored by the environmental monitoring device. The environmental data monitored by the environmental monitoring equipment can be sent to the environmental data receiving equipment by the environmental monitoring equipment in a data message mode.

The monitoring items include, but are not limited to, monitoring items for water and monitoring items for atmosphere.

For step S2, all the abnormal features of the environmental data to be analyzed are extracted, and at least one target abnormal feature is obtained. That is, the number of target anomaly features in the target anomaly feature set includes, but is not limited to, one, two, three, four, five. The target anomaly feature is also an anomaly feature, and the target anomaly feature comprises: exception feature name, value of exception feature. For example, if the name of the abnormal feature is long-term unchanged, the value of the abnormal feature includes: with or without change, for data that is not changed for a long time, "changed" may also be expressed as "no" and "unchanged" may also be expressed as "yes", which are not specifically limited by the examples herein.

The abnormal classification model based on the decision tree can select a training method from the prior art, can also obtain a training sample set, establishes a CART decision tree according to the training sample set, randomly prunes the CART decision tree to generate a plurality of sub-decision trees to be trained, verifies the plurality of sub-decision trees to be trained by using a verification sample set, and determines the abnormal classification model according to a verification result. The abnormal classification model is obtained through training based on the decision tree, so that the abnormal classification model has strong generalization capability.

Decision trees are a very common classification method, which is supervised learning; so-called supervised learning is to give a set of samples, each sample having a set of attributes and a class, which are determined in advance, and then learn to obtain a classifier, which can give the correct classification to the newly appeared object, and such machine learning is called supervised learning.

For step S3, all the target abnormal feature sets are input into an abnormal classification model for abnormal classification, and the abnormal classification model outputs an abnormal classification result. The anomaly classification result is an anomaly class of the environmental data.

For step S4, obtaining an anomaly classification result output by the anomaly classification model, and taking the anomaly classification result as a result of anomaly classification of the environmental data to be analyzed.

The exception classification result comprises the name of the exception category and the value of the exception category. For example, the names of the exception categories include a device failure category, a data exception category, and a data superscalar category, which are not specifically limited by the examples herein.

According to the method, the target abnormal feature set is obtained by extracting the environmental data to be analyzed, the target abnormal feature set is input into the abnormal classification model for abnormal classification, the abnormal classification result output by the abnormal classification model is obtained, the abnormal classification result shows the abnormal environmental data to be analyzed visually, the abnormal classification model is obtained by training based on the decision tree, the accuracy of the abnormal classification result is improved, and therefore the abnormal environmental data to be analyzed is reasonably analyzed.

In an embodiment, the step of obtaining the environmental data to be analyzed includes:

s11: acquiring a data message monitored by environment monitoring equipment;

s12: and analyzing according to the data message to obtain the environmental data to be analyzed.

For step S11, the environment monitoring device communicates with the environment data receiving device through the 212 communication protocol, packages the monitored environment data in a data message, and sends the data message to the environment data receiving device based on the 212 communication protocol. It is understood that the environment monitoring device and the environment data receiving device may also communicate with each other through other communication protocols, such as TCP (transmission control protocol), which is not specifically limited by the examples herein.

In step S12, the data packet is analyzed to obtain environmental data monitored by the environmental monitoring device, that is, the environmental data to be analyzed is the environmental data monitored by the environmental monitoring device. For example, the data message "# #0307ST =32, CN =2011, QN =20200714141100011, PW =123456, MN = WWSZ0003060095, CP = & DataTime =20200714141100, 060-Rtd =1.150,060-Flag = N, 011-Rtd =9.1,011-Flag = N, B01-Rtd =0.275, B01-Flag = N, B01TOTAL-Rtd =8462.000, B01TOTAL-Flag = N, 101-Rtd =0.0140,101-Flag = N, 029-Rtd =0.0070,029-Flag = N, 028-Rtd =0.0471,028-Flag = N, 001-Rtd =7.379,001-Flag = N & & C301", the data between two "&" is the environmental data of the environment monitoring device, the data to be analyzed for the data message is analyzed for "Rtd = 060 = 42-19", and the data between two "&" Rtd = 7342-0.0140,101-Flag = 31-fln "&" is analyzed for the environment data message "& = 011-463-stn = 011" & "= 060 = 7342, B01-Flag = N, B01TOTAL-Rtd =8462.000, B01TOTAL-Flag = N, 101-Rtd =0.0140,101-Flag = N, 029-Rtd =0.0070,029-Flag = N, 028-Rtd =0.0471,028-Flag = N, 001-Rtd =7.379,001-Flag = N ", which is not specifically limited by way of example.

In an embodiment, before the step of inputting the target abnormal feature set into an abnormal classification model for abnormal classification, the step of inputting the target abnormal feature set into an abnormal classification model based on decision tree training further includes:

s31: obtaining a training sample set, the training sample set comprising a plurality of environmental data training samples, the environmental data training samples comprising: at least one abnormal characteristic sample value and an abnormal classification calibration value;

s32: performing recursive division by adopting a CART algorithm according to the plurality of environmental data training samples to establish a CART decision tree;

s33: carrying out random pruning and constant determination on the CART decision tree to obtain a plurality of sub decision trees to be trained, wherein each sub decision tree to be trained corresponds to a target constant;

s34: obtaining a verification sample set, wherein the verification sample set comprises a plurality of environment data verification samples;

s35: and determining the abnormal classification model according to the plurality of environmental data verification samples and the plurality of sub-decision trees to be trained.

For step S31, performing abnormal feature extraction on each historical environmental data to obtain an abnormal feature sample value corresponding to the historical environmental data; and calibrating each historical environment data to obtain an abnormal classification calibration value corresponding to the historical environment data.

The abnormal feature sample value comprises the name and the sample value of the abnormal feature. The abnormal classification calibration value comprises an abnormal class name and a calibration result.

It will be appreciated that there may be multiple anomaly classification calibrations, such as when the anomaly categories for the environmental data include: when the device fault class, the data abnormal class and the data standard exceeding class are three, the number of the abnormal classification calibration values is three, and the abnormal classification calibration values are respectively the device fault class calibration value, the data abnormal class calibration value and the data standard exceeding class calibration value, which is not specifically limited in this example.

The CART algorithm is a classification regression tree algorithm and is a binary recursive segmentation technology, a current sample is divided into two subsamples, each generated non-leaf node has two branches, and therefore a decision tree generated by the CART algorithm is a binary tree with a concise structure.

For step S32, the method specifically includes: and performing recursive division by adopting a CART algorithm according to all the environmental data training samples to establish a CART decision tree. That is, the CART decision tree is a binary tree with a compact structure.

For step S33, the method specifically includes: randomly pruning the CART decision tree, and taking the residual part of the CART decision tree after pruning as a sub-decision tree to be trained; and determining a constant for each sub-decision tree to be trained to obtain a target constant corresponding to the sub-decision tree to be trained.

With respect to step S34, it is understood that the verification sample set may be generated according to historical environmental data, or may be generated according to a training sample set. The environmental data training sample comprises: at least one anomaly characteristic sample value, an anomaly classification calibration value.

For step S35, the method specifically includes: performing a kini index calculation on the plurality of sub-decision trees to be trained according to the plurality of environment data verification samples to obtain a plurality of subtree kini indexes; and determining an optimal sub-decision tree according to the plurality of subtree kiney indexes, and taking the optimal sub-decision tree and a target constant corresponding to the optimal sub-decision tree as the abnormal classification model.

And inputting the plurality of environment data verification samples into one sub-decision tree to be trained for conducting the calculation of the kindness index to obtain a subtree kindness index.

And finding out the minimum value from the subtree kiney indexes, and taking the sub-decision tree to be trained corresponding to the found subtree kiney indexes as the optimal sub-decision tree.

It can be understood that, when the abnormal category increases, the abnormal classification model needs to be retrained in steps S31 to S35, and the retrained abnormal classification model is used for performing the abnormal classification.

S31 to S35, obtaining an abnormal classification model through training based on a decision tree, and enabling the abnormal classification model obtained through training to have strong generalization capability; the decision tree is prevented from being divided too finely by pruning, so that the abnormal classification model generates overfitting on noise data; the plurality of sub decision trees to be trained are obtained by random pruning, and then the optimal sub decision tree is determined from the plurality of sub decision trees to be trained, so that the accuracy of the found optimal sub decision tree is improved, and the accuracy of the abnormal classification model is improved.

In an embodiment, the step of building a CART decision tree by performing recursive partitioning with a CART algorithm according to the plurality of environmental data training samples includes:

s321: selecting an independent variable Xi, determining Vi according to the independent variable Xi, and dividing an n-dimensional space into two parts, wherein all points of one part meet Xi less than or equal to Vi, all points of the other part meet Xi more than Vi, and for non-continuous variables, the abnormal features only take two values, and the values of the abnormal features include: yes or no;

s322: and reselecting one abnormal feature from the two parts to continue to be divided, adopting a kini index as a division standard, stopping building the tree until a stopping condition is met, and using the built binary tree as the CART decision tree, wherein the stopping condition is as follows: the number of samples of the environmental data training samples of leaf nodes is 1 or the abnormal class belongs to the same class.

Through the step S321 and the step S322, the plurality of environment data training samples are divided into each leaf node of the binary tree, and the kini index is used as a division standard, so that each division is an optimal result, and the accuracy of the CART decision tree is improved.

For step S322, specifically, the method includes: and selecting one of the two parts by using a Gini index to obtain an optimal abnormal feature and an optimal split point, dividing one of the two parts according to the optimal abnormal feature and the optimal split point, performing recursive processing to complete the division of all nodes, and stopping building the tree until a stopping condition is met.

Wherein the number of samples of the environmental data training samples of the leaf node is 1 or the abnormal class belongs to the same class, including: the number of the environmental data training samples of the leaf node is 1, or the environmental data training samples of the leaf node belong to the same abnormal category.

In an embodiment, the step of reselecting one abnormal feature from the two parts to continue partitioning, using the kini index as a partitioning standard, stopping building the tree until a stopping condition is met, and using the built binary tree as the CART decision tree includes:

s3221: taking each part of the two parts as a node to be divided;

s3222: taking each abnormal feature of the nodes to be divided as an abnormal feature to be divided;

s3223: performing a Gini index calculation according to the value of the abnormal feature to be divided and all splitting points corresponding to the value of the abnormal feature to be divided to obtain a plurality of splitting Gini indexes;

s3224: determining an optimal splitting point according to the plurality of splitting kini indexes, and taking the splitting kini index corresponding to the optimal splitting point as the optimal splitting kini index;

s3225: determining the optimal abnormal features according to the optimal split-kini indexes of all the abnormal features to be divided;

s3226: generating two sub-nodes from the node to be divided according to the optimal abnormal feature;

s3227: dividing all the environment data verification samples of the node to be divided into the two child nodes according to the optimal abnormal feature and the optimal split point of the optimal abnormal feature;

s3228: and when the stopping condition is met, taking the established binary tree as the CART decision tree, otherwise, taking the two child nodes as the two parts, and executing the step of taking each part of the two parts as a node to be divided.

Through steps S3221 to S3228, it is realized that the CART decision tree is constructed by the CART algorithm, so that the obtained CART decision tree is a binary tree with the finest division.

For step S3221, specifically, the method includes: and taking each of the two parts as a node to be divided, namely, the number of the obtained nodes to be divided is two.

For step S3222, that is, each abnormal feature in the node to be partitioned is an abnormal feature to be partitioned. For example, the number of the abnormal features to be divided is 3 if the abnormal features in the nodes to be divided include 3, which is not specifically limited in this example.

For step S3223, specifically, the method includes: for an abnormal feature to be divided belonging to a variable, the splitting point is the middle point of the feature values of the abnormal feature to be divided of a pair of continuous variables. Assuming that one abnormal feature to be divided of m environmental data training samples has m continuous values, m-1 split points exist, each split point is the mean value of two adjacent continuous values, the division of each abnormal feature to be divided is sorted according to the reduction amount of impurities, and the split point with the largest reduction amount of impurities is used as the optimal split point. Wherein, the impurities are measured by a Giny index; the impurity reduction is defined as the sum of the impurity before division minus the impurity of each node after division multiplied by the sample ratio occupied by the division.

The calculation formula of the Gini index of the node A to be divided is as follows:

wherein, P_iIndicates the probability of belonging to the ith exception category, the number of C exception categories. All samples belong to the same class when gini (a) = 0; gini (A) is maximized when all classes occur with the same probability in the node, when Gini (A) has a value of (C-1) C/2.

Assuming that a node A to be divided is divided into B and C, wherein the proportion of B in the sample in A is p, C is q, and p + q is 1, the impurity reduction amount calculation formula is as follows: gini (A) - (p × Gini (B)) + q × Gini (C)).

For example, table 1 shows 6 training samples of environment data of a node a to be partitioned, where the abnormal features include: data is unchanged for a long time, data exceeds the range, data identification is abnormal, and the abnormal classification calibration value is a data abnormal type calibration value.

TABLE 16 environmental data training sample data Table

According to the environmental data training samples in table 1, as shown in table 2, the abnormal feature data are divided according to the long-term invariance, and the Gini index after division is calculated as follows:

table 2 statistical table of environmental data training samples according to abnormal characteristic data long-term invariance

Gini(t₁)=1-(2/2)²-(0/2)²=0, i.e. the kini index at which the anomaly characteristic data does not change over time;

Gini(t₂)=1-(2/4)²-(2/4)²=0.5, i.e. the kini index at which the anomaly characteristic data does not change for a long period of time;

gini () = (2/6) × 0+ (4/6) × 0.5=0.333, that is, the amount of change in impurities.

According to the environmental data training samples in table 1, as shown in table 3, if the abnormal characteristic data is divided according to the overrange, the Gini index after division is calculated as follows:

TABLE 3 environmental data training sample statistics table according to abnormal characteristic data over-range

Gini(t₁)=1-(1/3)²-(2/3)²=0.444, i.e. the data overrun is the kini index at the time of the overrun;

Gini(t₂)=1-(3/3)²-(0/3)²=0, i.e. the data overrun is the kini index without overrun;

gini () = (3/6) × 0.444+ (3/6) × 0=0.222, that is, the amount of change in impurities.

For step S3224, specifically, the method includes: finding out the minimum value from the optimal splitting kiney indexes of all the abnormal features to be divided, and taking the abnormal features to be divided corresponding to the found optimal splitting kiney indexes as the optimal abnormal features.

Step S3225 specifically includes finding a minimum value from the optimal splitting kiney indexes of all the abnormal features to be classified, and using the abnormal feature to be classified corresponding to the found optimal splitting kiney index as the optimal abnormal feature.

For step S3228, when the stop condition is satisfied, taking the established binary tree as the CART decision tree; when the stop condition is not satisfied, steps S3221 to S3228 are repeatedly performed, thereby realizing recursive iteration.

In an embodiment, the step of performing random pruning and constant determination on the CART decision tree to obtain a plurality of sub-decision trees to be trained, where each sub-decision tree to be trained corresponds to a target constant includes:

s331: randomly pruning the CART decision tree to obtain a plurality of sub decision trees to be trained;

s332: determining a constant of the sub-decision tree to be trained to obtain a target constant corresponding to the sub-decision tree to be trained;

C (t) is the number of uncertainties in the leaf node, α is a variable;

pruned decisiontree T with T as root node_tHas a loss function of

By pruning, the phenomenon that the abnormal classification model generates overfitting effect on noise data due to the fact that the decision tree is divided too finely is avoided; the plurality of sub decision trees to be trained are obtained by random pruning, and then the optimal sub decision tree is determined from the plurality of sub decision trees to be trained, so that the accuracy of the found optimal sub decision tree is improved, and the accuracy of the abnormal classification model is improved.

Where α =0 and α are sufficiently small, there is an inequality:

when α is increased, there is a certain α

That is to say

Then can derive

(ii) a I.e. pruned decisiontree T with T as root node_tThe same loss function value as that of the single-node tree with T as the root node, the number of the nodes of T is small, and the sub-decision tree with T as the root node is used as the pruned sub-decision tree T_tThe accuracy of the deciduous tree to be trained obtained after pruning is not influenced.

In an embodiment, the step of performing random pruning and constant determination on the CART decision tree to obtain a plurality of sub-decision trees to be trained, where each sub-decision tree to be trained corresponds to a target constant further includes:

the obtaining a set of validation samples includes:

By removing the training sample set of the cut environmental data from the training sample set, the rest part is used as the verification sample set, which is beneficial to improving the accuracy of verification of the verification sample set.

Wherein the pruned environment data training sample set is the environment data training samples in all the pruned leaf nodes.

In one embodiment, after the step of obtaining the anomaly classification result output by the anomaly classification model, the method further includes: and carrying out proportion calculation according to the abnormal classification result and the abnormal classification type to obtain the proportion corresponding to each abnormal type.

If the proportion corresponding to the abnormal equipment fault class is larger, the monitoring equipment is more abnormal, and the monitoring equipment or the data transmission channel needs to be monitored and corrected; if the proportion corresponding to the abnormal category data abnormal category is large, the monitoring data or the monitoring equipment is abnormal, and the monitoring equipment or the monitoring data needs to be monitored and corrected; if the proportion corresponding to the abnormal class data exceeding the standard is larger, indicating that the production pollution discharge of the enterprise exceeds the standard, the enterprise needs to be monitored and rectified.

As shown in fig. 2, in one embodiment, an apparatus for analyzing data abnormality based on environmental data is provided, the apparatus comprising:

the abnormal feature extraction module 10 is configured to acquire environmental data to be analyzed and monitoring items, and perform abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set;

and an anomaly classification module 20, configured to input the target anomaly feature set into an anomaly classification model for performing anomaly classification, where the anomaly classification model is a model obtained based on decision tree training, and obtains an anomaly classification result output by the anomaly classification model, and the anomaly classification result is used to express an anomaly category of the environmental data to be analyzed.

In one embodiment, the abnormal feature extraction module further comprises: an environment data acquisition submodule;

the environment data acquisition submodule is used for acquiring data messages monitored by the environment monitoring equipment and analyzing the data messages to obtain the environment data to be analyzed.

In one embodiment, the apparatus further comprises:

a training sample determination module, configured to obtain a training sample set, where the training sample set includes a plurality of environmental data training samples, and the environmental data training samples include: at least one abnormal characteristic sample value and an abnormal classification calibration value;

the training module is used for training samples according to the plurality of environmental data, carrying out recursive division by adopting a CART algorithm to establish a CART decision tree, carrying out random pruning and constant determination on the CART decision tree to obtain a plurality of sub-decision trees to be trained, obtaining a verification sample set by corresponding each sub-decision tree to be trained to a target constant, wherein the verification sample set comprises a plurality of environmental data verification samples, and determining the abnormal classification model according to the plurality of environmental data verification samples and the plurality of sub-decision trees to be trained.

In one embodiment, the training module comprises: a CART decision tree construction submodule;

the CART decision tree construction submodule is used for selecting an independent variable Xi, determining Vi according to the independent variable Xi, dividing an n-dimensional space into two parts, wherein all points of one part meet Xi less than or equal to Vi, all points of the other part meet Xi more than Vi, and for discontinuous variables, the value of an abnormal feature is only two, and the value of the abnormal feature comprises: if not, reselecting one abnormal feature from the two parts to continue to be divided, adopting the kiney index as a division standard, stopping building the tree until a stopping condition is met, and taking the built binary tree as the CART decision tree, wherein the stopping condition is as follows: the number of samples of the environmental data training samples of leaf nodes is 1 or the abnormal class belongs to the same class.

In one embodiment, the CART decision tree construction sub-module comprises: an optimal abnormal feature determining unit and a CART decision tree determining unit;

the optimal abnormal feature determining unit is configured to use each of the two parts as a to-be-divided node, use each abnormal feature of the to-be-divided node as a to-be-divided abnormal feature, perform a kini index calculation according to a value of the to-be-divided abnormal feature and all splitting points corresponding to the value of the to-be-divided abnormal feature to obtain a plurality of splitting kini indexes, determine an optimal splitting point according to the plurality of splitting kini indexes, use the splitting kini index corresponding to the optimal splitting point as an optimal splitting kini index, and determine an optimal abnormal feature according to the optimal splitting kini indexes of all the to-be-divided abnormal features;

the CART decision tree determining unit is configured to generate two child nodes from the node to be partitioned according to the optimal abnormal feature, partition all the environment data verification samples of the node to be partitioned into the two child nodes according to the optimal abnormal feature and the optimal split point of the optimal abnormal feature, use the established binary tree as the CART decision tree when the stopping condition is met, and use the two child nodes as the two parts if the stopping condition is not met, and execute the step of using each of the two parts as one node to be partitioned.

In one embodiment, the training module comprises: determining a sub-module of the sub-decision tree to be trained;

the sub-decision tree to be trained determining sub-module is used for randomly pruning the CART decision tree to obtain a plurality of sub-decision trees to be trained, and determining constants of the sub-decision trees to be trained to obtain target constants corresponding to the sub-decision trees to be trained;

C (t) is in the leaf nodeα is a variable;

pruned decisiontree T with T as root node_tHas a loss function of

In one embodiment, the training module further comprises: a verification sample set determination submodule;

the randomly pruning and constant determining the CART decision tree to obtain a plurality of sub decision trees to be trained, wherein each sub decision tree to be trained corresponds to a target constant, and the method further comprises the following steps: carrying out random pruning and constant determination on the CART decision tree to obtain a plurality of sub-decision trees to be trained and a clipped environment data training sample set, wherein each sub-decision tree to be trained corresponds to a target constant; and the number of the first and second groups,

the verification sample set determining submodule is used for eliminating the training sample set of the cut environmental data from the training sample set and taking the rest part as the verification sample set.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a data anomaly analysis method based on environmental data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data anomaly analysis based on environmental data. The data anomaly analysis based on the environmental data comprises the following steps: acquiring environmental data to be analyzed and monitoring items; performing abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set; inputting the target abnormal feature set into an abnormal classification model for abnormal classification, wherein the abnormal classification model is a model obtained based on decision tree training; and acquiring an abnormal classification result output by the abnormal classification model, wherein the abnormal classification result is used for expressing the abnormal category of the environmental data to be analyzed.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a method for data anomaly analysis based on environmental data, including the steps of: acquiring environmental data to be analyzed and monitoring items; performing abnormal feature extraction on the environmental data to be analyzed based on the monitoring items to obtain a target abnormal feature set; inputting the target abnormal feature set into an abnormal classification model for abnormal classification, wherein the abnormal classification model is a model obtained based on decision tree training; and acquiring an abnormal classification result output by the abnormal classification model, wherein the abnormal classification result is used for expressing the abnormal category of the environmental data to be analyzed.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for analyzing data anomalies based on environmental data, the method comprising:

acquiring environmental data to be analyzed and monitoring items;

2. The method for analyzing data abnormality based on environmental data according to claim 1, wherein the step of obtaining the environmental data to be analyzed includes:

acquiring a data message monitored by environment monitoring equipment;

3. The method for analyzing data abnormality based on environmental data according to claim 1, wherein the step of inputting the target abnormality feature set into an abnormality classification model for abnormality classification, wherein the abnormality classification model is obtained based on decision tree training and further comprises:

performing recursive division by adopting a CART algorithm according to the plurality of environmental data training samples to establish a CART decision tree;

4. The method of claim 3, wherein the step of building the CART decision tree by recursive partitioning using the CART algorithm according to the plurality of environmental data training samples comprises:

5. The method of claim 4, wherein the step of reselecting an abnormal feature from the two parts to continue dividing, using a kini index as a division criterion, stopping building the tree until a stop condition is met, and using the built binary tree as the CART decision tree comprises:

taking each part of the two parts as a node to be divided;

6. The method of claim 3, wherein the step of randomly pruning and constant-determining the CART decision tree to obtain a plurality of sub-decision trees to be trained, each sub-decision tree to be trained corresponding to a target constant comprises:

C (t) is the number of uncertainties in the leaf node, α is a variable;

pruned decisiontree T with T as root node_tHas a loss function of

7. The method of claim 3, wherein the step of randomly pruning and constant-determining the CART decision tree to obtain a plurality of sub-decision trees to be trained, each sub-decision tree to be trained corresponding to a target constant, further comprises:

the obtaining a set of validation samples includes:

8. An apparatus for analyzing data abnormality based on environmental data, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1to 7.