CN117235624B - Emission data falsification detection method, device and system and storage medium - Google Patents
Emission data falsification detection method, device and system and storage medium Download PDFInfo
- Publication number
- CN117235624B CN117235624B CN202311236361.1A CN202311236361A CN117235624B CN 117235624 B CN117235624 B CN 117235624B CN 202311236361 A CN202311236361 A CN 202311236361A CN 117235624 B CN117235624 B CN 117235624B
- Authority
- CN
- China
- Prior art keywords
- data
- pollutant
- anomaly
- contaminant
- entropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003860 storage Methods 0.000 title claims abstract description 9
- 238000001514 detection method Methods 0.000 title claims description 47
- 238000009826 distribution Methods 0.000 claims abstract description 63
- 238000000034 method Methods 0.000 claims abstract description 58
- 238000012360 testing method Methods 0.000 claims abstract description 44
- 238000012549 training Methods 0.000 claims abstract description 32
- 230000002159 abnormal effect Effects 0.000 claims abstract description 24
- 238000012544 monitoring process Methods 0.000 claims abstract description 16
- 239000003344 environmental pollutant Substances 0.000 claims description 78
- 231100000719 pollutant Toxicity 0.000 claims description 78
- 239000000356 contaminant Substances 0.000 claims description 66
- 238000013145 classification model Methods 0.000 claims description 35
- 238000004364 calculation method Methods 0.000 claims description 28
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 238000012795 verification Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 16
- 230000000694 effects Effects 0.000 claims description 13
- 230000001186 cumulative effect Effects 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000004519 manufacturing process Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 6
- 238000001276 Kolmogorov–Smirnov test Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 230000003068 static effect Effects 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 238000013524 data verification Methods 0.000 claims description 2
- 238000013136 deep learning model Methods 0.000 abstract description 3
- 230000005856 abnormality Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 5
- 238000007689 inspection Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 108010033040 Histones Proteins 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 239000003513 alkali Substances 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005923 long-lasting effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 238000002791 soaking Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method, a device, a system and a storage medium for detecting emission data counterfeiting, which are used for carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results. By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified. The invention can effectively detect the fake emission data, and reduces the possibility of fake online monitoring data of enterprises.
Description
Technical Field
The present invention relates to the field of environmental monitoring and protection technologies, and in particular, to a method, an apparatus, a system, and a storage medium for detecting emission data falsification.
Background
The fake and fake detection of the on-line monitoring data is a field attack and defense war. In the existing counterfeiting method of enterprises, the method is generally based on basic physical and chemical means or is used for tampering on-line data.
The physical and chemical means based on counterfeiting, such as means of soaking alkali liquor in a filter element, drawing out a tube, pumping air and the like, often have more regular corresponding characterization, and the identification method is relatively more definite, such as video identification and the like, which are not described herein.
The means for tampering the data, namely the counterfeiting means for which the invention is mainly aimed, are generally identified by a method for manually summarizing the characterization of the significant data and a method for establishing a BP neural network based on the original data of pollutants in the past. After the enterprise side is upgraded by applying a certain mathematical foundation and a fake making means, the method for tampering the data can be easily perceived.
Description of the prior art and deficiencies thereof:
1) Method for representing significant data based on artificial summary
In long-lasting operation, experienced engineers often become aware of unreasonable portions of enterprise data. The law which can be described by mathematical means can be clearly put into the system by means of automatic recognition after being summarized by experienced engineers.
The main problems of this method are that it depends greatly on the manual experience of engineers, there are inevitably unstable and difficult to describe parts, and the consumption of manpower and material resources is large.
2) Method for establishing BP neural network based on pollutant raw data
In particular to a method for directly establishing BP network to make regression model for original data of pollutant emission data.
The main problem of this method is that the real-time pollution source drain various scores are used as input to calculate the actual pollution source score under the calculation of the predicted pollution source score and the index, and the actual pollution source score and the neural network predicted pollution source score may operate according to the same mathematical relationship, and in the case that the two have a common false input, there may be approximate false output. The method uses BP neural network to build regression model, the fitting is the relation of various scores to pollution source scores, under the condition of the existing definite mapping relation, the neural network is likely to be the process of purely fitting various score evaluation indexes to obtain actual pollution source scores by the traditional calculation method. In the case where the batch of data has been counterfeited, i.e., the input condition itself is counterfeit, the neural network and the conventional standard calculation method may not be able to distinguish between them in the point of distinguishing the counterfeited.
This process is similar to a process of solving a unitary quadratic equation using a neural network, even though the structure is complex, the final parameter matrix is fit well enough if trained, but only to a mathematical relationship that can be expressed in a compact elementary function. If the model result is different from the traditional calculation method result, the model is likely to be inadequately learned, and the fake is not really identified. Neural networks are more effective when the model describes a more complex relationship that is often verified by traditional mathematical means to be difficult to express as a concise mathematical process.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems occurring in the prior art. Accordingly, there is a need for an emission data falsification detection method, apparatus and system, and storage medium, to solve at least the following problems:
1. the method based on manual summary of salient data characterization relies heavily on the manual experience of engineers, and therefore there are inevitably unstable and difficult to describe parts. Therefore, the labor and material resources are more consumed, the cost is higher, and the standardized, popularized and reproducible method is more difficult to form.
2. Although the method for establishing the BP neural network based on the original data of the pollutants is separated from the error region of the previous method, a certain error region exists in solving the problem by using the mathematical relationship described by the model.
The invention is mainly developed on the methods of statistical analysis and neural networks.
According to a first aspect of the present invention, there is provided an emission data falsification detection method, the method comprising:
Acquiring on-line monitoring data, wherein the on-line monitoring data comprise production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, wherein Zn represents emission data of an nth pollutant, n is the number of types of the pollutants, a matrix A of the plurality of time points is marked as one sample, and a set of all samples is used as a data set;
Calculating mutual information among pollutants, and extracting features serving as relativity among pollutant sequences;
Performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to sequences of a plurality of time points of each pollutant;
Obtaining an anomaly score by utilizing anomaly detection algorithms under different distribution hypothesis conditions;
Splicing the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences by taking the dimension of the pollutants as the standard, and constructing a classification model by taking the characteristics, statistic parameters and abnormal scores as the input of a neural network;
training the classification model based on the data set to obtain a fake detection model;
And calculating the probability of emission data counterfeiting based on the counterfeiting detection model.
Further, the mutual information between the contaminants is calculated by the following formula:
Gain(T,X)=Enrropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
Further, the performing an independent normal distribution check on each dimension to obtain a statistic parameter specifically includes:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
Further, the method for obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
Further, for the assumption precondition of normal distribution, the data inspection fails to pass the normal distribution inspection, the normal distribution transformation is performed first, and then the anomaly score is calculated.
Further, the classification model includes a Self-Attention structure, an RNN structure, and a LuongAttention structure, and the constructing the classification model specifically includes:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
Further, training the classification model based on the data set to obtain a fake detection model, which specifically includes:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
According to a second aspect of the present invention, there is provided an emission data falsification detection device including:
A data acquisition module configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being used as a data set;
The characteristic calculation module is configured to calculate mutual information among pollutants and extract characteristics serving as relativity among pollutant sequences;
the parameter calculation module is configured to perform independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to a sequence of a plurality of time points of each pollutant;
An anomaly score acquisition module configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
the classification model construction module is configured to splice the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences based on the dimension of the pollutants, and is used as the input of the neural network to construct a classification model;
The model training module is configured to train the classification model based on the data set to obtain a fake detection model;
And the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model.
According to a third aspect of the present invention, there is provided an emission data falsification detection system, the system including: a memory for storing a computer program; a processor for executing the computer program to implement the method as described above.
According to a fourth aspect of the invention, there is provided a non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method as described above.
The emission data falsification detection method, the device, the system and the storage medium according to the various schemes have at least the following technical effects:
the invention has various advantages because the secondary modeling is performed based on the results of various abnormal distributions. Firstly, the invention considers the precondition of various statistical assumptions and has adaptability to different data distribution conditions in actual use. Secondly, the method describes a complex mathematical relationship by using a deep learning model, and enterprises can hardly find a data generation method for resisting the mathematical relationship. In addition, the method can be deployed on a server in the reasoning stage and automatically operated by connecting a database, and only a small amount of manpower and material resources are needed.
Drawings
In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The same reference numerals with letter suffixes or different letter suffixes may represent different instances of similar components. The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the inventive embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.
FIG. 1 illustrates a flow chart of a emissions data falsification detection method in accordance with an embodiment of the present invention.
Fig. 2 shows a schematic structural diagram of a classification model according to an embodiment of the invention.
FIG. 3 illustrates a classification model building flow chart of an emissions data falsification detection method according to an embodiment of the invention.
FIG. 4 illustrates a flow chart of modeling of emissions data modeling detection methods in accordance with an embodiment of the present invention.
Fig. 5 shows a structural diagram of an emission data falsification detection device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention. Embodiments of the present invention will be described in further detail below with reference to the drawings and specific examples, but not by way of limitation. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.
The embodiment of the invention provides a method for detecting emission data falsification, in particular to a method for detecting emission data falsification based on-line monitoring of various abnormal distribution and deep learning of data.
And carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results.
The scheme is suitable for detecting the emission data of the enterprise on-line monitoring data marked as the normal production state and judging which pollutants in the emission data have the possibility of counterfeiting the emission data.
By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified.
For environmental protection department supervision enterprises, the automatic identification method can save a great deal of manpower and material resources and create economic benefits. The method of secondary modeling by using the abnormal scores under various distribution conditions is not easy to be perceived by enterprises to detect, so that the false is purposefully made, and the deterrent has corresponding social value.
Specifically, referring to fig. 1, the method includes the following steps:
Step S100, acquiring on-line monitoring data, wherein the on-line monitoring data comprises production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, zn represents emission data of an nth pollutant, n is the type number of the pollutant, a matrix A of a plurality of time points is marked as one sample, and a set of all samples is used as a data set.
Illustratively, assuming 6 contaminant species, each time point is noted as a vector Z (z= { Z1, Z2, Z3, Z4, Z5, Z6 }). The matrix a at 180 time points is noted as one sample, and all sample sets are taken as data sets.
Step S200, calculating mutual information between pollutants, and extracting features as correlations between pollutant sequences.
Where the mutual information between contaminants refers to the mutual information between contaminants of each type in a sequence of contaminants (matrix a of multiple time points), if the contaminant type has 6 types, denoted as Z1, Z2, Z3, Z4, Z5, Z6, respectively, the mutual information between contaminants may be the mutual information between Z1 and Z2, Z2 and Z3, Z4 and Z5, Z5 and Z6, etc. The mutual information between the contaminants characterizes the correlation or independence between the data, and is therefore extracted as a feature of the correlation between contaminant sequences.
In some embodiments, the mutual information between contaminants is calculated by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
And step S300, performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimension corresponds to a sequence of a plurality of time points of each pollutant.
For example, assuming 6 contaminants, a matrix a contains 6 contaminant sequences and a sample has 6 dimensions.
In some embodiments, the performing a separate normal distribution check on each dimension to obtain a statistic parameter specifically includes:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
Step S400, obtaining an anomaly score by utilizing an anomaly detection algorithm under different distribution hypothesis conditions.
In some embodiments, for a hypothesis that is a normal distribution, but that the data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.
In some embodiments, the obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically includes:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
And S500, splicing the characteristics, the statistic parameters and the abnormal scores of the relativity among the pollutant sequences based on the dimension of the pollutants, and constructing a classification model by taking the characteristics, the statistic parameters and the abnormal scores as the input of the neural network.
In some embodiments, the structure of the classification model is shown in fig. 2, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, as shown in fig. 3, the constructing the classification model by using the feature, the statistic parameter, and the anomaly score of the correlation between the pollutant sequences based on the dimension of the pollutant as the input of the neural network specifically includes:
step S501, calculating the output of each block by using a Self-Attention structure for the characteristics, statistic parameters and anomaly scores of the relativity among the pollutant sequences;
Step S502, constructing RNN structure calculation output containing sequence according to logic relation for the results of a plurality of Self-Attention;
In step S503, the weighted results of the input and output of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
And step S600, training the classification model based on the data set to obtain a fake detection model.
In some embodiments, as shown in fig. 4, the training the classification model based on the data set to obtain a fake detection model specifically includes:
step S601, extracting positive samples with preset proportion from the data set, changing the value exceeding the pollutant standard in the positive samples, reducing the value to the Euclidean distance far away from the pollutant standard, marking the value as a negative sample, and taking the generated negative sample and the original real sample set as a total data set;
Step S602, dividing the total data set into a training set, a testing set and a verification set;
Step S603, based on the training set and the testing set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to a model with the best training effect from each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
Step S604, respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
Finally, in step S700, the probability of emission data falsification is calculated based on the falsification detection model.
Specifically, for a model that has been trained, the model structure and parameters are directly loaded in the inference phase. When each batch of data is operated, the model is directly loaded for reasoning after the steps S100-S400 are carried out, and the probability of emission data counterfeiting is calculated.
The embodiment of the invention also provides a device for detecting the falsification of the emission data, as shown in fig. 5, the device 500 comprises:
A data acquisition module 501 configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being a data set;
A feature computation module 502 configured to compute mutual information between contaminants, extracting features that are correlations between contaminant sequences;
A parameter calculation module 503 configured to perform an individual normal distribution check for each dimension, resulting in a statistic parameter, the dimension corresponding to a sequence of multiple time points for each contaminant;
an anomaly score acquisition module 504 configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
The classification model construction module 505 is configured to splice the characteristics, the statistic parameters and the anomaly scores of the relativity between the pollutant sequences based on the dimension of the pollutants, and construct a classification model by taking the characteristics, the statistic parameters and the anomaly scores as the input of the neural network;
The model training module 506 is configured to train the classification model based on the data set to obtain a fake detection model;
A fraud detection module 507 configured to calculate a probability of emission data fraud based on the fraud detection model.
In some embodiments, the feature calculation module is further configured to calculate the mutual information between the contaminants by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
In some embodiments, the parameter calculation module is further configured to: :
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
In some embodiments, the anomaly score acquisition module is further configured to:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
In some embodiments, the anomaly score acquisition module is further configured to:
for the assumption that normal distribution is the precondition and data inspection fails normal distribution inspection, normal distribution transformation is performed before abnormal scores are calculated.
In some embodiments, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
In some embodiments, the model training module is further configured to:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
It should be noted that, the emission data falsification detection device and the method described above belong to the same technical idea, which can have the same beneficial effects, and are not repeated here.
The embodiment of the invention provides an emission data falsification detection system, which comprises:
A memory for storing a computer program;
a processor for executing the computer program to implement the method as described in any of the embodiments above.
Embodiments of the present invention provide a non-transitory computer-readable storage medium storing instructions which, when executed by a processor, perform a method as described in any of the embodiments above.
Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Claims (6)
1. A method for detecting false emissions data, the method comprising:
Acquiring on-line monitoring data, wherein the on-line monitoring data comprise production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, wherein Zn represents emission data of an nth pollutant, n is the number of types of the pollutants, a matrix A of the plurality of time points is marked as one sample, and a set of all samples is used as a data set;
Calculating mutual information among pollutants, and extracting features serving as relativity among pollutant sequences;
Performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to sequences of a plurality of time points of each pollutant;
Obtaining an anomaly score by utilizing anomaly detection algorithms under different distribution hypothesis conditions;
Splicing the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences by taking the dimension of the pollutants as the standard, and constructing a classification model by taking the characteristics, statistic parameters and abnormal scores as the input of a neural network;
training the classification model based on the data set to obtain a fake detection model;
calculating to obtain the probability of emission data counterfeiting based on the counterfeiting detection model;
the mutual information between the contaminants is calculated by the following formula:
Gain(T,X)=Entropy(T)-Entropy(X)
Wherein E (S) represents information Entropy of the contaminant, i represents a kind of the contaminant, c represents a kind number of the contaminant, p i represents a marginal probability density function of the ith contaminant, gain (T, X) represents mutual information between the two contaminants, entropy (T) represents information Entropy of one of the two contaminants, entropy (T, X) represents information Entropy of the other of the two contaminants;
And performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the method specifically comprises the following steps of:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of normal distribution test, n represents the data quantity of single pollutant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function;
the method for obtaining the anomaly score by utilizing the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein HBOS (p) represents an anomaly score calculated under a Histogram-based Outlier Score method, d represents the data amount of a single pollutant participating in calculation, and hist i (p) represents the frequency of the Histogram after bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, xi representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Wherein, ψ represents the number of data extracted from the data set to which x belongs, c (ψ) represents the average height under the data points of ψ, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated by (ψ -1), H (x) represents the height of one data point x, i.e. the height of one data point x needs to go through several edges from the root node of the tree to reach the leaf node;
The classification model comprises a Self-Attention structure, an RNN structure and a Luong Attention structure, and features, statistic parameters and abnormal scores of relevance among spliced pollutant sequences taking the dimension of pollutants as a standard are used as inputs of a neural network, and the construction of the classification model specifically comprises the following steps:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
2. The method of claim 1, wherein for a hypothesis whose normal distribution is assumed and whose data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.
3. The method according to claim 1, wherein training the classification model based on the data set results in a fake detection model, specifically comprising:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
4. An emissions data fraud detection apparatus, the apparatus comprising:
A data acquisition module configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being used as a data set;
The characteristic calculation module is configured to calculate mutual information among pollutants and extract characteristics serving as relativity among pollutant sequences;
the parameter calculation module is configured to perform independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to a sequence of a plurality of time points of each pollutant;
An anomaly score acquisition module configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
the classification model construction module is configured to splice the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences based on the dimension of the pollutants, and is used as the input of the neural network to construct a classification model;
The model training module is configured to train the classification model based on the data set to obtain a fake detection model;
the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model;
the feature calculation module is further configured to calculate mutual information between contaminants by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants;
The parameter calculation module is further configured to:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of normal distribution test, n represents the data quantity of single pollutant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function;
the anomaly score acquisition module is further configured to:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in practical calculations, this formula is equivalent to the following formula:
Wherein HBOS (p) represents an anomaly score calculated under a Histogram-based Outlier Score method, d represents the data amount of a single pollutant participating in calculation, and hist i (p) represents the frequency of the Histogram after bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Wherein, ψ represents the number of data extracted from the data set to which x belongs, c (ψ) represents the average height under the data points of ψ, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated by (ψ -1), H (x) represents the height of one data point x, i.e. the height of one data point x needs to go through several edges from the root node of the tree to reach the leaf node;
The classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
5. An emissions data falsification detection system, characterized by: the system comprises:
A memory for storing a computer program;
A processor for executing the computer program to implement the method of any one of claims 1 to 3.
6. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method of any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311236361.1A CN117235624B (en) | 2023-09-22 | 2023-09-22 | Emission data falsification detection method, device and system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311236361.1A CN117235624B (en) | 2023-09-22 | 2023-09-22 | Emission data falsification detection method, device and system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117235624A CN117235624A (en) | 2023-12-15 |
CN117235624B true CN117235624B (en) | 2024-05-07 |
Family
ID=89090785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311236361.1A Active CN117235624B (en) | 2023-09-22 | 2023-09-22 | Emission data falsification detection method, device and system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117235624B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118313564B (en) * | 2024-06-05 | 2024-08-23 | 生态环境部环境工程评估中心 | Abnormality identification method, device, equipment and medium for enterprise emission monitoring data |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614526A (en) * | 2018-11-09 | 2019-04-12 | 环境保护部环境工程评估中心 | Environmental monitoring data fraud means recognition methods based on higher-dimension abnormality detection model |
CN110990393A (en) * | 2019-12-17 | 2020-04-10 | 清华苏州环境创新研究院 | Big data identification method for abnormal data behaviors of industry enterprises |
CN111507376A (en) * | 2020-03-20 | 2020-08-07 | 厦门大学 | Single index abnormality detection method based on fusion of multiple unsupervised methods |
WO2021068563A1 (en) * | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Sample date processing method, device and computer equipment, and storage medium |
CN112785420A (en) * | 2021-01-26 | 2021-05-11 | 上海明略人工智能(集团)有限公司 | Credit scoring model training method and device, electronic equipment and storage medium |
WO2021174751A1 (en) * | 2020-03-02 | 2021-09-10 | 平安国际智慧城市科技股份有限公司 | Method, apparatus and device for locating pollution source on basis of big data, and storage medium |
CN114049134A (en) * | 2021-11-09 | 2022-02-15 | 重庆商勤科技有限公司 | Pollution source online monitoring data counterfeiting identification method |
CN114580747A (en) * | 2022-03-04 | 2022-06-03 | 西安交通大学 | Abnormal data prediction method and system based on data correlation and fuzzy system |
CN116308415A (en) * | 2023-02-15 | 2023-06-23 | 江苏蓝创智能科技股份有限公司 | Sewage discharge data true and false risk assessment method |
CN116662899A (en) * | 2023-05-05 | 2023-08-29 | 河南晶锐冷却技术股份有限公司 | Noise-containing data anomaly detection method based on self-adaptive strategy |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11715284B2 (en) * | 2018-05-18 | 2023-08-01 | Nec Corporation | Anomaly detection apparatus, anomaly detection method, and program |
US11005872B2 (en) * | 2019-05-31 | 2021-05-11 | Gurucul Solutions, Llc | Anomaly detection in cybersecurity and fraud applications |
CA3085092A1 (en) * | 2019-06-27 | 2020-12-27 | Royal Bank Of Canada | System and method for detecting data drift |
KR102583582B1 (en) * | 2020-02-24 | 2023-09-27 | 주식회사 마키나락스 | Method for generating anoamlous data |
-
2023
- 2023-09-22 CN CN202311236361.1A patent/CN117235624B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614526A (en) * | 2018-11-09 | 2019-04-12 | 环境保护部环境工程评估中心 | Environmental monitoring data fraud means recognition methods based on higher-dimension abnormality detection model |
WO2021068563A1 (en) * | 2019-10-11 | 2021-04-15 | 平安科技(深圳)有限公司 | Sample date processing method, device and computer equipment, and storage medium |
CN110990393A (en) * | 2019-12-17 | 2020-04-10 | 清华苏州环境创新研究院 | Big data identification method for abnormal data behaviors of industry enterprises |
WO2021174751A1 (en) * | 2020-03-02 | 2021-09-10 | 平安国际智慧城市科技股份有限公司 | Method, apparatus and device for locating pollution source on basis of big data, and storage medium |
CN111507376A (en) * | 2020-03-20 | 2020-08-07 | 厦门大学 | Single index abnormality detection method based on fusion of multiple unsupervised methods |
CN112785420A (en) * | 2021-01-26 | 2021-05-11 | 上海明略人工智能(集团)有限公司 | Credit scoring model training method and device, electronic equipment and storage medium |
CN114049134A (en) * | 2021-11-09 | 2022-02-15 | 重庆商勤科技有限公司 | Pollution source online monitoring data counterfeiting identification method |
CN114580747A (en) * | 2022-03-04 | 2022-06-03 | 西安交通大学 | Abnormal data prediction method and system based on data correlation and fuzzy system |
CN116308415A (en) * | 2023-02-15 | 2023-06-23 | 江苏蓝创智能科技股份有限公司 | Sewage discharge data true and false risk assessment method |
CN116662899A (en) * | 2023-05-05 | 2023-08-29 | 河南晶锐冷却技术股份有限公司 | Noise-containing data anomaly detection method based on self-adaptive strategy |
Non-Patent Citations (5)
Title |
---|
Fraud detection: a systematic literature review of graph-based anomaly detection approaches;Tahereh Pourhabibi. et.al;《Decision support systems》;20200630;第133卷;第1-15页 * |
Statistical Inference of Rough Set Dependence and Importance Analysis;Dan Hu, et.al;《IEEE Transactions on Fuzzy Systems 》;20131231;第21卷(第6期);第10701079页 * |
一种基于贝叶斯后验的异常值在线检测及置信度评估算法;孙栓柱;宋蓓;李春岩;王皓;;中国科学技术大学学报;20170815(08);第644-652页 * |
基于多数据源融合的创业板上市公司财务造假异常检测;李爱华等;《数据分析与知识发现》;20230531;第7卷(第5期);第33-47页 * |
基于正态检验的瓦斯涌出异常预警方法;张天宇等;《软件导刊》;20200331;第19卷(第3期);第99-103页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117235624A (en) | 2023-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111882446B (en) | Abnormal account detection method based on graph convolution network | |
CN117235624B (en) | Emission data falsification detection method, device and system and storage medium | |
CN112132233A (en) | Criminal personnel dangerous behavior prediction method and system based on effective influence factors | |
CN111126820A (en) | Electricity stealing prevention method and system | |
CN113837578B (en) | Grid supervision, management and evaluation method for power supervision enterprise | |
CN117593101A (en) | Financial risk data processing and analyzing method and system based on multidimensional data | |
CN111079348B (en) | Method and device for detecting slowly-varying signal | |
CN114492614A (en) | Method and device for classifying faults in hot rolling process of strip steel based on ensemble learning | |
CN116977834B (en) | Method for identifying internal and external images distributed under open condition | |
CN115858606A (en) | Method, device and equipment for detecting abnormity of time series data and storage medium | |
CN109617864A (en) | A kind of website identification method and website identifying system | |
CN117522586A (en) | Financial abnormal behavior detection method and device | |
CN117372144A (en) | Wind control strategy intelligent method and system applied to small sample scene | |
CN116680639A (en) | Deep-learning-based anomaly detection method for sensor data of deep-sea submersible | |
CN115293783A (en) | Risk user identification method and device, computer equipment and storage medium | |
CN115471122A (en) | Energy consumption evaluation method and system based on metadata model | |
CN115496364A (en) | Method and device for identifying heterogeneous enterprises, storage medium and electronic equipment | |
CN114154617A (en) | Low-voltage resident user abnormal electricity utilization identification method and system based on VFL | |
CN109308565B (en) | Crowd performance grade identification method and device, storage medium and computer equipment | |
CN118468200B (en) | Method for analyzing faking data in pollution source online monitoring data | |
CN118395223B (en) | Environment investigation data processing method for geological mineral exploration | |
CN111932145B (en) | Method for judging scale formation influence of gathering and transportation pipeline based on wastewater quality | |
CN115617011A (en) | Industrial equipment state detection method based on autoregressive graph antagonistic neural network | |
CN115797069A (en) | Risk account determination method, device, equipment and storage medium | |
CN117218410A (en) | Multi-source data migration tube explosion positioning and performance evaluation method based on generation type countermeasure network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Country or region after: China Address after: 100096 no.258, 2nd floor, building 2, Xisanqi building materials City, Haidian District, Beijing Applicant after: China Energy Conservation Digital Technology Co.,Ltd. Address before: 100096 no.258, 2nd floor, building 2, Xisanqi building materials City, Haidian District, Beijing Applicant before: CECEP TALROAD TECHNOLOGY CO.,LTD. Country or region before: China |
|
GR01 | Patent grant | ||
GR01 | Patent grant |