CN117235624B - Emission data falsification detection method, device and system and storage medium - Google Patents

Emission data falsification detection method, device and system and storage medium Download PDF

Info

Publication number
CN117235624B
CN117235624B CN202311236361.1A CN202311236361A CN117235624B CN 117235624 B CN117235624 B CN 117235624B CN 202311236361 A CN202311236361 A CN 202311236361A CN 117235624 B CN117235624 B CN 117235624B
Authority
CN
China
Prior art keywords
data
pollutant
anomaly
contaminant
entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311236361.1A
Other languages
Chinese (zh)
Other versions
CN117235624A (en
Inventor
庞继伟
孙艺嘉
张栩
郭炜
杨珊珊
丁杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Energy Conservation Digital Technology Co ltd
Original Assignee
China Energy Conservation Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Energy Conservation Digital Technology Co ltd filed Critical China Energy Conservation Digital Technology Co ltd
Priority to CN202311236361.1A priority Critical patent/CN117235624B/en
Publication of CN117235624A publication Critical patent/CN117235624A/en
Application granted granted Critical
Publication of CN117235624B publication Critical patent/CN117235624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a device, a system and a storage medium for detecting emission data counterfeiting, which are used for carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results. By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified. The invention can effectively detect the fake emission data, and reduces the possibility of fake online monitoring data of enterprises.

Description

Emission data falsification detection method, device and system and storage medium
Technical Field
The present invention relates to the field of environmental monitoring and protection technologies, and in particular, to a method, an apparatus, a system, and a storage medium for detecting emission data falsification.
Background
The fake and fake detection of the on-line monitoring data is a field attack and defense war. In the existing counterfeiting method of enterprises, the method is generally based on basic physical and chemical means or is used for tampering on-line data.
The physical and chemical means based on counterfeiting, such as means of soaking alkali liquor in a filter element, drawing out a tube, pumping air and the like, often have more regular corresponding characterization, and the identification method is relatively more definite, such as video identification and the like, which are not described herein.
The means for tampering the data, namely the counterfeiting means for which the invention is mainly aimed, are generally identified by a method for manually summarizing the characterization of the significant data and a method for establishing a BP neural network based on the original data of pollutants in the past. After the enterprise side is upgraded by applying a certain mathematical foundation and a fake making means, the method for tampering the data can be easily perceived.
Description of the prior art and deficiencies thereof:
1) Method for representing significant data based on artificial summary
In long-lasting operation, experienced engineers often become aware of unreasonable portions of enterprise data. The law which can be described by mathematical means can be clearly put into the system by means of automatic recognition after being summarized by experienced engineers.
The main problems of this method are that it depends greatly on the manual experience of engineers, there are inevitably unstable and difficult to describe parts, and the consumption of manpower and material resources is large.
2) Method for establishing BP neural network based on pollutant raw data
In particular to a method for directly establishing BP network to make regression model for original data of pollutant emission data.
The main problem of this method is that the real-time pollution source drain various scores are used as input to calculate the actual pollution source score under the calculation of the predicted pollution source score and the index, and the actual pollution source score and the neural network predicted pollution source score may operate according to the same mathematical relationship, and in the case that the two have a common false input, there may be approximate false output. The method uses BP neural network to build regression model, the fitting is the relation of various scores to pollution source scores, under the condition of the existing definite mapping relation, the neural network is likely to be the process of purely fitting various score evaluation indexes to obtain actual pollution source scores by the traditional calculation method. In the case where the batch of data has been counterfeited, i.e., the input condition itself is counterfeit, the neural network and the conventional standard calculation method may not be able to distinguish between them in the point of distinguishing the counterfeited.
This process is similar to a process of solving a unitary quadratic equation using a neural network, even though the structure is complex, the final parameter matrix is fit well enough if trained, but only to a mathematical relationship that can be expressed in a compact elementary function. If the model result is different from the traditional calculation method result, the model is likely to be inadequately learned, and the fake is not really identified. Neural networks are more effective when the model describes a more complex relationship that is often verified by traditional mathematical means to be difficult to express as a concise mathematical process.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems occurring in the prior art. Accordingly, there is a need for an emission data falsification detection method, apparatus and system, and storage medium, to solve at least the following problems:
1. the method based on manual summary of salient data characterization relies heavily on the manual experience of engineers, and therefore there are inevitably unstable and difficult to describe parts. Therefore, the labor and material resources are more consumed, the cost is higher, and the standardized, popularized and reproducible method is more difficult to form.
2. Although the method for establishing the BP neural network based on the original data of the pollutants is separated from the error region of the previous method, a certain error region exists in solving the problem by using the mathematical relationship described by the model.
The invention is mainly developed on the methods of statistical analysis and neural networks.
According to a first aspect of the present invention, there is provided an emission data falsification detection method, the method comprising:
Acquiring on-line monitoring data, wherein the on-line monitoring data comprise production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, wherein Zn represents emission data of an nth pollutant, n is the number of types of the pollutants, a matrix A of the plurality of time points is marked as one sample, and a set of all samples is used as a data set;
Calculating mutual information among pollutants, and extracting features serving as relativity among pollutant sequences;
Performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to sequences of a plurality of time points of each pollutant;
Obtaining an anomaly score by utilizing anomaly detection algorithms under different distribution hypothesis conditions;
Splicing the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences by taking the dimension of the pollutants as the standard, and constructing a classification model by taking the characteristics, statistic parameters and abnormal scores as the input of a neural network;
training the classification model based on the data set to obtain a fake detection model;
And calculating the probability of emission data counterfeiting based on the counterfeiting detection model.
Further, the mutual information between the contaminants is calculated by the following formula:
Gain(T,X)=Enrropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
Further, the performing an independent normal distribution check on each dimension to obtain a statistic parameter specifically includes:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
Further, the method for obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
Further, for the assumption precondition of normal distribution, the data inspection fails to pass the normal distribution inspection, the normal distribution transformation is performed first, and then the anomaly score is calculated.
Further, the classification model includes a Self-Attention structure, an RNN structure, and a LuongAttention structure, and the constructing the classification model specifically includes:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
Further, training the classification model based on the data set to obtain a fake detection model, which specifically includes:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
According to a second aspect of the present invention, there is provided an emission data falsification detection device including:
A data acquisition module configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being used as a data set;
The characteristic calculation module is configured to calculate mutual information among pollutants and extract characteristics serving as relativity among pollutant sequences;
the parameter calculation module is configured to perform independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to a sequence of a plurality of time points of each pollutant;
An anomaly score acquisition module configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
the classification model construction module is configured to splice the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences based on the dimension of the pollutants, and is used as the input of the neural network to construct a classification model;
The model training module is configured to train the classification model based on the data set to obtain a fake detection model;
And the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model.
According to a third aspect of the present invention, there is provided an emission data falsification detection system, the system including: a memory for storing a computer program; a processor for executing the computer program to implement the method as described above.
According to a fourth aspect of the invention, there is provided a non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method as described above.
The emission data falsification detection method, the device, the system and the storage medium according to the various schemes have at least the following technical effects:
the invention has various advantages because the secondary modeling is performed based on the results of various abnormal distributions. Firstly, the invention considers the precondition of various statistical assumptions and has adaptability to different data distribution conditions in actual use. Secondly, the method describes a complex mathematical relationship by using a deep learning model, and enterprises can hardly find a data generation method for resisting the mathematical relationship. In addition, the method can be deployed on a server in the reasoning stage and automatically operated by connecting a database, and only a small amount of manpower and material resources are needed.
Drawings
In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The same reference numerals with letter suffixes or different letter suffixes may represent different instances of similar components. The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the inventive embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.
FIG. 1 illustrates a flow chart of a emissions data falsification detection method in accordance with an embodiment of the present invention.
Fig. 2 shows a schematic structural diagram of a classification model according to an embodiment of the invention.
FIG. 3 illustrates a classification model building flow chart of an emissions data falsification detection method according to an embodiment of the invention.
FIG. 4 illustrates a flow chart of modeling of emissions data modeling detection methods in accordance with an embodiment of the present invention.
Fig. 5 shows a structural diagram of an emission data falsification detection device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention. Embodiments of the present invention will be described in further detail below with reference to the drawings and specific examples, but not by way of limitation. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.
The embodiment of the invention provides a method for detecting emission data falsification, in particular to a method for detecting emission data falsification based on-line monitoring of various abnormal distribution and deep learning of data.
And carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results.
The scheme is suitable for detecting the emission data of the enterprise on-line monitoring data marked as the normal production state and judging which pollutants in the emission data have the possibility of counterfeiting the emission data.
By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified.
For environmental protection department supervision enterprises, the automatic identification method can save a great deal of manpower and material resources and create economic benefits. The method of secondary modeling by using the abnormal scores under various distribution conditions is not easy to be perceived by enterprises to detect, so that the false is purposefully made, and the deterrent has corresponding social value.
Specifically, referring to fig. 1, the method includes the following steps:
Step S100, acquiring on-line monitoring data, wherein the on-line monitoring data comprises production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, zn represents emission data of an nth pollutant, n is the type number of the pollutant, a matrix A of a plurality of time points is marked as one sample, and a set of all samples is used as a data set.
Illustratively, assuming 6 contaminant species, each time point is noted as a vector Z (z= { Z1, Z2, Z3, Z4, Z5, Z6 }). The matrix a at 180 time points is noted as one sample, and all sample sets are taken as data sets.
Step S200, calculating mutual information between pollutants, and extracting features as correlations between pollutant sequences.
Where the mutual information between contaminants refers to the mutual information between contaminants of each type in a sequence of contaminants (matrix a of multiple time points), if the contaminant type has 6 types, denoted as Z1, Z2, Z3, Z4, Z5, Z6, respectively, the mutual information between contaminants may be the mutual information between Z1 and Z2, Z2 and Z3, Z4 and Z5, Z5 and Z6, etc. The mutual information between the contaminants characterizes the correlation or independence between the data, and is therefore extracted as a feature of the correlation between contaminant sequences.
In some embodiments, the mutual information between contaminants is calculated by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
And step S300, performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimension corresponds to a sequence of a plurality of time points of each pollutant.
For example, assuming 6 contaminants, a matrix a contains 6 contaminant sequences and a sample has 6 dimensions.
In some embodiments, the performing a separate normal distribution check on each dimension to obtain a statistic parameter specifically includes:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
Step S400, obtaining an anomaly score by utilizing an anomaly detection algorithm under different distribution hypothesis conditions.
In some embodiments, for a hypothesis that is a normal distribution, but that the data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.
In some embodiments, the obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically includes:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
And S500, splicing the characteristics, the statistic parameters and the abnormal scores of the relativity among the pollutant sequences based on the dimension of the pollutants, and constructing a classification model by taking the characteristics, the statistic parameters and the abnormal scores as the input of the neural network.
In some embodiments, the structure of the classification model is shown in fig. 2, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, as shown in fig. 3, the constructing the classification model by using the feature, the statistic parameter, and the anomaly score of the correlation between the pollutant sequences based on the dimension of the pollutant as the input of the neural network specifically includes:
step S501, calculating the output of each block by using a Self-Attention structure for the characteristics, statistic parameters and anomaly scores of the relativity among the pollutant sequences;
Step S502, constructing RNN structure calculation output containing sequence according to logic relation for the results of a plurality of Self-Attention;
In step S503, the weighted results of the input and output of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
And step S600, training the classification model based on the data set to obtain a fake detection model.
In some embodiments, as shown in fig. 4, the training the classification model based on the data set to obtain a fake detection model specifically includes:
step S601, extracting positive samples with preset proportion from the data set, changing the value exceeding the pollutant standard in the positive samples, reducing the value to the Euclidean distance far away from the pollutant standard, marking the value as a negative sample, and taking the generated negative sample and the original real sample set as a total data set;
Step S602, dividing the total data set into a training set, a testing set and a verification set;
Step S603, based on the training set and the testing set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to a model with the best training effect from each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
Step S604, respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
Finally, in step S700, the probability of emission data falsification is calculated based on the falsification detection model.
Specifically, for a model that has been trained, the model structure and parameters are directly loaded in the inference phase. When each batch of data is operated, the model is directly loaded for reasoning after the steps S100-S400 are carried out, and the probability of emission data counterfeiting is calculated.
The embodiment of the invention also provides a device for detecting the falsification of the emission data, as shown in fig. 5, the device 500 comprises:
A data acquisition module 501 configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being a data set;
A feature computation module 502 configured to compute mutual information between contaminants, extracting features that are correlations between contaminant sequences;
A parameter calculation module 503 configured to perform an individual normal distribution check for each dimension, resulting in a statistic parameter, the dimension corresponding to a sequence of multiple time points for each contaminant;
an anomaly score acquisition module 504 configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
The classification model construction module 505 is configured to splice the characteristics, the statistic parameters and the anomaly scores of the relativity between the pollutant sequences based on the dimension of the pollutants, and construct a classification model by taking the characteristics, the statistic parameters and the anomaly scores as the input of the neural network;
The model training module 506 is configured to train the classification model based on the data set to obtain a fake detection model;
A fraud detection module 507 configured to calculate a probability of emission data fraud based on the fraud detection model.
In some embodiments, the feature calculation module is further configured to calculate the mutual information between the contaminants by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.
In some embodiments, the parameter calculation module is further configured to: :
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.
In some embodiments, the anomaly score acquisition module is further configured to:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.
In some embodiments, the anomaly score acquisition module is further configured to:
for the assumption that normal distribution is the precondition and data inspection fails normal distribution inspection, normal distribution transformation is performed before abnormal scores are calculated.
In some embodiments, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
In some embodiments, the model training module is further configured to:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
It should be noted that, the emission data falsification detection device and the method described above belong to the same technical idea, which can have the same beneficial effects, and are not repeated here.
The embodiment of the invention provides an emission data falsification detection system, which comprises:
A memory for storing a computer program;
a processor for executing the computer program to implement the method as described in any of the embodiments above.
Embodiments of the present invention provide a non-transitory computer-readable storage medium storing instructions which, when executed by a processor, perform a method as described in any of the embodiments above.
Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims (6)

1. A method for detecting false emissions data, the method comprising:
Acquiring on-line monitoring data, wherein the on-line monitoring data comprise production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, wherein Zn represents emission data of an nth pollutant, n is the number of types of the pollutants, a matrix A of the plurality of time points is marked as one sample, and a set of all samples is used as a data set;
Calculating mutual information among pollutants, and extracting features serving as relativity among pollutant sequences;
Performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to sequences of a plurality of time points of each pollutant;
Obtaining an anomaly score by utilizing anomaly detection algorithms under different distribution hypothesis conditions;
Splicing the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences by taking the dimension of the pollutants as the standard, and constructing a classification model by taking the characteristics, statistic parameters and abnormal scores as the input of a neural network;
training the classification model based on the data set to obtain a fake detection model;
calculating to obtain the probability of emission data counterfeiting based on the counterfeiting detection model;
the mutual information between the contaminants is calculated by the following formula:
Gain(T,X)=Entropy(T)-Entropy(X)
Wherein E (S) represents information Entropy of the contaminant, i represents a kind of the contaminant, c represents a kind number of the contaminant, p i represents a marginal probability density function of the ith contaminant, gain (T, X) represents mutual information between the two contaminants, entropy (T) represents information Entropy of one of the two contaminants, entropy (T, X) represents information Entropy of the other of the two contaminants;
And performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the method specifically comprises the following steps of:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of normal distribution test, n represents the data quantity of single pollutant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function;
the method for obtaining the anomaly score by utilizing the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in actual calculation, this formula will also be equivalent to the following formula:
Wherein HBOS (p) represents an anomaly score calculated under a Histogram-based Outlier Score method, d represents the data amount of a single pollutant participating in calculation, and hist i (p) represents the frequency of the Histogram after bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, xi representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Wherein, ψ represents the number of data extracted from the data set to which x belongs, c (ψ) represents the average height under the data points of ψ, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated by (ψ -1), H (x) represents the height of one data point x, i.e. the height of one data point x needs to go through several edges from the root node of the tree to reach the leaf node;
The classification model comprises a Self-Attention structure, an RNN structure and a Luong Attention structure, and features, statistic parameters and abnormal scores of relevance among spliced pollutant sequences taking the dimension of pollutants as a standard are used as inputs of a neural network, and the construction of the classification model specifically comprises the following steps:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
2. The method of claim 1, wherein for a hypothesis whose normal distribution is assumed and whose data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.
3. The method according to claim 1, wherein training the classification model based on the data set results in a fake detection model, specifically comprising:
Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;
Dividing the total data set into a training set, a testing set and a verification set;
Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;
And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.
4. An emissions data fraud detection apparatus, the apparatus comprising:
A data acquisition module configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being used as a data set;
The characteristic calculation module is configured to calculate mutual information among pollutants and extract characteristics serving as relativity among pollutant sequences;
the parameter calculation module is configured to perform independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to a sequence of a plurality of time points of each pollutant;
An anomaly score acquisition module configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;
the classification model construction module is configured to splice the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences based on the dimension of the pollutants, and is used as the input of the neural network to construct a classification model;
The model training module is configured to train the classification model based on the data set to obtain a fake detection model;
the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model;
the feature calculation module is further configured to calculate mutual information between contaminants by the following formula:
Gain(T,X)=Entropy(T)-Entropy(T,X)
Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants;
The parameter calculation module is further configured to:
the Kolmogorov-Smirnov test is given by:
Wherein D n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;
Anderson-Darling tested, the formula is as follows:
Wherein Z represents the statistics of normal distribution test, n represents the data quantity of single pollutant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function;
the anomaly score acquisition module is further configured to:
The static width histogram is used for interval division for each dimension to obtain abnormal scores:
in practical calculations, this formula is equivalent to the following formula:
Wherein HBOS (p) represents an anomaly score calculated under a Histogram-based Outlier Score method, d represents the data amount of a single pollutant participating in calculation, and hist i (p) represents the frequency of the Histogram after bin normalization;
Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:
Wherein, Representing a mahalanobis distance measure, x i representing the value of a sample point,/>Representing the average of the population;
iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:
Wherein the method comprises the steps of
Wherein, ψ represents the number of data extracted from the data set to which x belongs, c (ψ) represents the average height under the data points of ψ, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated by (ψ -1), H (x) represents the height of one data point x, i.e. the height of one data point x needs to go through several edges from the root node of the tree to reach the leaf node;
The classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:
Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;
Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;
the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.
5. An emissions data falsification detection system, characterized by: the system comprises:
A memory for storing a computer program;
A processor for executing the computer program to implement the method of any one of claims 1 to 3.
6. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method of any one of claims 1 to 3.
CN202311236361.1A 2023-09-22 2023-09-22 Emission data falsification detection method, device and system and storage medium Active CN117235624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311236361.1A CN117235624B (en) 2023-09-22 2023-09-22 Emission data falsification detection method, device and system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311236361.1A CN117235624B (en) 2023-09-22 2023-09-22 Emission data falsification detection method, device and system and storage medium

Publications (2)

Publication Number Publication Date
CN117235624A CN117235624A (en) 2023-12-15
CN117235624B true CN117235624B (en) 2024-05-07

Family

ID=89090785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311236361.1A Active CN117235624B (en) 2023-09-22 2023-09-22 Emission data falsification detection method, device and system and storage medium

Country Status (1)

Country Link
CN (1) CN117235624B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118313564B (en) * 2024-06-05 2024-08-23 生态环境部环境工程评估中心 Abnormality identification method, device, equipment and medium for enterprise emission monitoring data

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614526A (en) * 2018-11-09 2019-04-12 环境保护部环境工程评估中心 Environmental monitoring data fraud means recognition methods based on higher-dimension abnormality detection model
CN110990393A (en) * 2019-12-17 2020-04-10 清华苏州环境创新研究院 Big data identification method for abnormal data behaviors of industry enterprises
CN111507376A (en) * 2020-03-20 2020-08-07 厦门大学 Single index abnormality detection method based on fusion of multiple unsupervised methods
WO2021068563A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Sample date processing method, device and computer equipment, and storage medium
CN112785420A (en) * 2021-01-26 2021-05-11 上海明略人工智能(集团)有限公司 Credit scoring model training method and device, electronic equipment and storage medium
WO2021174751A1 (en) * 2020-03-02 2021-09-10 平安国际智慧城市科技股份有限公司 Method, apparatus and device for locating pollution source on basis of big data, and storage medium
CN114049134A (en) * 2021-11-09 2022-02-15 重庆商勤科技有限公司 Pollution source online monitoring data counterfeiting identification method
CN114580747A (en) * 2022-03-04 2022-06-03 西安交通大学 Abnormal data prediction method and system based on data correlation and fuzzy system
CN116308415A (en) * 2023-02-15 2023-06-23 江苏蓝创智能科技股份有限公司 Sewage discharge data true and false risk assessment method
CN116662899A (en) * 2023-05-05 2023-08-29 河南晶锐冷却技术股份有限公司 Noise-containing data anomaly detection method based on self-adaptive strategy

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11715284B2 (en) * 2018-05-18 2023-08-01 Nec Corporation Anomaly detection apparatus, anomaly detection method, and program
US11005872B2 (en) * 2019-05-31 2021-05-11 Gurucul Solutions, Llc Anomaly detection in cybersecurity and fraud applications
CA3085092A1 (en) * 2019-06-27 2020-12-27 Royal Bank Of Canada System and method for detecting data drift
KR102583582B1 (en) * 2020-02-24 2023-09-27 주식회사 마키나락스 Method for generating anoamlous data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614526A (en) * 2018-11-09 2019-04-12 环境保护部环境工程评估中心 Environmental monitoring data fraud means recognition methods based on higher-dimension abnormality detection model
WO2021068563A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Sample date processing method, device and computer equipment, and storage medium
CN110990393A (en) * 2019-12-17 2020-04-10 清华苏州环境创新研究院 Big data identification method for abnormal data behaviors of industry enterprises
WO2021174751A1 (en) * 2020-03-02 2021-09-10 平安国际智慧城市科技股份有限公司 Method, apparatus and device for locating pollution source on basis of big data, and storage medium
CN111507376A (en) * 2020-03-20 2020-08-07 厦门大学 Single index abnormality detection method based on fusion of multiple unsupervised methods
CN112785420A (en) * 2021-01-26 2021-05-11 上海明略人工智能(集团)有限公司 Credit scoring model training method and device, electronic equipment and storage medium
CN114049134A (en) * 2021-11-09 2022-02-15 重庆商勤科技有限公司 Pollution source online monitoring data counterfeiting identification method
CN114580747A (en) * 2022-03-04 2022-06-03 西安交通大学 Abnormal data prediction method and system based on data correlation and fuzzy system
CN116308415A (en) * 2023-02-15 2023-06-23 江苏蓝创智能科技股份有限公司 Sewage discharge data true and false risk assessment method
CN116662899A (en) * 2023-05-05 2023-08-29 河南晶锐冷却技术股份有限公司 Noise-containing data anomaly detection method based on self-adaptive strategy

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Fraud detection: a systematic literature review of graph-based anomaly detection approaches;Tahereh Pourhabibi. et.al;《Decision support systems》;20200630;第133卷;第1-15页 *
Statistical Inference of Rough Set Dependence and Importance Analysis;Dan Hu, et.al;《IEEE Transactions on Fuzzy Systems 》;20131231;第21卷(第6期);第10701079页 *
一种基于贝叶斯后验的异常值在线检测及置信度评估算法;孙栓柱;宋蓓;李春岩;王皓;;中国科学技术大学学报;20170815(08);第644-652页 *
基于多数据源融合的创业板上市公司财务造假异常检测;李爱华等;《数据分析与知识发现》;20230531;第7卷(第5期);第33-47页 *
基于正态检验的瓦斯涌出异常预警方法;张天宇等;《软件导刊》;20200331;第19卷(第3期);第99-103页 *

Also Published As

Publication number Publication date
CN117235624A (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN117235624B (en) Emission data falsification detection method, device and system and storage medium
CN112132233A (en) Criminal personnel dangerous behavior prediction method and system based on effective influence factors
CN111126820A (en) Electricity stealing prevention method and system
CN113837578B (en) Grid supervision, management and evaluation method for power supervision enterprise
CN117593101A (en) Financial risk data processing and analyzing method and system based on multidimensional data
CN111079348B (en) Method and device for detecting slowly-varying signal
CN114492614A (en) Method and device for classifying faults in hot rolling process of strip steel based on ensemble learning
CN116977834B (en) Method for identifying internal and external images distributed under open condition
CN115858606A (en) Method, device and equipment for detecting abnormity of time series data and storage medium
CN109617864A (en) A kind of website identification method and website identifying system
CN117522586A (en) Financial abnormal behavior detection method and device
CN117372144A (en) Wind control strategy intelligent method and system applied to small sample scene
CN116680639A (en) Deep-learning-based anomaly detection method for sensor data of deep-sea submersible
CN115293783A (en) Risk user identification method and device, computer equipment and storage medium
CN115471122A (en) Energy consumption evaluation method and system based on metadata model
CN115496364A (en) Method and device for identifying heterogeneous enterprises, storage medium and electronic equipment
CN114154617A (en) Low-voltage resident user abnormal electricity utilization identification method and system based on VFL
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN118468200B (en) Method for analyzing faking data in pollution source online monitoring data
CN118395223B (en) Environment investigation data processing method for geological mineral exploration
CN111932145B (en) Method for judging scale formation influence of gathering and transportation pipeline based on wastewater quality
CN115617011A (en) Industrial equipment state detection method based on autoregressive graph antagonistic neural network
CN115797069A (en) Risk account determination method, device, equipment and storage medium
CN117218410A (en) Multi-source data migration tube explosion positioning and performance evaluation method based on generation type countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 100096 no.258, 2nd floor, building 2, Xisanqi building materials City, Haidian District, Beijing

Applicant after: China Energy Conservation Digital Technology Co.,Ltd.

Address before: 100096 no.258, 2nd floor, building 2, Xisanqi building materials City, Haidian District, Beijing

Applicant before: CECEP TALROAD TECHNOLOGY CO.,LTD.

Country or region before: China

GR01 Patent grant
GR01 Patent grant