CN117235624B

CN117235624B - Emission data falsification detection method, device and system and storage medium

Info

Publication number: CN117235624B
Application number: CN202311236361.1A
Authority: CN
Inventors: 庞继伟; 孙艺嘉; 张栩; 郭炜; 杨珊珊; 丁杰
Original assignee: China Energy Conservation Digital Technology Co ltd
Current assignee: China Energy Conservation Digital Technology Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-05-07
Anticipated expiration: 2043-09-22
Also published as: CN117235624A

Abstract

The invention discloses a method, a device, a system and a storage medium for detecting emission data counterfeiting, which are used for carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results. By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified. The invention can effectively detect the fake emission data, and reduces the possibility of fake online monitoring data of enterprises.

Description

Emission data falsification detection method, device and system and storage medium

Technical Field

The present invention relates to the field of environmental monitoring and protection technologies, and in particular, to a method, an apparatus, a system, and a storage medium for detecting emission data falsification.

Background

The fake and fake detection of the on-line monitoring data is a field attack and defense war. In the existing counterfeiting method of enterprises, the method is generally based on basic physical and chemical means or is used for tampering on-line data.

The physical and chemical means based on counterfeiting, such as means of soaking alkali liquor in a filter element, drawing out a tube, pumping air and the like, often have more regular corresponding characterization, and the identification method is relatively more definite, such as video identification and the like, which are not described herein.

The means for tampering the data, namely the counterfeiting means for which the invention is mainly aimed, are generally identified by a method for manually summarizing the characterization of the significant data and a method for establishing a BP neural network based on the original data of pollutants in the past. After the enterprise side is upgraded by applying a certain mathematical foundation and a fake making means, the method for tampering the data can be easily perceived.

Description of the prior art and deficiencies thereof:

1) Method for representing significant data based on artificial summary

In long-lasting operation, experienced engineers often become aware of unreasonable portions of enterprise data. The law which can be described by mathematical means can be clearly put into the system by means of automatic recognition after being summarized by experienced engineers.

The main problems of this method are that it depends greatly on the manual experience of engineers, there are inevitably unstable and difficult to describe parts, and the consumption of manpower and material resources is large.

2) Method for establishing BP neural network based on pollutant raw data

In particular to a method for directly establishing BP network to make regression model for original data of pollutant emission data.

The main problem of this method is that the real-time pollution source drain various scores are used as input to calculate the actual pollution source score under the calculation of the predicted pollution source score and the index, and the actual pollution source score and the neural network predicted pollution source score may operate according to the same mathematical relationship, and in the case that the two have a common false input, there may be approximate false output. The method uses BP neural network to build regression model, the fitting is the relation of various scores to pollution source scores, under the condition of the existing definite mapping relation, the neural network is likely to be the process of purely fitting various score evaluation indexes to obtain actual pollution source scores by the traditional calculation method. In the case where the batch of data has been counterfeited, i.e., the input condition itself is counterfeit, the neural network and the conventional standard calculation method may not be able to distinguish between them in the point of distinguishing the counterfeited.

This process is similar to a process of solving a unitary quadratic equation using a neural network, even though the structure is complex, the final parameter matrix is fit well enough if trained, but only to a mathematical relationship that can be expressed in a compact elementary function. If the model result is different from the traditional calculation method result, the model is likely to be inadequately learned, and the fake is not really identified. Neural networks are more effective when the model describes a more complex relationship that is often verified by traditional mathematical means to be difficult to express as a concise mathematical process.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems occurring in the prior art. Accordingly, there is a need for an emission data falsification detection method, apparatus and system, and storage medium, to solve at least the following problems:

1. the method based on manual summary of salient data characterization relies heavily on the manual experience of engineers, and therefore there are inevitably unstable and difficult to describe parts. Therefore, the labor and material resources are more consumed, the cost is higher, and the standardized, popularized and reproducible method is more difficult to form.

2. Although the method for establishing the BP neural network based on the original data of the pollutants is separated from the error region of the previous method, a certain error region exists in solving the problem by using the mathematical relationship described by the model.

The invention is mainly developed on the methods of statistical analysis and neural networks.

According to a first aspect of the present invention, there is provided an emission data falsification detection method, the method comprising:

Acquiring on-line monitoring data, wherein the on-line monitoring data comprise production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, wherein Zn represents emission data of an nth pollutant, n is the number of types of the pollutants, a matrix A of the plurality of time points is marked as one sample, and a set of all samples is used as a data set;

Calculating mutual information among pollutants, and extracting features serving as relativity among pollutant sequences;

Performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to sequences of a plurality of time points of each pollutant;

Obtaining an anomaly score by utilizing anomaly detection algorithms under different distribution hypothesis conditions;

Splicing the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences by taking the dimension of the pollutants as the standard, and constructing a classification model by taking the characteristics, statistic parameters and abnormal scores as the input of a neural network;

training the classification model based on the data set to obtain a fake detection model;

And calculating the probability of emission data counterfeiting based on the counterfeiting detection model.

Further, the mutual information between the contaminants is calculated by the following formula:

Gain(T，X)＝Enrropy(T)-Entropy(T,X)

Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants.

Further, the performing an independent normal distribution check on each dimension to obtain a statistic parameter specifically includes:

the Kolmogorov-Smirnov test is given by:

Wherein D _n represents statistics of normal distribution test, sup represents an upper bound in a set of distances, x represents contaminant data of a single type of participation test, F _n (x) represents cumulative probability of actual distribution obtained by x, and F (x) represents cumulative probability of theoretical distribution to be obeyed;

Anderson-Darling tested, the formula is as follows:

Wherein Z represents the statistics of a normal distribution test, n represents the data amount of a single contaminant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function.

Further, the method for obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:

The static width histogram is used for interval division for each dimension to obtain abnormal scores:

in actual calculation, this formula will also be equivalent to the following formula:

Wherein, HBOS (p) represents the anomaly score calculated under the method of Histone-based Outlier Score, d represents the data quantity of single pollutant participating in calculation, hist _i (p) represents the frequency (relative quantity) of the Histogram after the bin normalization;

Obtaining an anomaly score by calculating anomaly values through a mahalanobis distance:

Wherein, Representing a mahalanobis distance measure, x _i representing the value of a sample point,/>Representing the average of the population;

iteratively computing samples that are presumed to be outliers using a binary search tree structure, computing outlier scores:

Wherein the method comprises the steps of

Where ψ represents the number of data extracted from the dataset to which x belongs, c (ψ) represents the average height under ψ data points, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated from (ψ -1), H (x) represents the height of one data point x, i.e. the root node of the tree needs to go through several edges to reach the leaf node.

Further, for the assumption precondition of normal distribution, the data inspection fails to pass the normal distribution inspection, the normal distribution transformation is performed first, and then the anomaly score is calculated.

Further, the classification model includes a Self-Attention structure, an RNN structure, and a LuongAttention structure, and the constructing the classification model specifically includes:

Calculating the output of each block by using a Self-Attention structure on the characteristics, statistic parameters and abnormal scores of the relativity among the pollutant sequences;

Constructing RNN structure calculation output containing sequencing according to a logic relation for the results of a plurality of Self-attributes;

the weighted results of the inputs and outputs of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.

Further, training the classification model based on the data set to obtain a fake detection model, which specifically includes:

Extracting positive samples with preset proportion from the data set, changing the numerical value exceeding the pollutant standard in the positive samples, reducing the numerical value to a Euclidean distance far away from the pollutant standard, marking the numerical value as a negative sample, and taking the generated negative sample and an original real sample set as a total data set;

Dividing the total data set into a training set, a testing set and a verification set;

Based on the training set and the test set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to one model with the best training effect by each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;

And respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.

According to a second aspect of the present invention, there is provided an emission data falsification detection device including:

A data acquisition module configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being used as a data set;

The characteristic calculation module is configured to calculate mutual information among pollutants and extract characteristics serving as relativity among pollutant sequences;

the parameter calculation module is configured to perform independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimensions correspond to a sequence of a plurality of time points of each pollutant;

An anomaly score acquisition module configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;

the classification model construction module is configured to splice the characteristics, statistic parameters and abnormal scores of the relativity among pollutant sequences based on the dimension of the pollutants, and is used as the input of the neural network to construct a classification model;

The model training module is configured to train the classification model based on the data set to obtain a fake detection model;

And the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model.

According to a third aspect of the present invention, there is provided an emission data falsification detection system, the system including: a memory for storing a computer program; a processor for executing the computer program to implement the method as described above.

According to a fourth aspect of the invention, there is provided a non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method as described above.

The emission data falsification detection method, the device, the system and the storage medium according to the various schemes have at least the following technical effects:

the invention has various advantages because the secondary modeling is performed based on the results of various abnormal distributions. Firstly, the invention considers the precondition of various statistical assumptions and has adaptability to different data distribution conditions in actual use. Secondly, the method describes a complex mathematical relationship by using a deep learning model, and enterprises can hardly find a data generation method for resisting the mathematical relationship. In addition, the method can be deployed on a server in the reasoning stage and automatically operated by connecting a database, and only a small amount of manpower and material resources are needed.

Drawings

In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The same reference numerals with letter suffixes or different letter suffixes may represent different instances of similar components. The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the inventive embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.

FIG. 1 illustrates a flow chart of a emissions data falsification detection method in accordance with an embodiment of the present invention.

Fig. 2 shows a schematic structural diagram of a classification model according to an embodiment of the invention.

FIG. 3 illustrates a classification model building flow chart of an emissions data falsification detection method according to an embodiment of the invention.

FIG. 4 illustrates a flow chart of modeling of emissions data modeling detection methods in accordance with an embodiment of the present invention.

Fig. 5 shows a structural diagram of an emission data falsification detection device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention. Embodiments of the present invention will be described in further detail below with reference to the drawings and specific examples, but not by way of limitation. The order in which the steps are described herein by way of example should not be construed as limiting if there is no necessity for a relationship between each other, and it should be understood by those skilled in the art that the steps may be sequentially modified without disrupting the logic of each other so that the overall process is not realized.

The embodiment of the invention provides a method for detecting emission data falsification, in particular to a method for detecting emission data falsification based on-line monitoring of various abnormal distribution and deep learning of data.

And carrying out various mathematic correlation tests, distribution tests and abnormality tests on the emission data of enterprises, and constructing and training a deep learning model based on statistics on the basis of various test results.

The scheme is suitable for detecting the emission data of the enterprise on-line monitoring data marked as the normal production state and judging which pollutants in the emission data have the possibility of counterfeiting the emission data.

By using the model, the environment monitoring data can be detected based on the abnormal distribution of the data, and the counterfeits which do not accord with the data distribution relation can be automatically identified.

For environmental protection department supervision enterprises, the automatic identification method can save a great deal of manpower and material resources and create economic benefits. The method of secondary modeling by using the abnormal scores under various distribution conditions is not easy to be perceived by enterprises to detect, so that the false is purposefully made, and the deterrent has corresponding social value.

Specifically, referring to fig. 1, the method includes the following steps:

Step S100, acquiring on-line monitoring data, wherein the on-line monitoring data comprises production state data and emission data of various pollutants, each time point is marked as a vector Z, Z= { Z1, Z2 … Zn }, zn represents emission data of an nth pollutant, n is the type number of the pollutant, a matrix A of a plurality of time points is marked as one sample, and a set of all samples is used as a data set.

Illustratively, assuming 6 contaminant species, each time point is noted as a vector Z (z= { Z1, Z2, Z3, Z4, Z5, Z6 }). The matrix a at 180 time points is noted as one sample, and all sample sets are taken as data sets.

Step S200, calculating mutual information between pollutants, and extracting features as correlations between pollutant sequences.

Where the mutual information between contaminants refers to the mutual information between contaminants of each type in a sequence of contaminants (matrix a of multiple time points), if the contaminant type has 6 types, denoted as Z1, Z2, Z3, Z4, Z5, Z6, respectively, the mutual information between contaminants may be the mutual information between Z1 and Z2, Z2 and Z3, Z4 and Z5, Z5 and Z6, etc. The mutual information between the contaminants characterizes the correlation or independence between the data, and is therefore extracted as a feature of the correlation between contaminant sequences.

In some embodiments, the mutual information between contaminants is calculated by the following formula:

Gain(T，X)＝Entropy(T)-Entropy(T,X)

And step S300, performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the dimension corresponds to a sequence of a plurality of time points of each pollutant.

For example, assuming 6 contaminants, a matrix a contains 6 contaminant sequences and a sample has 6 dimensions.

In some embodiments, the performing a separate normal distribution check on each dimension to obtain a statistic parameter specifically includes:

the Kolmogorov-Smirnov test is given by:

Anderson-Darling tested, the formula is as follows:

Step S400, obtaining an anomaly score by utilizing an anomaly detection algorithm under different distribution hypothesis conditions.

In some embodiments, for a hypothesis that is a normal distribution, but that the data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.

In some embodiments, the obtaining the anomaly score by using the anomaly detection algorithm under different distribution hypothesis conditions specifically includes:

Wherein the method comprises the steps of

And S500, splicing the characteristics, the statistic parameters and the abnormal scores of the relativity among the pollutant sequences based on the dimension of the pollutants, and constructing a classification model by taking the characteristics, the statistic parameters and the abnormal scores as the input of the neural network.

In some embodiments, the structure of the classification model is shown in fig. 2, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, as shown in fig. 3, the constructing the classification model by using the feature, the statistic parameter, and the anomaly score of the correlation between the pollutant sequences based on the dimension of the pollutant as the input of the neural network specifically includes:

step S501, calculating the output of each block by using a Self-Attention structure for the characteristics, statistic parameters and anomaly scores of the relativity among the pollutant sequences;

Step S502, constructing RNN structure calculation output containing sequence according to logic relation for the results of a plurality of Self-Attention;

In step S503, the weighted results of the input and output of the RNN structure are calculated in Luong Attention structures, and the probability of each contaminant anomaly in the corresponding sequence is calculated via the MLP structure of the two hidden layers.

And step S600, training the classification model based on the data set to obtain a fake detection model.

In some embodiments, as shown in fig. 4, the training the classification model based on the data set to obtain a fake detection model specifically includes:

step S601, extracting positive samples with preset proportion from the data set, changing the value exceeding the pollutant standard in the positive samples, reducing the value to the Euclidean distance far away from the pollutant standard, marking the value as a negative sample, and taking the generated negative sample and the original real sample set as a total data set;

Step S602, dividing the total data set into a training set, a testing set and a verification set;

Step S603, based on the training set and the testing set, starting a neural network for training by using different random seeds, and taking down model parameters corresponding to a model with the best training effect from each random seed; the training effect is determined according to the accuracy of a test set, wherein the accuracy of the test set is the accuracy obtained by predicting fake making according to the fact that the fake making probability calculated by each pollutant is more than or equal to 0.5 and comparing the real label;

Step S604, respectively configuring classification models by using the model parameters, and comparing the verification sets to obtain a model with the best verification set effect as a fake detection model.

Finally, in step S700, the probability of emission data falsification is calculated based on the falsification detection model.

Specifically, for a model that has been trained, the model structure and parameters are directly loaded in the inference phase. When each batch of data is operated, the model is directly loaded for reasoning after the steps S100-S400 are carried out, and the probability of emission data counterfeiting is calculated.

The embodiment of the invention also provides a device for detecting the falsification of the emission data, as shown in fig. 5, the device 500 comprises:

A data acquisition module 501 configured to acquire on-line monitoring data including production status data and emission data of a plurality of pollutants, each time point being denoted as a vector Z, z= { Z1, Z2 … Zn }, where Zn represents emission data of an nth pollutant, n is a kind number of pollutants, a matrix a of a plurality of time points being denoted as one sample, and a set of all samples being a data set;

A feature computation module 502 configured to compute mutual information between contaminants, extracting features that are correlations between contaminant sequences;

A parameter calculation module 503 configured to perform an individual normal distribution check for each dimension, resulting in a statistic parameter, the dimension corresponding to a sequence of multiple time points for each contaminant;

an anomaly score acquisition module 504 configured to acquire anomaly scores using anomaly detection algorithms under different distribution hypothesis conditions;

The classification model construction module 505 is configured to splice the characteristics, the statistic parameters and the anomaly scores of the relativity between the pollutant sequences based on the dimension of the pollutants, and construct a classification model by taking the characteristics, the statistic parameters and the anomaly scores as the input of the neural network;

The model training module 506 is configured to train the classification model based on the data set to obtain a fake detection model;

A fraud detection module 507 configured to calculate a probability of emission data fraud based on the fraud detection model.

In some embodiments, the feature calculation module is further configured to calculate the mutual information between the contaminants by the following formula:

Gain(T，X)＝Entropy(T)-Entropy(T,X)

In some embodiments, the parameter calculation module is further configured to: :

the Kolmogorov-Smirnov test is given by:

Anderson-Darling tested, the formula is as follows:

In some embodiments, the anomaly score acquisition module is further configured to:

Wherein the method comprises the steps of

for the assumption that normal distribution is the precondition and data inspection fails normal distribution inspection, normal distribution transformation is performed before abnormal scores are calculated.

In some embodiments, the classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:

In some embodiments, the model training module is further configured to:

It should be noted that, the emission data falsification detection device and the method described above belong to the same technical idea, which can have the same beneficial effects, and are not repeated here.

The embodiment of the invention provides an emission data falsification detection system, which comprises:

A memory for storing a computer program;

a processor for executing the computer program to implement the method as described in any of the embodiments above.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing instructions which, when executed by a processor, perform a method as described in any of the embodiments above.

Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. The elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the invention. This is not to be interpreted as an intention that the features of the claimed invention are essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method for detecting false emissions data, the method comprising:

calculating to obtain the probability of emission data counterfeiting based on the counterfeiting detection model;

the mutual information between the contaminants is calculated by the following formula:

Gain(T,X)＝Entropy(T)-Entropy(X)

Wherein E (S) represents information Entropy of the contaminant, i represents a kind of the contaminant, c represents a kind number of the contaminant, p _i represents a marginal probability density function of the ith contaminant, gain (T, X) represents mutual information between the two contaminants, entropy (T) represents information Entropy of one of the two contaminants, entropy (T, X) represents information Entropy of the other of the two contaminants;

And performing independent normal distribution verification on each dimension to obtain statistic parameters, wherein the method specifically comprises the following steps of:

the Kolmogorov-Smirnov test is given by:

Anderson-Darling tested, the formula is as follows:

Wherein Z represents the statistics of normal distribution test, n represents the data quantity of single pollutant participating in the test, w (x) represents the weight function, and f (x) represents the theoretical distribution density function;

the method for obtaining the anomaly score by utilizing the anomaly detection algorithm under different distribution hypothesis conditions specifically comprises the following steps:

Wherein HBOS (p) represents an anomaly score calculated under a Histogram-based Outlier Score method, d represents the data amount of a single pollutant participating in calculation, and hist _i (p) represents the frequency of the Histogram after bin normalization;

Wherein, Representing a mahalanobis distance measure, xi representing the value of a sample point,/>Representing the average of the population;

Wherein the method comprises the steps of

Wherein, ψ represents the number of data extracted from the data set to which x belongs, c (ψ) represents the average height under the data points of ψ, s (x, ψ) represents the anomaly score, H (ψ -1) is the harmonic number calculated by (ψ -1), H (x) represents the height of one data point x, i.e. the height of one data point x needs to go through several edges from the root node of the tree to reach the leaf node;

The classification model comprises a Self-Attention structure, an RNN structure and a Luong Attention structure, and features, statistic parameters and abnormal scores of relevance among spliced pollutant sequences taking the dimension of pollutants as a standard are used as inputs of a neural network, and the construction of the classification model specifically comprises the following steps:

2. The method of claim 1, wherein for a hypothesis whose normal distribution is assumed and whose data verification fails, the normal distribution transformation is performed before the anomaly score is calculated.

3. The method according to claim 1, wherein training the classification model based on the data set results in a fake detection model, specifically comprising:

4. An emissions data fraud detection apparatus, the apparatus comprising:

the fake-making detection module is configured to calculate the probability of fake emission data based on the fake-making detection model;

the feature calculation module is further configured to calculate mutual information between contaminants by the following formula:

Gain(T，X)＝Entropy(T)-Entropy(T，X)

Wherein E (S) represents the information Entropy of the contaminant, i represents the type of contaminant, c represents the number of types of contaminant, pi represents the marginal probability density function of the ith contaminant, gain (T, X) represents the mutual information between the two contaminants, entropy (T) represents the information Entropy of one of the two contaminants, entropy (T, X) represents the information Entropy of the other of the two contaminants;

The parameter calculation module is further configured to:

the Kolmogorov-Smirnov test is given by:

Anderson-Darling tested, the formula is as follows:

the anomaly score acquisition module is further configured to:

in practical calculations, this formula is equivalent to the following formula:

Wherein the method comprises the steps of

The classification model includes a Self-Attention structure, an RNN structure, and a Luong Attention structure, the classification model construction module is further configured to:

5. An emissions data falsification detection system, characterized by: the system comprises:

A memory for storing a computer program;

A processor for executing the computer program to implement the method of any one of claims 1 to 3.

6. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, perform the method of any one of claims 1 to 3.