CN111985567A

CN111985567A - Automatic pollution source type identification method based on machine learning

Info

Publication number: CN111985567A
Application number: CN202010846058.3A
Authority: CN
Inventors: 王春迎; 詹宇; 马景金; 马红楠; 张朝; 王振强; 张仕富; 吴秦慧姿
Original assignee: Hebei Advanced Environmental Protection Industry Innovation Center Co ltd; Hebei Xianhe Environmental Protection Technology Co ltd
Current assignee: Hebei Advanced Environmental Protection Industry Innovation Center Co ltd; Hebei Xianhe Environmental Protection Technology Co ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-11-24
Anticipated expiration: 2040-08-21
Also published as: CN111985567B

Abstract

A pollution source type automatic identification method based on machine learning. Comprises the following steps: based on the environmental monitoring data, time and geographic information, identifying the occurrence of pollution problems and judging the type of a pollution source through analysis and judgment, and establishing a typical pollution case library; based on a machine learning algorithm, taking data of a case base as a sample to extract data characteristics, and developing a pollution source type recognition algorithm model; monitoring the real-time monitoring data by using the algorithm model, marking the abnormal data as a pollution event when the abnormal data is found, further identifying the type of a source causing pollution, realizing online identification of pollution source emission and automatically alarming; checking or on-site checking the model identification result according to the alarm information, processing the pollution problem if the model identification result exists really, and supplementing and listing the pollution problem in a typical case library for continuous optimization of an algorithm model; and if the identification result is not accurate, removing the pollution event mark. Based on monitoring data such as gridding micro stations and small stations, more data can be brought into a data source, and the model can be further optimized.

Description

Automatic pollution source type identification method based on machine learning

Technical Field

The invention relates to the field of atmospheric environment monitoring, in particular to a pollution source type automatic identification method based on machine learning.

Background

In the field of atmospheric environment monitoring, a standard air station method is adopted in traditional monitoring, and due to the fact that cost is high, distribution quantity is small, generated data quantity is small, and the problem of fine pollution is difficult to accurately reflect. The micro-station adopting the sensor method can realize large-scale point distribution application due to low cost, SO that monitoring data with high space-time resolution in a monitoring area is obtained, monitoring parameters comprise PM10, PM2.5, SO2, NO2, CO, O3, temperature and humidity, the space resolution is up to 1 x 1km, and the time resolution is 1 h. The acquisition of massive environmental monitoring data supports the establishment of the corresponding relation between a pollution source and air quality, through manual analysis and research, the existing pollution problem can be found from data characteristics, and the source type of air pollution can be judged, including a dust raising source, a moving source, a coal-fired source, a catering oil smoke source, an industrial source and the like, so that the investigation range is reduced, the investigation accuracy is improved, the supervision efficiency is improved, and the manpower is saved for the on-site investigation work of the environmental problem.

However, the current problems are that the process of finding pollution problems and source types based on mass monitoring data requires a large amount of manpower and time, has high dependence on the technical level and experience of research personnel, has low efficiency of the whole application process, is poor in timeliness and is limited by the level of the technical personnel, and is difficult to effectively support environmental management. Therefore, a calculation method capable of efficiently, quickly and stably identifying the type of the pollution source is needed.

At present, the existing pollution source identification patent technology is based on hot spot grids instead of real-time monitoring data, for example, chinese patent CN110147383A, entitled "method and apparatus for determining pollution source type", and discloses a method for determining pollution source type, which determines pollution source type of the pollution grid by setting preset concentration value and preset concentration difference value, and combining wind speed, wind direction and pollution source situation in the grid; the invention of Chinese patent CN110006799A is named as 'a classification method of hotspot grid pollution types', and discloses a classification method of hotspot grid pollution types, which is used for classifying the atmosphere hotspot grid pollution types according to the change characteristics of the concentration of atmospheric pollutants along with time. The technology has the following disadvantages: firstly, the time and space resolution of hotspot grid data is low, so that pollution source identification work is mostly based on historical data, pollution tracing work cannot be guided in real time, and scientific and effective verification on an identification result is difficult; secondly, the satellite inversion data are restricted by meteorological conditions such as cloud cover, accuracy cannot be guaranteed, and effective tracing cannot be achieved; thirdly, the hot spot grid data reflects the air quality condition of the grid area rather than the periphery of the pollution source, so that the pollution source type is difficult to distinguish through the data characteristics; fourthly, the pollution source identification mode is single, and the characteristic parameters are few. And the types of the pollution sources at least comprise 6 types of pollution sources with different pollution characteristics. And the contamination characteristics described above cannot be accurately described.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for automatically identifying a type of a pollution source based on machine learning, so that the method can utilize parameters, time and space coordinate information of each pollutant, and the space information participates in a model operation for identifying a pollution process, that is, differences between target grid data and surrounding grid data are considered, rather than analyzing a data change trend in a time series.

In order to achieve the purpose, the invention provides a machine learning-based pollution source type automatic identification method, which mainly comprises the following steps:

step one, based on monitoring data such as PM10, PM2.5, SO2, NO2, CO, O3, temperature, humidity and the like, and time and geographic information, through (expert) analysis and judgment, the occurrence of pollution problems is identified, the type of a pollution source is judged, and a typical pollution case library is established.

Secondly, extracting data characteristics by taking mass data of the case base as samples based on a machine learning algorithm, and developing a pollution source type recognition algorithm model;

monitoring the real-time monitoring data by using the model, marking the abnormal data as a pollution event when the abnormal data is found, further identifying the type of a source causing pollution, realizing online identification of pollution source emission and automatically alarming;

fourthly, the expert examines or checks the model recognition result on site according to the alarm information, if the model recognition result exists, the pollution problem is processed, and event supplements are listed in a typical case library for continuous optimization of the algorithm model; and if the identification result is not accurate, removing the pollution event mark.

The identification algorithm adopted by the method is based on monitoring data such as PM10, PM2.5, SO2, NO2, CO, O3, temperature, humidity and the like, and time and geographic information, through analysis and judgment (by means of manual judgment of experts and the like), the occurrence of pollution problems is identified, the type of a pollution source is judged, and a typical pollution case library is established. Then, based on a machine learning algorithm, taking mass data of the case base as a sample to extract data characteristics, and developing a pollution source type identification algorithm model; and monitoring the real-time monitoring data by using the model, marking the abnormal data as a pollution event when the abnormal data is found, further identifying the type of a source causing pollution, realizing online identification of pollution source emission and automatically alarming. Furthermore, the model identification result can be audited or checked on site by virtue of experts according to alarm information, if the model identification result does exist, the pollution problem is treated, and event supplement is listed in a typical case library for continuous optimization of an algorithm model; and if the identification result is not accurate, removing the pollution event mark.

Preferably, the algorithm model training set contains pollution-free time series pollution data, after the pollution data of the grid is obtained, the proposed 38 features are calculated, and the classification result of the grid pollution type can be output by inputting the mathematical model after training.

The invention has the beneficial effects that by means of the technical scheme, the invention realizes the following advantages compared with the prior art:

(1) a data source: compared with the prior art based on hotspot grid data, the method is based on monitoring data such as grid micro stations and small stations, and can bring more data into a data source;

(2) an algorithm model: the technical scheme of the invention adopts a machine learning algorithm which specifically comprises algorithms such as a random forest, a neural network, a support vector machine, a gradient propeller and the like, and adopts a combined model which comprises sub models based on curve shape (time sequence shape) and deep neural network automatic feature extraction and the like;

(3) is characterized in that: in view of the fact that the selectable features based on features in the prior art are few (single grid judgment), through repeated research of the inventor, the algorithm of the invention can comprise 38 feature values in total, multi-point bit comparison judgment is realized, and data such as peripheral pollution sources and the like are further considered as the feature values; (can improve the accuracy of pollution type identification, and has the functions of distinguishing local sources and external sources, and the like, and overcomes the one-sidedness based on single grid analysis)

(4) Model continuous optimization: compared with the prior art which is based on historical data and has fixed algorithm, the technical scheme of the invention is that a generation of algorithm model is generated through the historical data, the application can be implemented in subsequent monitoring data and new cases can be found, the new cases are automatically put into a case library after being audited by technicians, and the model can be further optimized;

(5) compared with the prior art that the method is based on the client, the method can be based on the cloud server, and has the advantages that the cost of the client is reduced, the advantages of large data are formed at the server end, a large number of cases are collected at different places, the advantages of the technical scheme are fully played, and the accuracy of the algorithm judgment result is further improved.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for automatically identifying a pollution source type based on machine learning according to the present invention.

Detailed Description

For a better understanding of the objects, aspects and advantages of the present invention, reference is made to the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

The hot spot grid in the invention refers to a technical unit related to the organization of the environmental protection department, and the Jingjin Ji and the surrounding key area of 2+26 cities are divided into a plurality of grids according to the length of 3km multiplied by 3 km. The method comprises the steps of integrating various data such as satellite remote sensing, air quality ground observation, meteorological observation and the like, utilizing a remote sensing image recognition technology based on cognition and multi-source data fusion, then determining the PM2.5 average concentration of each grid through atmospheric pollutant satellite remote sensing inversion, and determining key supervision areas in hot point grids according to concentration numerical sorting.

Referring to fig. 1, a flow chart of the recognition algorithm used in the present invention is shown, and its main concept content is briefly described as follows:

1. establishment of typical case base

The typical case base is a collection of cases describing pollution events and audited by human (experts) on the basis of environment monitoring data, and data information contained in each case at least comprises the following components: the starting time and the ending time of the pollution event, the name and the coordinates of the affected point, the type of the affected parameter and the weather conditions of the current place, and the type of the pollution source which is judged by experts. Wherein the parameter types may include PM10, PM2.5, SO2, NO2, CO, O3, and VOCs, the meteorological conditions include wind direction, wind speed, temperature, and humidity, and the pollution source types include dust sources, mobile sources, coal-fired sources, food and beverage oil smoke sources, industrial sources, and others.

2. Data characterization

The data features extracted from the algorithm model of the invention comprise: 1. first derivative standard deviation of PM 2.5; 2. first derivative standard deviation of CO; 3. SO (SO)₂First derivative standard deviation of; 4. the first 10 first order differential series-squared sums for PM 2.5; 5. maximum value of CO; 6. a major contaminant; 7. skewness of AQI; 8. the 1st autocorrelation coefficient of PM 10; 9. quartiles of CO; 10. 1st autocorrelation coefficient of PM 2.5; 11. the coefficient of variation of AQI; 12. coefficient of variation of CO; 13. first derivative standard deviation of PM 10; 14. the first 10 first differential series sums of CO; 15. the sum of AQI; 16. SO (SO)₂And is added to the CO sum; 17. skewness of PM 10; 18. o is₂Maximum value of (d); 19. SO (SO)₂The sum of (1); 20. median of CO; 21. the first 10 first order differential series sums of AQI; 22. NO₂The kurtosis of (a); 23. the first 10 first differential order sums of squares for PM 10; 24. 1st autocorrelation coefficient of AQI; 25. a first differential stage of CO; 26. 1st autocorrelation coefficient of CO; 27. SO (SO)₂The first differential order of; 28. the sum of CO; 29. SO (SO)₂A median of (d); 30. kurtosis of PM 2.5; 31. a primary differential stage number of PM 2.5; 32. NO₂The first 10 first order differential series sums of squares; 33. SO (SO)₂The kurtosis of (a); 34. small value of AQI maximum time; 35. SO (SO)₂Coefficient of variation of (a); 36. correlation coefficient of PM10 and CO; 37. SO (SO)₂And CO correlation coefficient; 38. NO₂And CO correlation coefficient.

The 38 characteristics can reflect the change situation of rising and falling of each pollutant and the (time cross) relevance of each pollutant time series to a certain extent, and comprehensively characterize the pollution types of each site in different periods from the statistical perspective.

For example, the feature 6(NO2_ diff1_ acf10) represents the degree of variation of the NO2 sequence, the feature 11(distance _ dtw) represents the similarity of time series between different pollutants, and the feature 17(co-quantile) represents the frequency distribution of C0 pollution, which can indicate to some extent whether a case belongs to automotive pollution.

However, due to the complexity of the multivariate time series variation and the correlation of multivariate time series of peripheral sites, it is difficult to artificially generalize and select the time series characteristics corresponding to each pollution type (or case). Therefore, the invention mainly combines the 38 weighted characteristics automatically based on the training data in the case base through a machine learning algorithm to generate a data-driven prediction model.

3. Model algorithm description/calculation formula

The technical scheme includes that a multi-label classification model is established for an existing case and a case supplemented later, namely, composite pollution formed by combining a plurality of pollution types possibly exists in the same time period and the same place, as shown in table 1 (all pollution types are not included), each row corresponds to one case or one pollution event, X is selected characteristic value summary, X1, X2, X3, X4, X5 and X6 are respectively characteristic values of corresponding cases, Y1, Y2, Y3, Y4 and Y5 are different pollution types and are called labels in the multi-label model, 1 represents that the type belongs to, and 0 represents that the type does not belong to. The model adopts a combination strategy, and the combination strategy mainly comprises Binary Relevance (Binary Relevance), Classifier Chains (Classifier Chains), Nested Stacking (Nested Stacking) and the like.

X	Y1	Y2	Y3	Y4	Y5
						X1	1	0	0	0	0
X2	0	1	1	0	0
						X3	0	0	0	1	0
X4	0	0	0	0	1
						X5	0	1	0	0	0
X6	1	0	1	0	0

TABLE 1 Multi-tag model example

The invention mainly uses a binary association strategy, the principle of the strategy is to establish a binary classification for each label, the binary classification is simple and has/does not have a problem, namely whether the label belongs to the type or not, as shown in table 2, a model is divided into five binary classifications, then a plurality of binary classifications are combined together, each label is independently predicted during prediction, the dependency between the labels is not considered, then the result is combined into a multi-label target, the binary classification has linear computational complexity in the aspect of label quantity, and can be easily parallelized, namely the binary classification of each label is established at the same time, and the operation speed is improved. In addition, machine learning (e.g., random forest) models under default parameter configurations tend to ignore the less significant types of pollution in training samples in the prediction. In the algorithm, a cutoff value (cutoff) parameter in each two classifier is adjusted based on the proportion of each pollution type in a training sample, so that each pollution type can be predicted in a balanced manner by an optimized model, and the overall prediction performance is improved.

TABLE 2 binary Association policy example

When the binary classification of each label is established independently, the same machine learning algorithm is used for modeling of each binary classification under the default condition, and the algorithm comprises a random forest, a neural network, a support vector machine, a gradient propulsion machine and the like. After further learning and research, different characteristic value combinations can be combined when modeling of each pollution type is tried, different machine learning algorithms are tried, the optimal characteristic value combination and the optimal algorithm are selected to establish binary classification, finally, different binary classifications are combined and combined to form an optimal multi-label model according to binary association, and when a new pollution event is predicted, the pollution type can be comprehensively judged according to the characteristic value of the pollution event.

The invention constructs three algorithms of a support vector machine, a random forest and an XGboost for a model. Briefly introduced here, a Support Vector Machine (SVM) is a type of generalized linear classifier that performs binary classification on data in a supervised learning manner, and can be used for classification and regression. The random forest is an algorithm for integrating a plurality of trees through the idea of ensemble learning, belongs to a nonlinear classifier, and therefore, the complex nonlinear interdependence relation between variables can be mined. The basic unit of the random forest is a decision tree which is a basic classifier, the main work is to select features to divide a data set, and finally, the data is attached with two different types of labels, and the constructed decision tree is in a tree structure. The random forest can be obtained by constructing a plurality of decision trees, each tree gives a classification result when prediction is carried out, voting is carried out accordingly, and a final classification result is output by adopting a principle that majority obeys minority. XGBoost is also a decision tree based machine learning algorithm, different from random forests, where each decision tree is constructed separately, and the idea of XGBoost is to grow a tree by adding trees continuously and performing feature splitting continuously, and each time a tree is added, it is actually to learn a new function to fit the residual of the last prediction until a stopping condition is reached, such as the number of trees to be constructed. During prediction, according to the characteristics of a prediction sample, a corresponding leaf node is found on each tree, each leaf node corresponds to a score, and finally the scores corresponding to each tree are added together to obtain the prediction value of the sample.

When the model is constructed, because each sample of the pollution type is not necessarily balanced, which has certain influence on the accuracy of the model, the method optimizes the point when the model is constructed, avoids the influence caused by unbalanced samples to a certain extent by improving the parameters of the model, and can correspondingly adjust under the condition that the cases are continuously supplemented.

4. How to base on cloud server

In the development process of the algorithm model provided by the invention, as more available cases are provided, the prediction accuracy of the developed model is higher, so that environmental monitoring data of multiple cities are required; after the development is completed, the model can be applied to different cities. Therefore, in the scheme of the invention, the model is set to be in a cloud operation mode, and the operation mode can effectively utilize as much data as possible, improve the precision of the model and facilitate later wide application.

The following specific examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.

In this embodiment, the method for automatically identifying the pollution source type based on machine learning of the present invention is to utilize a micro station to obtain monitoring data with high spatial and temporal resolution in a monitoring area, wherein the monitoring parameters include PM10, PM2.5, SO2, NO2, CO, O3, temperature, humidity, and propose classification based on concentration characteristics that change with time and geographic information. As shown in fig. 1, the method for automatically identifying the type of a pollution source based on machine learning provided by the present invention mainly includes the following steps:

monitoring data such as PM10, PM2.5, SO2, NO2, CO, O3, temperature, humidity and the like with high space-time resolution in a monitoring area, and time and geographic information are obtained through a micro station;

establishing a typical pollution case library based on expert judgment;

developing a pollution source type recognition algorithm model aiming at the pollution source emission data characteristics based on a machine learning algorithm;

carrying out abnormal data marking on the real-time monitoring data by using an algorithm model, identifying the type of a pollution source and automatically alarming;

then, the expert examines the model identification result according to the alarm information to determine whether the model identification result is accurate; if the identification is correct, processing the pollution source, and supplementing the event into a case library to further optimize the algorithm model; the contamination event flag is de-flagged if an error is identified.

In the following embodiments, the classification of the high spatial and temporal resolution site pollution types in the monitored area comprises the following steps:

1. PM10, PM2.5, SO2, NO2, CO, O3, temperature, humidity and other monitoring data with high space-time resolution of a monitoring area, and time and geographic information are obtained through the micro-station.

Because different pollution types have different characteristics on the change of the pollutant concentration, various characteristics are extracted from time series pollution data according to basic statistics of the data; and then converting some geographic information, emission list information and information acquired by expert judgment into corresponding characteristic variables, such as: and (3) the characteristics of pollution sources around the stations, road network density around the stations and time series distance, and the total number is 140.

The characteristics and some of the calculations involved for each contaminant are as follows:

the 6 pollutants (PM10, PM2.5, SO2, NO2, O3, CO) and AQI were formed in case groups:

diff1_ acf 10: the first 10 first order difference series sums of squares;

diff1_ acf 1: a first differential stage number;

x _ acf 1: a first autocorrelation coefficient;

x _ pacf 5: the sum of squares of the autocorrelation coefficients of the first five parts;

diff2x _ pacf 5: the first 5 2 differential series sums of squares;

std1st _ der: first derivative standard deviation;

the average value, the sum, the maximum value, the quartile, the variation coefficient, the mean, the standard deviation, the median, the variance, the skewness, the kurtosis and the hour value of the maximum time of the AQI are formed by grouping the 6 pollutants and the AQI according to cases; correlation coefficients between six contaminants and AQI; the main contaminants.

Pollution sources around the station: according to the pollution source information and the emission list information around the stations, acquiring the number of different types of pollution sources around different stations and taking the pollution sources as characteristic values;

site peripheral road network density: considering the influence of motor vehicle emission on pollutant data, according to the situation of the road network around the site, the density of the road network around the site is obtained by using a geographic information system technology and is used as a characteristic value;

time series distance features: similarity of time series between contaminants, Dynamic Time Warping (DTW) distance is used.

Then screening a certain amount of characteristic variables from all considered variables according to the importance of the variables in the random forest model, and finally selecting the following 38 data characteristics based on the pollution data and the geographic information, the emission list information and the information obtained by expert judgment as the basis of pollution type classification.

The method is characterized in that: co _ stdlst _ der; first derivative standard deviation of CO;

and (2) feature: pm10_ diff1_ acf 10; the first 10 first differential order sums of squares for PM 10;

and (3) feature: pm2 — 5_ diff1_ acf 10; the first 10 first order differential series-squared sums for PM 2.5;

and (4) feature: co _ diff1_ acf 10; the first 10 first differential series sums of CO;

and (5) feature: polarization; the positions of the sites of the pollution cases judged by the experts, such as main roads, sensitive points, towns, construction sites, environmental background points and the like;

and (6) feature: no2_ diff1_ acf 10; the first 10 first order differential series-squared sums of NO 2;

and (7) feature: aqi _ diff1_ acf 10; the first 10 first order differential series sums of AQI;

and (2) characteristic 8: x _ acf1_ aqi; a first autocorrelation coefficient of AQI;

and (2) characteristic 9: aqi _ cv; the coefficient of variation of AQI;

the characteristics are as follows: data; AQI maximum time small value;

and (2) characteristic 11: distance _ dtw; similarity in time series between contaminants, using dtw distance;

and (2) feature 12: aqi _ sum; the sum of AQI;

and (2) characteristic 13: pm10_ stdlst _ der; first derivative standard deviation of PM 10;

feature 14: pm2_5_ stdlst _ der; first derivative standard deviation of PM 2.5;

and (2) feature 15: so2_ stdlst _ der; first derivative standard deviation of SO 2;

and (4) feature 16: co _ max; maximum value of CO;

and (2) feature 17: co _ quantile; quartiles of CO;

feature 18: so2_ co _ sum; sum of SO2 plus sum of CO;

and (2) feature 19: so2_ max; maximum value of SO 2;

and (2) feature 20: co _ sum; the sum of CO;

characteristic 21: so2_ sum; the sum of SO 2;

and (2) feature 22: x _ acf1_ pm 10; a first autocorrelation coefficient of PM 10;

and (4) feature 23: x _ acf1_ co; a first autocorrelation coefficient of CO;

feature 24: x _ acf1_ pm2_ 5; first autocorrelation coefficient of PM 2.5;

and (2) feature 25: co _ cv; coefficient of variation of CO;

feature 26: so2_ cv; the coefficient of variation of SO 2;

characteristic 27: so2_ mean; the median of SO 2;

characteristic 28: co _ mean; median of CO;

characteristic 29: pm2 — 5_ diff1_ acf 1; a primary differential stage number of PM 2.5;

and (2) feature 30: so2_ diff1_ acf 1; a first differential stage of SO 2;

feature 31: co _ diff1_ acf 1; a first differential stage of CO;

feature 32: skewness _ pm 10; skewness of PM 10;

feature 33: skewness _ aqi; skewness of AQI;

feature 34: pm2 — 5_ kurtosis; kurtosis of PM 2.5;

characteristic 35: so2_ kurtosis; kurtosis of SO 2;

feature 36: no2_ kurtosis; kurtosis of NO 2;

feature 37: polarization _ entities; acquiring the number of different types of pollution sources around the site according to the pollution source information around the site;

feature 38: polarization _ type; and obtaining the number of different types of pollution sources around the station according to the emission list.

2. And establishing a typical pollution case library based on expert judgment.

The type of contamination of each high spatial-temporal resolution grid may be determined by expert judgment based on the contamination data and some other information, and in this embodiment the determined types of contamination include: raise dust and dust; a motor vehicle; heavy vehicles, machinery, ships; catering oil smoke; burning coal; carrying out unorganized incineration; an enterprise; fireworks and crackers; the procedures involving VOCs are 9 types.

3. And developing a pollution source type identification algorithm model aiming at the pollution source emission data characteristics based on a machine learning algorithm.

And calculating 38 technical characteristics selected by the invention according to the pollution data and other information, and labeling the characteristic data corresponding to each grid according to the pollution type judged by experts to be used as training data of the model. The method adopts a machine learning algorithm, specifically comprises a random forest, a neural network, a support vector machine, a gradient propeller and the like, and adopts a combined model, and includes sub-models based on curve shape (time sequence shape) and deep neural network automatic feature extraction and the like to train a data model, so that the proposed dimensionality and feature classification can be better understood, and the accuracy of pollution type classification can be improved.

4. And (4) carrying out abnormal data marking on the real-time monitoring data by using an algorithm model, identifying the type of a pollution source and automatically alarming.

The algorithm model training set contains pollution-free time sequence pollution data, after the pollution data of the grid are obtained, 38 proposed technical features are calculated, and the classification result of the grid pollution type can be output by inputting the mathematical model after training. In the early-stage test, two standard air stations and three micro stations (southeast corner of a certain steel enterprise, a certain sewage treatment plant and a city north loop) in a certain city are randomly selected, data after 2019, 9 and 1 days are selected, a segmentation function is adopted to divide the data into different segments, then the pollution segments are screened by using different pollutant concentration conditions, each segment is predicted by using a model established by a case to obtain the pollution types of the different segments, then a series of information of the obtained site pollution segments is sent back to an expert, and the expert performs secondary judgment.

5. The expert examines the model identification result according to the alarm information and determines whether the model identification result is accurate; if the identification is correct, processing the pollution source, and supplementing the event into a case library to further optimize the algorithm model; the contamination event flag is de-flagged if an error is identified. For example, the pollution type of the site Tangshan ceramics 2019/9/714: 00-2019/9/87: 00 in the time period is recognized as a (flying dust and dust), the expert group performs secondary judgment, the judgment type is a (flying dust and dust), the result obtained by the model is matched with the judgment result of the expert, and the case can be used as case supplement to be input into a case base and a pollution source is processed; and the pollution type of the suburb sewage treatment plant 842, 2019/9/212:00-2019/9/2111:00 at the station is identified as g (enterprise), the expert group has no obvious pollution source when carrying out secondary judgment, the expert audits the model identification result according to the alarm information to be different from the model identification result, and the pollution event mark is removed at the moment.

It will be appreciated by those skilled in the art that the model of the present invention will have an increasing accuracy of model identification as contamination events are replenished into the case library.

Although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the invention.

Claims

1. A pollution source type automatic identification method based on machine learning is characterized by comprising the following steps:

the method comprises the following steps of firstly, identifying the occurrence of pollution problems and judging the type of a pollution source through analysis and judgment based on environmental monitoring data, time and geographic information, and establishing a typical pollution case library;

monitoring real-time monitoring data by using the algorithm model, marking abnormal data as a pollution event if the abnormal data are found, further identifying the type of a source causing pollution, realizing online identification of pollution source emission and automatically alarming;

checking or checking the model identification result on site according to the alarm information, if the model identification result exists really, processing the pollution problem, and adding event supplements into a typical case library for continuous optimization of the algorithm model; and if the identification result is not accurate, removing the pollution event mark.

2. The method for automatically identifying the type of the pollution source based on the machine learning as claimed in claim 1, wherein:

the environmental monitoring data includes: PM10, PM2.5, SO2, NO2, CO, O3, temperature, and humidity, as well as time and geographic information;

the typical case base is a set of cases which are used for describing pollution events and are audited on the basis of the environmental monitoring data, and each case comprises the following data information: the starting time and the ending time of the pollution event, the name and the coordinates of the affected point, the type of the affected parameter and the meteorological conditions of the current place, and the type of the pollution source which is judged by an expert;

the affected parameters are parameters for obtaining high spatial and temporal resolution of the monitored area through a micro-station, and the parameter types at least comprise 6 pollutants: PM10, PM2.5, SO2, NO2, CO, O3, and VOCs, meteorological conditions including wind direction, wind speed, temperature, and humidity, pollution stain types including dust sources, mobile sources, coal-fired sources, food and beverage oil smoke sources, and industrial sources.

3. The automatic identification method for the pollution source type based on the machine learning according to the claim 1 or 2, characterized in that in the step one, various characteristics are extracted from the time series pollution data according to the basic statistics of the data; and converting the geographic information, the emission list information and the information acquired by expert judgment into corresponding characteristic variables.

4. The method for automatically identifying the type of the pollution source based on the machine learning as claimed in claim 3, wherein the extracted features and the calculation method are as follows:

the 6 pollutants and AQI are formed according to case grouping:

diff1_ acf 10: the first 10 first order difference series sums of squares;

diff1_ acf 1: a first differential stage number;

x _ acf 1: a first autocorrelation coefficient;

diff2x _ pacf 5: the first 5 2 differential series sums of squares;

std1st _ der: first derivative standard deviation;

the average value, the sum, the maximum value, the quartile, the variation coefficient, the mean, the standard deviation, the median, the variance, the skewness, the kurtosis and the hour value of the maximum time of the AQI are formed by grouping the 6 pollutants and the AQI according to cases; correlation coefficients between six contaminants and AQI; a major contaminant;

5. The method of claim 4, wherein a certain amount of characteristic variables are selected from all considered variables according to the importance of the variables in the random forest model, and the following 38 data characteristics based on pollution data and geographic information, emission list information and information obtained by expert judgment are selected as the basis for pollution type classification,

and (2) characteristic 9: aqi _ cv; the coefficient of variation of AQI;

the characteristics are as follows: data; AQI maximum time small value;

and (2) feature 12: aqi _ sum; the sum of AQI;

feature 14: pm2_5_ stdlst _ der; first derivative standard deviation of PM 2.5;

and (4) feature 16: co _ max; maximum value of CO;

and (2) feature 17: co _ quantile; quartiles of CO;

feature 18: so2_ co _ sum; sum of SO2 plus sum of CO;

and (2) feature 19: so2_ max; maximum value of SO 2;

and (2) feature 20: co _ sum; the sum of CO;

characteristic 21: so2_ sum; the sum of SO 2;

and (4) feature 23: x _ acf1_ co; a first autocorrelation coefficient of CO;

feature 24: x _ acf1_ pm2_ 5; first autocorrelation coefficient of PM 2.5;

and (2) feature 25: co _ cv; coefficient of variation of CO;

feature 26: so2_ cv; the coefficient of variation of SO 2;

characteristic 27: so2_ mean; the median of SO 2;

characteristic 28: co _ mean; median of CO;

and (2) feature 30: so2_ diff1_ acf 1; a first differential stage of SO 2;

feature 31: co _ diff1_ acf 1; a first differential stage of CO;

feature 32: skewness _ pm 10; skewness of PM 10;

feature 33: skewness _ aqi; skewness of AQI;

feature 34: pm2 — 5_ kurtosis; kurtosis of PM 2.5;

characteristic 35: so2_ kurtosis; kurtosis of SO 2;

feature 36: no2_ kurtosis; kurtosis of NO 2;

feature 38: polarization _ type; obtaining the number of different types of pollution sources around the station according to the discharge list;

the data features extracted from the algorithm model comprise: 1. first derivative standard deviation of PM 2.5; 2. first derivative standard deviation of CO; 3. SO (SO)₂First derivative standard deviation of; 4. the first 10 first order differential series-squared sums for PM 2.5; 5. maximum value of CO; 6. a major contaminant; 7. skewness of AQI; 8. the 1st autocorrelation coefficient of PM 10; 9. quartiles of CO; 10. 1st autocorrelation coefficient of PM 2.5; 11. the coefficient of variation of AQI; 12. coefficient of variation of CO; 13. first derivative scaling of PM10Tolerance; 14. the first 10 first differential series sums of CO; 15. the sum of AQI; 16. SO (SO)₂And is added to the CO sum; 17. skewness of PM 10; 18. o is₂Maximum value of (d); 19. SO (SO)₂The sum of (1); 20. median of CO; 21. the first 10 first order differential series sums of AQI; 22. NO₂The kurtosis of (a); 23. the first 10 first differential order sums of squares for PM 10; 24. 1st autocorrelation coefficient of AQI; 25. a first differential stage of CO; 26. 1st autocorrelation coefficient of CO; 27. SO (SO)₂The first differential order of; 28. the sum of CO; 29. SO (SO)₂A median of (d); 30. kurtosis of PM 2.5; 31. a primary differential stage number of PM 2.5; 32. NO₂The first 10 first order differential series sums of squares; 33. SO (SO)₂The kurtosis of (a); 34. small value of AQI maximum time; 35. SO (SO)₂Coefficient of variation of (a); 36. correlation coefficient of PM10 and CO; 37. SO (SO)₂And CO correlation coefficient; 38. NO₂And CO correlation coefficient.

6. The method for automatically identifying the type of the pollution source based on the machine learning as claimed in claim 1, wherein: the pollution source type identification algorithm model is a multi-label classification model established for the existing cases and the cases supplemented later, namely, the pollution source type identification algorithm model can express composite pollution formed by combining a plurality of pollution types possibly existing in the same place in the same time period; and adopting a combination strategy to dye the pollution source type identification algorithm model.

7. The method for automatically identifying the type of the pollution source based on the machine learning as claimed in claim 6, wherein: the combination strategy is binary association, a classifier chain or nested superposition; according to the proportion of each pollution type in the training data, a cutoff value (cutoff) parameter is set in each classifier so as to solve the problem of non-equilibrium of training samples and improve the prediction accuracy of accidental pollution types.

8. The method for automatically identifying the type of the pollution source based on the machine learning as claimed in claim 7, wherein: the combination strategy is a binary association strategy, a binary classification is established for each label, the binary classification is a simple problem, namely whether the label belongs to the type or not, a model is divided into a plurality of binary classifications, then the binary classifications are combined together, each label is independently predicted during prediction, the dependency between the labels is not considered, then the result is combined into a multi-label target, the binary classification has linear calculation complexity in the aspect of label quantity so as to be easily parallelized, namely the binary classification of each label is established at the same time, and the operation speed is improved.

9. The method according to claim 8, wherein the selected 38 features are calculated according to the pollution data and other information, and the feature data corresponding to each grid is labeled according to the judged pollution type to serve as training data of the model; the method adopts a machine learning algorithm, specifically comprises a random forest, a neural network, a support vector machine, a gradient propeller and the like, and adopts a combined model, and includes sub-models based on curve shape (time sequence shape) and deep neural network automatic feature extraction and the like to train a data model, so that the proposed dimensionality and feature classification can be better understood, and the accuracy of pollution type classification can be improved.

10. The method of claim 1, wherein an algorithm model training set contains pollution-free time series pollution data, after the pollution data of the grid is obtained, the proposed 38 features are calculated, and the classification result of the grid pollution type can be output by inputting the mathematical model after the training.