AU2021105563A4

AU2021105563A4 - Method for Traceability of Air Pollutants Based on Coupled Machine Learning and Correlation Analysis

Info

Publication number: AU2021105563A4
Application number: AU2021105563A
Authority: AU
Inventors: Min Gao; Wei Guo; Lifen LI; Jiwei Pang; Xu Zhang
Original assignee: Cecep Tianrong Technology Co Ltd
Current assignee: Cecep Tianrong Technology Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-10-14
Anticipated expiration: 2029-08-16

Abstract

The present invention discloses a method for traceability of air pollutants based on coupled machine learning and correlation analysis. The method specifically comprises the following steps: establishing a transient model for pollutant concentration distribution based on spatio-temporal pollutant concentration data of a regional grid source; extracting features from a pollutant concentration correlation coefficient matrix for a grid source by Gaussian regression, and coupling machine learning algorithms for intelligent recognition of pollutant transport channels and pollutant source regions. According to the present invention, features are extracted from the spatio-temporal correlation matrix of pollutant concentrations by Gaussian regression and correlation analysis based on the transient model established based on spatio-temporal pollutant concentration data of a regional grid source, so as to solve the problems of delayed response and uncertainty in time window of pollutant concentrations. In addition, model training data are constantly updated by using machine learning algorithms, thereby ensuring the continuous and effective improvement of the accuracy of traceability algorithms. 1/1 FIGURES OF THE SPECIFICATION d itorica -Data Preproces n Data sing the source data in Possiblc New pollution data source election Model model Optimization Inew data Results of model applicati on FIG. 1

Description

1/1

FIGURES OF THE SPECIFICATION

d itorica -Data

n Preproces Data sing the source data in Possiblc New pollution data source election Model model Optimization Inew data

Results of model applicati on

FIG. 1

Method for Traceability of Air Pollutants Based on Coupled Machine Learning

and Correlation Analysis

TECHNICAL FIELD

The present invention relates to the technical field of pollutant traceability, in

particular to a method for traceability of air pollutants based on coupled machine

learning and correlation analysis.

BACKGROUND

With the rapid development of economy, the acceleration of industrialization and

urbanization, and the increase of energy consumption in China, a series of

atmospheric environmental problems have emerged. Compared with pollutants in the

environment such as pollutants in waters and soil, air pollutants have the

characteristics of easy diffusion, easy mixing and unclear pollution paths, and may be

affected by emission sources, pollution processes, meteorological conditions and the

like. Among them, emission sources are internal factors, meteorological conditions

are external factors, and pollution processes are motivation factors. Since motivation

factors and external factors are mainly influenced by the objective laws of nature, it is

difficult for human beings to control these factors. Therefore, the control of internal

factors is the most effective method for air pollution control and environmental

management, the core of which is to identify pollution sources, clarify causes of

pollution, achieve targeted governance and improve control efficiency.

Identification of the sources of air pollution can be divided into two categories:

pollution traceability, focusing on the traceability of emission sources in terms of

spatio-temporal distribution; and analysis of emission sources, focusing on the

analysis of composition and industry of emission sources. As the main means for

precise control and scientific control of ambient air quality, atmospheric fine grid system is widely used. Grid-based environmental statistical data analysis can achieve rough traceability of air pollution, but the response time is long. As a result, researchers improve the response time by traceability of air pollution based on model software and machine learning algorithms. However, the existing methods have the following disadvantages in achieving traceability of air pollution: (1) backward trajectory: backward trajectory is an integrated model system for calculating and analyzing airflow motion, deposition and diffusion trajectories, the core of which is to calculate and describe air mass motion through wind direction and wind speed in a three-dimensional meteorological field, and thus localize pollution sources through air mass trajectories. However, the method is highly dependent on wind field data and is limited by the input field of multiple meteorological elements. The current research mainly focuses on the long-range transport on a short time scale and the identification of external pollution sources, which can provide theoretical reference in dealing with overseas pollution sources and coordinated inter-regional prevention and control, but is not applicable in dealing with the traceability of small-scale regional endogenous pollution for the time being. (2) Probabilistic method: probabilistic method is a method for traceability of pollution mainly developed for the complexity of physical and chemical processes of air pollution and the discreteness of numerical models. The main principle is to combine available concentration observations with priori information, and analyze and mine the uncertainty and confidence intervals of posterior parameters based on a large number of historical data. In the application, a large amount of supporting data and priori information on known pollution sources are required, which is difficult to achieve in atmospheric emergency response. (3)

Analytical method for sources of particulate matter: the method is developed for

qualitative identification of pollution sources by analyzing the physicochemical properties of particulate matters in ambient air and samples from the pollution sources. Meanwhile, the contribution rate of pollution sources can be calculated quantitatively based on mathematical statistics and numerical model simulation.

However, the method focuses on the analysis of the composition and industry of

emission sources, and cannot obtain the lock-in and contribution rate of pollution

sources in geographical space. Therefore, it is difficult for the method to meet the

requirements for accurate traceability of air pollution, targeted governance and

efficient control of air pollution.

At this stage, traceability analysis based on models is mostly conducted from the

perspective of the influence of boundary conditions on pollutant diffusion, such as

wind direction, wind power and other factors. Such models are not universal and

cannot achieve rapid deployment in different regional scenarios, and personnel with

certain knowledge background are required to adjust localization parameters. In

addition, changes in real-time data during pollutant transport are not considered,

therefore, such models are used for steady-state modelling, and it is impossible to

correct dynamic response of the models according to transient spatio-temporal

pollutant concentration data. Moreover, the problems of delay effects during pollutant

concentration transport and uncertainty in valid time window of pollutant events

cannot be effectively considered for the existing models.

SUMMARY

The purpose of the present invention is to provide a method for traceability of air

pollutants based on coupled machine learning and correlation analysis. According to

the method, intelligent recognition is performed on pollutant transport channels and

pollutant source regions by using coupled machine learning and correlation analysis

based on spatio-temporal pollutant concentration data of a regional grid source.

In order to achieve the purpose, the present invention provides a method for

traceability of air pollutants based on coupled machine learning and correlation

analysis, comprising the following steps:

Si. acquiring real-time spatio-temporal data and historical spatio-temporal data

for each grid source site in a target region;

S2. constructing a database based on the real-time spatio-temporal data and the

historical spatio-temporal data; and extracting historical spatio-temporal data over a

period of time from the database;

S3. constructing a transient model for pollutant concentration distribution based

on the historical spatio-temporal data over a period of time;

S4. extracting features from the transient model for pollutant concentration

distribution by Gaussian regression, and standardizing the extracted features;

S5. constructing a possible pollution source selection model by using machine

learning algorithms, and training the possible pollution source selection model with

the extracted historical spatio-temporal data over a period of time as a training set;

then inputting the extracted features into the trained possible pollution source

selection model, and outputting the result of whether the grid source is on a transport

path;

S6. repeating S3-S4 for the real-time spatio-temporal data to obtain features R2 ,

and 6, and standardizing the features to obtain preprocessed new data; and

S7. taking the preprocessed new data as an input into the trained possible

pollution source selection model, and outputting whether the grid source is on the

transport path; and adding the output result to the training set for optimization and

continual learning of the possible pollution source selection model.

Preferably, both the real-time spatio-temporal data and the historical spatio-temporal data include geographic location information, pollutant concentration information, sampling time and meteorological information.

Preferably, the S3 specifically comprises:

extracting pollutant concentration information from the historical

spatio-temporal data, then obtaining a valid time window i and a transport response

delay j according to set pollution events based on a hierarchical tree structure

constructed by a transport channel grid source k, and constructing a matrix to be

compared step by step and a correlation coefficient matrix in real time, i.e., a transient

model for pollutant concentration distribution.

Preferably, the S4 specifically comprises:

converting the correlation coefficient matrix into a correlation coefficient vector,

and extracting features from the correlation coefficient vector by Gaussian regression

to obtain Gaussian regression eigenvalues R2 k, pk and 6k.

Preferably, the possible pollution source selection model has an expression as

follows:

yk fk(Rk, pk,Sk)

where ykE[0,1], 0 means that the grid source k is not on the transport path and 1

means that the grid source k is on the transport path;fk represents the possible pollution

source selection model, and each site is analyzed based onfk to record the grid source

k with yk=1.

Preferably, the machine learning algorithms include random forest, decision tree,

clustering, Bayesian classification, support vector machine, EM and Adaboost.

Preferably, the S7 specifically comprises:

reassigning the pollution event time and pollution event concentration vector,

repeating S1-S5, then marking step by step until the correlation coefficient is the lowest, and outputting the marking result, i.e., the transport channel and possible pollution source regions, and ending iterative computation to realize continual learning and optimization of the model.

Preferably, the method for constructing the transient model for pollutant

concentration distribution specifically comprises:

Step 1. setting trigger conditions for a pollution event according to national

standards, and automatically marking the pollution event time t;

Step 2. constructing a pollution event concentration vector Xi and a vector Yif to

be compared step by step by the valid time window i and the transport response delay

j of the set pollution event based on Step 1:

Step 3. constructing a matrix Zijk to be compared step by step according to the

vector Yijk to be compared step by step; and

Step 4. constructing a correlation coefficient matrix Rif for the grid source k

based on the matrix Zij to be compared step by step, i.e., a transient model for

pollutant concentration distribution.

The following is the advantageous technical effect of the present invention as

compared with the prior art:

Since most of existing traceability models are steady-state models considering

the influence of boundaries on pollutant diffusion, without considering the problems

of delay effects during pollutant concentration transport and uncertainty in valid time

window of pollutant events, as well as the problem of inability to continuously

improve the stability, universality and accuracy of models. According to the present

invention, features are extracted from the spatio-temporal correlation matrix of

pollutant concentrations by Gaussian regression and correlation analysis based on the

transient model established based on spatio-temporal pollutant concentration data of a regional grid source, so as to solve the problems of delayed response and uncertainty in time window of pollutant concentrations. In addition, model training data are constantly updated by using machine learning algorithms, thereby ensuring the continuous and effective improvement of the accuracy of traceability algorithms.

BRIEF DESCRIPTION OF THE FIGURES

In order to explain the technical solutions in the embodiments of the present

invention or the prior art more clearly, the drawings used in the embodiments will be

briefly introduced below. Obviously, the drawings in the following description are

some embodiments of the present invention. For those of ordinary skill in the art,

other drawings can be obtained based on these drawings without paying creative

labor.

FIG. 1 is a flowchart showing the method of the present invention.

DESCRIPTION OF THE PRESENT INVENTION

The technical solutions in the embodiments of the present invention will be

described clearly and completely with reference to the accompanying drawings in the

embodiments of the present invention. Apparently, the described embodiments are

only a part of the embodiments of the present invention, not all of the embodiments.

Based on the embodiments of the present invention, all other embodiments obtained

by those of ordinary skill in the art without creative work should fall within the

protection scope of the present invention.

The present invention will be further described in detail with reference to

accompanying drawings and preferred embodiments for clear understanding of the

above purpose, features and advantages of the present invention.

Embodiment 1

Referring to FIG. 1, the present invention provides a method for traceability of air pollutants based on coupled machine learning and correlation analysis, specifically comprising the following steps:

for each grid source site in a target region;

where both the real-time spatio-temporal data and the historical spatio-temporal

data include geographic location information (latitude and longitude), sampling time,

meteorological information (e.g., wind power and wind direction) and pollutant

concentration information;

S2. constructing a database based on the real-time spatio-temporal data and the

period of time from the database;

on the historical spatio-temporal data over a period of time to improve the dynamic

response speed of the model;

extracting pollutant concentration information from the historical

compared step by step and a correlation coefficient matrix, i.e., a transient model for

pollutant concentration distribution, which specifically comprises:

S3.1. setting trigger conditions for a pollution event according to national

standards, and automatically marking the pollution event time t;

S3.2. constructing a pollution event concentration vector Xi, as shown in formula

(1): where xt represents the pollutant concentration of the grid source at which a standard event occurs at time t; i represents the valid time window of a set pollutant event, iE[3, 1];

I = -(2)

where I represents the upper limit of i; the operator [ indicates rounding up;

T is the antecedent duration of the pollution event; and AT is the data monitoring

cycle of the grid source;

S3.3. constructing a vector Yijk to be compared step by step, as shown in

formulas (3)-(5):

Yi/l=(yt-jk,ytr-+1*,...,yt-j+,k) (3)

J [(d/)] (4) AT

rn-kn- kd d = 1,n1dmn (5) Ck

where Y#X, Yif represents the pollutant concentration of a site k at time t-j; j

represents the set pollutant transport response delay, jE[1,J]; J represents the upper

limit ofj; v is the wind speed; a is the angle between wind direction and two points in

space; d is the average distance between any two grid sources; m, nEk, k is the total

number of grid sources; d, n is the distance between grid sources m and n; and C2is

the number of combinations of any two grid sources from k grid sources;

S3.4. constructing a matrix Zijk to be compared step by step according to the

vector Yijk to be compared step by step, as shown in formula (6):

11 1,2 " 6j

Z Y2 1 22 .. 2,j6

S3.5. constructing a correlation coefficient matrix Ri,; of the grid source k based

on the matrix Zi- to be compared step by step, as shown in formula (7):

k k k ri,1 ri,2 --- rij k k k Rk- r2 1 r 2 --- ~j (7) k k k ri-2,1 ri-2,2 --- ri-2,j

k Cov(,XiYi 2 j) Ti-2,j = a~j-ary j (8) Var(Xi)-VarY2j

S4. extracting features from a pollutant concentration correlation coefficient

matrix for a grid source by Gaussian regression, and standardizing the extracted

2 features to obtain Gaussian regression eigenvalues R k, k,6;

converting the correlation coefficient matrix RiJ into a vectorRk

r = (I - 2) -j (9)

then extracting features from the vector Rrk by Gaussian regression to obtain

Gaussian regression eigenvalues R 2 , pk and 6k; where R 2 represents the fitting effect

of Gaussian regression, p represents the mean value of the correlation coefficient, and

6 represents the variance of the correlation coefficient;

S5. constructing a possible pollution source selection model by using machine

selection model, outputting the result of whether the grid source is on a transport path,

and performing intelligent recognition on pollutant transport channels and pollution

source regions to reduce the difficulty in manual analysis and improve the universality

of the model, which specifically comprises: establishing the possible pollution source selection model based on the transient model fkfor pollutant concentration distribution by using a random forest machine learning algorithm, as shown in formula (10):

Y - fk(R , pk,S) (10)

means that the grid source k is on the transport path; Here, whether the grid source site

is on the pollution transport path is manually marked by a professional (0 means that

the grid source is not on the transport path, and 1 means that the grid source is on the

transport path).

Each site is analyzed based onf to record the grid source k with yk=-.

The machine learning algorithms include random forest, decision tree, clustering,

Bayesian classification, support vector machine, EM and Adaboost. Other machine

learning algorithms do not include but are not limited to random forest; a variety of

machine learning algorithms are used for modeling, and a champion model is selected

as the possible pollution source selection model;

S6. extracting new geographic location information (longitude and latitude),

pollutant concentration information, sampling time and meteorological information

(e.g., wind power and wind direction) when a new pollution path grid source (new

data) appears, i.e., real-time spatio-temporal data; repeating S3-S4 for dynamic

correlation analysis to obtain features R 2, p and 6, and standardizing the features to

obtain preprocessed new data; and

S7. inputting the preprocessed new data into the possible pollution source

selection model to obtain the result of whether the grid source is on the transport path;

and adding the result to the training set based on application feedback of the possible

pollution source selection model to realize continual learning and optimization of the model based on new data, thus improving the judgment accuracy, which specifically comprises: reassigning by formulas (11) and (12), repeating S1-S5, then marking k. step by step until the correlation coefficient R,/ is the lowest, and outputting k., i.e., the transport channel and possible pollution source regions, and ending iterative computation to realize continual learning and optimization of the model, where w indicates the wth cycle.

t =t-j (11)

X, = Yk; (12)

The preferred embodiments described herein are only for illustration purpose,

and are not intended to limit the present invention. Various modifications and

improvements on the technical solution of the present invention made by those of

ordinary skill in the art without departing from the design spirit of the present

invention shall fall within the protection scope set forth in claims of the present

invention.

Claims

1. A method for traceability of air pollutants based on coupled machine learning

and correlation analysis, comprising the following steps:

for each grid source site in a target region;

S2. constructing a database based on the real-time spatio-temporal data and the

period of time from the database;

on the historical spatio-temporal data over a period of time;

S4. extracting features from the transient model for pollutant concentration

distribution by Gaussian regression, and standardizing the extracted features;

S5. constructing a possible pollution source selection model by using machine

path;

and 6, and standardizing the features to obtain preprocessed new data; and

S7. taking the preprocessed new data as an input into the trained possible

continual learning of the possible pollution source selection model.

2. The method for traceability of air pollutants based on coupled machine learning and correlation analysis according to claim 1, characterized in that both the real-time spatio-temporal data and the historical spatio-temporal data include geographic location information, pollutant concentration information, sampling time and meteorological information.

3. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 2, characterized in that the S3

specificallycomprises:

extracting pollutant concentration information from the historical

model for pollutant concentration distribution.

4. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 3, characterized in that the S4

specifically comprises:

to obtain Gaussian regression eigenvalues R2 k, Pk and 6k.

5. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 1, characterized in that the

possible pollution source selection model has an expression as follows: 2 k F( Y - fk(k , Pk, 5k) (10)

where ykE[0,1], 0means that the grid source k is not on the transport path and 1

means that the grid source k is on the transport path;fk represents the possible pollution source selection model, and each site is analyzed based onf to record the grid source k with yk=1.

6. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 5, characterized in that the

machine learning algorithms include random forest, decision tree, clustering,

Bayesian classification, support vector machine, EM and Adaboost.

7. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 5, characterized in that the S7

specifically comprises:

reassigning the pollution event time and pollution event concentration vector,

repeating S1-S5, then marking step by step until the correlation coefficient is the

lowest, and outputting the marking result, i.e., the transport channel and possible

pollution source regions, and ending iterative computation to realize continual

learning and optimization of the model.

8. The method for traceability of air pollutants based on coupled machine

learning and correlation analysis according to claim 3, characterized in that the

method for constructing the transient model for pollutant concentration distribution

specifically comprises:

Step 1. setting trigger conditions for a pollution event according to national

standards, and automatically marking the pollution event time t;

Step 2: constructing a pollution event concentration vector Xi and a vector Yij to

j of the set pollution event based on Step 1: Step 3. constructing a matrix Zijk to be compared step by step according to the

vector Yijk to be compared step by step; and

Step 4. constructing a correlation coefficient matrix Rif for the grid source k

pollutant concentration distribution.