CN115471011A

CN115471011A - Air quality prediction method based on rough set and structure risk minimization

Info

Publication number: CN115471011A
Application number: CN202211277183.2A
Authority: CN
Inventors: 张晓霞; 张蓬浩; 王国胤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-12-13

Abstract

The invention relates to an air quality prediction method based on rough set and structure risk minimization, which comprises the following steps: acquiring environmental parameter sample data related to air quality; establishing an air quality evaluation system for carrying out grade evaluation on environmental parameter sample data acquired from a meteorological monitoring station to establish an air quality index decision table; calculating the empirical error of the condition attribute subset and the mutual information of the condition attribute subset and the decision attribute by utilizing a rough set theory and a structure risk minimization theory according to an air quality index decision table; and calculating by utilizing a genetic algorithm to obtain an optimal condition attribute subset, and predicting the air quality by taking the condition attributes in the optimal condition attribute subset as the condition attributes of the rough set classifier and using the environmental parameter data of the target monitoring point.

Description

Air quality prediction method based on rough set and structure risk minimization

Technical Field

The invention relates to the field of air quality prediction, in particular to an air quality prediction method based on rough set and structure risk minimization.

Background

Environmental pollution problems have now led to a strong social response, because serious Air pollution problems affect the health and life of people, the main pollutants constituting the Air Quality Index are PM2.5, PM10, S02, NO2, CO, O3, TSP (suspended particulate matter), DF (dustfall), etc., while AQI (Air Quality Index) may be associated with one or more pollutant factors.

In real life, data of main pollutants are key to form the AQI, however, due to errors in data acquisition of the pollutants, the pollutant data may be incomplete or redundant, and the work difficulty of air quality analysis and prediction is increased.

Roughset theory is a mathematical tool proposed by z. Pawlak in 1982 to deal with incomplete and uncertain knowledge. The rough set can effectively analyze and process various incomplete information and discover implicit information rules from the incomplete information. At present, the prediction of the air quality is mostly carried out by utilizing a rough set, but the rough set theory needs to be established on an indistinguishable relation, namely an equivalent relation, and because the requirement of the equivalent relation is strict and the tolerance to wrong information is low, the generalization of the rough set is generally weak when a large amount of noise exists in a data set, so that the prediction accuracy is unstable.

Disclosure of Invention

The method aims to solve the problems that in the prior art, due to the fact that errors exist in data acquisition of pollutants, the pollutant data may be incomplete or redundant, and the working difficulty of air quality analysis and prediction is increased; the invention provides an air quality prediction method based on a rough set and structure risk minimization, which comprises the following steps of:

s1: acquiring environmental parameter sample data related to air quality from a meteorological monitoring station;

s2: establishing an air quality evaluation system to perform grade evaluation on environmental parameter sample data acquired from a meteorological monitoring station to obtain the air quality index grade of the environmental parameter sample;

s3: according to the environmental parameter sample data, taking an environmental parameter related to air quality as a condition attribute, and taking the air quality index grade of the environmental parameter sample as a decision attribute to create an air quality index decision table;

s4: generating a limited number of condition attribute subsets according to condition attributes in an air quality index decision table, and calculating the empirical error of the condition attribute subsets and the mutual information of the condition attribute subsets and the decision attributes by using a rough set theory and a structural risk minimization theory;

s5: calculating to obtain an optimal condition attribute subset by using a genetic algorithm according to the empirical error of the condition attribute subset and the mutual information of the condition attribute subset and the decision attribute;

s6: and predicting the air quality by taking the condition attributes in the optimal condition attribute subset as the condition attributes of the rough set classifier and using the environmental parameter data of the target monitoring point to obtain an air quality result.

The present invention has at least the following advantageous effects

The invention combines the rough set theory and the structural risk minimization criterion, utilizes the characteristic that the rough set theory can carry out quantitative analysis, thereby reasoning and explaining the relationship between data, adds the structural risk minimization criterion, balances the prediction error and the complexity, and improves the stability and the robustness of the air quality prediction. And by combining with a genetic algorithm, the characteristic dimension is reduced on the premise of not reducing the classification accuracy, and the air quality prediction speed is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the structural risk minimization criteria of the present invention;

FIG. 3 is a flow chart of the genetic algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Referring to fig. 1, the present invention provides an air quality prediction method based on roughness set and structure risk minimization, comprising:

the main pollutants forming the air quality index comprise eight items of PM2.5, PM10, S02, NO2, CO, O3, TSP (suspended particulate matter), DF (dust fall) and the like; therefore, the environment parameters mainly collected by the invention are PM2.5, PM10, S02, NO2, CO, O3, TSP (suspended particulate matter) and DF (dust fall); the environmental parameter sample data is the pollutant concentration corresponding to PM2.5, PM10, S02, NO2, CO, O3, TSP (suspended particulate matter) and DF (dust fall).

the air quality evaluation system establishes an air quality index grade evaluation system according to six air quality index grades of superior, good, light pollution, moderate pollution, severe pollution and severe pollution according to the national standard GB 3095-2012. The air quality index is 0-50 as grade one; the air quality index is 51-100, and is grade two; the air quality index is 101-150, and is grade three; the air quality index is 151-200, and is grade four; the air quality index is 201-300, and is grade five; an air quality index of greater than 300, grade six, as shown in table 1:

TABLE 1 air quality evaluation Table

Dividing 3 intervals according to the level concentration limit values of various environmental parameters in the national standard GB 3095-2012, and respectively coding the intervals into low intervals, medium intervals and high intervals, wherein the low intervals represent that the environmental parameters do not exceed the standard, the medium intervals represent that the environmental parameters exceed the standard, and the high intervals represent that the environmental parameters seriously exceed the standard; let A ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ ，A ₇ ，A ₈ The environmental parameter evaluation tables are respectively obtained by indicating that the environmental parameters are PM2.5, PM10, S02, NO2, CO, O3, TSP (suspended particulate matter) and DF (dust fall).

And evaluating the environmental parameter sample data in the environmental parameter sample data according to an environmental parameter evaluation table, and obtaining the air quality index grade corresponding to the environmental parameter sample data according to an air quality index evaluation table.

S3: according to the environmental parameter sample data, taking an environmental parameter related to air quality as a condition attribute, and taking the air quality index grade of the environmental parameter sample as a decision attribute to create an air quality index decision table, wherein the air quality index decision table is shown as a table 2:

TABLE 2 air quality index decision-making table

Universe of discourse

A ₁

A ₂

A ₃

A ₄

A ₅

A ₆

A ₇

A ₈

D

x ₁

Is low in

Is low with

In

Is low with

In

Class 2

x ₂

Is low with

Is low in

Height of

In (1)

Is low in

Height of

In

Class 3

x ₃

Is low with

Is low in

Is low with

In (1)

Is low in

In

Class 1

x ₄

Is low with

In

Is low with

In

Is low with

In (1)

Is low with

In (1)

Class 2

x ₅

In

Height of

In

In (1)

Height of

Class 2

x ₆

Height of

In (1)

High (a)

Height of

In

Height of

Class 4

x ₇

Is low in

In

Height of

Is low in

In

High (a)

Class 3

x ₈

High (a)

In

High (a)

In

Height of

Grade 5

…

Wherein, x1 represents the first environmental parameter sample data, x2 represents the second environmental parameter sample data \8230, and so on, D represents the air quality index grade corresponding to the environmental parameter sample data.

Theory of rough set:

the rough set is a new mathematical method for processing inaccurate, uncertain and incomplete data, and can discover implicit knowledge and reveal potential laws through analysis and reasoning on the data. It passes through a pair of exact sets: upper approximation and lower approximation to determine an approximate description of the uncertain target set.

Let S be the information table, S is expressed as: s = (U, at = C & { d }, { V =) _a |a∈At}，{I _a |a∈At})

U is a limited object set called domain; at is a finite, non-empty attribute set; va represents the attribute value range of the attribute a belonging to At, namely the value range of the attribute a; ia is an information function representing the value of the object x at a.

The non-resolvable relationship on the domain of discourse U is:

it is apparent that a non-resolvable relationship is an equivalent relationship that divides the domain of discourse U into U/R _B ，U/R _B ＝{X ₁ ，X ₂ ，…，X _m By an equivalence relation R _B The set of equivalence classes formed. Equivalence class [ x ] formed from equivalence relations] _B ＝{y|(x，y)∈R _B Is a basic knowledge grain in a coarse set.

For each subset X ∈ U and the equivalence relation R, the upper and lower approximations of X are defined as follows:

the universe of discourse U is divided into positive universe POS by the upper and lower approximate set of X _R (X), negative field NEG _R (X) and boundary Domain BNG _R (X) three disjoint regions wherein

Positive domain: POS (Point-of-sale) _R (X)＝RX；

A negative domain:

boundary domain:

the approximate quality is used to describe the dependency between the attributes. If the value of the attribute Q is completely dependent on P, then Q is said to be dependent on P, denoted as

Let P, Q ∈ At, Q have a dependency degree on P of k (0 ≦ k ≦ 1), expressed as:

s41: a decision information system for obtaining an air quality index decision table according to a rough set theory:

wherein, let S = (U, C utoud, V, f) be a decision information system, where U = { x = { n = } ₁ ，x ₂ ，…，x _n Is a non-empty finite set of objects, also called a domain, x _i Denoted as the ith environmental parameter sample data. C = { a = ₁ ，A ₂ ，…，A _m Is a non-empty finite set of attributes, where A _i Expressed as concentrations of PM2.5, PM10, S02, NO2, CO, O3, TSP, etc. gases and aerosols, B is a subset of the set of conditional attributes C. D is a decision attribute, represented here as an air quality index rating, divided into six total ratings according to the severity of the air quality.

Is a value range in which V _a The value range of the attribute a is represented, and f is an information function.

S42: calculating the empirical error of the condition attribute subset according to the dependence of the decision attribute subset D on the condition attribute subset B in the decision information system;

R _emp (B)＝1-γ _B (D)，

where, |, represents the cardinality of the set, i.e., the number of elements within the set. One derived from the conditional attribute subset B is divided into U/IND (B) = { X ₁ ，X ₂ ，…，X _n }，X _i Is an equivalent class thereof. [ x ] of] _D One derived from decision attribute D is divided into U/IND (D).

For example, let U = { x) according to Table 2 ₁ ，x ₂ ，x ₃ ，x ₄ ，x ₅ ，x ₆ ，x ₇ ，x ₈ Is an environment sample data set, where C = { a = } ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ ，A ₇ ，A ₈ Denotes the PM2.5, PM10, S02, NO2, CO, O3, TSP, DF ring parameter set; d represents a decision attribute (air quality index rating corresponding to the environmental parameter sample data). When the conditional attributes A1A2A3A4 are taken as a partition, the subset of conditional attributes,

b = { PM2.5, PM10, S02, NO2}, when the set of equivalence classes derived from conditional attribute B is: U/IND (B) = { { x { { X { } ₁ ，x ₃ }，{x ₂ }，{x ₄ }，{x ₅ }，{x ₇ }，{x ₆ ，x ₈ } when the decision attribute D is taken as a partition, U/IND (D) = { { x ₃ }，{x ₁ ，x ₄ ，x ₅ }，{x ₂ ，x ₇ }，{x ₆ }，{x ₈ The dependency of decision attribute D on condition attribute B at this time is:

thus R _emp (B)＝1-γ _B (D)＝1-0.625＝0.375。

S43: introducing a mutual information regularization function according to a structure risk minimization criterion to calculate mutual information of the condition attribute subset and the decision attribute;

referring to fig. 2, structural Risk Minimization (SRM) is a proposed strategy to prevent overfitting. Structural risk minimization is equivalent to regularization. The structural risk adds a regularizer or penalty term (penalty term) to the empirical risk that represents the complexity of the model. The structural risk is defined as:

wherein J (f) is a function of the model complexity, and λ ≧ 0 is a coefficient for balancing empirical risk and model complexity. The strategy of minimizing the structural risk considers the model with the smallest structural risk as the optimal model:

the minimum structural risk requires both the empirical risk and the model complexity to be small, and the model has better generalization at the moment.

On the basis, a mutual information regularization item I (B; D) is introduced into the selected rough set model and is expressed as

I(B；D)＝H(D)-H(D|B)

Where H (D) is the entropy of the decision attribute D and H (D | B) is the conditional attribute subset B with respect to the decision

The conditional information entropy of the attribute D, and the I (B; D) represents the mutual information of the attribute subset B and the decision attribute D, the quantitative analysis can be performed by utilizing the rough set theory, so that the characteristics of the relation between data are inferred and explained, the structural risk minimization criterion is added, the prediction error and the complexity are balanced, and the stability and the robustness of the air quality prediction are improved.

Referring to fig. 3, S5: calculating to obtain an optimal condition attribute subset by using a genetic algorithm according to the empirical error of the condition attribute subset and the mutual information of the condition attribute subset and the decision attribute;

s51: coding the condition attribute subset, and taking the coded condition attribute subset as an initial chromosome population of the genetic algorithm;

s52: calculating an expected error of the condition attribute subset according to the empirical error of the condition attribute subset and mutual information of the condition attribute subset and the decision attribute;

the expected error for the subset of conditional attributes comprises:

minR _reg (B)＝R _emp (B)+αI(B；D)；

wherein, minR _reg (B) Representing the expected error, R, of the attribute subset B _emp (B) The empirical error of the subset B is represented, I (B; D) represents the mutual information of the attribute subset B and the decision attribute D, alpha is taken as a hyperparameter, and alpha is more than or equal to 0.

S53: and processing the initial chromosome by utilizing selection, crossing and mutation operators of a genetic algorithm to obtain a crossed mutated chromosome. Wherein, the selection operator adopts a roulette method, the crossover operator adopts single-point crossover, and the mutation operator adopts basic bit mutation;

the selection is made by roulette rules based on the expected error for each initial chromosome, as follows:

(1) Assuming a population size of M (number of initial chromosomes), calculating a fitness f (i =1, 2.. Multidot.m) (expectation error) of each individual (initial chromosome) in the population;

(2) Calculating the probability that each individual (initial chromosome) is inherited into the next generation population;

(3) Calculate each individual (initial chromosome) x _i (i =1,2, \ 8230n); n) of the cumulative probability q _i ；

(4) Generating a uniformly distributed pseudo random number r in the interval of [0,1 ];

(5) If r < q ₁ Then individual 1 is selected, otherwise, individual k is selected such that: q. q.s _k-1 ＜r≤q _k If true;

(6) Repeating the steps (4) and (5) for M times to obtain the preset times.

When the crossover operation is carried out, individuals are selected to participate in crossover according to a certain probability, crossover points are randomly selected from two crossed random chromosomes, and then substrings after the crossover points are exchanged to generate next generation individuals.

The basic bit variation operation refers to the variation of the individual code string with the probability P _m The gene value at one or several randomly assigned loci is subjected to mutation operation. The operation process is as follows:

(1) For each locus of an individual (initial chromosome), with a probability P _m Designating it as a variation point;

(2) And carrying out mutation operation on the specified mutation points.

S54: taking the chromosomes after cross mutation as initial chromosomes of the next iteration of the genetic algorithm, and repeatedly executing the steps S52-S54 until the preset iteration times are reached; and taking the condition attribute subset with the minimum expected error as the optimal condition attribute subset. The method combines the genetic algorithm to reduce the characteristic dimension on the premise of not reducing the classification accuracy, and improves the air quality prediction speed.

For example, let U = { x) according to Table 2 ₁ ，x ₂ ，x ₃ ，x ₄ ，x ₅ ，x ₆ ，x ₇ ，x ₈ Is a set of environmental sample data, where C = { a = } ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ ，A ₇ ，A ₈ Denotes the PM2.5, PM10, S02, NO2, CO, O3, TSP, DF ring parameter set; d represents a decision attribute (air quality index rating corresponding to the environmental parameter sample data). Assuming that the optimal attribute subset obtained according to the above operation is A1A2A3, then the optimal condition attribute subset B = { PM2.5, PM10, S02}, then the set of equivalent classes derived from the condition attribute B is: U/IND (B) = { { x { { X { } ₁ ，x ₃ }，{x ₂ }，{x ₄ }，{x ₅ }，{x ₇ }，{x ₆ ，x ₈ A set of rules derived from the decision table is γ (B) = { γ (y) = ₁ ，γ ₂ ，γ ₃ ，γ ₄ ，γ ₅ ，γ ₆ }. Given a sample data of the environment to be measured as X,

x = { low, high, medium }, where X is on the order of 3 with probability of 1, possibly the air quality index according to the rule set derived from the decision table.

And the emission of pollutants is reduced by timely feeding back the predicted air quality result to related enterprises, and the environmental quality of a target monitoring point is improved.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An air quality prediction method based on roughness set and structure risk minimization, comprising:

s6: and taking the condition attribute in the optimal condition attribute subset as the condition attribute of the rough set classifier, and predicting the air quality by using the environmental parameter data of the target monitoring point to obtain an air quality result.

2. The method of claim 1, wherein the calculating the empirical error of the subset of condition attributes and the mutual information between the subset of condition attributes and the decision attributes using the rough set theory and the structure risk minimization theory comprises:

s41: a decision information system of an air quality index decision table is obtained according to a rough set theory;

s43: and introducing a mutual information regularization function according to a structure risk minimization criterion to calculate mutual information of the condition attribute subset and the decision attribute.

3. The method of claim 2, wherein the empirical error for the subset of condition attributes comprises:

R _emp (B)＝1-γ _B (D)

wherein, | · | represents the cardinality of the set, i.e., the number of elements in the set; U/IND (B) = { X ₁ ，X ₂ ，…，X _n Denotes a partition derived from the condition attribute subset B, X _i Represents an equivalence class, [ x ] in the partition] _D One representation derived from the decision attribute D is divided into U/IND (D).

4. The method of claim 2, wherein the mutual information between the subset of condition attributes and the decision attributes comprises:

I(B；D)＝H(D)-H(D|B)

where H (D) is the entropy of the decision attribute D, H (D | B) is the entropy of the conditional information of the conditional attribute subset B with respect to the decision attribute D, and I (B; D) represents the mutual information of the attribute subset B and the decision attribute D.

5. The method of claim 1, wherein the calculating an optimal condition attribute subset using a genetic algorithm based on empirical errors of the condition attribute subset and mutual information between the condition attribute subset and the decision attribute comprises:

s53: processing the initial chromosome by utilizing selection, crossing and mutation operators of a genetic algorithm to obtain a crossed variant chromosome, wherein the selection operator adopts a roulette method, the crossing operator adopts single-point crossing, and the mutation operator adopts basic bit mutation;

s54: taking the chromosome after cross mutation as an initial chromosome of the next iteration of the genetic algorithm, and repeatedly executing the steps S52-S54 until the preset iteration times are reached; and taking the condition attribute subset with the minimum expected error as the optimal condition attribute subset.

6. The method of claim 1, wherein the expected error for the subset of conditional attributes comprises:

min R _reg (B)＝R _emp (B)+αI(B；D)；

wherein, min R _reg (B) Representing the expected error, R, of the attribute subset B _emp (B) The empirical error of the subset B is represented, I (B; D) represents the mutual information of the attribute subset B and the decision attribute D, alpha is used as a hyperparameter, and alpha is more than or equal to 0.

7. The method of claim 1, wherein the environmental parameters associated with air quality comprise: including PM2.5, PM10, S02, NO2, CO, O3, TSP, DF.

8. The air quality prediction method based on rough set and structural risk minimization according to claim 1, characterized in that the air quality evaluation system establishes an air quality index grade evaluation system according to six air quality index grades of excellent, good, light pollution, moderate pollution, severe pollution and severe pollution according to national standard GB 3095-2012.