CN112394137A

CN112394137A - Intelligent calibration method for monitoring environmental air quality

Info

Publication number: CN112394137A
Application number: CN201910747028.4A
Authority: CN
Inventors: 祁柏林; 王宁; 张欣; 魏景峰; 刘闽; 杜毅明; 周晓磊; 白雪; 张镝; 陈月; 王兴刚; 范秋枫; 孟繁星; 金继鑫
Original assignee: Shenyang Institute of Computing Technology of CAS
Current assignee: Shenyang Institute of Computing Technology of CAS
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2021-02-23

Abstract

The invention relates to an intelligent calibration method for monitoring environmental air quality. The method mainly aims at the problem of monitoring data deviation caused by low precision of a sensor of the current miniature monitoring instrument according to the gridding requirement of the current air environment monitoring, and calibrates the data monitored by the miniature monitoring instrument. And taking standard data of the national standard station as a target learning value. And placing the miniature monitoring instrument around a national standard station to train the learning rule under the same environment. The method can learn the rule close to the national standard station after training the micro monitoring instrument, and the accuracy of the monitoring data of the micro monitoring instrument can be improved by utilizing the new learning rule. SO mainly aiming at one of six air quality pollutants₂The data are described as an example.

Description

Intelligent calibration method for monitoring environmental air quality

Technical Field

The invention relates to the field of air quality monitoring, in particular to an intelligent calibration method for monitoring the air quality of an environment.

Background

The quality status of the air environment is closely related to our life. Whether from media or other routes, the environment problem is more serious when the age of the people is the current time. This is a difficult task in view of the present. In order to realize a large number of standard stations released by the state for gridding supervision, the high-precision sensors have high cost, and monitoring point positions are not flexible enough in arrangement, so that the monitoring points are far away from the aim of realizing comprehensive supervision.

In the face of the emergence of the related policy of realizing gridding supervision, a plurality of miniature monitoring instruments which are convenient to distribute and have low cost emerge in the market. However, these monitoring devices have a common disadvantage: the physical characteristics of the sensors themselves cause some deviation in the monitored data. These data errors cause certain trouble to our normal environmental regulatory work.

Disclosure of Invention

In order to reduce the problem of data deviation caused by the physical characteristics of a common sensor, the intelligent calibration method for monitoring the air quality of the environment can improve the accuracy of a miniature monitor to a certain extent, and makes a certain reference value for solving the problem of data deviation of the miniature monitor caused by the physical characteristics of the sensor.

The technical scheme adopted by the invention for solving the technical problems is as follows: an intelligent calibration method for monitoring the quality of ambient air comprises the following steps:

data processing: integrating data transmitted into a database by a micro monitoring instrument and national standard data to obtain marked data and unmarked data; taking national standard data as a label, and duplicating a marked data set and an unmarked data set into two parts for collaborative training;

model training: respectively training the two marked data sets on an LSTM model to obtain two models with different parameters;

and (3) collaborative training: and applying the trained model to an unlabeled data set, selecting data with high confidence coefficient in the unlabeled data set, adding the data into the labeled data set of another trainer, and performing iterative training in such a way until all parameters of the model are stable.

The data processing adopts SO of one of six pollutants of air quality₂The domestic standard data is used as a training target, and the data collected by the miniature monitoring instrument is used as input data.

The data processing comprises the following steps:

integrating data transmitted into a database by the micro monitoring instrument and the obtained national standard data according to the same time, wherein the national standard data corresponding to the data transmitted into the database by the micro monitoring instrument is existed at the same time, and the data transmitted into the database by the micro monitoring instrument at the time is marked data; the rest is unmarked data;

the marked data set is copied into two parts which are respectively used as a marked data set 1 and a marked data set 2; the unlabeled data set is also duplicated as unlabeled data set 1 and unlabeled data set 2, respectively.

The marked data is the data which is transmitted into the database by the miniature monitoring instrument and has corresponding national standard data at the same time, namely the data which accords with the national standard regulation; the non-marking data is the national standard data which does not correspond to the data of the micro monitoring instrument.

The model training adopts a long-term and short-term memory network model, and the neural network is divided into three layers.

The cooperative training comprises the following steps:

1) applying model 1 trained by labeled data set 1 to unlabeled data set 1, predicting by model, and obtaining data (x) from unlabeled data set 1_μ1，y_μ1) Confidence detection with the data of the marker dataset 1:

finding (x) in unlabeled dataset 1 in labeled dataset 1_μ1，y_μ1) The K-neighborhood values found form a set Z1, and y corresponds to the neighborhood value x in the labeled data set 1 and x in the unlabeled data set 1_μ1Corresponding to y_μ1Sorting the distance differences;

first unmarked dataset 1 (x)_μ1，y_μ1) Added to labeled dataset 1, ranked set Z1 as the test set, and model 2 trained using these results of the loss function output compared to the (x) results of the previous unlabeled dataset 1 not added_μ1，y_μ1) Whether the loss function of (1) becomes small; decreasing indicates the piece of unmarked data (x)_μ1，y_μ1) The model 2 is optimized and used as an alternative for adding the unmarked data set 1; selecting a piece of data which can make the loss function reduce most from the alternatives of the unmarked data set 1 as the data with the highest confidence coefficient, and finally adding the piece of data into the marked data set 2 of the model 2; wherein x_μ1In the unlabeled data set 1A piece of data x_μ1，y_μ1Is to pass the trained model 1 to a piece of data x in the unlabeled data set 1_μ1The predicted value of (2);

applying model 2 trained by labeled data set 2 to unlabeled data set 2, predicting by model, according to (x) of unlabeled data set 2_μ2，y_μ2) Confidence detection with the data in the label data set 2:

finding (x) in unlabeled dataset 1 in labeled dataset 2_μ2，y_μ2) The K-neighborhood values found form a set Z2, and y corresponds to the neighborhood value x in the labeled data set 2 and x in the unlabeled data set 2_μ2Corresponding to y_μ2Sorting the distance differences;

first unmarked data set 2 (x)_μ2，y_μ2) Added to labeled dataset 2, ranked set Z2 as the test set, and model 1 trained using these results of the loss function output compared to the (x) results of the previous unlabeled dataset 2 not added_μ2，y_μ2) Whether the loss function of (1) becomes small; decreasing indicates the piece of unmarked data (x)_μ2，y_μ2) The model 1 is optimized and used as an alternative for adding the marked data set 2; selecting a piece of data which can make the loss function reduce most from the alternatives of the unmarked data set 2 as the data with the highest confidence coefficient, and finally adding the piece of data into the marked data set 1 of the model 1; wherein x_μ2Refers to a piece of data x in the unlabeled data set 2_μ2，y_μ2Is to pass the trained model 2 to a piece of data x in the unlabeled data set 2_u2The predicted value of (2);

2) the above process is continuously and alternately repeated until the parameters of the last two trained models are closest.

The format of the marking data is (x, y); x represents the data collected by the micro monitoring instrument, namely the electric signal value, and y represents the national standard data.

The K-neighborhood value is used to calculate a prediction value y of the set of neighborhood values_uA difference value with each neighboring value y;the K-proximity calculation formula is as follows;

d(x_μ，x_i)＝||x_μ，x_i||² (1)

in the formula: x is the number of_μRepresenting the data, i.e. electrical signal values, x, in an unmarked data set_iIs x_μIn the neighborhood of the marker dataset.

The invention has the following beneficial effects and advantages:

1. the data utilization rate is high. Through semi-supervised learning, both marked data and unmarked data can be fully utilized. Although we can now obtain a large amount of data, there is very little data that really meets the requirements that we can utilize. Unlabeled data may not work optimally in model training, but manual labeling presents significant difficulties. By using a semi-supervised learning method, the defect can be avoided, and the data can be fully utilized by people.

Multiple angles. The cooperative training is a multi-angle training method. It is known that viewing questions from different angles tends to be more comprehensive. It is also desirable to achieve such an effect here using cooperative training. The data set is divided into two parts which are the same, namely, the training of the model is carried out from different angles, so that the final fitting capability and the generalization capability of the model are better.

Drawings

FIG. 1 is a comparison of before and after data processing.

Fig. 2 is a flowchart of the overall algorithm.

FIG. 3 is a diagram of the structure of the LSTM.

Detailed Description

The present invention will be described in further detail with reference to examples.

As shown in fig. 2, an intelligent calibration method for monitoring ambient air quality includes the following steps:

step 1: and (6) data processing. The data transmitted into the database by the existing miniature monitoring instrument and the known national standard data are processed and then divided. And (3) taking national standard data as a label, copying the marked data set and the unmarked data set into two parts, and preparing for cooperative training in the early stage.

Step 2: and (5) training a model. Training the marked two data sets on an LSTM model to obtain two models with different parameters;

and step 3: and (5) detecting confidence. For data x in each unmarked dataset_μPredicting y with trained models_μThen find x in the tagged dataset_μK proximity values that are close, and grouping the K proximity values into a set, the set being ordered by K-proximity distance. K-neighborhood distance is the predicted value y in calculating the set of neighborhood values_uA difference value with each neighboring value y; the K-vicinity calculation formula is shown in formula (1). We will note the set of K neighbors sorted by K-neighbor distance as Z. We treat this set as a new test set, with each piece of prediction data (x)_μ，y_μ) And adding the new training set into the labeled data set to serve as a new training set, and training the corresponding model by using the new training set and the test set. Finding a (x) that minimizes the model loss function_μ，y_μ) Of (x)_μ，y_μ) The data with the highest confidence.

And 4, step 4: and (5) performing cooperative training. According to the model trained in the step 2, two data sets can train out models with different parameters, then the trained model is applied to an unmarked data set, data with high confidence in the unmarked data set is selected and added into a marked data set of another trainer, and the iterative training is repeated in such a way until the marked data set is stable, that is, data which does not meet the confidence requirement in the unmarked data set needs to be added into the marked data set, which means that the model training is stable.

Data processing: SO used here is one of six air quality pollutants₂The hour data of (2) and the monitoring data of a national standard station are used as training targets, and the data collected by a miniature monitoring instrument are used as input data to be trained. Because the obtained national standard data is limited, the data of a part of miniature monitoring instruments does not have the corresponding national standard data,we attribute such data as unlabeled data. Both the labeled and unlabeled data we duplicated in two to prepare for subsequent co-training.

The miniature monitoring instrument is small-scale air quality monitoring equipment, for example, small instruments erected above roads can be seen at each intersection in normal times. The national monitoring instrument needs to arrange a plurality of point positions, and has poor flexibility and high cost. Compared with the standard station put in China, the miniature monitoring instrument has low cost and high flexibility, but has the defect that the accuracy of the sensor is not high.

The model training method comprises the following steps: here we use the long short term memory network model (LSTM). Experiments show that the effect is optimal when the neural network is three layers. Two different rules are trained through two data sets.

The method for collaborative training comprises the following steps:

we apply the model trained with labeled data set 1 to unlabeled data set 1, and by model prediction, we predict (x) from unlabeled data 1_μ1，y_μ1) And carrying out confidence detection on the data in the labeled data 1, screening out data with high confidence, adding the data into a labeled data set 2 of another training model, and continuously and repeatedly training in a crossed manner until the last two labeled data sets are stable, namely, data which does not meet the confidence condition in the unlabeled data set 1, namely data which can reduce the loss function is not available, so that no new data is added into the labeled data set 2 to be stable, and each parameter of the model is stable (the parameter is not changed).

We apply the model trained with labeled data set 2 to unlabeled data set 2, and by model prediction, we predict (x) from unlabeled data 2_μ2，y_μ2) Carrying out confidence detection with the data in the labeled data 2, screening out the data with high confidence, adding the data into the labeled data set 1 of another training model, and repeating the training continuously and alternately until the last two labeled data sets are stable, namely, the data which does not meet the confidence condition in the unlabeled data set 2, namely the data which does not meet the confidence condition in the unlabeled data setThe existence of data enables the loss function to be reduced, so that the mark data set 1 has no new data to be added, and is stable, which also means that the parameters of the model are stable (the parameters are not changed).

Step 1: and (3) carrying out certain processing on the data, and pulling back the value with larger fluctuation to the upper limit, wherein the threshold value is set to be 2.3, namely when the data difference is greater than 2.3, the abnormal value is adjusted, and the difference value is controlled within the range of 2.3.

Step 2: we copy the processed labeled data and unlabeled data into two sets respectively, and model training is performed on the labeled data of the sets respectively, and the LSTM model is used here. Experimentally, the models for both datasets were adjusted to three layers.

And step 3: the trained models of the two labeled data sets are applied. The trained model of labeled dataset 1 is applied to unlabeled dataset 1, for data x in each unlabeled dataset_μ1Predicting y with trained models_μ1Then find x in the marked data set_μ1The K neighbors that are close together are grouped into a set that is ordered by the K-close neighbors. K-proximity distance is calculated for the proximity value of x, predicted value y_uThe difference between 1 and y. The K-vicinity calculation formula is shown in formula (1). At the same time, we also apply the trained model of labeled dataset 2 to unlabeled dataset 2, for data x in each unlabeled dataset_μ2Predicting y with trained models_μ2Then find x in the marked data set_μ2The K neighboring values are similar and grouped into a set.

d(x_μ，x_i)＝||x_μ，x_i||²

(1) In the formula: x is the number of_μIs the value of the electrical signal in the unmarked data set

x_iIs x_μAdjacent values in the labeled dataset.

And 4, step 4: according to step 3, we perform confidence detection. Firstly, distance sorting is carried out on the obtained K adjacent sets, and whether the obtained K-adjacent value sets and the original marked data sets reduce the cost brought by the loss function in the marked data sets or not is judged. The penalty function we use here is the root mean square error (mean squared error). If the loss function is decreased, we will find a piece of data in the set of K-neighborhood values that can maximize the decrease of the loss function, and add the piece of data with the highest confidence corresponding to the labeled data set 1 to the labeled data set 2, and similarly, add the piece of data with the highest confidence corresponding to the labeled data set 2 to the labeled data set 1. The two data sets are operated in a loop iteration mode until the two calibrated data sets can be kept stable or the maximum iteration times is reached, the algorithm is ended, and the currently updated optimal parameters are stored. Our trained model here is a three-layer, with the output dimension of the first layer being 22, the second layer being 22, the output being 22, and the third layer being 22, the output being one-dimensional. While the maximum number of iterations we set is 1000.

This is a comparison before and after processing the data, as shown in fig. 1. Because the variation amplitude of the monitoring data is stable, the difference between the data is ensured not to exceed 2.3 when the data is processed, and the threshold value is the optimal value after experimental verification.

After processing, we can implement our whole training according to the whole algorithm of fig. 2. According to FIG. 2, the labeled data are first copied into two copies, and then the LSTM model is trained respectively. The LSTM model has three control gates: forgetting gate, input gate and output gate. The LSTM model is shown in fig. 3, and their functional formula is as follows:

forget the door: f. of_t＝σ(W_f·[h_t-1，x_t]+b_f) (2)

In the formula:

f_t-indicating forgetting gate input

Sigma-an activation function, we use here sigmoid

W_f-weight matrix of forgetting gate

h_t-1Output of last moment

x_t-input of the current time

b_f-forgetting gate bias terms

An input gate:

C_tstate unit of current time t input

f_tOutput f of the forgetting gate_t

C_t-1The state of the cell at the previous moment

i_t-input of the gate i_t

New cell state

i_t＝σ(W_i·[h_t-1，x_t]+b_i (4)

W_i-matrix weights of input gates

b_i-biasing of the input gate

C_t＝tanh(W_c·[h_t-1，x_t]+b_c) (5)

W_c-matrix weights of output gates

b_cBiasing of the output gate

tan h-output activation function

An output gate: o is_t＝σ(W_o[h_t-1，x_t]+b_o) (6)

O_t-sigmoid gate output

W_o-calculating a weight matrix of cell states

b_o-calculating the bias of the cell state

h_t＝O_t*tanh(C_t) (7)

We apply the trained models to unlabeled data 1, 2, respectively. The unlabeled data 1 is predicted by using an LSTM model trained by the labeled data 1, then data adjacent to K in the labeled data 1 is searched to form a K adjacent set, and then the distance is calculated according to the formula (1) and sorted according to the distance. And calculating the K neighbor data and the labeled data set 1 to retrain the model, and finding the data which can reduce the loss function of the model to the maximum in the K neighbor array if the loss function can be reduced, adding the data into the labeled data set 2, and repeating the iteration in the way until the model parameters are stable, and finishing the grating method.

Claims

1. An intelligent calibration method for monitoring the quality of ambient air is characterized by comprising the following steps:

2. The intelligent calibration method for monitoring the air quality of the environment as claimed in claim 1, wherein the data processing employs SO of one of six pollutants of the air quality₂The hour data of (2) takes national standard data as a training target, and data collected by a micro monitoring instrument is taken as input data.

3. The intelligent calibration method for monitoring the quality of the ambient air as claimed in claim 1, wherein the data processing comprises the following steps:

4. The intelligent calibration method for monitoring the quality of the ambient air according to claim 1, wherein the labeled data is data which is transmitted into the database by the miniature monitoring instrument and has corresponding national standard data at the same time, namely data which meets the national standard regulation; the non-marking data is the national standard data which does not correspond to the data of the micro monitoring instrument.

5. The intelligent calibration method for monitoring the quality of the ambient air as claimed in claim 1, wherein the model training adopts a long-short term memory network model, and the neural network has three layers.

6. The intelligent calibration method for monitoring the quality of the ambient air as claimed in claim 1, wherein the cooperative training comprises the following steps:

1) applying model 1 trained by labeled data set 1 to unlabeled data set 1, predicting by model, and obtaining data (x) from unlabeled data set 1_μ1，y_μ1) Confidence detection with the data of the label data set 1:

first unmarked dataset 1 (x)_μ1，y_μ1) Added to the labeled dataset 1, the sorted set Z1 was used as the test set, and model 2 was trained using these to see the loss function result of the output compared to the previous (x) for the unlabeled dataset 1_μ1，y_μ1) Whether the loss function of (1) becomes small; the smaller the size, the unmarked data (x) is_μ1，y_μ1) The model 2 is optimized and used as an alternative for adding the unmarked data set 1; selecting a piece of data which can make the loss function reduce most from the alternatives of the unmarked data set 1 as the data with the highest confidence coefficient, and finally adding the piece of data into the marked data set 2 of the model 2; wherein x_μ1Refers to a piece of data x in the unlabeled data set 1_μ1，y_μ1Is to pass the trained model 1 to a piece of data x in the unlabeled data set 1_μ1The predicted value of (2);

first unmarked data set 2 (x)_μ2，y_μ2) Added to the labeled dataset 2, and the sorted set Z2 as the test set, using these to train model 1, see the loss function result of the output compared to the previous (x) for the unlabeled dataset 2_μ2，y_μ2) Whether the loss function of (1) becomes small; the smaller the size, the unmarked data (x) is_μ2，y_μ2) The model 1 is optimized and used as an alternative for adding the marked data set 2; selecting a piece of data which can make loss function reduce most from the alternatives of the unmarked data set 2 as the data with highest confidence degree, and finally using the piece of dataAdded to the labeled dataset 1 of model 1; wherein x_μ2Refers to a piece of data x in the unlabeled data set 2_μ2，y_μ2Is to pass the trained model 2 to a piece of data x in the unlabeled data set 2_μ2The predicted value of (2);

7. The intelligent calibration method for monitoring the quality of the ambient air according to claim 5, wherein the format of the tag data is (x, y); x represents the data collected by the micro monitoring instrument, namely the electric signal value, and y represents the national standard data.

8. The intelligent calibration method for monitoring the quality of the ambient air according to claim 5, wherein the K-neighborhood value is used for calculating a neighborhood value set, and a predicted value y_μA difference value with each neighboring value y; the K-proximity calculation formula is as follows;

d(x_μ,x_i)＝||x_μ,x_i||² (1)