AU2019100348A4

AU2019100348A4 - A specified gas sensor correction method based on locally weighted regression algorithm

Info

Publication number: AU2019100348A4
Application number: AU2019100348A
Authority: AU
Inventors: Tong Chang; Zhiyi HUANG; Zihao Xue; Zhicheng Yang; Shuhe Zhang; Chongbo ZHAO
Original assignee: Huang Zhiyi Miss
Current assignee: Huang Zhiyi Miss
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-05-09
Anticipated expiration: 2027-04-04

Abstract

This application relates to a specified gas sensor correction method based on locally weighted regression algorithm. Local linear regression processing is performed by accurate carbon monoxide concentration data and sensor data that need to be corrected. The data we extracted is the harmful gas data measured by the Wanliu Air Quality Monitoring Station per hour and the harmful gas data measured by the sensor in real time. The data collection time is from July to October 2017. We read the data and extract the carbon monoxide concentration, temperature, humidity and sensor data that need to be corrected, and normalize the extracted temperature and humidity data. After the processing is completed, the data is subjected to local linear regression and the parameters of the local linear regression are optimized. The sensor data to be corrected can be close to the actual harmful gas concentration in the collection place under different environments, and the error can be ensured in a small range, high precision, good applicability, and can meet the need of accurate measurement of the actual harmful gas concentration. Start Get Sensor Data Get Monitoring Station Data Data Integrating Data Training Test and Optimization End Figure 1

Description

A specified gas sensor correction method based on locally weighted regression algorithm

TECHNOLOGY FIELD

The application belongs to the field of information processing, specifically designed as a specified gas sensor correction method based on locally weighted regression algorithm.

BACKGROUND

With the advent of the modernization and socialization, our demand for the precision of gas component detection in the environment and the cost saving of the corresponding instruments are gradually increasing. For instance, the sulfur dioxide generated by automobile exhaust emissions in daily life, the formaldehyde produced by house decoration, the raw material Silane in the semiconductor film precipitation process, or the oxygen concentration in aerospace oxygen supply systems, any of them needs us accurately measure the concentration of the specified gas and control the cost of gas concentration detection. Meanwhile, the gas sensor is susceptible to drift due to environmental influences. Thus, we need to calibrate the gas sensor periodically to meet our needs.

2019100348 04 Apr 2019

Current methods for measuring the concentration of a specified gas in a project include DOAS (Different Optical Absorption Spectroscopy) and a sensor method. Both methods have their pros and cons. The advantage of the long path method is that the measurement accuracy is high, but the instrument cost required for it is very high. In addition, the professional is needed to operate the instrument, and the detection cycle is long. The sensor method uses a metal oxide semiconductor and an electrochemical sensor, so the cost is low. However, its disadvantage is that its accuracy is not as perfect as DOAS, because it will be affected by the humidity and temperature, which will cause the drift, and the consistency of the same batch of sensors is unsatisfying. When the required measurement accuracy is relatively low, within a certain range, the sensor with less nonlinear error can be approximated as linear, which will bring great convenience to the measurement^[1]. Therefore, if we want to monitor the concentration of the harmful gas in real time, we need to use the accurate data obtained by DOAS and use the locally weighted regression algorithm to correct the hazardous gas sensor before using the sensor method.

2019100348 04 Apr 2019

SUMMARY

This application provides a gas sensor correction method based on, which aims to ensure the monitoring efficiency, establish locally weighted regression algorithm a correction model, so that the gas concentration monitored by the sensor can better reflect the true value and improve the effectiveness of the gas data.

The steps as figure 1 shows to establish the correction model are as follows:

(1) Get Sensor Data and Get Monitoring Station Data (2) Data Integrating (3) Data Training (4) Test and Optimization

2.1 Data Processing

2.1.1 Sensor Data

Sensors developed by the Institute of Automation of the Chinese Academy of Sciences provide real-time monitoring of gas concentration, temperature and humidity data. This application obtains these gas data

2019100348 04 Apr 2019 and corrects them. Considering the influence of temperature and humidity, these two kinds of data should also be extracted together with gas data to generate a TXT file for subsequent use.

Since the temperature and humidity values in the original data are large, a large error occurs when weighting the gas concentration data with a relatively small value. Therefore, the temperature and humidity values should be normalized to values in the [0, 1] interval.

2.1.2 Real Data

Because the long-path method detector used by Wanliu monitoring station has the characteristics of high precision, it can be used as real data in the locally weighted regression algorithm of this application.

The data of the Wanliu monitoring station is already generated in units of hours. This application uses the same method to extract the corresponding gas concentration data according to the date and hour, and also generates TXT file based on these real data for subsequent use.

2.1.3 Data Integrating

The subsequent data processing integrates the sensor data with the

2019100348 04 Apr 2019 monitoring station data according to the date and hour. In the integration process, those unpaired data should be discarded.

2.2 Algorithm Introduction

This application provides a method for correcting inaccurate data generated by a sensor. It bases on a locally weighted regression algorithm.

In order to make the test data close to the real data and follow its changes, in the process of data fitting at a certain point, not only the difference between the test value and the real value but also the influence of other values near the point should be considered. The closer the nearby point is to the test point, the stronger the ability to predict the change law of the point. Conversely, the farther the distance is, the weaker the prediction ability becomes.

The local weighted regression algorithm was bom based on this idea and is a non-parametric learning algorithm. It is based on a general linear regression algorithm, but assigns a weight to each point near the test point, and it reflects the contribution of nearby points to predicting the test point. The weight value whose test point closer to the real point is larger, and farther point value is smaller. Here, the Gaussian kernel formula is used to express the weight.

(1)

2019100348 04 Apr 2019 ί

ω= exp

V

In formula(l), the parameter d measures the distance between the test point and the nearby point. The parameter τ is used to control the speed at which the weight of nearby points decreases with distance. The larger τ is, the faster the decline is. The smaller the τ is, the slower the decrease.

Therefore, when τ is large, the contribution of the surrounding points to the test points is almost ignored, and the value corresponding to the test points cannot be well predicted, so that the fitting curve cannot accurately reflect the real data, and the under-fitting condition occurs. When τ is small, the surrounding points provide a good prediction for the value of the test point, but this prediction ability is too prominent in the training set, which also leads to the fitting deviation in the data test, that is, over-fitting Case.

In summary, the value of τ is the key to the algorithm, and it is also the key to the performance of the correction model provided by this application. This application divides the training set and the test set by a ratio of 5:1, writes an algorithm, and performs local weighted regression

2019100348 04 Apr 2019 on the training set with different parameters τ to find a suitable τ value.

2.3 Algorithm Test

Through data training, a set of τ value has been obtained. The algorithm test uses the generated τ value as the parameter to be substituted into the test data, and uses the test data as the test point for local weighted regression, then verifies whether the generated τ value can make the fitting curve of the test set reflect the truth as accurately as possible.

2.4 Optimization

The purpose of the optimization is to select the value of τ with the smallest error value in the case of no fitting. The optimization method is to use the cross-validation method to perform the loop test, calculate the average error of the specific τ value, compare the average error values, and select the τ value with the smallest error value as the optimized result.

2019100348 04 Apr 2019

DESCRIPTION OF DRAWINGS

Figure 1 is a flowchart of the steps to establish a correction model.

Figure 2 is a block diagram of data processing.

Figure 3 is a flowchart of locally weighted regression.

Figure 4 illustrates the weight value distribution map of the training.

Figure 5 illustrates the chart of the train data.

Figure 6 illustrates the chart of the predicted data.

Figure 7 illustrates the flowchart of Cross-validation method.

Figure 8 illustrates the block diagram of optimization.

DESCRIPTION OF EMBODIMENT

Data processing

In the acquisition of the original data, the data of the unreliable sensor and Wanliu district (reliable data) are respectively got from the

2019100348 04 Apr 2019 corresponding original TXT file.

The sensor data provided by the Automation Institute reflects the real-time changes of various meteorological indicators, that is, the gas concentration, temperature, humidity and other values are updated several times per second, but the data of the Wanliu detection station closer to the real value that is updated every hour. Therefore, in the process of data processing, the imbalance of the update rate must be overcome. To realize one-to-one correspondence between the sensor and the monitoring station data.

After a large amount of data analysis, the sensor does not change much in the data monitored within one hour span but only a small range of negligible drift, so it can be roughly averaged to reflect the monitoring within this hour and achieve the true values correspond one-to-one. However, after the actual value comparison, the unreality of the gas data detected by the sensor is proved. The deviation between the unreliable sensor data and the Wanliu data is very large, and this deviation has a certain relationship with temperature and humidity.

2019100348 04 Apr 2019

Therefore, the data must be processed in consideration of the influence of temperature and humidity. In the process of curve fitting, temperature and humidity must exist as independent variables.

However, the specific values of the temperature and humidity monitored by the sensor are not in the same magnitude as the specific values of the gas concentration and need to be normalized. The normalization formula is as follows:

,_r . . ₇ X — X min /Ολ unijormization value =X max— X min

In the formula(2), the •Xmin and •Umax values of temperature and humidity data is obtained by traversing all the data to be processed. This regulation is intended to reflect the temperature and humidity changes with the seasons, and contributes to the improvement of the accuracy of the regression results.

The maximum value of temperature and humidity, and the abnormal data (such as sudden change of temperature and humidity in one second and normal again in the next second) needs to be eliminated.

io

2019100348 04 Apr 2019

The significance of normalization is that it not only ensures the relative size of temperature and humidity, but also avoids the large error and more abnormal data after curve fitting due to the large difference between the independent variables. Description, data processing is divided into the following steps:

First, read the data of the sensor and extract the temperature, humidity and gas concentration data to be corrected, and calculate the hourly average value of the data collected by the sensor as the original data value corresponding to the monitoring station data. The amount of data is twenty-four sets of data from July to October, arranged in the order of month, day, and time. The purpose of this is to facilitate subsequent comparison. While traversing the data to generate a TXT file, the maximum and minimum values of temperature and humidity in the sensor data should be calculated. They are used for normalization calculation. Data read from the sensor is showed in the Table 1.

Table 1 Data read from the sensor

Date	Hour	Sensor Value	Temperature	Humidity
20170709	0	2.593333333	29.02226496	61.77863248
20170709	1	2.663504274	29.07833333	62.2482906
20170709	2	2.632832618	29.08433476	61.97296137
20170709	3	2.598969957	29.11918455	61.77682403

2019100348 04 Apr 2019

20170709	4	2.584188034	29.12307692	61.98974359
20170709	5	2.594316239	29.16589744	62.6508547
20170709	6	2.629484979	29.26154506	62.29785408
20170709	7	2.67021645	29.43134199	61.81688312

Second, the gas concentration data monitored by the monitoring station in the unit of hour is extracted as the real value, and the data amount is also twenty-four sets of data per day from July to October. Data read from the monitoring station has been shown in Table 2.

2019100348 04 Apr 2019

Table 2 Data read from the monitoring station

Date	Hour	Value
20170701	0	0.8
20170701	1	0.8
20170701	2	0.8
20170701	3	0.8
20170701	4	0.8
20170701	5	0.9
20170701	6	1
20170701	7	0.9
20170701	8	0.9

Third, according to the formula(2), the hourly averaged temperature and humidity monitored by the sensor are normalized into the [0, 1 ] interval, so that the original data is modified and written back to the original file.

Finally, The sensor data and the monitoring station data that are acquired and arranged in chronological order are compared one by one, and the corresponding data will be utilized while the unpaired data(for example, the sensor data in some hours may lost which can’t correspond to the station data) are discarded. Rewrite the file and arrange it according to the order of the arguments and dependent variables required by the local weighted regression algorithm program. Table 4 shows the resulting data.

2019100348 04 Apr 2019

Figure 2 shows the complete process of the above data processing

Table 3 shows the normalized sensor data.

Table 3 Normalized sensor data

Date	Hour	Sensor Value	Temperature	Humidity	CO Value
20170709	0	2.593333	0.653315	0.788614	0.9
20170709	1	2.663504	0.657166	0.796238	1.1
20170709	2	2.632833	0.657578	0.791769	0.9
20170709	3	2.59897	0.659971	0.788585	0.8
20170709	4	2.584188	0.660239	0.792041	0.8
20170709	5	2.594316	0.66318	0.802774	0.9
20170709	6	2.629485	0.669749	0.797043	1
20170709	7	2.670216	0.681411	0.789235	1
20170709	8	2.611325	0.710849	0.779706	1

It should be emphasized that the values monitored by the sensors and the values actually obtained by the monitoring stations are not meaningful in terms of specific values, because the monitoring units they applied are different, but there is a monitoring value in relative size, and there is a certain error. For example, in the 2^th set of data and the 8^th set of data in Table 4, the relative value between the sensor data and the monitoring station data is reversed, and it can be seen that the data of the sensor is deviated by the influence of temperature and humidity. This is also the object that this application needs to be corrected.

2019100348 04 Apr 2019

Table 4 The resulting data

Date	Hour	Xo	Sensor Value	Temperature	Humidity	CO Value
20170709	0	1	2.593333	0.653315	0.788614	0.9
20170709	1	1	2.663504	0.657166	0.796238	1.1
20170709	2	1	2.632833	0.657578	0.791769	0.9
20170709	3	1	2.59897	0.659971	0.788585	0.8
20170709	4	1	2.584188	0.660239	0.792041	0.8
20170709	5	1	2.594316	0.66318	0.802774	0.9
20170709	6	1	2.629485	0.669749	0.797043	1
20170709	7	1	2.670216	0.681411	0.789235	1
20170709	8	1	2.611325	0.710849	0.779706	1

4.2Core Algorithm

The objective formula ^[2] is defined as:

ke(x) = θθΧθ+ 6lXl~\----θηΧη η—I = ^θίΧί /=0 = θ^τχ (3)

Our goal is to minimize the cost formula:

-θ^τχ^² (4)

Change it to the representation of linear algebra:

m

-0^rx⁽ⁱ⁾]² = (X0-Y)^rW(X0-Y) i-1 (5)

2019100348 04 Apr 2019 .,(1) (2)

0 w^(ra)J w is a diagonal matrix of m*m dimensions

l	A ·	·· ^χ1 1
1	A² ·	·· ^χ2 1
n-l
1	A ·	.. n-l

X is the input matrix of the m*n dimension.

Al

Y = ^y b J (6) (7) (8)

Y is the result of m><l dimension.

Al

0.

θ= .¹

A J #is the parameter vector of nxl dimension

Define the lose formula j(0 ) as:

. 1

7(0) = w^(/)(/'’ - 0^rx(z))² = -(ΧΘ - Y/ W(X3 - y) ,₌₁ 2 (9) (10) ^(¢) = 1^-(^-17^(^-1⁷) (¹¹) = x^xg -g^Tx^riry - Y^Twxe + γ^Γΐγγ} δθ^γ ’

2019100348 04 Apr 2019

While ±j^)₌₀:

οθ ^{v 7} (X^tWX0-X^tWY) = Q e^(x^Twx)~^l X^TWY (12) (13)

The weight here according to formula (1) is defined as: formula (14) w^(,) = exp < (*^ω-%ρ

2τ² (14)

4.3 Correction model

The data source is divided into two parts: the training set and the test set (the ratio is about 5:1, and cross-validation is used to train each other. This part will be explained in detail in 3.4).

Both the training set and the test set are processed through data processing. The temperature and humidity of the sensor have been normalized, and the data of the sensor and the monitoring station are also one-to-one corresponding according to time.

The correction model uses the sensor data of the training set as the training sample, the monitoring station data of the training sets uses as the

2019100348 04 Apr 2019 real sample, the sensor data of the test set as the test sample, inputs the τ value, and plug in the τ value into the locally weighted regression algorithm to obtain the predicted value corresponding to the test sample. After the glitch processing based on the Pauta Criterion, the curve corresponding to the predicted value is fitted, and the monitoring station data (true value) of the test sets are compared, and output the prediction error value. Figure 3 shows the complete flow of locally weighted regression.

It can be known from formula (6) that the weight matrix W is a diagonal matrix whose dimension is defined as the number of training samples, which is set to value m, wherein the main diagonal element w⁽ⁱ⁾ is obtained by the formula (14), where v^(l) is a training sample, including gas data formalized temperature and humidity in the training set x is a test sample, and contains corresponding parameters in the test sets, τ value can be arbitrarily specified, and its optimal value will be obtained by training and this part will be explained in detail in 3.4 (optimization). For the z^th test sample, νν^(ί) is the weight of the z'^th training sample, which is the core of the locally weighted regression algorithm, that is, the difference between the training and the test sample reflects the magnitude of the weight, which some of them have already been mentioned in the

2019100348 04 Apr 2019

Summary and will not be described here. As can be seen from the above description, each test sample can correspond to a unique weight matrix, which reflects the contribution of each training sample value to it.

After obtaining the weight matrix W, the regression coefficient Θ can be directly substituted into formula (13). The value of X and Y are from the training sets, which are sensor data (including gas data, normalized temperature and humidity) and monitoring station data. Formula (13) is obtained by mathematical derivation from formula (11). It can be known from the form that locally weighted regression is based on linear regression and introduces a weight function, which is an optimization of linear regression and the specificity of formula (13). The derivation process is described in detail in 4.2.

From the Θ value and the test sample value, the predicted value he(x) can be directly calculated by the formula (4), and the fitted curve can be compared with the true value of the test set. However, when the τ value is too small or the test sample is insufficient, the predicted value he(x) may be abnormally burred. For the treatment of this problem, the application uses the Pauta Criterion in statistics, that is, the data with large errors is eliminated. This part will be described in detail in 3.5.

2019100348 04 Apr 2019

The purpose of this algorithm is to make the fitted data of the test set close to the value of the real data, that is, the prediction error of the output should be as small as possible. When the training sets and the test sets are given, the prediction error is only affected by the τ value, so the choice of the τ value is particularly important in this application, and this part will be described in detail in 3.4. Table 5 and 6 shows the data training and predict of r.

Table 5 Training τ data

τ	error
0.1	0.04027532432543213
0.2	0.04773740604327002
0.3	0.05075444332661508
0.4	0.0529269027580894
0.5	0.054712653947254095
0.6	0.05606035131168442
0.7	0.05696749658484701
0.8	0.05757518800873845
0.9	0.057998093148849024
1.0	0.05830514930158624

2019100348 04 Apr 2019

Table 6 Prediction data

T	error
0.1	0.7632511923439733
0.2	0.16123384228185111
0.3	0.07527770429771075
0.4	0.05915819623830766
0.5	0.055009295190228576
0.6	0.053560744552811825
0.7	0.05295997610744374
0.8	0.0526895006748422
0.9	0.05256505960580796
1.0	0.05250999816875201

4.4 Optimization

The purpose of optimization is to determine the parameter value with the smallest predicted sample error value. Where Xi is the test vector in the weight function ω(14), τ is expressed as the rate at which the weight ω decreases with distance. Since the test vector is different from the value of the sample in the training set, the samples in the training set are different for different test vectors. They all have different weights. The closer the sample vector in the test value is to the test vector, the greater the weight value will be. Specifically, as shown in the Figure 4 it shows

2019100348 04 Apr 2019 the weight value distribution map of the training set with respect to the first row of test vectors when the τ value is 0.5.

The optimization method is to determine the fitting range of τ value first by error value calculation, avoiding over-fitting and under-fitting phenomenon, and then find the optimal solution of τ value by cross-validation method. The over-fitting phenomenon means that the hypothesis becomes excessively strict in order to obtain a consistent hypothesis. Avoiding over-fitting is a core task in classifier design. Classifier performance is typically evaluated using methods that increase the amount of data and test sample sets ^[3]. Under-fitting means that the model is not well-fitting and the data distance fitting curve is far away, or the model does not capture the data features well. The data cannot be well fitted ^[4]. Cross-validate in a given modeling sample, take most of the samples to build the model, leave a small number of samples to be predicted by the newly established model. Find the prediction error of this small part of the sample and record their square sum. This process continues until all samples are predicted once and only once. The squared prediction error of each sample is summed ^[5].

First, use the locally weighted regression algorithm to find the training set

2019100348 04 Apr 2019 error and prediction set error corresponding to different τ values to determine whether there is over-fitting or under-fitting. Take 7=0.1, 0.5, respectively to calculate the sample error and prediction error and draw an image. The result is shown in the Figure 5 and 6.

As shown in the Table 7, when 7=0.1, a serious over-fitting phenomenon occurs. The rate of decline of the weight function ω(χ) is too fast, resulting in too small a surrounding sample weight, and the data consistency requirement for the sample set is too strict. In the actual predict, the data is not well predicted. The prediction error and the sample error value are ideal when 7=0.5, so it can be determined that the value of τ should be around 0.5.

Table 7 Training data and prediction data comparison table

The value of τ	Training Error Rate	Forecast Error Rate
0.1	0.04027532432543213	0.7632511923439733
0.5	0.054712653947254095	0.055009295190228576
1.0	0.05830514930158624	0.05250999816875201

After determining the approximate range of τ values, the exact τ value is obtained by cross-validation, whose procedure is shown in the Figure 7. First, we divide the training set into a total of 1560 groups of data into

6*260 groups. Take one of the data sets in sequence as a verification set

2019100348 04 Apr 2019 and the rest were training sets. Train the sample with the training set, calculate the predicted value, and then compare it with the verification set to calculate the result of the loss function/(^) (15). Repeat this process, let all the data sets verify each other and calculate the error of each group, and then the average is taken as the total error value. Each τ value corresponds to a set of errors. First, the τ value with a step length of 0.05 is selected to calculate the error, the error range is narrowed, and the τ value corresponding to the smallest error value is calculated as the optimized result. The processes of optimization are shown in the Figure

8.

_^ffl Λβ)=-Σ(&>(Υ’)-/’)² (15)

7=1

In summary, 7=0.26 calculated by locally weighted regression algorithm and cross-validation

Table 8 and 9 shows the Optimized data

Table 8 Optimized data

The value of τ	Error value
0.1	0.07592582931205656
0.15	0.06752336680726281
0.2	0.06639442993008411
0.25	0.06515814801346187

2019100348 04 Apr 2019

0.3	0.0651135339903361
0.35	0.06571206680262161
0.4	0.06565720645536112
0.45	0.06596193454022407
0.5	0.06556091887898556
0.55	0.0658312756962428
0.6	0.06599210106979636
0.65	0.0659552674648468
0.7	0.06606561845802374
0.75	0.06625633289772172
0.8	0.06681197179614594
0.85	0.0668532678294503
0.9	0.06688899918754804
0.95	0.06689299030116082

Table 9 Optimized data

The value of τ	Error value
0.21	0.06573004795859695
0.22	0.0653437976794451
0.23	0.06534679642034062
0.24	0.06523371547425948
0.25	0.06515814801346187
0.26	0.06502440855809799
0.27	0.06513030940646333
0.28	0.06509019840769054
0.29	0.06512060153426212
0.30	0.06511353399033583

2019100348 04 Apr 2019

4.5Pauta Criterion

Due to the influence of the external environment, the sensor may drift, which causes the sensor test data to introduce random errors. In order to eliminate such effects, the data can be processed by pulling the criteria. The Pauta Criterion is worked out by calculating the arithmetic mean Xb and the residual error Vb of the sample and calculating the standard deviation σ according to the Bessel formula. When the residual error is greater than 3 times the standard deviation, it can be concluded that the sample value has a large error. It need to be removed.

Claims

CLAIM

1. A method for correcting specific gas detection sensors, which is based on locally weighted regression algorithm, characterized in that said method measures specific gases accuraely in a real-time way and improves the efficiency of measurement by means of correcting the sensor data with real data.
2. The method according to claim 1, characterized in utilizing a locally weighted regression method which only relates to the distance to the training data, wherein the closer the distance is, the larger the relationship is, and the smaller the converse is; further, the under-fitting can be effectively avoided and the interference of far-reaching data can be reduced, resulting in that it only related to nearer data.
3. The method according to claim 1, wherein said method utilizes a crossvalidation method to screen for optimal parameters; said crossvalidation method divides the training set into multiple sets of subtraining sets, divides the subset into training sets and test sets according to the specified proportion, makes them test each other, and selects the optimal parameters by judging the error mean of the specified parameters; said cross-validation method may extract effective information from limited information as much as possible, and reduce the randomness problem caused by manual division.