AU2018100221A4

AU2018100221A4 - A correction method based on linear regression algorithm for PM2.5 sensors

Info

Publication number: AU2018100221A4
Application number: AU2018100221A
Authority: AU
Inventors: Yinan Feng; Taifu Li; Yifu Qiao; Weiyi Shi; Hao Wu; Ziying Zhou
Original assignee: Zhou Ziying Miss
Current assignee: Zhou Ziying Miss
Priority date: 2018-02-21
Filing date: 2018-02-21
Publication date: 2018-03-29
Anticipated expiration: 2026-02-21

Abstract

With the increasing booming population, use of transportation vehicles and the establishment of factories, the society is moving forward at an unprecedented rate. These advancement and innovation have greatly improved the lives of those who live in the 21" century, while many serious environmental and health issues emerged, one of which is PM2.5 or also known as atmospheric particulate matter with a diameter of 2.5 micrometers. This article present an invention that relates to a method for correcting PM 2.5 sensors based on linear regression algorithm. Based on accurate data received from PM 2.5 monitoring station near the proximity of the PM 2.5 sensor, the method uses linear regression algorithm to correct the PM 2.5 output value of the sensor, making the output value of the sensor consistent with the accurate value of PM 2.5 from the monitoring station, and therefore achieving the purpose of making the PM 2.5 sensor more accurate and reliable. This method is simple and has high precision and strong practicability.

Description

DESCRIPTION

Title

A correction method based on linear regression algorithm for PM2.5 sensors FIELD OF THE INVENTION

This invention belongs to the field of detection equipment technology, and especially, it is a method of correcting and adjusting PM2.5 sensors using linear regression algorithm. BACKGROUND OF THE INVENTION

With the spread of industrialization, one problem that we are always obsessed by is the increment of haze. Peculiarly, in the north of China, many citizens have suffered from the hazards brought by haze. According to the analysis of China meteorological administration, an area of over 101 million square kilometers in the north of China, the Yellow river-huai river valley and Changjiang-huaihe basin were covered by haze in December, 2016, and 23 cities including Beijing and Tianjin started Red alert to this situation. What’s more, nearly 50 expressways had been suspended. Fortunately, Chinese government has implemented the regulation that PM2.5 which is the main cause of haze should be included in daily air quality monitoring.

This also brings another problem to the public. Mostly, due to the paucity of accurate and precise devices, ordinary people can hardly pinpoint the amount of PM2.5 with general detection equipment. Humidity, temperature, sunlight and many other factors can affect the result of the detection. For these reasons, it is highly necessary to apply a method to correct these gears to acquire a relatively precise index. One way to finish this job is using the data collected at nearby monitoring station as true index. This patent utilizes equation of linear regression to set a connection between the data from detection equipment and data of monitoring station. Hence, with the help of Linear Regression, we can correct the following data.

Compared to other methods, Linear Regression is not only convenient but also valid. Though it is a speculation, once we have enough data, it can foretell the index accurately as well. Data from monitoring station is open to the public and it is convenient for everyone to acquire huge amount of data to foresee the true index.

SUMMARY OF THE INVENTION

We start with pretreatment to the data of PM2.5 collected by the sensor and the real data of PM2.5 collected by the monitoring station. We suppose that a sample of data from i -th PM2.5 sensor. The sample is represented asx(i). For any

n is the data amount of the sample, m is the number of features and we assume that

Besides, we construct a linear-regression model h

, is the parameter of i -th feature, x, is the value of i -th feature andh (x)is the value predicted by the linear-regression algorithm. We also construct a cost function that is

. The aim of the cost function is to minimize the value of J( )and then find the value of . yu) is the value of the real data of the i-th sample.

By gradient descent, we get a function

, where is the step size. From this function, we can calculate

. We can get the value of

After that, we let the

and

. Through the function

, we can get the value of

Using a new sample x<new), we could calculate the predication h (x("m}) of this sample. The prediction is the shift value of the PM2.5 sensor, and we can get the correct measured value of the PM2.5 sensor.

DESCRIPTION OF DRAWINGS

Fig. 1 Original data

Fig. 2 The monitoring station data at Wanliu, Beijing Fig. 3 The flow chart of data pre-processing Fig.4 Processed data

Fig. 5 The flow chart of gradient descent method Fig. 6 Correction for the sensor data

DESCRIPTION OF PREFERRED EMBODIMENT

The invention patent of the haze sensor was placed in Wanliu, Beijing. Its sampling period is 15 seconds and the original data is shown in Fig. 1. Real haze data is collected from Wanliu, Beijing. Wanliu monitoring station and is shown in Fig. 2. Concrete implementation steps are as follows:

Stepl: The data pre-processing stage, as shown in Fig. 3, can be divided into three steps. In step 1.1 and 1.2, we process the PM2.5 sensor data and Wanliu station data, and then in step 1.3 we integrate the processed data. Specific steps are as follows:

Step 1.1 This step is called the PM2.5 sensor data processing. The sensor data is shown in Fig.l. It processes the data of October 20 to 31 collected by the sensor and gets the hourly haze average value of October 20 to October 31. Specific steps are as follows:

Define function ReadData(month, day, min, max), in which month, day, min and max represent the month, day, hour upper limit and hour lower limit when the data was collected. Set the sum of PM25 PM25_sum=0, counter num=0

Open the original sensor data m3_y201710.txt which is stored in Data/Original_SampleData, the data form is shown in figure 1.

Store the original data in different files according to its months and dates. For every line of the original data, open or cerate the corresponding files under the catalog /Data/PM25 in a form of month_day.

In order to figure out the everyday hourly haze value and store the data in the corresponding file month_day, we split every line in the data according to the space character, take the second term and get the string of year/month/date. Then we split the string according to the 7’, take the third term and use int function to get value of the data. Judging the data, if the date value equals to the value of day and the value of hours is greater than or equal to min, less than max, gets into the next step. If the value of the data is not 0, add the value to PM25 sum, add 1 to num. If the data is not within the set range, there are two situations. In the first case, num is not equal to 0, it shows that the data has been processed. Work out the average value of PM25 PM25_sum/num and write it into the file month day. Break the loop.

In the second case, num is equal to 0, it shows that the data of this period is missing or no eligible data has been read. Continue the loop. The code is shown below. def ReadData(month, day, min, max): PM25 sum = 0 num = 0

Original data = open("../Data/Original SampleData/m3_y201710.txt") for line in Original data:

else:

else: continue

Step 1.2 This step is called the Wanliu station data processing. The data in Wanliu station refers to the inaccurate data. The data is stored in csv fdes. From Fig2 we have noticed that there are several lines about PM 10 or AQI, and there is also information in the areas other than Wanliu.

Therefore, the main task of Step 1.2 is to collect the information of only PM2.5 from Oct 21st to Oct 31st, which is called, screening. Specific steps are as follows.

We defined a function called Wanliupm25, the parameters of which are month, dmin and dmax. Dmin and dmax refer to the earliest and the latest date, and are not included in the date in which we want to collect the data. In our case, we use Wanliupm25( 10,20,32), meaning we want to operate the files from Oct 21st to Oct 31st.

According to the date we have set, the appropriate csv files are opened one by one in a loop. The first line of the csv file is the header, including the name of the locations. By using the function of islice in the itertools library, we have cut the header of the charts, making the loop start from the second line of the chart.

We have created another loop which scans all the lines in the file. Using the split function, we are able to ignore the comma character, and collect the rest elements in the line as a list called mm.

The third column of the file is the type of pollution, and the tenth column stores the information in Wanliu area. If mm[2], the third element of a certain line, equals to “PM2.5” and mm[9], the tenth element of the line, is not empty, it means we have found the target information. We write mm[9] and mm[l], the time of the day, into a newly created file according to the date. Therefore, there will be 11 files, referring to the 11 different dates.

The code is shown below. from itertools import islice def Wanliupm25(month,dmin,dmax): forj in range(dmin,dmax): original_file = open(r'Data\Wanliu\beijing_all_201710' + str(j) + '.csv','r',encoding='UTF-8') f = open('Data/WanliuPM25/'+str(month)+'J+str(j),'a') for line in islice(original_file,l,None): mm = line.split(V) for i in range(25): if(int(mm[l]) == i and mm[2] == 'PM2.5' and mm[9] != "): f.write(mm[ 1 ]+'\t'+mm[9]+'\n') if_name_== '_main_

Wanliupm25( 10,20,32)

Step 1.3 This step is called the combination of the data. Through step 1.1 and step 1.2, we have got 11 files of original data and 11 files of Wanliu station data. The files are divided according to the date. However, when operating the data, the input information is made up of two files: the training set and the testing set. The main task of this step is to create a combination of the two groups of the data.

We collected the original and station data from Oct 21st to Oct 30th as the training set, and the data from the date of Oct 31st serve as the testing set. There are two newly created files called “com” and “test” to respectively store the training data and the testing data, by using alternative statement.

In the loop which scans the whole 11 days, for each day, we open the appropriate files in original and Wanliu folder. We set two empty list called t and m, to store the information of the two files. During the process, we use the split function to ignore the space character and make the rest elements a list. We use the function of append to add new element, in this case, the time and the numerical value of the PM2.5, into a list.

After storing the list t and m, we create a nested loop which scans both the elements in list t and those in list m. The parameter i and j respectively range from the number of lines in list t and list m. When m[j].split('\f)[0], the time in original data, equals to t[i].split('\t')[0], the time in Wanliu data, we write them into the appropriate fde. According to the date, we choose to write them in the training set, or the testing set.

The form of a signal line when we write information into the file is: the number “1” followed by the number in Wanliu data, then the original data. The three numbers are divided by space. It is shown in Fig.4.

The code is shown below. fcom = open(r'Data\COMData\com','a') ftest = open(r'Data\COMData\test','a') for filename in range(20,32): if filename != 31: sampledata = open(r'Data\PM25\PM2510 '+str(filename)) wanliudata = open(r'Data\WanliuPM25\10 '+str(filename)) t = [] m=[] for sampleline in sampledata: samplem = sampleline.split('\t') t.append(sample_m[0]+'\f+sampie_m[l]) for wanliuline in wanliudata: wanliu m = wanliuline. split('Yt') m.append(wanliu_m[0]+'\t'+wanliu_m[ 1 ]) for i in range(len(t)): forj in range(len(m)): if(m[j].split('\t')[0] == t[i].spiit('\t')[0]): f_com.write(T+'\t'+t[i].split('\t')[l].strip('\n')+'\t'+m[i].split('\f)[l]) else: sampledata = open(r' \Data\PM25\PM2510 _'+str(filename)) wanliudata = open(r'Data\WanliuPM25\10 '+str(filename)) t = [] m = [] for sampleline in sampledata: sample_m = sampleline.split('\t') t.append(sampie_m[0]+'\t'+sample_m[l]) for wanliuline in wanliudata: wanliu_m = wanliuline. split('Yt') m.append(waniiu_m[0]+'\t'+wanliu_m[ 1 ]) for i in range(len(t)): forj in range(len(m)): if (m[j].split('\t')[0] == t[i].split('\t')[0]): f_test.write(T + '\t' + t[i].split('\f)[l].strip('\n')+'\t'+ m[i].split('\t')[l])

Step 2 Use preceding gradient descent method to figure out parameters of linear regression model. Specific steps are as follows:

Step 2.1 Read the data, define function loadDataSet(fileName):

Split the first line of the file with '\t', get the number of the elements. Set numFeat as the number of the elements minus 1 to get the number of fields.

Set three empty matrix dataMat, labelMat and lineArr. Split every single line with '\t' , CurLine are the elements of every single line. Append every line’s first to Flo.numFeat-λ element’s value to lineArr. Then append lineArr to dataMat. Datamat contains the haze data collected from the haze sensor.

Append the value of the last element of each line float(curLine[-l]) to labelMat then we get a matrix which contains the haze data form wanliu haze monitoring station. The code is shown below: def loadDataSet(fileName): #general function to parse tab -delimited floats numFeat = len(open(flleName).readline().split('\t')) - 1 #get number of fields print(numFeat) dataMat = []; labelMat = [] fr = open(flleName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) dataMat.append(lineArr) labelMat.append(float(curLine[-l])) return dataMat,labelMat

Step2.2 Use gradient descent method to solve the model parameters, as is shown in figure 5. Define the gradient descent function gradAscent(dataMatIn, classLabels):

DataMatln and classLabels are the dataMat and labelMat which is obtained from the loadDataSetijlleName) respectively.

Set DataMatrix and labelMat as dataMatln matrix and the transpose of classLabels respectively. Create nX 1 all 1 matrix weights, n is the number of columns in the dataMatrix. Execute the loop for MaxCycles times:

Calculate the predicted value of haze h=linearFunc(dataMatrix,weights)= dataMatrix X weights. Calculate the error error=labelMat-h. Update the parameter weights=weights+ alpha

XdataMatrix Xerror, in which, alpha is the learning rate. The code is shown below: def gradAscent(dataMatIn, classLabels): dataMatrix = mat(dataMatln) labelMat = mat(classLabels).transpose() m, n = shape(dataMatrix) weights = ones((n, 1)) print(weights) print(shape(dataMatrix)) for k in range(maxCycles): h = linearFunc(dataMatrix, weights) error = (labelMat - h) weights = weights + alpha * dataMatrix.transpose() * error return weights

Step3 For a new set of sensor data, firstly we use Step 1.1 to process the data, then we can call our linear regression model and easily get the corrected value of this measuring value. That is to say the haze sensor is corrected. Step 3 can be shown in Fig. 6.

Claims

1. A correction method based on linear regression algorithm for PM2.5 sensors, including following steps: that a sample of data from i -th PM2.5 sensor. The sample is represented as

2. The method according claim 1, we construct a linear-regression model

,· is the parameter of i -th feature, xt is the value of i -th feature and h (x) is the value predicted by the linear-regression algorithm. We also construct a cost function that is

. The aim of the cost function is to minimize the value of J( ) and then find the value of . yU) is the value of the real data of the i -th sample. we get a function

where is the step size. From this function, we can calculate

. We can get the value of

1 After that, we let the

and

. Through the function

, we can get the value of . Using a new sample \ we could calculate the predication^ ^ of this sample. The prediction is the shift value of the PM2.5 sensor, and we can get the correct measured value of the PM2.5 sensor.