CN112613536A

CN112613536A - Near infrared spectrum diesel grade identification method based on SMOTE and deep learning

Info

Publication number: CN112613536A
Application number: CN202011443096.0A
Authority: CN
Inventors: 王书涛; 刘诗瑜; 崔凯; 张靖昆; 孔德明
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-04-06

Abstract

The invention discloses a near infrared spectrum diesel grade identification method based on SMOTE and deep learning, which comprises the following steps: step 1, drawing a diesel near-infrared spectrogram, analyzing distribution conditions of different grades of diesel, performing attribute mapping on grade labels, and taking the different grades of the diesel as a sample set; step 2, carrying out data equalization processing on the sample set by adopting SMOTE, and dividing the sample set into a training set sample and a test set sample; step 3, constructing a near infrared spectrum classification model of the one-dimensional deep convolution neural network by using the training set sample; and step 4, bringing the test set samples into the established model to obtain the diesel grade identification result, drawing a multi-classification confusion matrix, and analyzing the identification rate of each class. The invention does not need a large amount of preprocessing, can improve the accuracy of classification and identification, and can improve the identification rate of a few samples.

Description

Near infrared spectrum diesel grade identification method based on SMOTE and deep learning

Technical Field

The invention relates to the field of near infrared spectrum, in particular to a near infrared spectrum diesel grade identification method based on SMOTE and deep learning.

Background

Due to high energy density, low oil consumption and low price, petroleum derived diesel still dominates the market. Improving the quality and detection precision of diesel oil to meet the changing demand of the diesel oil market is still one of the main directions of the development of the global petroleum industry in the future. According to the GB/T1.1-2020 standard, the commercial diesel oil can be divided into 6 grades of 5#, 0#, -20#, -35# and-50 # according to the difference of condensation points. The lower the diesel grade, the less likely it is to form wax and the higher the price is relatively. In order to gain benefits, illegal manufacturers often have behaviors of adulteration of diesel oil and label disordering, and selling illegal oil products can not only damage engines, but also increase pollution emission and even harm personal safety. Therefore, the diesel brand can be quickly and accurately identified, so that convenience is provided for a supervision department to master accurate and timely detection data, and the method has important significance for guaranteeing the rights and benefits of consumers and the safety of lives.

The grade of diesel oil is only identified from the aspects of color, hand feeling, smell and the like, and although the grade is usually used in daily life, the grade is undoubtedly a work which is time-consuming, labor-consuming and highly subjective, and is not suitable for large-scale production detection. Near infrared spectroscopy (NIRS), a fast, green, low cost, easy to operate, non-destructive technique, has been used in many cases in the petrochemical field. NIRS of diesel fuel involves characteristic absorption of various hydrocarbons (such as O-H, C-H and N-H) in a complex mixture, and accurate identification of diesel fuel brands is extremely difficult and requires computer-aided detection. At present, the commonly used auxiliary models comprise a partial least square method, a support vector machine, an artificial neural network and the like, and because the NIRS has a wide spectrum range, weak useful information intensity, more noise interference and serious spectrum peak overlap, the traditional machine learning methods have to combine a large amount of pre-processing of denoising, feature extraction, dimension reduction and the like to obtain a faster detection speed and a more accurate prediction result, so that not only is the workload increased invisibly, but also the applicability of the models and the accuracy of prediction need to be improved urgently.

Deep learning is a deep network, which is a new research direction in the field of machine learning, and in recent years, development in a plurality of application fields such as image processing, speech recognition, machine translation, and the like, such as tea. The DCNN is the most widely applied deep learning model, can autonomously extract effective features from complex data and reduce dimensionality, and has stronger expression capability compared with the traditional shallow model, but has less processing capacity for one-dimensional NIRS because the DCNN is mainly used for processing two-dimensional or three-dimensional images.

Disclosure of Invention

The invention aims to provide a near infrared spectrum diesel grade identification method based on SMOTE and deep learning, which can improve the accuracy of classification identification and improve the identification rate of a few samples.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a near infrared spectrum diesel grade identification method based on SMOTE and deep learning comprises the following steps:

step 1, drawing a diesel near-infrared spectrogram, analyzing distribution conditions of different grades of diesel, performing attribute mapping on grade labels, and taking the different grades of the diesel as a sample set;

step 2, carrying out data equalization processing on the sample set by adopting SMOTE, and dividing the sample set into a training set sample and a test set sample;

step 3, constructing a near infrared spectrum classification model of the one-dimensional deep convolution neural network by using the training set sample;

and step 4, bringing the test set samples into the established model to obtain the diesel grade identification result, drawing a multi-classification confusion matrix, and analyzing the identification rate of each class.

The technical scheme of the invention is further improved as follows: the method comprises the steps of drawing a near infrared spectrum image by using a sample set of diesel oil, dividing sample grades into 5 types including-10 #, -20#, -35#, -50# and interference according to the condensation point of the diesel oil, respectively mapping attributes to be

types

1, 2, 3, 4 and 0.

The technical scheme of the invention is further improved as follows: the specific process of performing data equalization processing on the sample set by using SMOTE is as follows:

1): firstly, calculating the Euclidean distance from each sample x in the minority class to all samples in the minority class sample set to obtain k neighbor of each sample x;

2): setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each sample x of a minority class, wherein the selected neighbors are assumed to be x_n；

3): for each x_nRespectively according to the formula<1>Constructing a new sample;

x_new＝x+rand(0,1)×|x-x_n|,new∈1,2,…,N <1>

4): finally, repeating the steps for N times to synthesize N new samples; if the rare class has a total of T samples, NT new samples can be synthesized.

The technical scheme of the invention is further improved as follows: the specific method for constructing the near infrared spectrum classification model of the one-dimensional depth convolution neural network comprises the following steps:

performing some transformations on the one-dimensional near infrared spectrum data to enable the input signal to meet the requirements of a convolutional neural network; regarding the one-dimensional near infrared spectrum as a special set of two-dimensional images only comprising one row or one column, performing corresponding dimension expansion on a spectrum signal, and converting a category label into a form of one-hot coding; and constructing a one-dimensional depth convolution neural network model by referring to LetNet-5, wherein the one-dimensional depth convolution neural network model comprises an input layer, two convolution layers, two pooling layers, two full-connection layers and an output layer.

The technical scheme of the invention is further improved as follows: the convolution layer is composed of a group of convolution kernels with trainable parameters, the size of the convolution kernels is set to be m multiplied by 1, and the convolution operation of the one-dimensional signal is shown in a formula <2 >:

wherein l is the current convolution layer, l-1 is the (l-1) th convolution layer, and x_iAnd y_jRespectively representing the ith input feature map and the jth output feature map, and a convolution operator, omega_ijRepresents the convolution kernel, b is the bias, and f (.) is the operation of the activation function.

The technical scheme of the invention is further improved as follows: an activation function PReLU is introduced into the convolutional layer, and the expression of the function is shown in <3 >.

The technical scheme of the invention is further improved as follows: the operation of the pooling layer is shown in formula <4 >:

wherein l represents the current pooling layer, l-1 represents the (l-1) th pooling layer, y_jIs the jth output characteristic diagram, beta is a multiplicative bias term, and b is bias;

the pooling method is a maximum pooling method, and a sampling method of the maximum pooling method is calculated according to a formula <5 >:

in the formula, one feature map obtained by the convolution layer is divided into a plurality of regions X_k,k∈1,2,…,K。

The technical scheme of the invention is further improved as follows: the fully-connected layer comprises a Flatten layer and two Dense layers, the activation function of the last Dense layer is Softmax, and a certain proportion of random deactivation is added into the fully-connected layer, wherein the fully-connected layer is calculated according to the formula <6 >:

h_ω,b(x)＝f(ω^Tx+b) <6>

where ω is the weight of the neuron, b is the bias, T is the transpose, and h (x) is the output of the neuron.

The technical scheme of the invention is further improved as follows: after the model is constructed, a training method is configured; the configured training method comprises a loss function, an optimizer and an evaluation index, wherein the loss function is specifically a cross entropy loss function, the formula of the loss function is shown as <7>, the optimizer adopts Adam optimization, and the evaluation index is accuracy A, and is shown as a formula <8 >:

in the formula, n_iThe number of predicted samples is the same as the number of actual samples, and n is the total number of samples.

The technical scheme of the invention is further improved as follows: on the basis of the constructed one-dimensional depth convolution neural network model, respectively adopting a test set subjected to SMOTE oversampling processing and an original test set to predict the grade of diesel oil, and obtaining the integral classification recognition rate; then drawing a multi-classification confusion matrix, and obtaining the precision, the recall rate, the accuracy and the balance F score according to the confusion matrix, wherein the precision, the recall, the accuracy and the balance F score are shown in <9>, <10>, <11>, <12 >:

where TP is the number of samples for which positive examples are predicted as positive examples, FN is the number of samples for which positive examples are predicted as negative examples, FP is the number of samples for which negative examples are predicted as positive examples, and TN is the number of samples for which negative examples are predicted as negative examples.

Due to the adoption of the technical scheme, the invention has the technical progress that:

the near infrared spectrum diesel grade identification method based on SMOTE and deep learning can greatly improve the accuracy of classification and identification on the premise of not needing complex operations such as manual feature extraction, dimension reduction and the like, and improves the identification rate of a few samples in consideration of the problem of unbalanced class samples in actual life. The model combining SMOTE and deep learning provided by the invention has strong applicability and expandability, and is beneficial to the development of a rapid detection system with high accuracy, simple operation, portability and based on NIRS.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a NIRS diagram for diesel fuel;

FIG. 3 is a distribution diagram of sample numbers of various categories of raw data;

FIG. 4 is a training set sample distribution graph after SMOTE oversampling processing;

FIG. 5 is a diagram of a NIRS-based one-dimensional depth convolution neural network classification model architecture;

FIG. 6 is a graph of a training set loss function variation;

FIG. 7 is a graph of training set accuracy change;

FIG. 8 is a diagram of a training set multi-class confusion matrix after SMOTE oversampling processing;

FIG. 9 is a diagram of a test set multi-class confusion matrix after SMOTE oversampling;

FIG. 10 is a diagram of an original test set multi-classification confusion matrix;

FIG. 11 is a comparison graph of prediction accuracy of the XGboost, SVM and BP methods.

Detailed Description

The present invention will be described in further detail with reference to the following examples:

referring to fig. 1, the specific implementation steps of the present invention are as follows:

step 1, drawing a diesel NIRS diagram, analyzing distribution conditions of different grades of diesel, performing attribute mapping on grade labels, and taking the different grades of the diesel as a sample set;

in the present embodiment, the sample details of the sample set of the grade of diesel fuel used are shown in table 1. There are 394 samples in total, and the attributes of the marks-10 #, -20#, -35#, -50# and the interference are mapped to

categories

1, 2, 3, 4 and 0, respectively. As can be seen from the table, the number of samples in each category is different, and the samples in each category are extremely unevenly distributed.

TABLE 1 sample details of Diesel grade data set

The NIRS of the diesel oil is shown in figure 2, the spectral wavelength range is 750 nm-1550 nm, the interval is 2nm, and the total number of characteristic wavelength points is 401. It can be seen from the figure that 394 samples are stacked together and cannot be distinguished, the information intensity of the spectrum is weak, the interference is large, and it is completely infeasible to realize accurate distinction of 5 categories by only using the NIRS chart, so the method of step 2 is required to perform the equalization processing of the data.

Step 2, performing data equalization processing on the sample set by adopting an SMOTE method;

firstly, a cross-validation method is adopted to automatically divide a sample set according to the proportion of 7:3, 275 training set samples are obtained, and 119 testing set samples are obtained. In order to improve the generalization capability of the model and solve the class imbalance phenomenon, the sampling data of the training set is equalized by adopting the SMOTE oversampling technology, the distribution condition of each class sample before SMOTE processing is shown in fig. 3, and the distribution condition of each class sample of the training set after SMOTE processing is shown in fig. 4. After the processing, the number of samples in each category is consistent, and the samples are 184 samples, that is, samples in the training set are changed from 275 original samples to 920 samples. For later comparison, the SMOTE method is adopted to automatically generate unbalanced samples for the test set samples, and the total number of generated new test set samples is 395.

Step 3, establishing an NIRS classification model of the one-dimensional deep convolution neural network by using a diesel training set sample;

in the present embodiment, the overall structure of the NIRS-based one-dimensional depth convolution neural network classification model is shown in fig. 5. The method comprises the following specific steps:

step 3.1: and the input layer inputs a one-dimensional diesel NIRS signal, and the input shape is (401, 1).

Step 3.2: the convolution layer is used in cooperation with a one-dimensional deep convolution neural network, the size of convolution kernels is 40 x 1, the number of the convolution kernels is 16, the step length is 1, the activation function is PReLU, and the PReLU has the characteristics of high convergence speed and low error rate and can effectively solve the problems of avoiding gradient disappearance and gradient explosion.

Step 3.3: and (3) a pooling layer, wherein the size of a pooling window is 3 x 1 and the step length is 1 by adopting a maximum pooling method.

Step 3.4: and (3) convolution layer, the size of convolution kernel is 40 x 1, the number of convolution kernels is 64, the step size is 1, and the activation function is PReLU.

Step 3.5: and (3) a pooling layer, wherein the size of a pooling window is 3 x 1 and the step length is 1 by adopting a maximum pooling method.

Step 3.6: and a Flatten layer which is used for realizing the transition from the multi-dimensional input to the full connection layer by one-dimensional input.

Step 3.7: the Dense layer, in order to reduce the risk of overfitting of the model, adds random inactivation Dropout with a ratio of 0.1, the number of neurons is 128, and the activation function is PReLU.

Step 3.8: and in the Dense layer, the number of the neurons is 5, the neurons correspond to 5 output categories respectively, and the activation function is Softmax.

Step 3.9: the cross entropy is used as a loss function, an Adam optimizer is adopted, the accuracy A is used as an evaluation index, the number of batch processing samples is set to be 16, training is carried out on training set samples subjected to SMOTE oversampling processing, an iteration curve of the loss function of the training set is obtained, and as shown in figure 6, the loss value is smaller and smaller along with the increase of the training batch and is finally close to 0. The iteration curve of the evaluation index accuracy of the training set is shown in fig. 7, and the accuracy recognition rate of the training set gradually increases with the increase of the training batch, and finally approaches to 1. From the training result, the model has better performance, and the diesel grade qualitative analysis model of the one-dimensional deep convolution neural network is successfully established.

Here, it should be noted that:

in step 3.2 and step 3.4, the convolution layer is composed of a set of convolution kernels with trainable parameters, convolution operation is performed by sliding on input data according to a specific rule, extraction of spectral local abstract features is achieved, and a one-dimensional feature map is correspondingly generated.

In step 3.3 and step 3.5, the pooling layer is typically used to sample the map generated by the convolution operation to reduce the dimensionality of the feature vectors in the convolutional layer; on the premise of ensuring that the number of the characteristic graphs is not changed, the running speed of the algorithm can be greatly improved by reducing the data volume.

In step 3.6, the Flatten layer is used to Flatten the data to facilitate ordered connections to neurons.

In step 3.9, the cross entropy is used to evaluate the difference between the probability distribution obtained by the current training and the true distribution, indicating the distance between the probability of actual output and the probability of expected output.

In this embodiment, firstly, based on a constructed one-dimensional deep convolutional neural network classification model, classification prediction is performed by using training set data, and the classification accuracy of the training set is 97.61%; the data of the test set after SMOTE oversampling processing is brought in, and the classification accuracy rate can be obtained to be 95.44%; the classification accuracy was 95.80% by substituting the original 119 samples of the test set data. At this time, in order to observe the recognition rate of each class, especially the recognition rate of a few classes of samples, a multi-class confusion matrix needs to be drawn, the multi-class confusion matrix is transformed by adding a one-to-one strategy on the basis of a traditional two-class confusion matrix, rows of the matrix represent real classes of data, and columns of the matrix represent prediction classes. Therefore, the numbers on the main diagonal line represent the number of tuples of which the predicted result is consistent with the actual result, and the numbers outside the diagonal line represent the number of tuples of which the prediction is wrong.

Then, the multi-class confusion matrices for the above three cases can be plotted as shown in fig. 8, 9 and 10, respectively. It can be seen that the prediction accuracy of each class is relatively high regardless of the training set or the test set, and the accurate recognition rate of the samples of the few classes, namely the class 0, the class 1 and the class 4, can be as high as 100% for the actual original test set samples. According to the confusion matrix and the formulas <9>, <10>, <11> and <12>, the accuracy of the prediction model is 98.67%, the recall rate is 100%, the accuracy is 95.80% and the F1 value is 0.9933. The model has high accuracy rate for diesel grade classification and strong generalization ability.

In order to increase persuasion, the XGboost integrated learning method and the SVM and BP neural network are adopted to process the same diesel NIRS data set.

Specifically, the parameters of XGBoost are set as follows: the tree of the tree is 196, the maximum depth of the tree is 5, the minimum leaf node weight sum is 1, the complexity control term gamma is 0.15, the weight of the L1 regular term is 0.08, the weight of the L2 regular term is 0.1, the ratio of random sampling of each tree is 0.71, the ratio colsample _ byte of the column number of each random sampling is 0.69, the learning rate is 0.1, the weak classifier selects 'gbtree', the objective function selects 'multi: softmax', the category number is 5, and the CPU thread number is 4. Through the construction of the model, the XGboost model brand classification recognition rate of 75.63% of the diesel oil sample of the original test set can be obtained.

The parameter settings of the SVM are as follows: the range of the kernel parameter g and the penalty parameter c is [ -10,0.2], and parameter optimization is carried out by taking RBF as a kernel function and adopting a cross validation grid search mode. Through the construction of the model, the classification recognition rate of the SVM model brand for the diesel oil sample of the original test set is 78.99%, and the operation speed of the method is slightly slow.

The parameters of the BP neural network are set as follows: the number of nodes of an input layer is 275, an implicit layer comprises 9 nodes, an output layer comprises 5 nodes, the output of the neural network is changed into probability distribution by utilizing Softmax, then cross entropy is used as a loss function, and the learning rate is 0.05. Through the construction of the model, the classification recognition rate of the BP model number of the diesel oil sample of the original test set is 69.75%, and the model operation speed is very slow because all neurons between the traditional BP neural network layers are connected.

The classification results of several methods are plotted in fig. 11, and it is obvious that the classification recognition rate of the diesel grade can be greatly improved by combining the SMOTE oversampling technology provided by the invention with the one-dimensional deep convolutional neural network method.

In summary, the method of combining the SMOTE oversampling technology with the one-dimensional deep convolutional neural network not only solves the problem of imbalance of class sample number in the actual situation, but also avoids the complex preprocessing processes of denoising, feature selection, dimension reduction and the like required by the traditional NIRS modeling method. The overall classification recognition rate of the diesel grade is improved, the recognition rate of a few class samples is greatly improved, and the generalization capability and the practical applicability of the model are strong. The intelligent identification of the diesel grade based on the deep learning NIRS modeling is used for replacing the fussy manual identification, and manpower and material resources are saved. In addition, the method has good application prospect in the NIRS qualitative analysis field.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A near infrared spectrum diesel grade identification method based on SMOTE and deep learning is characterized by comprising the following steps:

2. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 1, characterized in that: the method comprises the steps of drawing a near infrared spectrum image by using a sample set of diesel oil, dividing sample grades into 5 types including-10 #, -20#, -35#, -50# and interference according to the condensation point of the diesel oil, respectively mapping attributes to be types 1, 2, 3, 4 and 0.

3. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 2, characterized in that: the specific process of performing data equalization processing on the sample set by using SMOTE is as follows:

3): for each x_nRespectively according to the formula<1>Constructing a new sample；

x_new＝x+rand(0,1)×|x-x_n|,new∈1,2,…,N <1>

4. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 3, characterized in that: the specific method for constructing the near infrared spectrum classification model of the one-dimensional depth convolution neural network comprises the following steps:

5. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: the convolution layer is composed of a group of convolution kernels with trainable parameters, the size of the convolution kernels is set to be m multiplied by 1, and the convolution operation of the one-dimensional signal is shown in a formula <2 >:

6. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: an activation function PReLU is introduced into the convolutional layer, and the expression of the function is shown in <3 >.

7. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: the operation of the pooling layer is shown in formula <4 >:

8. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: the fully-connected layer comprises a Flatten layer and two Dense layers, the activation function of the last Dense layer is Softmax, and a certain proportion of random deactivation is added into the fully-connected layer, wherein the fully-connected layer is calculated according to the formula <6 >:

h_ω,b(x)＝f(ω^Tx+b) <6>

9. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: after the model is constructed, a training method is configured; the configured training method comprises a loss function, an optimizer and an evaluation index, wherein the loss function is specifically a cross entropy loss function, the formula of the loss function is shown as <7>, the optimizer adopts Adam optimization, and the evaluation index is accuracy A, and is shown as a formula <8 >:

10. The SMOTE and deep learning based near infrared spectrum diesel grade identification method according to claim 4, characterized in that: on the basis of the constructed one-dimensional depth convolution neural network model, respectively adopting a test set subjected to SMOTE oversampling processing and an original test set to predict the grade of diesel oil, and obtaining the integral classification recognition rate; then drawing a multi-classification confusion matrix, and obtaining the precision, the recall rate, the accuracy and the balance F score according to the confusion matrix, wherein the precision, the recall, the accuracy and the balance F score are shown in <9>, <10>, <11>, <12 >: