CN110672552A

CN110672552A - Confidence coefficient estimation method for vehicle fuel oil near infrared spectrum detection result

Info

Publication number: CN110672552A
Application number: CN201910910971.2A
Authority: CN
Inventors: 熊智新; 张肖雪; 杨冲; 赵静远
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-01-10
Anticipated expiration: 2039-09-25
Also published as: CN110672552B

Abstract

The invention provides a confidence estimation method for a vehicle fuel near infrared spectrum detection result, which is used for completing the calculation of Mahalanobis distance based on principal component analysis, obtaining a significance level according to F distribution on the basis and completing the confidence estimation of the detection result of a sample to be detected. And according to the Mahalanobis distance and the statistic obtained by the Mahalanobis distance, obeying to F distribution, confidence coefficient estimation on the reliability of the detection result is provided for the practical application of the near infrared spectrum analysis technology, and a quantitative basis is provided for the next qualitative diagnosis of the analysis object or the effectiveness evaluation of the quantitative analysis result.

Description

Confidence coefficient estimation method for vehicle fuel oil near infrared spectrum detection result

Technical Field

The invention relates to near infrared spectrum anomaly detection of vehicle fuel oil, in particular to a near infrared detection result confidence degree estimation method based on Mahalanobis distance.

Background

Based on group frequency and frequency multiplication absorption of hydrogen group stretching vibration in organic molecules, the near infrared spectrum can establish a linear or nonlinear relation between the spectrum and a quality index through a chemometrics method, quickly and efficiently complete qualitative and quantitative analysis of a sample, and can overcome the defects of complicated process, high cost, low efficiency and the like in the traditional oil analysis technology.

In recent years, the near infrared spectroscopy is widely and more mature to be applied to the measurement of the content of various components of oil products so as to improve the production management and quality supervision level of the oil products. In the acquisition process of the near infrared spectrum, abnormal spectrum data can be generated due to factors such as change of sample properties, change of experimental conditions, measurement errors of instruments, artificial measurement errors and the like; the presence of abnormal spectra affects the data characteristic performance, and further reduces the reliability of the spectrum detection result. Therefore, identifying and rejecting abnormal samples is a necessary condition for building a reliable near-infrared analysis model. Common abnormal sample point elimination methods include Mahalanobis Distance (MD), a lever method, monte carlo cross validation, and the like. However, in the actual process of oil product rapid detection, the difference of the same oil product spectral data is often unavoidable in consideration of different production processes and adulteration possibility existing in different oil refineries. Therefore, simple abnormal sample rejection is often not desirable, and enterprises need to provide a suitable judgment standard (for example, the confidence level is not less than 80%) to complete the judgment and screening of samples, which is important for the rapid detection of oil quality indexes and further diagnostic analysis.

However, in the field of near infrared spectroscopy, there is no mature confidence estimation method for the detection result of the spectral data. In the field of process control, the quadratic calculation of mahalanobis distance (PCA-MD) due to Principal Component Analysis (PCA) is equivalent to hotelling T²PCA-MD is commonly used for T²Checking; through T²The comparison of the control limit and the square of the PCA-MD value can judge whether the sample to be tested is in a normal state. On the basis of the above, T is combined²And the statistic accords with the F distribution, so that the calculation of the significance level of the sample can be completed. Therefore, by means of the data distribution idea, the significance level of the sample can be calculated by calculating the square value of near infrared spectrum data PCA-MD, and the confidence degree of the detection result can be further estimated.

Disclosure of Invention

Aiming at the rapid detection of oil products, the invention provides a near-infrared detection result confidence degree estimation method based on the Mahalanobis distance, so that whether a sample is qualified or not can be conveniently and effectively judged according to the confidence degree in practical application, and the reliability of an analysis result can be ensured.

The method completes the calculation of the Mahalanobis distance based on the principal component analysis, obtains the significance level according to the F distribution on the basis, and completes the confidence estimation of the detection result of the sample to be detected.

The implementation of the process specifically comprises the following steps:

s1, carrying out standardization processing on spectral data to obtain correction set spectral data X;

s2, adopting PCA-MD T²Detecting and removing abnormal samples in the correction set to ensure that the spectral data X in the correction set are all normal samples;

s3, carrying out PCA decomposition on the spectrum data X of the correction set, and combining the spectrum data X of the test set_testCalculating the squared value of the Mahalanobis distance between the sample to be measured and the sample in the correction set

S4, according to

Calculating the significance level alpha by following the F distribution_testThen obtaining confidence coefficient c of the near infrared spectrum detection result_test。

Step S2 includes:

s21: the PCA decomposition of the calibration set spectral data can be expressed as:

in the formula, T is belonged to R^n×pFor the score matrix, n represents the number of samples, P represents the number of principal components, and P belongs to R^m×pM represents the number of variables as a load matrix;

s22: the square of the PCA-MD value of the ith sample of the calibration set spectral data can be expressed as:

wherein, t_iRepresenting the ith row vector of the scoring matrix T, and sigma being the covariance matrix of T;

S23：T²the control limit may be expressed as:

where α is the significance level and the confidence in the control limits is 1- α. At this time, if

If the value is less than the control limit, judging the sample to be a normal sample; if it is

If the value is larger than the control limit, the abnormal sample is judged.

Step S3 includes:

s31: after the abnormal samples are removed, PCA decomposition is carried out on the spectrum data X of the correction set according to a formula (1), and the covariance matrix sigma of the load matrix P and the scoring matrix T is updated;

s32: calculating a score matrix of the spectral data of the sample set to be detected:

T_test＝X_testP (4)

in the formula, X_testA sample set to be detected and a correction set load matrix P are obtained;

s33: the square value of the Mahalanobis distance between the ith sample to be detected and the calibration set

Can be expressed as:

in the formula, t_test-iScore matrix T representing sample set to be tested_testThe ith row vector of (1).

Step S4 includes:

s41: according to the degree of freedom of the F distribution of p and n-pTo a significant level of alpha_testFor the ith test set sample, the significance level α_test-iCan be obtained according to the following formula:

s42: confidence level c for the ith test set sample_test-iCan be expressed as:

c_test-i＝1-α_test-i(7)

the method has the advantages that according to the Mahalanobis distance and the statistic obtained by the Mahalanobis distance, the F distribution obeys, confidence degree estimation of the reliability of the detection result is provided for the practical application of the near infrared spectrum analysis technology, and a quantitative basis is provided for the next qualitative diagnosis of an analysis object or the effectiveness evaluation of the quantitative analysis result.

Drawings

FIG. 1 is a flow chart of a PCA-MD based near infrared anomalous spectral confidence quantification method;

FIG. 2 is T taken from PCA-MD²Checking and eliminating a line graph of abnormal samples in the correction set;

FIG. 3 is a plot of a sample confidence estimate for a near infrared spectrum of diesel-blended gasoline;

FIG. 4 is a sample confidence estimate line graph for simulation case 1;

fig. 5 is a line graph of confidence estimates for simulation case 2.

Detailed description of the preferred embodiments

The technical scheme adopted by the method for performing confidence estimation on the oil product near infrared spectrum detection result is as follows:

S4, according to

Step S2 includes:

S23：T²the control limit may be expressed as:

where α is the significance level (typically set at 0.01 or 0.05) and the confidence of the control limit is 1- α. At this time, ifIf the value is less than the control limit, judging the sample to be a normal sample; if it is

If the value is greater than the control limit, the judgment is thatAnd (4) abnormal samples.

Step S3 includes:

T_test＝X_testP (4)

Can be expressed as:

Step S4 includes:

s41: significance level α is achieved based on F distribution with degrees of freedom p and n-p_testFor the ith test set sample, the significance level α_test-iCan be obtained according to the following formula:

s42: confidence level c for the ith test set sample_test-iCan be expressed as:

c_test-i＝1-α_test-i(7)

example 1:

taking the detection of a gasoline sample doped with a certain percentage of diesel oil as an example. The method comprises the steps of carrying out spectrum collection on diesel oil and gasoline samples provided by main oil refineries in south-johnson of Shandong through a near-infrared spectrometer with the model number of Thermo FisherAntaris II, and fitting as a correction set. And simultaneously collecting the near infrared spectrum of the doped gasoline as a test set.

S4, according to

The present invention is further detailed by simulating the above method by MATLAB in conjunction with fig. 1:

the first step is as follows: and completing the sample division and data standardization processing of the correction set and the test set. The calibration set comprises 81 pure gasoline near infrared spectrum samples, the test set comprises 1 pure gasoline spectrum sample and 10 gasoline spectrum samples respectively doped with diesel oil with different contents, and the diesel oil contents respectively account for 5.26%, 5.88%, 8.33%, 9.09%, 10%, 11.11%, 12.5%, 14.29%, 16.67% and 20%.

The second step is that: carrying out PCA model decomposition on the spectrum data of the correction set so as to calculate the square value of the Mahalanobis distanceThen according to T²The control limit (α set to 0.05) determines the presence or absence of an abnormal sample point. Since the gasoline near infrared spectrum data has too many variables (wavelength points), the first 6 principal components are selected here to analyze the differential contribution rate. As can be seen from Table 1, the cumulative variance contribution ratio did not increase significantly after the number of principal components exceeded 3 by PCA decomposition, and 3 principal components were selectedAnd calculating the MD value and eliminating abnormal values. As can be seen from fig. 2, the red dotted line represents the 95% confidence control limit, and the

samples

53 and 54 are significantly out of the control limit range, and therefore are determined to be abnormal samples. After the abnormal samples are eliminated, the number of the samples in the correction set is 79.

TABLE 1 influence of the number of principal components of the PCA model on the contribution rate and the cumulative contribution rate

The third step: carrying out PCA decomposition on the correction set without the abnormal samples and updating the covariance matrix sigma of the load matrix P and the score matrix T, and then combining the test spectrum data X_testCompleting the test set spectral data scoring matrix T_testAnd the squared value of the mahalanobis distance between the test set and the correction set samples

As can be seen from table 2, after the abnormal samples in the correction set are removed, the variance contribution rate and the cumulative variance contribution rate of the principal component of the PCA model slightly change, and according to the criterion that the cumulative variance contribution rate does not significantly rise, 3 principal components are still selected to update the covariance matrix Σ of the load matrix P and the score matrix T.

TABLE 2 influence of PCA model principal component number on contribution rate and cumulative contribution rate

The fourth step: according to

Level of significance achieved for F distributions with set and degrees of freedom 3 and 76_testAnd confidence c_testAnd (4) calculating. As shown in fig. 3, the confidence of the pure gasoline near infrared spectrum sample in the test set is above 90%; after 5.26% of diesel oil is doped, the confidence coefficient of the sample is rapidly reduced to about 20%; when 5.88% -11.11% of diesel oil is doped, the confidence coefficient reduction trend is not obvious, but is in the range of 20% -35% in all casesThis condition may be caused by the mixing of the sample being uneven or the evaporation of part of the gasoline during actual operation; when the content of the doped diesel oil exceeds 11.11 percent, the confidence level of the sample is gradually reduced from the vicinity of 30 percent to the vicinity of 1 percent.

Example 2:

the mixture of diesel oil and gasoline is simulated by taking the proportion of a single spectrum as an example. A near-infrared spectrometer with the model number of Thermo Fisher Antaris II is adopted to carry out spectrum collection on diesel oil and gasoline samples provided by main oil refineries in south and Ji of Shandong. The calibration set is 81 pure gasoline near infrared spectrum samples in the embodiment 1, 11 samples in the test set are respectively formed by adding 1 gasoline spectrum and 1 diesel spectrum according to a specific proportion, and the diesel content respectively accounts for 0 percent (pure gasoline), 2 percent, 4 percent, 6 percent, 8 percent, 10 percent, 12 percent, 14 percent, 16 percent, 18 percent and 20 percent.

the first step is as follows: and completing the data standardization processing of the correction set and the test set.

The second step is that: carrying out PCA model decomposition on the spectrum data of the correction set so as to calculate the square value of the Mahalanobis distance

Then according to T²The control limit (α set to 0.05) determines the presence or absence of an abnormal sample point. Since the correction set is consistent with the correction set of embodiment 1, 3 principal components are still selected to calculate the MD values of the correction set and to remove the abnormal values, and after the abnormal samples are removed, the number of samples in the correction set is 79.

As shown in Table 2, after removing abnormal samples in the correction set, principal components of the PCA modelThe variance contribution rate and the accumulated variance contribution rate slightly change, and according to the criterion that the accumulated variance contribution rate does not obviously rise, 3 principal components are still selected to update the covariance matrix sigma of the load matrix P and the score matrix T.

The fourth step: according to

Level of significance achieved for F distributions with set and degrees of freedom 3 and 76_testAnd confidence c_testAnd (4) calculating. As shown in fig. 4, as the proportion of doped diesel increases, the sample confidence of the simulation test set shows a smooth decreasing curve; when the proportion of the blended diesel oil is 2-10%, the confidence coefficient is reduced to be lower than 50% quickly, and when the proportion of the blended diesel oil is 6%; as the proportion of blended diesel oil continues to increase, the confidence rate of decline gradually slows down.

Example 3:

the mixture of diesel oil and gasoline is simulated by the mixture ratio of a plurality of spectra. Firstly, a near-infrared spectrometer with the model number of Thermo Fisher Antaris II is adopted to carry out spectrum collection on diesel oil and gasoline samples provided by main oil refineries in Jinan, Shandong. The calibration set is 81 pure gasoline near infrared spectrum samples in the example 1, 11 samples in the test set are obtained by averaging after adding a plurality of gasoline spectrums and diesel oil spectrums, and the diesel oil contents respectively account for 0%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45% and 50%.

Then according to T²The control limit (α set to 0.05) determines the presence or absence of an abnormal sample point. Since the correction set remained the same as that of example 1, 3 principal components were still selected to calculate the MD values of the correction set andand removing abnormal values, wherein the number of the correction set samples is 79 after the abnormal samples are removed.

The fourth step: according to

Level of significance achieved for F distributions with set and degrees of freedom 3 and 76_testAnd confidence c_testAnd (4) calculating. As can be seen from FIG. 5, as the proportion of blended diesel increases, the confidence of the test set samples decreases overall. When the proportion of the doped diesel oil is 5-15%, the confidence coefficient is reduced at the fastest speed; when the diesel oil blending ratio is 15-25%, the confidence coefficient descending speed gradually slows down; when the diesel oil blending ratio is 25% -50%, the confidence coefficient is close to 0, and no obvious change exists.

According to 3 implementation cases, along with the increase of the proportion of diesel oil mixed in gasoline, the confidence coefficient of a detection sample integrally falls, which shows the effectiveness of judging whether the near-infrared detection result is abnormal or not by adopting data distribution in the method. The confidence coefficient can be compared with the judgment standard of an enterprise through the sample significance level and the confidence coefficient estimation provided by the Mahalanobis distance and the F distribution; if the confidence coefficient is not less than the judgment standard, the near infrared detection result of the sample is considered to be normal, and if the confidence coefficient is less than the judgment standard, the sample is considered to be suspicious, and the quality index needs to be further determined. Therefore, the method effectively guarantees the reliability of the near infrared spectrum detection result.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art should make various changes or modifications without departing from the spirit and scope of the present invention.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the foregoing description only for the purpose of illustrating the principles of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims, specification and equivalents thereof.

Claims

1. A confidence degree estimation method for a vehicle fuel near infrared spectrum detection result is characterized by comprising the following steps:

s2, adopting PCA-MD T²Detecting and removing abnormal samples in the correction set spectral data X to ensure that the correction set spectral data X are all normal samples;

S4, according to

2. The method for estimating the confidence of the detection result of the near infrared spectrum of the vehicle fuel according to claim 1, wherein the step S2 includes:

S23：T²the control limit is expressed as:

wherein, alpha is a significance level, and the confidence coefficient of the control limit is 1-alpha; if it isIf the value is less than the control limit, judging the sample to be a normal sample; if it is

If the value is larger than the control limit, the abnormal sample is judged.

3. The method for estimating the confidence of the detection result of the near infrared spectrum of the vehicle fuel according to claim 2, wherein the step S3 includes:

T_test＝X_testP (4)

Expressed as:

4. The method for estimating the confidence of the detection result of the near infrared spectrum of the vehicle fuel according to claim 3, wherein the step S4 includes:

s41: significance level α is achieved based on F distribution with degrees of freedom p and n-p_testFor the ith test set sample, the significance level α_test-iObtained according to the following formula:

s42: confidence level c for the ith test set sample_test-iCan be expressed as:

c_test-i＝1-α_test-i(7)。

5. the confidence estimation method for the detection result of the near infrared spectrum of the vehicle fuel according to any one of claims 2 to 4, characterized in that the significance level α is set to be 0.01 or 0.05.