CN116559110A

CN116559110A - Self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting

Info

Publication number: CN116559110A
Application number: CN202310236906.2A
Authority: CN
Inventors: 康守强; 赵瑞凡; 薛原; 沈涛
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-08-08

Abstract

A self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting relates to the technical field of near infrared spectrum transformation. The invention aims at the problem that the quantitative analysis of the substances to be detected generates larger error due to the overlapping of the spectrum peaks of the near infrared spectrum. The method comprises the steps of determining the number of Gaussian functions participating in Gaussian curve fitting by using discrete points of near infrared spectrums of an object to be detected, and determining the center position of the Gaussian functions by using the wavelength positions of the discrete points; using correlation analysis, an optimal gaussian bandwidth is determined that facilitates extraction of overlapping information. Based on the above, a curve fitting equation set is constructed, the height of the Gaussian function is determined by solving the equation set, and the area integration is carried out on the Gaussian function to obtain the transformation result of the spectrum of the object to be detected, so that the content prediction model of the object to be detected is constructed. The method is respectively applied to the prediction of the COD content of sewage and the moisture content of corn, and the prediction mean square error is reduced by at least 25% compared with that before transformation, which shows that the Gaussian function participating in curve fitting does not need to correspond to the real spectrum peak information, and the effective decomposition and recombination of the overlapped information in the original spectrum can be realized, so that the quantitative analysis error is reduced.

Description

Self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting

Technical Field

The invention relates to a self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting, and relates to the technical field of near infrared spectrum transformation.

Background

The near infrared spectrum technology is widely applied to quantitative analysis of substances to be detected in the fields of food, chemical industry, environment and the like due to the advantages of no secondary pollution, high detection speed and the like ^[1]-[3] . Near infrared spectrum is formed by photon energy absorption in the wavelength range of 780-2526 nm of polar chemical bonds in molecules, and because each chemical bond has multiple frequency multiplication peaks, the band is usually wider, and the spectral peaks of different chemical bonds are seriously overlapped, thus the quantitative analysis error can be caused ^[4] . In order to reduce the influence of overlapping peaks and to improve the utilization of spectral information, it is important to perform a suitable transformation or preprocessing of the near infrared spectrum.

Near infrared spectra cannot be directly quantitatively analyzed according to the peak value of the spectra, a prediction model of the content of an object to be detected needs to be established, and in the process, different pretreatment methods are used for improving the accuracy of the prediction model. Document [5] proposes a generalized multiplicative scattering correction method, which better eliminates the influence of a base line on a near infrared spectrum and improves the prediction accuracy of the oil content in oil palm fruits; document [6] combines a plurality of pretreatment methods, and provides a selective aggregate pretreatment strategy, which can obtain better pretreatment effects on near infrared spectrums of corn, blood and edible oil respectively. Besides, a plurality of band selection methods are developed for screening effective information in original spectrum data, and literature [7] proposes an iterative reduced window self-help soft shrinkage algorithm, and the near infrared spectrum band is accurately selected through continuously reducing the window, so that the precision of corn protein content prediction is improved; document [8] proposes a double-competition adaptive re-weighting sampling algorithm, effectively reduces the band selection difference between near infrared spectrums acquired by multiple instruments and multiple batches, and verifies on corn, medicament and wheat spectrum data sets; document [9] proposes a three-step mixing strategy, which combines the advantages of coarse selection, fine selection and optimal selection, and realizes effective selection of near infrared spectrum bands of tobacco and beer. The method can improve the precision of the prediction model, but less focuses on the influence of spectrum overlapping peaks on quantitative analysis, and only selects information which is more beneficial to modeling on the original spectrum data, can not fundamentally change the spectrum peak overlapping condition of the near infrared spectrum, and has lower utilization rate of the information.

Current methods for reducing the influence of overlapping peaks on light include improving the detection resolution of the instrument itself and separating the overlapping peaks by mathematical methods, but the improvement of the instrument is often limited by capital and development time, and more researchers choose to use mathematical methods ^[10] . Document [11 ]]The near infrared spectrum characteristic extraction method based on Gaussian curve fitting is provided, an original spectrum is decomposed into three Gaussian peaks, a relatively accurate corn chlorophyll content prediction model is established, and the effectiveness of the curve fitting method on near infrared spectrum overlapping peak analysis and model performance improvement is verified. Curve fitting of various functional forms allows peak analysis by determining spectral peak shape parameters, which have heretofore generally required determination of peak position, document [12 ]]The method is applied to infrared and Raman spectrums of four globulins, and accurate globulins structure estimation is realized; document [13 ]]The method for combining the second derivative and the Lorentz force function curve fitting is provided, the infrared spectrums of four different asphalts are analyzed, and the spectrum peak resolution of different chemical bonds is realized. The above method determines the spectral bits by increasing the spectral resolution, but ignores the noise effects that may be experienced. In order to reduce the influence of noise in derivative [14]Provides a method for combining Fourier deconvolution and second derivativeThe method for determining the spectrum peak position controls noise through an S-G convolution smoothing method, and realizes accurate qualitative and quantitative analysis of a protein secondary structure by utilizing infrared spectrum. In addition, the application of wavelet transformation to overlap peak-to-peak detection can also reduce the effect of noise ^[15] . Document [16]The method for combining image segmentation and wavelet transformation is provided, so that the influence of noise on peak position determination is better eliminated, and the accurate detection of the peak position of a simulated spectrum and an actual matrix-assisted laser desorption ionization time-of-flight mass spectrum is realized. With the development of neural networks, deep learning was also applied to overlap peak analysis, literature [17 ]]The method for determining the spectral peak parameters by using the deep neural network is provided, a deep model is trained by a synthesized nuclear magnetic resonance data set, and verification is carried out on nuclear magnetic resonance spectra of complex protein and metabonomics mixtures, so that a good effect is obtained.

The method realizes the separation of overlapped peaks by determining the actual spectrum peak parameters, has strong interpretability, is rarely applied to the quantitative analysis of near infrared spectrum, and has certain limitation. On the one hand, to reduce the influence of noise on peak position determination, data needs to be subjected to denoising treatment, and denoising can cause loss of available information quantity ^[18] The method comprises the steps of carrying out a first treatment on the surface of the On the other hand, the spectral peak parameter value is usually related to the initial value, so the analysis result may not be unique.

In order to avoid the problems encountered in the process of searching actual spectrum peak parameters, a learner also proposes a method based on integral near infrared spectrum, and document [19] applies a self-modeling mixed analysis method to single-substance analysis of a mixture, separates surface-enhanced Raman scattering spectra of different substances by matrix decomposition, and simultaneously performs qualitative and quantitative analysis on each component in a mixed pesticide; document [20] after infrared spectra of pure substances and mixtures are made into data sets, neural networks are applied to quantitative analysis of functional groups of mixed hydrocarbon fuels, which helps to accurately predict various properties of the mixtures. The method can reduce the influence of overlapping spectrum peaks on quantitative analysis without determining actual spectrum peaks, however, the method based on the whole near infrared spectrum has higher requirements on spectrum quality, and is generally based on the superposition theory, so the method is not suitable for near infrared spectrums with small signal energy, sensitivity to external interference and complex components.

Based on the problems existing in the current method, a Gaussian curve fitting method without determining an actual spectrum peak is provided, and parameters of a Gaussian function are determined by combining correlation analysis and equation set solving, so that the parameters have unique solutions, and effective decomposition and recombination of overlapping information in an original near infrared spectrum are realized under the condition that fitting errors are not introduced.

Disclosure of Invention

The invention aims to solve the technical problems that: because the spectrum peaks of the near infrared spectrum overlap, the quantitative analysis of the substance to be detected generates larger error, and a self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting is provided for the problem.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting comprises the following implementation processes:

(1) Determining the number of Gaussian functions according to the original spectrum discrete point number n, so that each discrete wavelength corresponds to one Gaussian function, and each Gaussian function corresponds to one central wavelength;

(2) The original near infrared spectrum is combined with the formula (9) to adaptively determine the optimal bandwidth of each Gaussian function so as to ensure that the subsequent conversion does not amplify noise and effectively display overlapped peak information;

where P (δ, λ) represents the correlation between the local raw spectral data centered on the wavelength λ and the gaussian function having a bandwidth δ, and a larger correlation means that the gaussian function having a bandwidth δ at the wavelength has a larger component; s (x) represents an original spectral vector corresponding to the original spectral wavelength x;

calculating the correlation corresponding to each bandwidth of each Gaussian function through a formula (9), selecting the bandwidth delta corresponding to the maximum correlation value from the correlation values, and taking the bandwidth delta as the optimal bandwidth of the Gaussian function; the number of optimal bandwidths is consistent with the number of gaussian functions;

(3) Substituting the center wavelength of the step (1) and the optimal bandwidth of the step (2) into a linear equation set shown as a formula (5) to obtain the height A of each Gaussian function,

where A is the height vector of the Gaussian function,for fitting the spectral vector, S is the original spectral vector; n is the number of spectrum discrete points; lambda (1) represents the wavelength corresponding to the spectrum discrete point number 1, lambda (n) represents the wavelength corresponding to the spectrum discrete point number n, and the other is the same; />Representing a fitting value corresponding to the 1 st spectrum discrete point; s (1) represents the original value of the 1 st spectrum discrete point; a (1) represents the height of a Gaussian function corresponding to the 1 st spectrum discrete point; other similar matters are carried out; the matrix on the left of equation (5) is a full order matrix;

the area of each gaussian function is integrated according to the bandwidth δ and the height a (column vector) of the gaussian function, thereby completing the transformation of the near infrared spectrum.

The use of a near infrared spectrum obtained by the conversion method according to claim 1 for quantitative analysis.

A quantitative analysis method based on near infrared spectrum comprises the following implementation processes:

quantitative analysis is divided into modeling and detection;

in the modeling stage, the spectral data in the training set needs to establish a quantitative analysis model with the corresponding true value of the content of the object to be tested, and modeling is carried out by using a partial least square method;

and in the detection stage, the test set or the spectrum data acquired in practice are input into a quantitative analysis model to obtain a predicted result of the content of the object to be detected.

In the modeling phase:

(1) Inputting an original near infrared spectrum matrix as a training set; the number of rows of the matrix is the number of samples of the training set, and the number of columns is the number of spectrum discrete points;

(2) Calculating average spectrum, wherein the difference of near infrared spectrum shapes of the same substances is not large, and the processing time can be reduced by determining the bandwidth of the Gaussian function by using the average spectrum of the original near infrared spectrum matrix;

(3) Adaptively determining a Gaussian function bandwidth according to a formula (9) to obtain an optimal bandwidth;

(4) Performing self-adaptive full-rank fitting integral transformation on an original near infrared spectrum matrix (training set) according to the formulas (3) and (9) to obtain a transformation data matrix (training set); performing the adaptive near infrared spectral transformation of claim 1 on the original near infrared spectral matrix;

(5) Constructing a content prediction model according to the formula (1), and constructing the content prediction model by combining the transformed data matrix with a partial least square method;

in the detection stage:

(1) Inputting an original near infrared spectrum as a test set, or actually collecting data as the test set;

(2) Performing self-adaptive full-rank fitting integral transformation on an original spectrum (a test set or actual acquired data) to obtain transformed data;

(3) And (3) content prediction, namely inputting the transformed data into a regression model for quantitative analysis, and outputting a content prediction result to realize quantitative analysis.

The quantitative analysis method based on the near infrared spectrum is used for detecting the water content of grains.

The quantitative analysis method based on the near infrared spectrum is used for detecting the COD content in the sewage.

The invention has the following beneficial technical effects:

the invention aims at the problem that the quantitative analysis of the substances to be detected generates larger error due to the overlapping of the spectrum peaks of the near infrared spectrum. According to the method, an actual spectrum peak is not required to be determined during Gaussian curve fitting, the parameters of the Gaussian function are determined by combining correlation analysis and equation set solving, so that the parameters have unique solutions, the original spectrum is replaced by area integration of the Gaussian function, and effective decomposition and recombination of overlapping information in the original near infrared spectrum are realized under the condition that fitting errors are not introduced.

The method comprises the steps of determining the number of Gaussian functions participating in Gaussian curve fitting by using discrete points of near infrared spectrum of an object to be detected, and determining the center position of the Gaussian functions by using the wavelength positions of the discrete points; using correlation analysis, an optimal gaussian bandwidth is determined that facilitates extraction of overlapping information. Based on the above, a curve fitting equation set is constructed, the height of the Gaussian function is determined by solving the equation set, and the area integration is carried out on the Gaussian function to obtain the transformation result of the spectrum of the object to be detected, so that the content prediction model of the object to be detected is constructed. The method is respectively applied to the prediction of the COD content of sewage and the moisture content of corn, and the prediction mean square error is reduced by at least 25% compared with that before transformation, which shows that the Gaussian function participating in curve fitting does not need to correspond to the real spectrum peak information, and the effective decomposition and recombination of the overlapped information in the original spectrum can be realized, so that the quantitative analysis error is reduced.

Drawings

Fig. 1 shows a gaussian curve fitting and integration process of near infrared spectra: (a) raw spectra, (b) gaussian curve fitting, (c) area integration of gaussian functions; FIG. 2 is a Gaussian function corresponding to the original spectrum and the positions of the wavelengths lambda (i) and lambda (j); FIG. 3 is a schematic diagram of a process for adaptively determining the bandwidth of a Gaussian function; FIG. 4 is a schematic diagram of an adaptive full rank fit integral transformation; FIG. 5 is a block flow diagram of an adaptive full rank fitting integral transform applied to quantitative analysis; fig. 6 is a diagram of a near infrared spectrum module and an acquisition device: (a) A near infrared spectrum module, (b) a water sample near infrared spectrum acquisition device; FIG. 7 is a graph of a Gaussian function for a full rank fit; FIG. 8 fits the spectrum to the original spectrum; FIG. 9 is a near infrared spectrum versus optimum Gaussian bandwidth chart (FIG. 9 shows an adaptively determined optimum Gaussian bandwidth versus near infrared spectrum chart); FIG. 10 is a graph of near infrared spectrum and conversion result of raw sewage; FIG. 11 is a chart of Pearson correlation coefficients of different wavelengths before and after near infrared spectrum matrix transformation of sewage; FIG. 12 is a graph of near infrared spectrum versus conversion results for raw moisture grain (corn); fig. 13 is a Pearson correlation plot of different wavelengths before and after near infrared spectral matrix transformation of a grain (corn) with moisture content.

Detailed Description

The implementation of a novel near infrared spectral transformation method based on correlation and gaussian curve fitting is described below in conjunction with fig. 1 to 13. In implementation Fang Shizhong, part 1 presents an implementation of an adaptive near infrared spectral transformation method based on correlation and gaussian curve fitting. Section 2 describes the theory of quantitative analysis of near infrared spectra and gaussian curve fitting. Section 3 illustrates the principle of the proposed adaptive full rank fitting integral transformation method. Section 4 describes the procedure for applying the proposed transformation method to near infrared spectroscopy analysis. Part 5 applies the proposed method to both near infrared spectra for experimental verification.

1. The implementation process of the self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting comprises the following steps:

2 theory of correlation

2.1 quantitative analysis of near infrared Spectroscopy

Because of the special properties of wide band and more frequency doubling peaks of near infrared spectrum, the content information of the to-be-detected substance cannot be directly obtained from the spectrum, and a to-be-detected substance content prediction model needs to be established according to a spectrum matrix and a corresponding content true value ^[21] . The partial least square method is a commonly used near infrared spectrum modeling method, and the regression correlation coefficient between an independent variable (near infrared spectrum matrix) and a dependent variable (content of an object to be detected) is determined through multi-step iteration, and the expression is as follows:

U＝Tβ (1)

where U is the scoring matrix of the dependent variable, T is the scoring matrix of the independent variable, and beta is the regression coefficient matrix. U and T are determined according to formula (2):

wherein X, T is a spectrum matrix and a content matrix of an object to be detected respectively, P, E is a weight matrix and a residual matrix of independent variables respectively, Q, F is a weight matrix and a residual matrix of dependent variables respectively, a is a main component number determined through iteration, and iteration is stopped when the effect of a model cannot be improved by adding the main component.

2.2 Gauss curve fitting and integration

The overlapping peak analysis is a method for reducing the overlapping influence of spectrum peaks, and the method separates spectrum peaks which are originally overlapped together, which is equivalent to improving the spectrum resolution. And Gaussian curve fitting is widely applied to overlapped peak analysis, and can improve the resolution of the original spectrum by combining integral transformation. Assuming that the original spectrum is formed by superposition of a plurality of independent spectrum peaks, each spectrum peak is represented by a Gaussian function, and the process of superposing the Gaussian functions and approximating the original spectrum is called Gaussian curve fitting ^[22] The expression is

In the method, in the process of the invention,the fitting spectrum data is shown as a set of discrete values, x is wavelength, A _i 、δ _i 、λ _i The height, bandwidth and center position of the ith Gaussian function are respectively shown, the height, width and position of the actual spectrum peak are respectively shown, and m is the number of Gaussian functions involved in fitting. If the mean square error is used to measure the fitting error, the expression of the fitting error is

Where S represents the original spectrum and n is the number of points of the discrete spectrum data. Gaussian curve fitting typically uses a minimumIterative adjustment of Gaussian function parameters by a square algorithm and narrowing of fitting spectrum dataThe difference MSE from the original spectrum data S, and parameter adjustment is finished when the MSE is smaller than a set value.

In near infrared spectroscopy, the spectrum is usually referred to as absorbance spectrum, the absorbance is proportional to the number of molecules of the substance to be measured according to beer's law, and the absorbance can be represented by peak area, so that the amount or concentration of the substance to be measured can be analyzed by using the area of the Gaussian function after fitting ^[15] . It is noted that the calculation of the area is an integration process, the integration transforms the data originally having bandwidth information into data without bandwidth information, and the integrated data is used for replacing the original spectrum data, so that the resolution of the original data can be improved. Fig. 1 shows a gaussian curve fitting and integration process of near infrared spectra.

3 adaptive full rank fitting integral transform

3.1 full rank Gaussian curve fitting and integration

As described in chapter 2, in conventional gaussian curve fitting and integration, a gaussian function is used to represent a true spectral peak, so that the determination of the peak position is critical, however, on one hand, the determination of the peak position is affected by factors such as noise, and on the other hand, the actual spectral peak shape generally does not completely conform to the standard gaussian function, fitting an original spectrum with a small amount of gaussian function necessarily results in a fitting error, and, for the problems existing in the conventional method, a transformation method combining full-rank gaussian curve fitting (full-rank fitting) with integration operation is proposed herein.

Assuming that the number of discrete points of the original spectrum is n, the method requires that n gaussian functions fit the original spectrum, i.e. one gaussian function for each wavelength. If the bandwidth delta of the Gaussian function is fixed, only the height parameter of the Gaussian function is required to be adjusted when full rank fitting is performed. Each fitted discrete point corresponds to a linear equation shown in formula (3), then all the spectrum data corresponds to an equation set, and the equation set is written into a matrix form as shown in formula (5):

where A is the height vector of the Gaussian function,to fit the spectral vectors, S is the original spectral vector.

It can be seen from equation (5) that, unlike conventional gaussian curve fitting, the result of full rank fitting can be equal to the original spectrum, which is determined by the characteristics of the linear system of equations. In the linear equation set shown in the formula (5), the linearity between different column vectors of the coefficient matrix (matrix size: n×n) is irrelevant, the rank of the augmentation matrix composed of the spectrum vectors S is n, and the dimension of the augmentation matrix is equal to the dimension of the unknown quantity a, so that the height vector a of the gaussian function has and has only one set of solutions, the gaussian function is completely fitted to the original spectrum, and no fitting error exists. After the height a is determined, the area of each gaussian function is integrated, and the original spectrum is replaced with the integrated data, a transformation method referred to herein as a full-rank fitting integral transformation. Taking any two adjacent points (except a starting point and an ending point) in the original spectrum data as research objects, wherein the corresponding wavelengths (abscissa) are lambda (i) and lambda (j) respectively, the energies (ordinate) are S (i) and S (j) respectively, and the Gaussian functions corresponding to the full rank fitting are shown in fig. 2:

assuming that the fitting parameters of the two gaussian functions in fig. 2 are affected only by the gaussian functions adjacent to each other, after full-rank fitting, S (i) and S (j) can be expressed as the form of addition of three gaussian functions, as shown in formula (6):

on the basis, the areas of the two Gaussian functions are integrated, and energy of the two discrete points after full-rank fitting integral transformation is obtained:

as can be seen from a comparison of equation (6) and equation (7), the transformation method takes each discrete data point as an independent transformation object. In the fitting stage, decomposing data information at different positions by using a denser Gaussian function with a certain bandwidth; and in the integration stage, collecting the information in the bandwidth range of the Gaussian function in the center of the Gaussian function, and decomposing and recombining the original information to display the information covered by the spectrum peak overlapping. As can be seen from equation (7), the transform result is directly related to the height and bandwidth, and the height is obtained by solving the equation set after the bandwidth is determined, so that the key to achieving good transform effect of full-rank fitting integration is to determine the optimal bandwidth of each gaussian function before transform, so that the transform better eliminates the influence of overlapping peaks.

3.2 adaptive determination of Gaussian function Bandwidth

Near infrared spectrum waveforms are changeable and complex in overlapping condition, and different Gaussian function bandwidths are applicable to different overlapping conditions: if the bandwidth is large, the full rank fitting integral transformation is used for decomposing and recombining the data in a larger range and is used for the situation that the spectrum peak overlapping range is large; the opposite is true for smaller bandwidths. For different situations, the bandwidth of each Gaussian function is obviously not easy to realize by manual analysis and determination, so a method for adaptively determining the bandwidth of the Gaussian function is provided.

Since the full rank fitting enables the curve after the gaussian function superposition to completely fit the original spectrum, the bandwidth of the superposition component contained in the original spectrum is used as the bandwidth of the gaussian function in the full rank fitting. Analysis of the superimposed components may be achieved by correlation computation, in particular in the form of inner product operations of gaussian function components of different bandwidths with the original spectrum. The process of determining bandwidth using correlation is called adaptively determining gaussian function bandwidth as in equation (8):

where P (δ, λ) represents the correlation between the local raw spectral data centered on λ and the gaussian function having a bandwidth δ, and a larger correlation means that the gaussian function having a bandwidth δ at that wavelength has a larger component.

In the process of calculating the correlation, the calculation result is only related to the bandwidth, and as can be seen from the equation (8), the bandwidth delta and the height A of the Gaussian function can affect the inner product result at the same time. There are two solutions to this problem: 1. fixing the height of the Gaussian function; 2. the height parameter is represented by a bandwidth. In the scheme 1, because the integral of the gaussian function is always a positive value, if the height of the gaussian function is fixed, in the process of gradually increasing delta, the number of data points affecting the inner product result is gradually increased, so that the inner product value of the spectrum and the gaussian function is gradually increased, the searching process of the optimal delta is not converged, and the solution is impossible. Thus, using scheme 2 herein, the height of the gaussian function is represented by the bandwidth by fixing the area of the gaussian function, where the inner product results are only related to the bandwidth of the gaussian function. Updating the expression for adaptively determining the bandwidth of the gaussian function as in equation (9):

after improvement, the gaussian bandwidth with the strongest correlation at each wavelength position is adaptively determined by the formula (9), and the specific process is shown in fig. 3, in which the dark box represents the maximum value of each column of correlation values, and the gaussian bandwidth corresponding to the maximum value is the optimal bandwidth at the wavelength position.

Because the actual near infrared spectrum has different waveforms, the spectrum smoothing part has stronger correlation with a Gaussian function with larger delta, and the larger delta enables the subsequent full-rank fitting integral transformation to decompose and reorganize the original data in a larger range, so that the method is beneficial to extracting the information covered by the smoothing part due to spectrum peak overlapping to the maximum extent; in contrast, the spectrum change is severe, the data resolution is high, and the correlation with the Gaussian function with smaller delta is strong, so that the decomposition and recombination are only carried out in a small range, the original information can be kept, and the noise cannot be amplified. In conclusion, the method utilizes the correlation with the spectrum local waveform to determine the Gaussian function bandwidth, so that the Gaussian function bandwidth has obvious advantages under different waveform conditions of the near infrared spectrum.

3.3 adaptive full rank fitting integral transform

Combining the full-rank fitting integral transformation of 3.1 and the self-adaptive determination Gaussian function bandwidth of 3.2, a novel near infrared spectrum transformation method applied to quantitative analysis is provided, namely the self-adaptive full-rank fitting integral transformation, which comprises the following transformation steps:

(1) Determining the number of Gaussian functions according to the original spectrum discrete points, so that each discrete wavelength corresponds to one Gaussian function;

(3) And (3) constructing a linear equation set shown in the formula (5) according to the center wavelength of the step (1) and the optimal bandwidth of the step (2), and performing full-rank fitting integral transformation on the original spectrum.

The transformation process of the proposed method is shown in fig. 4: quantitative analysis based on adaptive full rank fitting integral transformation

Quantitative analysis is divided into modeling and detection. In the modeling stage, the spectral data in the training set needs to establish a quantitative analysis model with the corresponding true value of the content of the object to be tested, and the partial least square method is used for modeling; and in the detection stage, the test set or the spectrum data acquired in practice are input into a quantitative analysis model to obtain a predicted result of the content of the object to be detected.

The flow chart of the proposed transformation method applied to quantitative analysis is shown in fig. 5, and the specific steps are as follows:

modeling:

(1) The original near infrared spectrum matrix (training set) is input. The number of rows of the matrix is the number of samples of the training set, and the number of columns is the number of spectrum discrete points;

(2) An average spectrum is calculated. Because the near infrared spectrum shapes of the same substances have little difference, the processing time can be reduced by determining the bandwidth of the Gaussian function by using the average spectrum of the original near infrared spectrum matrix;

(4) Performing self-adaptive full-rank fitting integral transformation on an original near infrared spectrum matrix (training set) according to the formulas (3) and (9) to obtain a transformation data matrix (training set);

(5) And (3) constructing a content prediction model according to the formula (1). The transformed data matrix is combined with a partial least square method to establish a content prediction model;

and (3) detection:

(1) Inputting an original near infrared spectrum (test set or actual acquisition data);

(3) And (5) content prediction. And inputting the transformed data into a regression model for quantitative analysis, outputting a content prediction result, and realizing quantitative analysis.

5 experiment and discussion

In order to verify the effectiveness of the transformation method, the change of near infrared spectrum before and after transformation, the change of Pearson correlation coefficient of spectrum data matrix and true value, and the change of the prediction error of the content of the to-be-detected object during actual quantitative analysis are respectively compared and analyzed. To verify the general applicability of the method, experiments were performed on two widely differing data sets, respectively.

5.1 Experimental data and Environment

Sewage data set: near infrared spectrum module using texas instrumentsNIRscan ^TM Nano Evaluation Module) to collect the transmitted near infrared spectrum of the sewage, the physical diagram is shown in fig. 6. Continuously collecting sewage spectrum for about one week in a municipal sewage plant, and simultaneously detecting the COD value of a homologous water sample by using an on-line detector (chemical oxygen demand on-line automatic monitor COD-4200) of Shimadzu, wherein the detection point is positioned in a biochemical pond of the sewage plant, and the COD fluctuation range of the water sample is about 15-25mg/L. And removing data with excessive influence of manual operation or external interference to form a data set, selecting 40 samples to form a training set, and 10 samples to form a test set.

Corn dataset: the public data set comprises 80 pieces of corn near infrared spectrum data corresponding to 80 moisture content true valuesThe fluctuation range is 9-11% ^[23] . 64 samples are selected to form a training set, and 16 samples form a testing set.

Experimental environment and software: the computer is Window10 operating system, the algorithm realization language is Python, and the modeling software is Matlab.

5.2 Sewage data set experiment and analysis

5.2.1 fitting error

The wavelength range of the near infrared spectrum of the sewage is 901-1700nm, the number of discrete points is 228, namely, the number of Gaussian functions required by full-rank fitting is 228, and each discrete point wavelength corresponds to the central position of one Gaussian function. The gaussian bandwidth was set to 10nm and the gaussian height was solved by equation (3) to obtain a series of gaussian functions as shown in fig. 7.

The gaussian functions are added to produce a fitted spectrum, and fig. 8 shows a comparison of the fitted spectrum and the original spectrum.

As can be seen from fig. 8, the fitted spectrum completely coincides with the original spectrum, verifying that there is no fitting error in the full rank fitting.

5.2.2 optimal Gaussian function Bandwidth

The bandwidth of the gaussian function is set to 4-44nm, and gradually increases at intervals of 4nm (the wavelength interval of the near infrared spectrum is about 3.5nm, so the minimum bandwidth of the gaussian function cannot be less than 3.5 nm), and fig. 9 shows a comparison chart of the adaptively determined optimum gaussian function bandwidth and the near infrared spectrum:

as can be seen in fig. 9, the gaussian width is closely related to the local waveform of the near infrared spectrum: near infrared spectrum peaks near 950nm, 1150nm and 1400nm are obvious, the corresponding Gaussian function bandwidth is smaller, and noise control can be realized; and other positions have no obvious spectrum peak, so that the bandwidth is large, and the effective decomposition and recombination are facilitated.

5.2.3 Pre-and post-transform prediction error contrast

The raw spectral data was curve fitted using the gaussian center positions and bandwidths determined at 5.2.1 and 5.2.2, and the height of each gaussian was calculated. Finally, calculating the area of each Gaussian function, outputting a self-adaptive full-rank fitting integral transformation result, and comparing the original spectrum data with the transformation result in FIG. 10:

as can be seen from the transformation results of fig. 10, the proposed method causes the original near infrared spectrum to change differently at different waveforms: in the position where the original spectrum is severely deformed, the shape after transformation is basically consistent with the original data, and the method is proved to not amplify the change rate of the original data at the position where the original data is sharply changed, namely, not amplify noise; at the smoother position of the original spectrum, a distinct peak appears after transformation, indicating that the information at this point is decomposed and reorganized, causing the overlapping information to be displayed. In addition, the conversion time of each time is less than 0.1s, and the real-time detection requirement is met.

In near infrared spectrum analysis, the magnitude of the Pearson correlation coefficient can represent the linear correlation degree of the spectral energy and the true value of the content of the substance to be detected at different wavelength positions, and if the correlation coefficient corresponding to a certain wavelength position is improved, the overlapping information is shown through transformation. The Pearson correlation coefficient of the energy value and the COD value at each wavelength before and after the spectral matrix transformation is calculated, and the calculation result is shown in fig. 11:

as can be seen from fig. 11, the Pearson correlation coefficient changes after the spectral transformation. By combining with near infrared spectrum shape analysis of sewage, the Pearson correlation coefficient value of data and true values in the bands (1040-1120 nm, 1224-1325nm and 1539-1605 nm) with smooth spectrums is obviously improved, so that the correlation coefficient value reaches 0.25 and 0.30. The maximum value of the Pearson correlation coefficient before transformation is 0.22, and the maximum value after transformation is 0.3, thereby improving 36 percent.

The transformation method is applied to the actual quantitative analysis of the COD content of the sewage. Respectively constructing a substance content prediction model to be detected on 3 sewage data sets by using a partial least square method, wherein the 3 data sets comprise: the original data set, the transformed data set, and the transformed added Multiple Scatter Correction (MSC) processed data set. The mean square error of the predictive model built by 3 data sets is shown in table 1:

table 1 sewage COD content prediction error

Type of processing	MSE
		No processing	1.4307
Proposed method	1.0657
		Proposed method with MSC	1.0018

It can be seen from table 1 that the prediction error of the model is reduced by about 25% after the transformation by the proposed method. On the basis, the model prediction mean square error is 1.001838 by combining the provided transformation method with the MSC near infrared spectrum preprocessing method, so that the prediction error is further reduced, and compared with the value which is not subjected to MSC preprocessing, the value is reduced by 6%.

5.3 corn dataset experiments and analysis

The corn near infrared spectrum is subjected to self-adaptive full-rank fitting integral transformation according to the same flow in 5.2, and the quantity of Gaussian functions is set to 700 when the corn spectrum data is transformed due to different original spectrums, and the rest settings are unchanged. Fig. 12 shows a comparison of the raw corn spectral data before and after transformation:

from fig. 12, it can be concluded that the spectral transformation of sewage is similar, namely: the change is larger at the gentle position of the waveform, and the change is basically unchanged at the obvious position of the waveform. The conversion time of each time is about 0.1s, thereby meeting the real-time monitoring requirement.

The calculation result of the Pearson correlation coefficient is shown in fig. 13.

As can be seen from FIG. 13, in the wavelength bands of 1232-1453nm, 1849-1893nm and 2022-2290nm, the Pearson correlation coefficient values of the transformed data matrix and the true value are obviously improved, so that the correlation coefficient reaches 0.75 and 0.79, and the maximum value of the Pearson correlation coefficient before transformation is 0.65, and the maximum value of the Pearson correlation coefficient before transformation can be calculated to be improved by 20%.

A regression model of moisture content prediction was built for the corn raw near infrared spectrum dataset using the same modeling method, and the experimental results are shown in table 2:

table 2 corn moisture content prediction error

Type of processing	MSE
		No processing	0.0986
Proposed method	0.0667
		Proposed method with MSC	0.0662

The experimental results of table 2 demonstrate that the proposed method reduces the prediction error by about 30%. In combination with the multiple scatter correction, the prediction error is further reduced by 0.7%. This experiment also verifies the general applicability of the proposed method.

5.4 discussion

Experimental verification based on two near infrared spectral datasets showed that: the influence of spectral peak overlapping in a near infrared spectrum band on quantitative analysis can be reduced by utilizing information decomposition and recombination based on fitting integration, and the parameters of actual spectral peaks in a spectrum are not required to be determined in the decomposition and recombination process; the application field of near infrared spectrum is wide, the spectrum peak overlapping condition of different detection objects is different, and the self-adaptive transformation method can be better suitable for different data.

Summary 6

(1) According to the full-rank fitting integral transformation method, iteration is not needed in the fitting process, a unique solution of Gaussian function parameters can be obtained by solving an equation set, fitting errors are not introduced, the calculation time is about 0.1s when the number of spectrum discrete points is 700, and the real-time requirement is met. And combining full rank fitting with integral operation to realize the decomposition and recombination of the original information.

(2) In order to adapt full-rank fitting integral transformation to changeable near infrared spectrum waveforms, a method for adaptively determining Gaussian function bandwidth is provided, and the method can simultaneously meet the requirements of improving the resolution of data in a gentle region and not amplifying noise in an obvious region of a spectrum peak, effectively decompose and reorganize original data and reduce the influence of spectrum peak overlapping.

(3) The method is respectively applied to quantitative analysis of sewage and corn data, experiments show that the Pearson correlation coefficient of the transformed data and the true value of the content of the substance to be detected is improved, the maximum value is improved by more than 20%, the mean square error of sewage COD and corn protein content prediction is reduced by more than 25%, and the method has general applicability.

The near infrared and other detection technologies are increasingly developed, how to effectively extract data information is an important research direction for various detection data including near infrared spectrum, and the next work is to continuously research the information extraction method, so that the data utilization rate is further improved. The details of the references cited in the present invention are as follows:

[1]Z.Yang,Z.Wang,M.Lin,Nondestructive Testing of Jujube Water Based on the NTRS,

Xinjiang Agricultural Sciences 58(2021)2320-2326.

[2]H.Yu,W.Du,Z.Lang,K.Wang,J.Long,A Novel Integrated Approach to Characterization ofPetroleum Naphtha Properties from Near-Infrared Spectroscopy,Transactions onInstrumentation and Measurement,70(2021)1-13.

[3]Y.Tang,Z.Chen,Soil pH Prediction Based on Convolution Neural Network and Near InfraredSpectroscopy,Spectroscopy and Spectral Analysis,41(2021)892-897.

[4]W.Lu,Modern Near Infrared Spectroscopy Analytical Technology,China Petrochemical Press,

Beijing,2010.

[5]D.D.Silalahi,H.Midi,J.Arasan,M.S.Muatafa,J.P.Caliman,Robust generalizedmultiplicative scatter correction algorithm on pretreatment of near infrared spectral data,

Vibrational Spectroscopy,97(2018)55-65.

[6]X.Bian,K.Wang,E.Tan,P.Diwu,F.Zhang,Y.Guo,A selective ensemble preprocessingstrategy for near-infrared spectral quantitative analysis of complex samples,Chemometrics andIntelligent Laboratory Systems,197(2020)103916.

[7]Q.Xu,L.Guo,K.Du,B.Shan,F.Zhang,A Variable Selection Method for Near-infraredSpectroscopy Based on Iterative Shrinkage Window Bootstrapping Soft Shrinkage Algorithm,

Journal of Instrumental Analysis,41(2022)1229-1241.

[8]K.Zheng,T.Feng,W.Zhang,X.Huang,Z.Li,D.Zhang,Y.Yao,X.Zou,Variable selection bydouble competitive adaptive reweighted sampling for calibration transfer of near infraredspectra,Chemometrics and Intelligent Laboratory Systems,191(2019)109-117.

[9]H.Yu,Y.Yun,W.Zhang,H.Chen,D.Liu,Q.Zhong,W.Chen,W.Chen,Three-step hybridstrategy towards efficiently selecting variables in multivariate calibration of near-infraredspectra,Spectrochimica Acta Part A:Molecular and Biomolecular Spectroscopy,224(2020)117376.

[10]Q.Shen,Y.Xu,H.Kang,J.Bu,W.Guo,Research Status of Decomposition of OverlappingPeaks By Mathematical Methods at Home and Abroadm,Value Engineering,30(2011)197-197.

[11]M.Li,Y.Sheng,Study on Application of Gaussian Fitting Algorithm to Building Model ofSpectral Analysis,Spectroscopy and Spectral Analysis,28(2008)2352-2355.

[12]A.Sadat,I.J.Joye,3 Chapter 3:Peak Fitting Applied to Fourier Transform Infrared and RamanSpectroscopic Analysis of Proteins,A multidimensional view on zein proteins:Structure andfunctionality in dough and bread systems,10(2022)38-61.

[13]M.Asemani,A.R.Rabbani,Detailed FTIR spectroscopy characterization of crude oil extractedasphaltenes:Curve resolve of overlapping bands,Journal of Petroleum Science andEngineering,185(2020)106618.

[14]M.Fevzioglu,O.K.Ozturk,B.R.Hamaker,O.H.Campanella,Quantitative approach to studysecondary structure of proteins by FT-IR spectroscopy,using a model wheat gluten system,

International Journal of Biological Macromolecules,164(2020)2753-2760.

[15]M.F.Wahab,T.C.O'Haver,Wavelet transforms in separation science for denoising and peakoverlap detection,Journal of Separation Science,43(2020)1998-2010.

[16]G.Yang,J.Dai,X.Liu,M.Chen,X.Wu,Spectral feature extraction based on continuouswavelet transform and image segmentation for peak detection,Analytical Methods,12(2020)169-178.

[17]D.Li,A.L.Hansen,C.Yuan,L.Bruschweiler-Li,R.Bruschweiler,DEEP picker is a deepneural network for accurate deconvolution of complex two-dimensional NMR spectra,Naturecommunications,12(2021)1-13.

[18]J.Cai,Y.Xiao,X.Li.De-noising of tobacco near infrared spectroscopy based on generalized Stransform,Acta Tabacaria Sinica,23(2017)9-14.

[19]B.Hu,D.W.Sun,H.Pu,Q.Wei,Rapid nondestructive detection of mixed pesticides residueson fruit surface using SERS combined with self-modeling mixture analysis method,Talanta,

217(2020)120998.

[20]Y.Sun,L.Luo,Y.Liu.Analysis of Functional Group Mole Fraction of Complex Fuels Basedon Neural Network and Infrared Spectrum,Journal of Engineering Thermophysics,43(2022)1116-1122.

[21]G.Wang,H.Ye,Principal component analysis and partial least squares,Tsinghua UniversityPress,Beijing,2012.

[22]A.Goshtasby,W.D.Oneill,Curve fitting by a sum of Gaussians,CVGIP:Graphical Modelsand Image Processing,56(1994)281-288.

[23]https://eigenvector.com/resources/data-sets/#corn-sec。

Claims

1. the self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting is basically characterized by comprising the following implementation processes:

where A is the height vector of the Gaussian function,for fitting the spectral vector, S is the original spectral vector; n is the number of spectrum discrete points; lambda (1) represents the wavelength corresponding to the spectrum discrete point number 1, lambda (n) represents the wavelength corresponding to the spectrum discrete point number n, and the other is the same;representing a fitting value corresponding to the 1 st spectrum discrete point; s (1) represents the original value of the 1 st spectrum discrete point; a (1) represents the height of a Gaussian function corresponding to the 1 st spectrum discrete point; other similar matters are carried out; the matrix on the left of equation (5) is a full order matrix;

2. The use of near infrared spectrum, characterized in that the near infrared spectrum is converted near infrared spectrum obtained by the conversion method according to claim 1, and the converted near infrared spectrum is used for quantitative analysis.

3. The quantitative analysis method based on the near infrared spectrum is characterized by comprising the following implementation processes:

quantitative analysis is divided into modeling and detection;

4. A quantitative analysis method based on near infrared spectrum according to claim 3, wherein,

modeling:

and (3) detection:

5. A method of quantitative analysis based on near infrared spectroscopy according to claim 3, wherein the method of quantitative analysis based on near infrared spectroscopy is used for detection of moisture content of cereal grains.

6. A method of quantitative analysis based on near infrared spectroscopy according to claim 3, wherein the method of quantitative analysis based on near infrared spectroscopy is used for detecting COD content in sewage.