CN115221927A

CN115221927A - Ultraviolet-visible spectrum dissolved organic carbon detection method

Info

Publication number: CN115221927A
Application number: CN202210888398.1A
Authority: CN
Inventors: 王柯; 刘半藤; 陈友荣; 孟佳洋; 吕晓雯
Original assignee: Zhejiang Shuren University
Current assignee: Zhejiang Shuren University
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-10-21

Abstract

The invention discloses a method for detecting dissolved organic carbon by ultraviolet-visible spectrum, belonging to the technical field of water quality detection, and the method comprises the following steps: dividing a to-be-detected dissolved organic carbon detection sample into a training sample and a test sample, and performing conventional dissolved organic carbon detection on the training sample to obtain a concentration value of the training sample; preprocessing a training sample, wherein the preprocessing comprises smoothing denoising, scattering correction and baseline drift influence elimination; carrying out global feature extraction on the preprocessed training samples by adopting a self-organizing mapping network; performing cluster analysis on the global features to obtain feature information which can represent the spectral difference between different samples; constructing a regression tree integration model; solving the regression tree integration model; introducing a regularization term, and constructing a nonlinear analysis model of the regularized greedy decision tree; and substituting the test sample into the nonlinear analysis model to verify the effectiveness of the nonlinear analysis model. The accuracy of concentration detection of the dissolved organic carbon can be improved.

Description

Ultraviolet-visible spectrum dissolved organic carbon detection method

Technical Field

The invention belongs to the technical field of water quality detection, and particularly relates to a method for detecting dissolved organic carbon by using an ultraviolet-visible spectrum.

Background

Dissolved Organic Carbon (DOC) is taken as an important index for visually reflecting the pollution degree of human activities to water, and has important significance for evaluating Organic matter pollution of water in real time and developing water environment protection work. Researches prove that ultraviolet light at the wavelength of 254nm has better correlation with the concentration of dissolved organic carbon measured by a high-temperature oxidation method. However, in the actual measurement process, part of the wavelengths are susceptible to potential interference of inorganic ions, so that the measurement method based on single-wavelength absorbance is poor in applicability.

At present, scholars at home and abroad focus on establishing a multi-wavelength analysis model aiming at water body dissolved organic carbon, for example, sandford and the like use an ultraviolet spectrophotometric sensor for real-time and in-situ detection of DOC in fresh water, quantitative analysis is carried out in a 230-300nm wavelength range by adopting a mixed linear analysis curve fitting algorithm, and the method is proved to have better linear effect by comparison with a high-temperature catalytic oxidation method. Fichot et al established a correlation between the spectral slope coefficient of the ultraviolet spectrum in the 275-295nm range and the DOC concentration, used for on-site on-line quantitative analysis and research of DOC in surface water, and found that strong correlation exists between the ultraviolet absorption characteristic at 280nm and the content of dissolved organic carbon. Li et al studied the relationship between the spectral slopes of the dissolved organic carbon pairs at 254 and 280nm in surface water and wastewater treatment with a homemade miniature LED ultraviolet sensor, and the results showed that the spectrum at 280nm can supplement the traditional 254nm measurement. However, when studying a multi-wavelength analysis model, the above scholars do not consider that extracting a characteristic wavelength subset with strong effectiveness on the full spectrum of the dissolved organic carbon is easily interfered by environmental factors, and the detection performance of the model is reduced.

In summary, due to the influence of instruments, environmental noise, scattering interference and the like, the ultraviolet-visible spectrum absorbance of the dissolved organic carbon and the concentration of the sample to be detected do not strictly accord with the Lambert-Beer law, and the problems of how to reasonably select the wavelength and how to effectively extract the spectral characteristics are solved. The lack of a scheme for extracting characteristic wavelength subsets with strong effectiveness on a full spectrum leads to high data complexity and less effective information occupation, so that the existing model is difficult to accurately and efficiently realize the detection of the concentration of the dissolved organic carbon.

Disclosure of Invention

The embodiment of the invention aims to provide a method for detecting dissolved organic carbon by an ultraviolet-visible spectrum, which can solve the technical problems that the existing method for detecting the dissolved organic carbon is easily interfered by instruments, environmental noise and scattering, lacks a universal method for realizing detection on a full spectrum, has higher data complexity and less effective information occupation, and is difficult to accurately and efficiently realize concentration detection on the dissolved organic carbon.

In order to solve the technical problem, the invention is realized as follows:

the embodiment of the invention provides a method for detecting dissolved organic carbon in an ultraviolet-visible spectrum, which comprises the following steps:

s101: dividing a to-be-detected dissolved organic carbon detection sample into a training sample and a test sample, and performing conventional dissolved organic carbon detection on the training sample to obtain a concentration value of the training sample;

s102: preprocessing a training sample, wherein the preprocessing comprises smoothing denoising, scattering correction and elimination of baseline drift influence;

s103: carrying out global feature extraction on the preprocessed training samples by adopting a self-organizing mapping network;

s104: performing cluster analysis on the global features to obtain feature information which can represent the spectral difference between different samples;

s105: constructing a regression tree integration model;

s106: solving the regression tree integration model;

s107: introducing a regularization term, and constructing a nonlinear analysis model of the regularized greedy decision tree;

s108: and substituting the test sample into the nonlinear analysis model to verify the effectiveness of the nonlinear analysis model. In the embodiment of the invention, the interference of instruments, environmental noise and scattering is eliminated by preprocessing the data, and the concentration detection of the organic carbon is realized on a full spectrum by adopting a self-organizing mapping network and a nonlinear analysis model of a regularized greedy decision tree, so that the universality is strong, the complexity of the data is low, the effective information occupies a high area, and the accuracy of the concentration detection of the dissolved organic carbon is improved.

Drawings

Fig. 1 is a schematic flow chart of a method for detecting dissolved organic carbon in ultraviolet-visible spectrum according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings in combination with embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art based on the embodiments of the present invention without any inventive step, are within the scope of the present invention.

The method for detecting dissolved organic carbon in ultraviolet-visible spectrum provided by the embodiment of the invention is described in detail by specific embodiments and application scenarios thereof with reference to the attached drawings.

Referring to fig. 1, a schematic flow chart of a method for detecting dissolved organic carbon in ultraviolet-visible spectrum according to an embodiment of the present invention is shown.

The method for detecting the dissolved organic carbon in the ultraviolet-visible spectrum, provided by the embodiment of the invention, comprises the following steps:

s101: dividing a to-be-detected dissolved organic carbon detection sample into a training sample and a test sample, and performing conventional dissolved organic carbon detection on the training sample to obtain a concentration value of the training sample.

In the practical application process, a person skilled in the art can select the relative proportion between the training sample and the testing sample according to practical needs, and the embodiment does not limit this.

In a possible implementation, S101 specifically includes:

s1011: and dividing 80% of the dissolved organic carbon detection samples to be detected into training samples and 20% of the dissolved organic carbon detection samples to be detected into testing samples, and performing conventional dissolved organic carbon detection on the training samples to obtain the concentration values of the training samples.

S102: and preprocessing the training sample, wherein the preprocessing comprises smoothing denoising, scattering correction and elimination of baseline drift influence.

It should be noted that, in the preprocessing process, the characterization capability of the spectral feature on the sample information can be improved by adding feature correlation analysis on the basis of local feature extraction.

In a possible implementation, S102 specifically includes sub-steps S1021 to S1024:

s1021: a Savitzky-Golay convolution smoothing method is adopted to remove noise signals with large frequency span in training samples, and the specific implementation mode of the Savitzky-Golay convolution smoothing method is as follows:

wherein x is _n+1 At the n +1 th wavelength of the spectrum x,

representing the nth wavelength of the average spectrum after mean centering treatment, hi is a smoothing coefficient, H is a normalization factor, and w is the window size;

s1022: using multivariate scattering correction method (Multiplicative Scatterer Cor)recovery, MSC) will average the spectrum

Performing linear regression with the spectrum x, and performing scattering correction by using linear regression parameters:

wherein x is _MSC For the set of spectra after the multivariate scatter correction, b ₀ And b is a correction parameter obtained by comparing the spectrum x and the average spectrum

Performing linear regression by using a least square method;

s1023: eliminating the influence of baseline drift by a background modeling method, selecting a partial spectrum at a non-characteristic wavelength in a target analyte spectrum, and fitting the partial spectrum in a polynomial form by using a least square method:

A＝cλ ^Z

log (a) = log (c) + zlog (λ) formula 3

Wherein A represents absorbance, the absorbance is expressed in logarithmic form, λ represents wavelength, z represents the order relationship between absorption rate and wavelength, and c represents a constant;

s1024: and subtracting the absorbance of part of the spectrum from the total absorbance of the training sample to obtain the absorbance of the training sample after scattering is removed.

S103: and carrying out global feature extraction on the preprocessed training samples by adopting a self-organizing mapping network.

It should be noted that, when global feature extraction is performed by using the self-organized mapping network, a smaller detection error can be achieved and the detection performance can be improved when the problems that the features of the ultraviolet-visible spectrum in the dissolved organic carbon are not concentrated and are easily affected by nonlinearity are faced.

In a possible implementation, S103 specifically includes sub-steps S1031 to S1032:

s1031: falseLet the number of the two-dimensional neuron array be m, and the external input vector X be an N-dimensional vector, i.e., X = [ X ] ₁ ,x ₂ ,...,x _N ] ^T Weight vector W between input vector and i-th hidden layer unit _i Comprises the following steps: w _i ＝[w _i1 ,w _i2 ,...,w _iN ] ^T Wherein w is _iN An nth weight representing an ith hidden layer unit;

s1032: in the competitive learning network, each neuron determines a winning neuron through mutual competition, and the winning neuron and the neighbor neurons thereof are adjusted in the learning network, wherein a competition result q is defined as the neuron with the weight vector closest to an input vector, namely:

wherein the topological neighborhood function η of the winning neuron q _qi (k) Comprises the following steps:

wherein r is _i And r _q Coordinates of neurons q and i, respectively, η (k) and R (k) being decreasing functions, η _qi (k) As the number of iterations k monotonically decreases, the neuron weight may be given by:

wherein, W _i (k) And (3) representing the weight of the kth iteration number, wherein mu (k) is a learning rate parameter of the kth iteration number and is decreased with the iteration number k.

S104: and carrying out cluster analysis on the global features to obtain feature information which can represent the spectral difference between different samples.

In a possible implementation, S104 specifically includes:

s1041: and performing clustering analysis on the global features by adopting a k-means clustering algorithm to obtain feature information capable of representing the spectral difference between different samples.

It should be noted that the amount of spectral information can be further compressed by using a k-means clustering algorithm.

S105: and constructing a regression tree integration model.

In a possible implementation, S105 specifically includes sub-steps S1051 to S1052:

s1051: for spectral feature data, a sample set feature vector is represented by a vector u, and then a regression tree integration model F _M (u) is expressed as:

wherein, T (u; theta) _m ) Represents a decision tree, Θ _m Representing the optimization parameters of the decision tree, wherein M is the number of the decision trees;

s1052: the recursive relationship between the two models can be obtained according to equation 7:

F _m (u)＝F _m-1 (u)+T(u；Θ _m ) Equation 8.

S106: and solving the regression tree integration model.

In a possible implementation, S106 specifically includes sub-steps S1061 to S1062:

s1061: optimization parameters Θ for decision trees _m Solving according to an empirical risk minimization criterion:

wherein, N represents the number of samples contained in the sample subset used for training, u _i Feature vector, y, representing the ith sample _i Is the corresponding measured value;

s1062: a quantitative analysis model of the spectrum is constructed by using a square error loss function, and then gradient solution is performed through a formula 10:

L(y _i ,F _m (u _i ))＝(y _i -F _m (u _i )) ² ＝(y _i -F _m-1 (u _i )-T(u；θ _m )) ² equation 10

Wherein, y _i -F _m-1 (u _i ) The residual of the data is fitted to the current model.

Wherein for weak learner T (u; Θ) _m ) In other words, the regression tree is trained using the residual error of each step of data fitting.

S107: and introducing a regular term, and constructing a nonlinear analysis model of the regularized greedy decision tree.

In one possible implementation, S107 specifically includes substeps S1071 to S1075:

s1071: the path from the root node to each leaf node constitutes a rule:

where v denotes a node of the tree, L (u) denotes a loss function, and a base learner b _v (u) represents whether u can reach the node v after the judgment of the decision tree node,

indicating that it is determined whether j nodes are less than the leaf node threshold among all the passing non-leaf nodes

Indicating that a determination is made among all passing non-leaf nodes whether k nodes are greater than a leaf node threshold

S1072: let v ₁ ，v ₂ Two sub-nodes of the node v are set, the base learner at the node v is made according to an additive methodb _v (u) as a combination of child nodes:

s1073: for model F, each node v in F can pass through (b) _v ，a _v ) Making an association wherein b _v Representing the base learner at node v, a _v Representing the weight of the node in the global learning process, the additive model of F can be represented as h _F (U)＝∑ _V∈F a _v b _v (u) the additive model adds the regularization term R (h) _F ) The latter loss function is expressed as:

Q(F)＝L(h _F (U)，Y)+R(h _F ) Equation 13

S1074: fixing all node weights, performing local optimal search on leaf nodes of the whole decision forest, finding out structural changes which enable loss functions to fall down most quickly, and determining a local optimal decision forest structure;

s1075: fixing decision forest structure by updating node weight a _v The loss function is minimized, the model prediction precision is increased, and the steps are carried out alternately, so that the optimal detection model is obtained.

S108: and substituting the test sample into the nonlinear analysis model to verify the effectiveness of the nonlinear analysis model.

In a possible implementation, S108 specifically includes sub-steps S1081 to S1083:

s1081: substituting the test sample into the nonlinear analysis model to obtain a model prediction value of the test sample;

s1082: carrying out conventional dissolved organic carbon detection on the test sample to obtain a real concentration value of the test sample;

s1083: and comparing the model predicted value with the real concentration value, and verifying the effectiveness of the nonlinear analysis model.

In the embodiment of the invention, the interference of instruments, environmental noise and scattering is eliminated by preprocessing the data, and the concentration detection of the organic carbon is realized on a full spectrum by adopting the self-organizing mapping network and the nonlinear analysis model of the regularized greedy decision tree, so that the universality is strong, the complexity of the data is low, the effective information occupies a high area, and the accuracy of the concentration detection of the dissolved organic carbon is improved.

The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for detecting dissolved organic carbon in an ultraviolet-visible spectrum is characterized by comprising the following steps:

s102: preprocessing the training sample, wherein the preprocessing comprises smoothing denoising, scattering correction and baseline drift influence elimination;

s103: carrying out global feature extraction on the preprocessed training samples by adopting a self-organizing mapping network

S104: performing cluster analysis on the global features to obtain feature information which can represent spectral differences among different samples;

s105: constructing a regression tree integration model;

s106: solving the regression tree integration model;

2. The detection method according to claim 1, wherein the S101 specifically includes:

s1011: and dividing 80% of the testing samples of the dissolved organic carbon to be tested into training samples and 20% of the testing samples, and carrying out conventional dissolved organic carbon detection on the training samples to obtain the concentration values of the training samples.

3. The detection method according to claim 1, wherein the S102 specifically includes:

s1021: removing a noise signal with a large frequency span in the training sample by adopting a Savitzky-Golay convolution smoothing method, wherein the specific implementation mode of the Savitzky-Golay convolution smoothing method is as follows:

wherein x is _n+1 At the n +1 th wavelength of the spectrum x,

n-th wavelength, h, representing the mean spectrum after mean centering _i Is a smoothing coefficient, H is a normalization factor, and w is a window size;

s1022: averaging spectra using multivariate scatter correction

Performing linear regression by using a least square method to obtain;

A＝cλ ^z

log (a) = log (c) + zlog (λ) formula 3

Wherein A represents absorbance, the absorbance is expressed in logarithmic form, λ represents wavelength, z represents the order relationship between absorbance and wavelength, and c represents a constant;

s1024: and subtracting the absorbance of the partial spectrum from the total absorbance of the training sample to obtain the absorbance of the training sample after scattering is removed.

4. The detection method according to claim 1, wherein the S103 specifically includes:

s1031: assuming that the number of two-dimensional neuron arrays is m, the external input vector X is an N-dimensional vector, i.e., X = [ X ] ₁ ,x ₂ ,…,x _N ] ^T Weight vector W between input vector and i-th hidden layer unit _i Comprises the following steps: w is a group of _i ＝[w _i1 ,w _i2 ,…,w _iN ] ^T Wherein w is _iN An nth weight representing an ith hidden layer unit;

s1032: in a competitive learning network, each neuron determines a winning neuron by competing with each other, and the winning neuron and its neighbor neurons are adjusted in the learning network, wherein a competition result q is defined as the neuron whose weight vector is closest to the input vector, that is:

wherein a topological neighborhood function η of the winning neuron q _qi (k) Comprises the following steps:

wherein r is _i And r _q Coordinates of neurons q and i, respectively, η (k) and R (k) being decreasing functions, η _qi (k) As the number of iterations k monotonically decreases, the neuron weights may be given by:

wherein, W _i (k) And the weight of the kth iteration number is shown, and mu (k) is a learning rate parameter of the kth iteration number and is decreased with the iteration number k.

5. The detection method according to claim 1, wherein the S104 specifically includes:

s1041: and performing clustering analysis on the global features by adopting a k-means clustering algorithm to obtain feature information which can represent the spectral difference between different samples.

6. The detection method according to claim 1, wherein the S105 specifically includes:

s1051: for spectral feature data, a sample set feature vector is represented by a vector u, and then the regression tree integration model F _M (u) is expressed as:

s1052: the recursive relationship between the two models can be obtained according to the formula 7:

F _m (u)＝F _m-1 (u)+T(u；Θ _m ) Equation 8.

7. The detection method according to claim 6, wherein the S106 specifically includes:

L(y _i ，F _m (u _i ))＝(y _i -F _m (u _i )) ²

＝(y _i -F _m-1 (u _i )-T(u；θ _m )) ² equation 10

8. The detection method according to claim 1, wherein the S107 specifically includes:

s1071: the path from the root node to each leaf node constitutes a rule:

where v denotes a node of the tree, L (u) denotes a loss function, and a base learner b _v (u) indicates whether u can reach the node v after the judgment of the decision tree node,

indicating that whether j nodes are less than the leaf node threshold is judged among all the passing non-leaf nodes

S1072: let v ₁ ，v ₂ Two sub-nodes of the node v are obtained, and the base learner b at the node v is enabled according to an additive method _v (u) as a combination of child nodes:

s1073: for model F, each node v in F can pass through (b) _v ，a _v ) Making an association wherein b _v Representing the base learner at node v, a _v Representing the weight of the node in the global learning process, the additive model of F can be represented as h _F (U)＝∑ _v∈ _F a _v b _v (u) the additive model adds a regularization term R (h) _F ) The latter loss function is expressed as:

Q(F)＝L(h _F (U)，Y)+R(h _F ) Equation 13

s1075: fixing decision forest structure by updating node weight a _v The loss minimization function increases the model prediction accuracy, and the steps are alternately performed, so that the optimal detection model is obtained.

9. The detection method according to claim 1, wherein the S108 specifically includes:

s1081: substituting the test sample into the nonlinear analysis model to obtain a model predicted value of the test sample;