CN112750507A - Method for simultaneously detecting content of nitrate and nitrite in water based on hybrid machine learning model - Google Patents

Method for simultaneously detecting content of nitrate and nitrite in water based on hybrid machine learning model Download PDF

Info

Publication number
CN112750507A
CN112750507A CN202110054882.XA CN202110054882A CN112750507A CN 112750507 A CN112750507 A CN 112750507A CN 202110054882 A CN202110054882 A CN 202110054882A CN 112750507 A CN112750507 A CN 112750507A
Authority
CN
China
Prior art keywords
sample
nitrate
nitrite
model
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110054882.XA
Other languages
Chinese (zh)
Other versions
CN112750507B (en
Inventor
熊莎
吴琼
张航
李勇刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110054882.XA priority Critical patent/CN112750507B/en
Publication of CN112750507A publication Critical patent/CN112750507A/en
Application granted granted Critical
Publication of CN112750507B publication Critical patent/CN112750507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analyzing Non-Biological Materials By The Use Of Chemical Means (AREA)

Abstract

The invention belongs to the field of spectral signal analysis, and particularly relates to a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model. The method comprises the following steps: acquiring a series of machine spectral data of nitrate and nitrite mixed solution samples with different nitrogen contents; classifying the samples according to the optimal critical concentration to obtain four types of samples; establishing a relation model between the nitrogen content of each corresponding nitrate and nitrite in the four types of samples and the corresponding spectral data; screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and acquiring spectral data of a sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite. The method of the invention realizes the accurate and rapid detection of nitrate nitrogen and nitrite nitrogen, and can ensure the detection sensitivity under low concentration.

Description

Method for simultaneously detecting content of nitrate and nitrite in water based on hybrid machine learning model
Technical Field
The invention belongs to the field of spectral signal analysis, and particularly relates to a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model.
Background
At present, various nitrogen-containing compound detection technologies exist in the market, and the detection technologies have great differences in the aspects of detection principles, calculation methods, operation processes, application fields and the like. The multi-component concentration instrument analysis method for relatively mature domestic and foreign research mainly comprises the following steps: electrochemistry, capillary electrophoresis, ion chromatography, biosensing, and spectrophotometry. The electrochemical measurement technology is not perfect in the aspect of monitoring the concentration of trace substances to be measured, and the surface of an electrode is easy to be polluted in an actual sample, so that the detection result is easy to be unstable. The method based on capillary electrophoresis is reliable, but needs large-scale instruments, is complex to operate and is difficult to realize field automatic monitoring. Chromatography can analyze the concentration of various ionic components simultaneously, and has high safety, but the equipment needs to be maintained frequently, and is time-consuming and expensive. The approach of biosensors needs to solve the problems of robustness, selectivity and standardization of operation. The ultraviolet-visible, near-infrared, fluorescence and other spectrum technologies are nondestructive, universal and flexible detection methods, have all the characteristics required for online monitoring, and are economical, feasible, rapid, simple and convenient methods at present. According to the light absorption characteristics of nitrate and nitrite, a rapid and simple ultraviolet-visible spectrophotometry is selected as a basic detection method.
Sequence analysis is commonly used in conventional spectrophotometry for the detection of nitrate and nitrite: firstly, analyzing nitrite in a sample by using a Griess reagent method, then reducing another same sample (generally using a copper/cadmium column), ensuring that all nitrate is converted into nitrite, then repeating nitrite analysis, and calculating the nitrate concentration by difference. This method is an indirect analysis for nitrate, is time consuming and highly dependent on the accuracy of the nitrite detection, and secondly the Griess method involves toxic chemical reagents, is harmful to the body and pollutes the environment. Researchers propose that the ultraviolet absorption spectra of both nitrate and nitrite can be used for direct measurement, and because the ultraviolet absorption spectra of nitrate and nitrite are similar in shape in the first half and have very close absorption peak wavelengths which are nearly overlapped, in actual operation, the contribution of nitrite and nitrate is difficult to separate from the collected spectra, while the existing direct spectroscopy still uses the traditional stoichiometric method to process spectral data, and faces the problems of narrow application range and low detection precision. In recent years, the combination of ultraviolet spectroscopy and machine learning methods has been successfully applied to the rapid detection of various compounds, but few studies have been made on the separation of nitrate and nitrite. Early experiments show that when a common machine learning model is oriented to a mixed solution of nitrate and nitrite within a certain concentration range, the sensitivity of the model for predicting components at low concentration is insufficient, and a machine learning method which can still maintain the same level of detection precision when the concentration of an analyte changes greatly needs to be found urgently.
Disclosure of Invention
Based on the above, the invention provides a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model, which combines classification and regression algorithms, can ensure that the detection precision of nitrate and nitrite in the whole model range reaches balance, is simple and convenient to operate and low in cost, and can simultaneously realize accurate and rapid detection of nitrate and nitrite in a simple environment.
The invention provides a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model, which specifically comprises the following steps:
s1, preparing a series of nitrate and nitrite mixed solution samples with different nitrogen contents, and measuring the spectral data of the samples;
s2, forming a two-dimensional plane by using the nitrogen content of nitrate and nitrite in the sample, obtaining the optimal critical concentration, dividing the two-dimensional plane into four sub-regions, and obtaining four types of samples by taking the sample in each sub-region as one type of sample;
s3, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each of the four types of samples and the corresponding spectral data to realize automatic classification of the samples;
s4, taking samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model;
and S5, acquiring the spectral data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite in the sample to be detected.
Further, the step S3 is specifically:
and training the nitrogen content of the nitrate and nitrite corresponding to each type in the four types of samples and the corresponding spectral data to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
Further, the obtaining a classification model of a support vector machine specifically includes:
the objective function of the support vector classification model is as follows:
Figure RE-GDA0002979635160000031
s.t.yiTxi+b)≥1-ξii≥0,i=1,2,…,l
said xiIs a sample vector, xjIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξiRepresents a relaxation variable;
selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression of the Gaussian kernel function is as follows:
Figure RE-GDA0002979635160000032
in the formula xi,xjFeatures representing samples in a low dimensional spaceThe eigenvector, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.
Further, the obtaining of the random forest classification model specifically includes:
sampling the samples to obtain self-service samples, namely constructing a CART tree, extracting a plurality of features from each node of the CART tree, calculating a Gini index of each feature, and obtaining classification features with classification capability; the calculation method of the Gini index D of the sample comprises the following steps
Figure RE-GDA0002979635160000041
Said C iskThe number of Kth category;
and classifying according to the classification characteristics to obtain a tree structure with completely split nodes.
Further, the obtaining the logistic regression model specifically includes:
the logistic regression model is as follows:
Figure RE-GDA0002979635160000042
in the formula, omega is weight, x is input sample data, and y is the probability that the sample is the positive class of the classifier;
the loss function of the model is:
Figure RE-GDA0002979635160000043
where ω is a weight, N is the number of samples,
Figure RE-GDA0002979635160000044
is the probability that the sample is positive, ynSample class label, 0 or 1.
Further, in the step S4, selecting the characteristic wavelength by using a stable variable substitution method, and establishing the optimal variable subset specifically includes:
obtaining sub data sets of a sample space and a variable space by adopting Monte Carlo sampling, and obtaining the sub data in the sample spaceThe stability of each variable is calculated in a centralized way, and an elite variable with high stability, namely stability S, is obtainedjThe calculation formula is as follows:
Figure RE-GDA0002979635160000045
in the formula bijThe regression coefficient of the jth variable of the ith sample,
Figure RE-GDA0002979635160000046
the mean value of the regression coefficient of the jth variable is obtained, and M is the total number of samples;
performing variable displacement analysis on the subdata set of the variable space, calculating the displacement degree, and obtaining the important variable with high displacement degree, namely the displacement degree PDjThe calculation formula is as follows: PD (photo diode)j=PCEj-SCEjIn the formula PCEjMean root mean square error, SCE, for models separately built with multiple wavelength subsets without j variablesjRespectively establishing a mean root mean square error value of a model by using a plurality of wavelength subsets containing j variables;
and combining the elite variables and the important variables, and obtaining an optimal variable subset by using a cross validation method.
Further, the final model structure in step S4 is:
Figure RE-GDA0002979635160000051
wherein,
Figure RE-GDA0002979635160000052
xiis the sample vector, σ is the bandwidth of the Gaussian kernel, i.e., the kernel parameter, [ b α [ ]1α2…αn]The value is constant and can be obtained by solving the target function of the least square support vector machine by a Lagrange method.
Further, the step S5 of determining the type of the sample to be tested according to the relationship model specifically includes:
and classifying by respectively adopting a support vector machine classification model, a random forest classification model and a logistic regression model to obtain three classes, and selecting the class which accounts for most of the three classes as the class of the sample to be detected.
Further, the conditions for measuring the spectrum data in the steps S1 and S5 are as follows:
the spectral scanning range is 190-400nm, and the spectral scanning interval is 1 nm.
Further, the optimal critical concentration in step S2 is 0.4mg N L-1
Has the advantages that:
the invention prepares a series of mixed solutions of nitrate and nitrite in advance, measures the spectrum data, and uses the data to establish a mixed machine learning model through classification and regression algorithm.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of sample classification provided by an embodiment of the present invention;
FIG. 3 is a frame diagram of an algorithm for analyzing the content of a sample to be tested according to an embodiment of the present invention;
FIG. 4 is a graph comparing the effect of predicting nitrate concentration using a single model and a mixed model according to the present invention;
FIG. 5 is a graph comparing the effect of predicting nitrite concentration using a single model and a mixture model according to embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Based on research, the invention finds that the combination of the ultraviolet spectrum and the machine learning method can be used for simultaneously and rapidly detecting nitrate and nitrite, but when a common machine learning model is oriented to a mixed solution of nitrate and nitrite within a certain concentration range, the sensitivity of predicting components at low concentration is insufficient, and a machine learning method which can still maintain the detection precision at the same level when the concentration of an analyte changes greatly is urgently needed to be found.
As shown in fig. 1, in one embodiment, a flow chart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model is provided, which specifically includes the following steps:
step S101, a series of nitrate and nitrite mixed solution samples with different nitrogen contents are configured, and the spectral data of the samples are measured.
In the present example, a nitrate nitrogen and nitrite nitrogen standard stock solution was first prepared: dried 0.7221g potassium nitrate or 0.4928g sodium nitrite is weighed and dissolved in a proper amount of fresh deionized water, transferred into a 1000ml volumetric flask, diluted to the marked line by the deionized water and mixed uniformly for later use. Diluting to 10mg N L for use-1The standard use solution of (1). All reagents were analytical grade (national chemical reagents, ltd., china). Respectively preparing nitrite N with the concentration of 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5 and 3.0mg N L-1The nitrogen concentration of the nitrate is 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5 and 3.0mg N L-1A total of 100 mixed samples. The deionized water is used as a reference solution for background subtraction, and the spectral data of each wavelength point is measured at intervals of 1nm within the wavelength range of 190-400 nm.
Step S102, forming a two-dimensional plane by using the nitrogen content of nitrate and nitrite in the sample, obtaining the optimal critical concentration, dividing the two-dimensional plane into four sub-regions, and obtaining four types of samples by taking the sample in each sub-region as one type of sample.
As shown in FIG. 2, the embodiment of the present invention provides a sample classification diagram, wherein the concentration plane of nitrate and nitrite is divided into four sub-regions for respective modeling, the critical concentrations for dividing the sub-regions are selected at lower positions due to insufficient sensitivity for analyte prediction at low concentration, and the critical concentrations are selected to be 0.3, 0.4 and 0.8mg N L-1The modeling analysis was performed, and the results are shown in Table 1, when the critical concentration was 0.4mg N L-1The whole model has higher classification accuracy and lower average relative error; the nitrate and nitrite contents in each sub-zone are characterized differently: the nitrate and nitrite contents in the zone 1 are both low; the nitrate content in zone 2 is much higher than nitrite; the concentration of nitrate in zone 3 is much lower than the concentration of nitrite; both nitrate and nitrite levels are higher in zone 4. Compared with a single full model, each sub-model is more adaptive to the sample characteristics of each sub-region and has higher prediction precision.
TABLE 1 comparison of model Performance at different Critical concentrations (mg N L)-1)
Figure RE-GDA0002979635160000081
S103, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each of the four types of samples and the corresponding spectral data to realize automatic classification of the samples;
in the embodiment of the invention, the nitrogen content of the nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data are trained to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
In the embodiment of the invention, an MATLAB tool box of LIBSVM-farutoultimateVersion is used for training a classification model of a support vector machine, and the target function of the classification model is as follows:
Figure RE-GDA0002979635160000082
s.t.yiTxi+b)≥1-ξii≥0,i=1,2,…,l (1)
in the formula xiIs a sample vector, yiIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, l is the total number of samples, C is a penalty factor, ξiRepresents a relaxation variable;
selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression of the Gaussian kernel function is as follows:
Figure RE-GDA0002979635160000083
in the formula xi,xjRepresenting the feature vector of the sample in a low-dimensional space, wherein sigma is the bandwidth of a Gaussian kernel;
in the process of modeling by using the SVM, firstly, normalization preprocessing is carried out on absorbance data, the data are mapped to the range of 0-1 to accelerate the convergence speed of a training network, then, the data dimension of an input layer is reduced by using Principal Component Analysis (PCA), and two hyper-parameters, namely a penalty factor C and a kernel parameter sigma, are optimized by using a Particle Swarm Optimization (PSO).
The SVC function integration in the LIBSVM-farutoultimateVersion tool box realizes the functions, and the functions are as follows: [ predict _ label, accuracy, bestc, bestg ] ═ SVC (train _ label, train _ data, test _ label, test _ data, Method _ option), where Method _ option is a structure, set to: the Method _ option.scale is 1, the Method _ option.pca is 0, and the Method _ option.type is 2, namely, an SVM classification model is established to obtain a prediction sample type prediction _ label, and an optimal penalty factor C and a kernel parameter g are simultaneously output.
In the embodiment of the invention, an MATLAB tool box of RF _ MexStandalone-v0.02 is used for training a random forest classification model, firstly, k new self-help sample sets are extracted randomly in a put-back mode by applying a bootstrap method from original training samples, and k CART trees are constructed, wherein the samples which are not extracted each time form k pieces of data outside bags; assuming n features, randomly drawing m features at each node of each tree, and selecting one feature with the most classification capability for node splitting by calculating the Kernel index of each feature, wherein for a given sample D, K classes are assumed, the number of the K classes is CK, and the calculation formula of the Kernel index of the sample D is as follows:
Figure RE-GDA0002979635160000091
if the selected attribute is A, then the calculation formula of the Gini index of the split data set D is as follows:
Figure RE-GDA0002979635160000092
where K denotes that the sample D is divided into K parts and the data set D is split into K DjA data set;
and forming a tree structure by using a node complete splitting mode, growing each CART tree to the maximum extent, voting the generated each tree on the sample category, and judging the final classification result of the unknown sample according to a minority majority-obeying principle.
In the embodiment of the invention, a program is written in MATLAB to realize logistic regression, and the output of a linear regression model is used as the input of a sigmoid function to obtain a mathematical expression model of the logistic regression, which has the following formula:
Figure RE-GDA0002979635160000101
in the formula, omega is weight, x is input sample data, and y is the probability that the sample is the positive class of the classifier;
the loss function is used to measure the difference between the output of the model and the real output, and in the logistic regression, the value of the loss function is equal to the total probability that the sample is in a certain class, and the formula is as follows:
Figure RE-GDA0002979635160000102
where ω is a weight, N is the number of samples,
Figure RE-GDA0002979635160000103
is the probability that the sample is positive, ynSample class label, 0 or 1.
According to the idea of maximum likelihood estimation, the optimal ω needs to be obtained to realize the loss function and obtain the maximum value, at this time, a random gradient descent method is applied, an initial value of ω is randomly generated, and then the optimal ω is obtained by continuously iterating through the following formula:
Figure RE-GDA0002979635160000104
in the formula, ωtIs an initial value of ω, ωt+1Is a new value of ω;
substituting the obtained omega value into a mathematical model of logistic regression to calculate the class probability score of each sample, and taking the class with the highest probability score as the final class of the sample; the concept of onevsall is also utilized to expand the logistic regression to realize multi-classification, and assuming that the data has N classes, 1 independent binary classifier is established for each class in the N classes by using the logistic regression. For classifier i, let the sample with label ═ i be the positive class, the rest samples be the negative class, and so on. Inputting sample data to be predicted, obtaining the probability p of judging the sample data to be a corresponding positive class by all classifiers, and taking the sample type corresponding to the maximum probability in the p as the final prediction type.
And (3) voting sample categories respectively according to classification models established by a support vector machine, a random forest and logistic regression, and taking the category (more than or equal to 2) of the obtained majority votes as the final category of the samples.
And step S104, taking the samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model.
In the embodiment of the invention, because the classifier has higher error probability at the boundary of the region, each sub-model comprises samples distributed on the boundary so as to avoid larger prediction error caused by classification error, a stable variable displacement method (SVP) is adopted to select characteristic wavelengths, an optimal variable subset is established, a sub-regression model is established by adopting a least square vector machine, the SVP is based on the evolution principle of intraspecific competition and survival of a fitter, the variables are evaluated by considering the stability and the displacement of the variables and statistical data related to the model performance, and the variable subset with the minimum RMSE mean value and relatively lower standard difference value is taken as the optimal variable; for each sub-region, SVP selected a unique variable subset of nitrite and nitrate, respectively. Models built using a specific subset of variables can be adapted to the characteristics of the target ion, resulting in better performance. And establishing a least square support vector machine model in MATLAB by using an LSSVMlabv1_8_ R2009b _ R2011a tool box, searching for optimal regularization parameters and kernel parameters by using a RBF kernel function and using a grid search to obtain a sub-regression model of each sub-region.
In the embodiment of the invention, a stable variable displacement method (SVP) is used for establishing a model for the components of nitrate and nitrite in each subregion respectively and selecting an optimal characteristic wavelength subset; the method comprises the steps of firstly obtaining sub-data sets of a sample space and a variable space by Monte Carlo sampling, calculating the stability of each variable in the sub-data sets of the sample space, sequencing the stability, taking the variable with high stability as an elite variable, and taking the rest as a normal variable. Stability SjThe calculation formula is as follows:
Figure RE-GDA0002979635160000111
in the formula bijThe regression coefficient of the jth variable of the ith sample,
Figure RE-GDA0002979635160000112
the mean value of the regression coefficient of the jth variable is shown, and M is the total number of samples.
Then carrying out variable displacement analysis on the subdata sets of the variable space, calculating the displacement degree of each variable and sequencing the variables with high displacement degree as important variables; degree of substitution PDjThe calculation formula is as follows:
PDj=PCEj-SCEj (9)
PCE in formulajThe rms error mean of the models respectively built with the plurality of wavelength subsets not containing the j variable, and SCEj is the rms error mean of the models respectively built with the remaining plurality of wavelength subsets containing the j variable.
The elite variables and the important variables are merged into a new subset of variables, and the process is repeated. And obtaining N variable subsets through N iterations, and finally selecting the variable subset with the minimum mean root mean square error and relatively low standard deviation value as the optimal subset by utilizing cross validation.
The LSSVMlabv1_8_ R2009b _ R2011a tool box was used to train 4 Least Squares Support Vector Machine (LSSVM) regression sub-models. LSSVM is an SVM in which the loss function is a quadratic loss function, and its objective function is as follows:
Figure RE-GDA0002979635160000121
Figure RE-GDA0002979635160000122
in the formula, xiIs a sample vector, yiIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξiRepresents the variable of the relaxation of the fiber,
Figure RE-GDA0002979635160000126
a non-linear mapping function for mapping the sample space to a high-dimensional feature space.
Using the RBF kernel function, as follows:
Figure RE-GDA0002979635160000123
at this time, the final model structure of the LSSVM is as follows:
Figure RE-GDA0002979635160000124
model parameter [ alpha ] in the formula1α2…αn]The LSSVM target function can be solved by using a Lagrangian method.
Figure RE-GDA0002979635160000125
Wherein α ═ α12,…,αn]Is a lagrange multiplier.
In an LSSVMlabv1_8_ R2009b _ R2011a toolbox, an LSSVM model can be established by initializing model parameters by using a tunelssvm function, and an optimal penalty factor C and a kernel parameter g found by using grid search can be output, wherein the initial values of C and g are set to be 100 and 0.01, and the tunelssvm function is as follows: model ═ tunelsvm (model _ ori, optfun, costfun, costfun _ args), its input parameter is set to costfun ═ crossvalatelsvm'; costfun _ args ═ {10, 'mse' }; optfun ═ gridsearch'; model _ ori ═ initlsvm (trnX, trnY, 'function evaluation', c, g, 'RBF _ kernel'), then, a regression model is built by utilizing a trainllsvm function, a model structure body is output and is used as an important input quantity of a simlsvm function, and a predicted value Y of an unknown sample can be output.
And S105, acquiring spectral data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite in the sample to be detected.
As shown in fig. 3, an embodiment of the present invention provides an algorithm frame diagram for content analysis of a sample to be tested, acquiring spectral data, i.e., spectral data, of the sample to be tested, classifying (LR) the sample by using a Support Vector Machine (SVM) classification model, a random forest classification (RF) model, and a logistic regression model to obtain three categories i, j, and k, and selecting a majority category of the three categories i, j, and k as a category of the sample to be tested.
TABLE 2 Category determination of three base classifier vote inconsistencies
Figure RE-GDA0002979635160000131
Figure RE-GDA0002979635160000141
In the embodiment of the invention, after the category l is obtained, a stable variable displacement method is used for selecting the variable subset corresponding to the category area, a least square support vector is used for establishing a regression model, and finally the predicted values of the concentrations of the nitrate and the nitrite are obtained.
In the embodiment of the invention, leave-one-out cross validation is adopted as an evaluation strategy, and Average Relative Error (ARE), Maximum Relative Error (MRE), predicted Root Mean Square Error (RMSEP) and decision coefficient (R) ARE utilized2) Four classical parameters were used to evaluate the performance of the established model, and all the procedures of this example were done in MATLAB.
As shown in table 3, the results of the predictive analysis of the concentration of the mixed solution by the hybrid machine learning model of the present invention and the single machine learning model are compared, wherein the single machine learning model first selects the characteristic wavelength by using SVP, and then establishes a model by using LSSVM.
TABLE 3 test results using different algorithms
Figure RE-GDA0002979635160000142
As can be seen from Table 3, the results of the prediction method using the hybrid machine learning model of the present invention show that the average relative error of nitrate is reduced from 6.25% to 1.64%, the maximum relative error is reduced from 39.96% to 5.01%, the average relative error of nitrite is reduced from 12.37% to 4.58%, and the maximum relative error is reduced from 79.81% to 9.23%. As shown in fig. 4 and 5, the effect of predicting nitrate and nitrite concentrations for the single model and the mixed model provided in the examples of the present invention is shown in comparison. Although the average relative error predicted by the single modeling is small when the concentration of the analyte is relatively high: (<10%) but when the analyte concentration is below 0.4mg N L-1When the method is used, the prediction error is greatly increased; the prediction method of the hybrid machine learning model provided by the invention has the advantages that the average relative error of hybrid modeling is always controlled to be below 5% no matter how the concentration of the analyte changes in the modeling area, and the performance is more stable.
The embodiment of the invention provides a hybrid machine learning model combining classification and regression algorithms at the same time, which can solve the problem of unbalanced precision of nitrate and nitrite prediction by a single model. In addition, a support vector machine, a random forest and a logistic regression are used for establishing a joint classifier, so that the classification system is optimized. The experimental result shows that compared with other direct spectroscopy methods using a single model, the method obviously reduces the maximum relative error of predicting the concentrations of nitrate and nitrite and improves the prediction precision of low-concentration components. It should be understood that the method of the present invention is not only applicable to the mixed solution of nitrate and nitrite prepared in the present embodiment in a certain concentration ratio, but also applicable to any water sample in any concentration range with nitrate and nitrite as main components.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims (10)

1. A method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model is characterized by specifically comprising the following steps:
s1, preparing a series of nitrate and nitrite mixed solution samples with different nitrogen contents, and measuring the spectral data of the samples;
s2, forming a two-dimensional plane by using the nitrogen content of nitrate and nitrite in the sample, obtaining the optimal critical concentration, dividing the two-dimensional plane into four sub-regions, and obtaining four types of samples by taking the sample in each sub-region as one type of sample;
s3, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data;
s4, taking samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and
and S5, acquiring the spectral data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite in the sample to be detected.
2. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model according to claim 1, wherein the step S3 is specifically as follows:
and training the nitrogen content of the nitrate and nitrite corresponding to each type in the four types of samples and the corresponding spectral data to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
3. The method for simultaneously detecting the nitrate and nitrite contents in water based on the hybrid machine learning model as claimed in claim 2, wherein the obtaining the support vector machine classification model specifically comprises:
the objective function of the support vector classification model is as follows:
Figure FDA0002900551780000011
s.t.yiTxi+b)≥1-ξi,ξi≥0,i=1,2,...,l
said xiIs a sample vector, yiIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξiRepresents a relaxation variable;
selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression of the Gaussian kernel function is as follows:
Figure FDA0002900551780000021
in the formula xi,xjRepresentsThe feature vector of the sample in the low dimensional space, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.
4. The method for simultaneously detecting the nitrate and nitrite content in water based on the hybrid machine learning model as claimed in claim 2, wherein the obtaining the random forest classification model specifically comprises:
sampling the samples to obtain self-service samples, namely constructing a CART tree, extracting a plurality of features from each node of the CART tree, calculating a Gini index of each feature, and obtaining classification features with classification capability; the calculation method of the Gini index D of the sample comprises the following steps
Figure FDA0002900551780000022
Said C iskThe number of Kth category;
and classifying according to the classification characteristics to obtain a tree structure with completely split nodes.
5. The method for simultaneous detection of nitrate and nitrite content in water based on hybrid machine learning model according to claim 2, wherein the obtaining of the logistic regression model specifically comprises:
the logistic regression model is as follows:
Figure FDA0002900551780000023
in the formula, omega is weight, x is input sample data, and y is the probability that the sample is the positive class of the classifier;
the loss function of the model is:
Figure FDA0002900551780000024
where ω is a weight, N is the number of samples,
Figure FDA0002900551780000025
is the probability that the sample is positive, ynSample class label, 0 or 1.
6. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model as claimed in claim 1, wherein said step S4 is implemented by selecting characteristic wavelength by using stable variable substitution method, and establishing optimal variable subset specifically comprises:
obtaining sub data sets of a sample space and a variable space by adopting Monte Carlo sampling, calculating the stability of each variable in the sub data sets of the sample space, and obtaining an elite variable with high stability, namely stability SjThe calculation formula is as follows:
Figure FDA0002900551780000031
in the formula bijThe regression coefficient of the jth variable of the ith sample,
Figure FDA0002900551780000032
the mean value of the regression coefficient of the jth variable is obtained, and M is the total number of samples;
performing variable displacement analysis on the subdata set of the variable space, calculating the displacement degree, and obtaining the important variable with high displacement degree, namely the displacement degree PDjThe calculation formula is as follows: PD (photo diode)j=PCEj-SCEjIn the formula PCEjMean root mean square error, SCE, for models separately built with multiple wavelength subsets without j variablesjRespectively establishing a mean root mean square error value of a model by using a plurality of wavelength subsets containing j variables;
and combining the elite variables and the important variables, and obtaining an optimal variable subset by using a cross validation method.
7. The method for simultaneously detecting the nitrate and nitrite contents in water based on the hybrid machine learning model as claimed in claim 6, wherein the regression submodel in the step S4 is:
Figure FDA0002900551780000033
wherein,
Figure FDA0002900551780000034
xiis the sample vector, σ is the bandwidth of the Gaussian kernel, i.e., the kernel parameter, [ b α [ ]1 α2… αn]The value is constant and can be obtained by solving the target function of the least square support vector machine by a Lagrange method.
8. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model as claimed in claim 2, wherein the determination of the type of the sample to be detected according to the relationship model in step S5 is specifically as follows:
and classifying by respectively adopting a support vector machine classification model, a random forest classification model and a logistic regression model to obtain three classes, and selecting the class which accounts for most of the three classes as the class of the sample to be detected.
9. The method for simultaneous detection of nitrate and nitrite in water based on hybrid machine learning model as claimed in claim 1, wherein the conditions for determining the spectral data in steps S1 and S5 are:
the spectral scanning range is 190-400nm, and the spectral scanning interval is 1 nm.
10. The method for simultaneously detecting the contents of nitrate and nitrite in water based on the hybrid machine learning model as claimed in claim 1, wherein the optimal critical concentration in the step S2 is 0.4mg NL-1
CN202110054882.XA 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model Active CN112750507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110054882.XA CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110054882.XA CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Publications (2)

Publication Number Publication Date
CN112750507A true CN112750507A (en) 2021-05-04
CN112750507B CN112750507B (en) 2023-12-22

Family

ID=75652155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110054882.XA Active CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Country Status (1)

Country Link
CN (1) CN112750507B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115901677A (en) * 2022-12-02 2023-04-04 北京理工大学 Method for predicting ammonium nitrate concentration in nitric acid-ammonium nitrate solution with updating mechanism
CN115950854A (en) * 2022-12-02 2023-04-11 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106153601A (en) * 2016-10-08 2016-11-23 江南大学 A kind of method based on SERS detection grease oxide in trace quantities since
CN107024445A (en) * 2017-04-17 2017-08-08 中国科学院南京土壤研究所 The modeling method and detection method of the quick detection of Nitrate in Vegetable
CN109001080A (en) * 2018-05-18 2018-12-14 内蒙古师范大学 A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior
CN109187392A (en) * 2018-09-26 2019-01-11 中南大学 A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model
US10229370B1 (en) * 2017-08-29 2019-03-12 Massachusetts Mutual Life Insurance Company System and method for managing routing of customer calls to agents
CN110591075A (en) * 2019-06-28 2019-12-20 四川大学华西医院 PEG-Peptide linear-tree-shaped drug delivery system and preparation method and application thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106153601A (en) * 2016-10-08 2016-11-23 江南大学 A kind of method based on SERS detection grease oxide in trace quantities since
CN107024445A (en) * 2017-04-17 2017-08-08 中国科学院南京土壤研究所 The modeling method and detection method of the quick detection of Nitrate in Vegetable
US10229370B1 (en) * 2017-08-29 2019-03-12 Massachusetts Mutual Life Insurance Company System and method for managing routing of customer calls to agents
CN109001080A (en) * 2018-05-18 2018-12-14 内蒙古师范大学 A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior
CN109187392A (en) * 2018-09-26 2019-01-11 中南大学 A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model
CN110591075A (en) * 2019-06-28 2019-12-20 四川大学华西医院 PEG-Peptide linear-tree-shaped drug delivery system and preparation method and application thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KATHERINE M. RANSOM等: "A hybrid machine learning model to predict and visualize nitrate concentration throughout the Central Valley aquifer, California, USA", pages 1 - 15, Retrieved from the Internet <URL:《网页在线公开:https://www.sciencedirect.com/science/article/pii/S0048969717313013》> *
陈菁菁: "基于机器学习的微量农药光谱预测模型", 《北京信息科技大学学报》, vol. 35, no. 2, pages 62 - 66 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115901677A (en) * 2022-12-02 2023-04-04 北京理工大学 Method for predicting ammonium nitrate concentration in nitric acid-ammonium nitrate solution with updating mechanism
CN115950854A (en) * 2022-12-02 2023-04-11 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution
CN115950854B (en) * 2022-12-02 2023-10-13 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution
CN115901677B (en) * 2022-12-02 2023-12-22 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution with updating mechanism

Also Published As

Publication number Publication date
CN112750507B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN111126575B (en) Gas sensor array mixed gas detection method and device based on machine learning
CN104949936A (en) Sample component determination method based on optimizing partial least squares regression model
CN101825567A (en) Screening method for near infrared spectrum wavelength and Raman spectrum wavelength
CN112750507B (en) Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model
CN110726694A (en) Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm
CN110018294B (en) Soil heavy metal detection value correction method and device and computer storage medium
CN108827909B (en) Rapid soil classification method based on visible near infrared spectrum and multi-target fusion
CN117078114B (en) Water quality evaluation method and system for water-bearing lakes under influence of diversion engineering
CN115221927A (en) Ultraviolet-visible spectrum dissolved organic carbon detection method
CN117556245B (en) Method for detecting filtered impurities in tetramethylammonium hydroxide production
CN114764682B (en) Rice safety risk assessment method based on multi-machine learning algorithm fusion
CN114184599B (en) Single-cell Raman spectrum acquisition number estimation method, data processing method and device
Dorantes et al. Calibration set optimization and library transfer for soil carbon estimation using soil spectroscopy—A review
Pessoa et al. Development of ant colony optimization (aco) algorithms based on statistical analysis and hypothesis testing for variable selection
CN107644285A (en) The screening of power sales profitability evaluation index and Weight Determination and system
CN107356556A (en) A kind of double integrated modelling approach of Near-Infrared Spectra for Quantitative Analysis
CN114219157B (en) Alkane gas infrared spectrum measurement method based on optimal decision and dynamic analysis
CN111220565B (en) CPLS-based infrared spectrum measuring instrument calibration migration method
CN111062118B (en) Multilayer soft measurement modeling system and method based on neural network prediction layering
Yun Method of Selecting Calibration Samples
CN112801172A (en) Chinese cabbage pesticide residue qualitative analysis method based on fuzzy pattern recognition
González-Vargas et al. Validation methods for population models of gene expression dynamics
Li et al. Driving factors of green climate fund leverage
Walde Discriminant analysis
Baccolo et al. Comparison of machine learning approaches for the classification of elution profiles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant