CN112750507B - Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model - Google Patents

Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model Download PDF

Info

Publication number
CN112750507B
CN112750507B CN202110054882.XA CN202110054882A CN112750507B CN 112750507 B CN112750507 B CN 112750507B CN 202110054882 A CN202110054882 A CN 202110054882A CN 112750507 B CN112750507 B CN 112750507B
Authority
CN
China
Prior art keywords
sample
model
nitrite
nitrate
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110054882.XA
Other languages
Chinese (zh)
Other versions
CN112750507A (en
Inventor
熊莎
吴琼
张航
李勇刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110054882.XA priority Critical patent/CN112750507B/en
Publication of CN112750507A publication Critical patent/CN112750507A/en
Application granted granted Critical
Publication of CN112750507B publication Critical patent/CN112750507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Investigating Or Analyzing Non-Biological Materials By The Use Of Chemical Means (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the field of spectrum signal analysis, and particularly relates to a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model. Comprising the following steps: acquiring a series of machine spectrum data of mixed solution samples of nitrate and nitrite with different nitrogen contents; classifying samples according to the optimal critical concentration to obtain four types of samples; establishing a relation model of the nitrogen content of the nitrate and the nitrite corresponding to each of the four types of samples and the corresponding spectrum data; screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and acquiring spectrum data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and adopting a regression sub-model corresponding to the type of the sample to be detected to carry out analysis and prediction so as to obtain the concentration of nitrate and nitrite. The method of the invention realizes accurate and rapid detection of nitrate nitrogen and nitrite nitrogen, and can ensure detection sensitivity under low concentration.

Description

Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model
Technical Field
The invention belongs to the field of spectrum signal analysis, and particularly relates to a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model.
Background
At present, a plurality of nitrogen-containing compound detection technologies exist in the market, and the detection technologies have great differences in detection principles, calculation methods, operation processes, application fields and the like. The method for analyzing the multicomponent concentration instrument which is relatively mature in domestic and foreign research mainly comprises the following steps: electrochemical, capillary electrophoresis, ion chromatography, biosensing, and spectrophotometry. The electrochemical measurement technology is not perfect in the aspect of monitoring the concentration of trace amounts of the to-be-detected substances, and the detection result is easy to be unstable in an actual sample due to the fact that the surface of the electrode is easy to be polluted. The method based on capillary electrophoresis is reliable, but needs large-scale instruments, is complex to operate, and is difficult to realize on-site automatic monitoring. Chromatography can analyze a plurality of ion component concentrations simultaneously, and is high in safety, but equipment needs to be maintained frequently, and is time-consuming and expensive. The method of the biosensor needs to solve the problems of robustness, selectivity and standardization of operation. The ultraviolet-visible, near infrared, fluorescence and other spectrum technologies are nondestructive, universal and flexible detection methods, have all the characteristics required by on-line monitoring, and are a method which is economical, feasible, quick and simple at present. According to the light absorption characteristics of nitrate and nitrite, a quick and simple ultraviolet-visible spectrophotometry is selected as a basic detection method.
Conventional spectrophotometry for detecting nitrate and nitrite is commonly used for sequential analysis: firstly, analyzing nitrite in a sample by using a Griess reagent method, then reducing another identical sample (generally using a copper/cadmium column), ensuring that all nitrate is converted into nitrite, and repeating nitrite analysis, thus calculating the nitrate concentration through a difference value. This method belongs to indirect analysis of nitrate, is time-consuming and very dependent on the detection accuracy of nitrite, and secondly the Griess method involves toxic chemical reagents, is harmful to the body and pollutes the environment. Researchers have proposed that the ultraviolet absorption spectrum of both nitrate and nitrite can be used for direct measurement, and as the ultraviolet absorption spectrums of nitrate and nitrite are similar in shape in the first half section and have very close absorption peak wavelengths and almost overlap, in actual operation, the contribution of nitrite and nitrate is difficult to separate from the collected spectrums, while the traditional direct spectrometry still uses the traditional chemometric method to process spectrum data, and the problems of narrow application range and low detection precision are faced. In recent years, the combination of ultraviolet light spectrum and machine learning method has been successfully applied to the rapid detection of various compounds, however, there is still little research on separating nitrate from nitrite. Early experiments show that when a common machine learning model is oriented to a mixed solution of nitrate and nitrite in a certain concentration range, the sensitivity of the mixed solution to components under the condition of predicting low concentration is insufficient, and a machine learning method capable of maintaining the same level of detection precision when the concentration of an analyte is greatly changed is needed to be searched.
Disclosure of Invention
Based on the above, the invention provides a method for simultaneously detecting the nitrate and nitrite contents in water based on a hybrid machine learning model, and the method combines classification and regression algorithms, so that the detection precision of the nitrate and nitrite in the whole model range can be balanced, the operation is simple and convenient, the cost is low, and the accurate and rapid detection of the nitrate and nitrite in a simple environment can be realized.
The invention provides a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model, which specifically comprises the following steps:
s1: preparing a series of mixed solution samples of nitrate and nitrite with different nitrogen contents, and measuring spectral data of the samples;
s2: forming a two-dimensional plane by using the nitrogen content of the nitrate and the nitrite in the sample, acquiring the optimal critical concentration, dividing the two-dimensional plane into four sub-areas, and acquiring four types of samples by taking the sample in each sub-area as one type of sample;
s3: establishing a relation model between the nitrogen content of the nitrate and the nitrite corresponding to each of the four types of samples and the corresponding spectrum data so as to realize automatic classification of the samples;
s4: taking samples in the subareas and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model;
s5: and acquiring spectrum data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and adopting a regression sub-model corresponding to the type of the sample to be detected to carry out analysis and prediction to obtain the concentration of nitrate and nitrite in the sample to be detected.
Further, the step S3 specifically includes:
and training the nitrogen content of the nitrate and the nitrite corresponding to each type of sample in the four types of samples and the corresponding spectrum data to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
Further, the obtaining a support vector machine classification model specifically includes:
the objective function of the support vector classification model is:
s.t. y iT x i +b)≥1-ξ i ,ξ i ≥0,i=1,2,...,l
the x is i Is a sample vector, x j Is a sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ζ i Represents a relaxation variable;
and selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression is as follows:
in which x is i ,x j The eigenvector representing the sample in low dimensional space, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.
Further, the obtaining the random forest classification model specifically includes:
sampling the sample to obtain a self-service sample, namely constructing a CART tree, extracting a plurality of features from each node of the CART tree, and calculating the base index of each feature to obtain classification features with classification capability; by a means ofThe method for calculating the base-Ni index D of the sample is as followsThe C is k Number of K-th category;
and classifying according to the classification characteristics to obtain a tree structure with completely split nodes.
Further, the obtaining the logistic regression model specifically includes:
the logistic regression model is:
in the middle ofAs the weight, x is input sample data, y is the probability that the sample is the positive class of the classifier;
the loss function of the model is:
in the method, in the process of the invention,is weight, N is number of samples, +.>For the probability that the sample is of a positive class, yn is the sample class label, 0 or 1.
Further, in the step S4, the characteristic wavelength is selected by adopting a stable variable displacement method, and the establishment of the optimal variable subset is specifically:
obtaining a sub-data set of a sample space and a variable space by adopting Monte Carlo sampling, calculating the stability of each variable in the sub-data set of the sample space, and obtaining an elite variable with high stability, wherein the stability is S j The calculation formula is as follows:in b ij Is the ithRegression coefficient of the j-th variable of the sample, < >>The average value of regression coefficients of the j-th variable is represented by M, and the total number of samples is represented by M;
performing variable displacement analysis in the sub-data set of the variable space, calculating the degree of displacement, and obtaining important variables with high degree of displacement, wherein the degree of displacement PD j The calculation formula is as follows: PD (potential difference) device j =PCE j -SCE j In PCE j Root mean square error mean value (SCE) of model built for each of multiple wavelength subsets without j-variable j Root mean square error mean values of the models respectively established for the plurality of remaining wavelength subsets containing j variables;
and merging the elite variable and the important variable, and obtaining an optimal variable subset by using a cross verification method.
Further, the final model structure in the step S4 is as follows:
wherein,x i is a sample vector, σ is the bandwidth of the gaussian kernel, i.e., the kernel parameter, [ bα ] 1 α 2 …α n ]The method is constant and can be obtained by solving a least square support vector machine objective function through a Lagrangian method.
Further, in the step S5, determining the category of the sample to be measured according to the relationship model specifically includes:
and classifying by adopting a support vector machine classification model, a random forest classification model and a logistic regression model to obtain three categories, and selecting the category which is the majority of the three categories as the category of the sample to be detected.
Further, the conditions for measuring the spectrum data in the steps S1 and S5 are as follows:
the spectrum scanning range is 190-400nm, and the spectrum scanning interval is 1nm.
Further, the optimal critical concentration in the step S2 is 0.4mg N L -1
The beneficial effects are that:
according to the invention, a series of mixed solutions of nitrate and nitrite are prepared in advance, spectral data of the mixed solutions are measured, the mixed machine learning model is established by utilizing the data through classification and regression algorithms, the nitrate and nitrite contents in a sample to be detected can be accurately and rapidly detected by only measuring the spectral data of the sample to be detected through the learning model, the detection precision of the nitrate and the nitrite in the whole modeling range can be ensured to be balanced, the prediction precision of low-concentration components is improved, and the mixed machine learning model is simple and convenient to operate and low in cost.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of sample classification according to an embodiment of the present invention;
FIG. 3 is an algorithm frame diagram of analysis of the content of a sample to be measured according to an embodiment of the present invention;
FIG. 4 is a graph showing the comparison of the effect of predicting nitrate concentration in a single model and a mixed model according to an embodiment of the present invention;
FIG. 5 is a graph showing the effect of predicting nitrite concentration in a single model and a mixed model according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention is based on research and discovers that the ultraviolet spectrum and the machine learning method can be used for simultaneously and rapidly detecting nitrate and nitrite, but when a common machine learning model is oriented to a nitrate and nitrite mixed solution in a certain concentration range, the sensitivity to components under the condition of predicting low concentration is insufficient, and a machine learning method which can still maintain the detection precision at the same level when the concentration of an analyte is greatly changed is needed to be searched.
As shown in fig. 1, in one embodiment, a flowchart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model is provided, specifically comprising the steps of:
step S101, a series of mixed solution samples of nitrate and nitrite with different nitrogen contents are prepared, and spectral data of the samples are measured.
In the embodiment of the invention, a standard stock solution of nitrate nitrogen and nitrite nitrogen is firstly prepared: dried 0.7221g of potassium nitrate or 0.4928g of sodium nitrite are weighed and dissolved in a proper amount of fresh deionized water, transferred into a 1000ml volumetric flask, diluted to marked line by deionized water and uniformly mixed for standby. Is diluted to 10mg N L at the time of use -1 Standard use solutions of (2). All reagents were analytical grade (national pharmaceutical chemicals, inc., china). Nitrite nitrogen concentrations of 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5, 3.0mg nL were prepared, respectively -1 Nitrate nitrogen concentration of 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5, 3.0mg N L -1 A total of 100 mixed samples. Background subtraction was performed with deionized water as a reference solution and spectral data at each wavelength point was measured at 1nm intervals over the 190-400nm wavelength range.
Step S102, a two-dimensional plane is formed by the nitrogen content of the nitrate and the nitrite in the sample, the optimal critical concentration is obtained, the two-dimensional plane is divided into four sub-areas, the sample in each sub-area is one type of sample, and four types of samples are obtained.
As shown in FIG. 2, the embodiment of the invention provides a sample classification schematic diagram, which is used for dividing a nitrate and nitrite concentration plan into four subareas for modeling respectively, and the critical concentrations for dividing the subareas are selected at lower positions due to insufficient analyte prediction sensitivity under low concentration, and the critical concentrations are respectively selected to be 0.3, 0.4 and 0.8mg N L -1 Modeling analysis was performed and the results are shown in Table 1 when the critical concentration was 0.4mg N L -1 When the method is used, the overall model has higher classification accuracy and lower average relative error; the nitrate and nitrite contents in each subregion are characterized by different amounts: the nitrate and nitrite contents in zone 1 are both lower; the nitrate content in zone 2 is much higher than nitrite; the nitrate concentration in zone 3 is much lower than the nitrite concentration; the nitrate and nitrite contents in zone 4 are both higher. Compared with a single full model, each sub-model is more suitable for the sample characteristics of each sub-region, and has higher prediction precision.
TABLE 1 comparison of model Performance at different Critical concentrations (mg N L -1 )
Step S103: establishing a relation model between the nitrogen content of the nitrate and the nitrite corresponding to each of the four types of samples and the corresponding spectrum data so as to realize automatic classification of the samples;
in the embodiment of the invention, the nitrogen content of the nitrate and the nitrite corresponding to each of the four types of samples and the corresponding spectrum data are trained to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
In the embodiment of the invention, a support vector machine classification model is trained by using a MATLAB tool box of LIBSVM-farutoUltimateVersion, and the objective function is as follows:
s.t. y iT x i +b)≥1-ξ i ,ξ i ≥0,i=1,2,...,l (1)
in which x is i Is a sample vector, y i Is a sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, l is the total number of samples, C is a penalty factor, ζ i Represents a relaxation variable;
and selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression is as follows:
in which x is i ,x j Feature vectors representing samples in a low dimensional space, σ being the bandwidth of the gaussian kernel;
in the process of modeling by using SVM, firstly, carrying out normalization pretreatment on absorbance data, mapping the data to a range of 0-1 to accelerate the convergence rate of a training network, then using Principal Component Analysis (PCA) to reduce the data dimension of an input layer, and using a Particle Swarm Optimization (PSO) to carry out optimization on two super parameters, namely a penalty factor C and a nuclear parameter sigma.
The SVC function integration in the libvm-farutoUltimateVersion toolbox implements the above functions as follows: [ prediction_label, accuracy, bestc, bestg ] =svc (trace_label, trace_data, test_label, test_data, method_option), wherein method_option is a structure, which is set to: the method_option.scale=1, the method_option.pca=0 and the method_option.type=2, and an SVM classification model can be established to obtain a prediction sample type prediction_label, and the optimal penalty factor C and the core parameter g are output at the same time.
In the embodiment of the invention, a random forest classification model is trained by using an MATLAB tool box of RF_MexStandone-v 0.02, firstly, k new self-service sample sets are randomly extracted from original training samples in a put-back way by applying a bootstrap method, k CART trees are constructed, and each time the samples which are not extracted form k pieces of out-bag data; assuming n features, randomly extracting m features at each node of each tree, and selecting one feature with the most classification capability to perform node splitting by calculating the base index of each feature, wherein for a given sample D, K categories are assumed, the number of the K categories is CK, and the base index of the sample D is calculated according to the following formula:
if the selected attribute is A, the Basil index of the split data set D is calculated as follows:
where K denotes that the sample D is divided into K parts and the data set D is split into K D j A data set;
and forming a tree structure by using a node complete splitting mode, enabling each CART tree to grow to the maximum extent, finally enabling each generated tree to vote on the sample category, and judging the final classification result of the unknown sample according to a few rules obeying majority.
In the embodiment of the invention, a program is written in MATLAB to realize logistic regression, and the output of a linear regression model is used as the input of a sigmoid function to obtain a mathematical expression model of logistic regression, wherein the mathematical expression model is represented by the following formula:
in the middle ofAs the weight, x is input sample data, y is the probability that the sample is the positive class of the classifier;
the loss function is used to measure the difference between the model's output and the true output, and in logistic regression the value of the loss function is equal to the total probability that the sample is of a certain class, the formula is as follows:
in the method, in the process of the invention,is weight, N is number of samples, +.>For the probability that the sample is of positive class, y n Is a sample class label, 0 or 1.
According to the maximum likelihood estimation idea, the optimal omega realization loss function needs to be obtained to obtain the maximum value, at the moment, a random gradient descent method is used for randomly generating an initial value of omega, and then the optimal omega is obtained by continuously iterating through the following formula:
in the method, in the process of the invention,is->Initial value>Is->A new value;
to be foundCalculating the class probability score of each sample by substituting the value into the mathematical model of logistic regression, and taking the class with the highest probability score as the final class of the sample; the method also utilizes the onevsall concept to expand the logistic regression to realize multi-classification, and supposes that the data has N categories, and uses the logistic regression to establish 1 for each category in the N categoriesIndividual binary classifiers. For classifier i, the samples of label= =i are set to positive class, the remaining samples are set to negative class, and so on. And inputting sample data to be predicted, obtaining the probability p that all classifiers judge the sample data to be predicted as corresponding positive classes, and taking the sample type corresponding to the probability with the maximum probability in p as the final prediction type.
Voting the sample categories according to the classification model established by the support vector machine, the random forest and the logistic regression, and taking the category (more than or equal to 2) of the obtained majority vote as the final sample category.
And step S104, taking samples in the subareas and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model.
In the embodiment of the invention, as the error probability of the classifier at the boundary of the region is larger, each sub-model comprises samples distributed on the boundary so as to avoid larger prediction errors caused by classification errors, a stable variable substitution method (SVP) is adopted to select characteristic wavelengths, an optimal variable subset is established, a least square vector machine is adopted to establish a sub-regression model, the SVP is based on the evolution principle of intraspecies competition and survival of the fittest, the variable is evaluated by considering the stability and substitution degree of the variable and statistical data related to the performance of the model, and the variable subset with the minimum RMSE mean value and relatively lower standard deviation value is taken as the optimal variable; for each sub-region, SVP selects a unique variable subset of nitrite and nitrate, respectively. Models built with specific subsets of variables can be adapted to the characteristics of the target ion, resulting in better performance. And establishing a least squares support vector machine model in MATLAB by using an LSSVMlabv1_8_R2009b_R20111 a tool box, searching optimal regularization parameters and kernel parameters by using an RBF kernel function, and obtaining a sub-regression model of each sub-region by using grid search.
In the embodiment of the invention, a stable variable displacement method (SVP) is used for respectively establishing a model for nitrate and nitrite components in each subarea to select an optimal characteristic wavelength subset; first, a sub-data set of a sample space and a variable space is obtained by Monte Carlo sampling, and each sub-data set of the sample space is calculatedAnd (3) the stability of the variables is ordered, the variables with high stability are used as elite variables, and the rest are normal variables. Stability S j The calculation formula is as follows:
in b ij Regression coefficients for the j variable of the i sample,the average value of regression coefficients of the j-th variable is represented by M, which is the total number of samples.
Then, carrying out variable displacement analysis in a sub-data set of a variable space, calculating the displacement degree of each variable and arranging the variables with high displacement degrees as important variables; degree of substitution PD j The calculation formula is as follows:
PD j =PCE j -SCE j (9)
in PCE j Root mean square error mean value (SCE) of model built for each of multiple wavelength subsets without j-variable j The root mean square error mean of the model is built for each of the remaining plurality of wavelength subsets containing j variables.
The elite variable and the important variable are combined into a new variable subset, and the process is repeated. And N variable subsets are obtained through N iterations, and finally, the variable subset with the minimum mean value of root mean square errors and relatively low standard deviation value is selected as the optimal subset through cross verification.
4 Least Squares Support Vector Machine (LSSVM) regression sub-models were trained using lssvmlababv1_8_r20091ba toolbox. The LSSVM is an SVM whose loss function is a quadratic loss function, whose objective function is as follows:
wherein x is i Is a sample vector, y i Is a sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, n is the total number of samples, and C isPenalty factor, ζ i Represents a variable of relaxation and,to a nonlinear mapping function that maps a sample space to a high-dimensional feature space.
The RBF kernel function is used as follows:
the LSSVM final model structure at this time is:
model parameters [ alpha ] 1 α 2 … α n ]The LSSVM objective function may be solved using a Lagrangian method.
Wherein α= [ α ] 1 ,α 2 ,…,α n ]Is the lagrange multiplier.
In the LSSVMlabv1_8_R20090b_R20111 tool box, an LSSVM model can be built by initializing model parameters by using a tunelssvm function, and the optimal penalty factor C and the core parameter g which are found by using grid search can be output, wherein initial values of C and g are set to be 100 and 0.01, and the tunelssvm function is as follows: model=tunelssvm (model_ori, optfun, costfun, costfun_args), its input parameter is set to costfun= 'cross validicatelssvm'; costfun_args= {10, 'mse' }; optfun= 'gridsearch'; model_ori=initlssvm (trnX, trnY, 'function estimation', c, g, 'rbf_kernel'), and then building a regression model by using the traplssvm function, and outputting a model structure body as an important input quantity of the simlssvm function, namely outputting a predicted value Y for an unknown sample.
Step S105, obtaining spectrum data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and adopting a regression sub-model corresponding to the type of the sample to be detected to conduct analysis and prediction so as to obtain the concentrations of nitrate and nitrite in the sample to be detected.
As shown in fig. 3, the embodiment of the invention provides an algorithm frame diagram for analyzing the content of a sample to be tested, and obtains the spectrum data of the sample to be tested, namely, the spectrum data, and then uses a Support Vector Machine (SVM) classification model, a random forest classification (RF) model and a logistic regression model to classify (LR) to obtain three categories i, j and k, and selects the majority of the three categories i, j and k as the category of the sample to be tested.
Table 2 class determination for inconsistent voting by three base classifiers
In the embodiment of the invention, after the class l is obtained, a variable subset of a corresponding class area is selected by using a stable variable displacement method, a regression model is established by using a least square support vector, and finally, the concentration predicted values of nitrate and nitrite are obtained.
In the embodiment of the invention, a cross validation is reserved as an evaluation strategy, and Average Relative Error (ARE), maximum Relative Error (MRE), prediction Root Mean Square Error (RMSEP) and decision coefficient (R) ARE utilized 2 ) Four classical parameters were used to evaluate the performance of the model built, and the procedure in this example was completed in MATLAB.
As shown in Table 3, the mixed machine learning model of the present invention was compared with the results of the concentration prediction analysis of the mixed solution using a single machine learning model, which was first modeled using SVP to select the characteristic wavelength and then LSSVM.
TABLE 3 detection results Using different algorithms
As can be seen from Table 3, the average relative error of nitrate was reduced from 6.25% to 1.64%, the maximum relative error was reduced from 39.96% to 5.01%, the average relative error of nitrite was reduced from 12.37% to 4.58%, and the maximum relative error was reduced from 79.81% to 9.23% by using the predictive method of the hybrid machine learning model of the present invention. As shown in fig. 4 and 5, the effect of predicting the nitrate and nitrite concentrations by the single model and the mixed model according to the embodiment of the present invention is compared with each other. Although the average relative error predicted by single modeling is small (< 10%) when the analyte concentration is relatively high, when the analyte concentration is below 0.4mg nL -1 When the prediction error is greatly increased; according to the prediction method of the hybrid machine learning model, no matter how the concentration of the analyte changes in the modeling area, the average relative error of the hybrid modeling is always controlled below 5%, and the performance is more stable.
The embodiment of the invention provides a hybrid machine learning model combining classification and regression algorithms, which can solve the problem of unbalanced accuracy of nitrate and nitrite prediction by a single model. In addition, a support vector machine, a random forest and logistic regression are used for establishing a joint classifier to optimize the classification system. Experimental results show that compared with other direct spectrometry using a single model, the method remarkably reduces the maximum relative error of predicting the nitrate and nitrite concentrations and improves the prediction accuracy of low-concentration components. It should be understood that the method of the present invention is applicable not only to the mixed solution of nitrate and nitrite with a certain concentration ratio prepared in the present embodiment, but also to any water sample in any concentration range with nitrate and nitrite as main components.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (8)

1. The method for simultaneously detecting the nitrate and nitrite contents in water based on the hybrid machine learning model is characterized by comprising the following steps of:
s1, preparing a series of mixed solution samples of nitrate and nitrite with different nitrogen contents, and measuring spectral data of the samples;
s2, forming a two-dimensional plane by using the nitrogen content of the nitrate and the nitrite in the sample, acquiring the optimal critical concentration, dividing the two-dimensional plane into four sub-areas, and obtaining four types of samples by taking the sample in each sub-area as one type of sample;
s3, establishing a relation model of the nitrogen content of the nitrate and nitrite corresponding to each type in the four types of samples and the corresponding spectrum data;
s4, taking samples in the subareas and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and
in the step S4, a stable variable displacement method is adopted to select characteristic wavelengths, and the establishment of the optimal variable subset is specifically as follows:
obtaining a sub-data set of a sample space and a variable space by Monte Carlo sampling, and counting in the sub-data set of the sample spaceCalculating the stability of each variable to obtain an elite variable with high stability and stability S j The calculation formula is as follows:in b ij Regression coefficient of jth variable for ith sample,/>The average value of regression coefficients of the j-th variable is represented by M, and the total number of samples is represented by M;
performing variable displacement analysis in the sub-data set of the variable space, calculating the degree of displacement, and obtaining important variables with high degree of displacement, wherein the degree of displacement PD j The calculation formula is as follows: PD (potential difference) device j =PCE j -SCE j In PCE j Root mean square error mean value (SCE) of model built for each of multiple wavelength subsets without j-variable j Root mean square error mean values of the models respectively established for the plurality of remaining wavelength subsets containing j variables;
merging the elite variables and the important variables, and obtaining an optimal variable subset by using a cross verification method;
the regression sub-model is:
wherein,x i is a sample vector, σ is the bandwidth of the gaussian kernel, i.e., the kernel parameter, [ bα ] 1 α 2 … α n ]The method is constant and can be obtained by solving a least square support vector machine objective function by a Lagrangian method;
s5, acquiring spectrum data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and adopting a regression sub-model corresponding to the type of the sample to be detected to conduct analysis and prediction to obtain the concentration of nitrate and nitrite in the sample to be detected.
2. The method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to claim 1, wherein the step S3 is specifically:
and training the nitrogen content of the nitrate and the nitrite corresponding to each type of sample in the four types of samples and the corresponding spectrum data to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
3. The method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to claim 2, wherein the obtaining a support vector machine classification model specifically comprises:
the objective function of the support vector classification model is:
s.t.y iT x i +b)≥1-ξ i ,ξ i ≥0,i=1,2,...,l
the x is i Is a sample vector, y i Is a sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ζ i Represents a relaxation variable;
and selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression is as follows:
in which x is i ,x j The eigenvector representing the sample in low dimensional space, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.
4. The method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to claim 2, wherein the obtaining a random forest classification model specifically comprises:
sampling the sample to obtain a self-service sample, namely constructing a CART tree, extracting a plurality of features from each node of the CART tree, and calculating the base index of each feature to obtain classification features with classification capability; the method for calculating the matrix index D of the sample is as followsThe C is k Number of K-th category;
and classifying according to the classification characteristics to obtain a tree structure with completely split nodes.
5. The method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to claim 2, wherein the obtaining a logistic regression model specifically comprises:
the logistic regression model is:
in the middle ofAs the weight, x is input sample data, y is the probability that the sample is the positive class of the classifier;
the loss function of the model is:
in the method, in the process of the invention,is weight, N is number of samples, +.>For the probability that the sample is of positive class, y n Is a sample class label, 0 or 1.
6. The method for simultaneously detecting nitrate and nitrite contents in water based on the hybrid machine learning model according to claim 2, wherein the determining the category of the sample to be detected according to the relation model in the step S5 is specifically:
and classifying by adopting a support vector machine classification model, a random forest classification model and a logistic regression model to obtain three categories, and selecting the category which is the majority of the three categories as the category of the sample to be detected.
7. The method for simultaneous detection of nitrate and nitrite content in water based on a hybrid machine learning model according to claim 1, wherein the conditions for determining the spectral data in steps S1 and S5 are:
the spectrum scanning range is 190-400nm, and the spectrum scanning interval is 1nm.
8. The method for simultaneous detection of nitrate and nitrite content in water based on a hybrid machine learning model as claimed in claim 1, wherein the optimal critical concentration in step S2 is 0.4mg nl -1
CN202110054882.XA 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model Active CN112750507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110054882.XA CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110054882.XA CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Publications (2)

Publication Number Publication Date
CN112750507A CN112750507A (en) 2021-05-04
CN112750507B true CN112750507B (en) 2023-12-22

Family

ID=75652155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110054882.XA Active CN112750507B (en) 2021-01-15 2021-01-15 Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model

Country Status (1)

Country Link
CN (1) CN112750507B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115901677B (en) * 2022-12-02 2023-12-22 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution with updating mechanism
CN115950854B (en) * 2022-12-02 2023-10-13 北京理工大学 Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution
CN118152705A (en) * 2024-02-02 2024-06-07 北京工业大学重庆研究院 Method for determining multi-parameter substitution index of abundance of effluent resistance gene of sewage plant

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106153601A (en) * 2016-10-08 2016-11-23 江南大学 A kind of method based on SERS detection grease oxide in trace quantities since
CN107024445A (en) * 2017-04-17 2017-08-08 中国科学院南京土壤研究所 The modeling method and detection method of the quick detection of Nitrate in Vegetable
CN109001080A (en) * 2018-05-18 2018-12-14 内蒙古师范大学 A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior
CN109187392A (en) * 2018-09-26 2019-01-11 中南大学 A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model
US10229370B1 (en) * 2017-08-29 2019-03-12 Massachusetts Mutual Life Insurance Company System and method for managing routing of customer calls to agents
CN110591075A (en) * 2019-06-28 2019-12-20 四川大学华西医院 PEG-Peptide linear-tree-shaped drug delivery system and preparation method and application thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106153601A (en) * 2016-10-08 2016-11-23 江南大学 A kind of method based on SERS detection grease oxide in trace quantities since
CN107024445A (en) * 2017-04-17 2017-08-08 中国科学院南京土壤研究所 The modeling method and detection method of the quick detection of Nitrate in Vegetable
US10229370B1 (en) * 2017-08-29 2019-03-12 Massachusetts Mutual Life Insurance Company System and method for managing routing of customer calls to agents
CN109001080A (en) * 2018-05-18 2018-12-14 内蒙古师范大学 A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior
CN109187392A (en) * 2018-09-26 2019-01-11 中南大学 A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model
CN110591075A (en) * 2019-06-28 2019-12-20 四川大学华西医院 PEG-Peptide linear-tree-shaped drug delivery system and preparation method and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习的微量农药光谱预测模型;陈菁菁;《北京信息科技大学学报》;第35卷(第2期);第62-66页 *

Also Published As

Publication number Publication date
CN112750507A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN112750507B (en) Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model
Khaledian et al. Selecting appropriate machine learning methods for digital soil mapping
Ramirez-Lopez et al. The spectrum-based learner: A new local approach for modeling soil vis–NIR spectra of complex datasets
CN111126575B (en) Gas sensor array mixed gas detection method and device based on machine learning
CN110726694A (en) Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm
US11790410B2 (en) System and method for natural capital measurement
Zhang et al. Predicting soil moisture content over partially vegetation covered surfaces from hyperspectral data with deep learning
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
CN114219157B (en) Alkane gas infrared spectrum measurement method based on optimal decision and dynamic analysis
CN110018294A (en) Heavy metal-polluted soil detects value correcting method, device and computer storage medium
CN117078114B (en) Water quality evaluation method and system for water-bearing lakes under influence of diversion engineering
CN117010266A (en) Paste yield stress prediction method and device based on XGBoost model
El Malki et al. Machine learning for optimal electrode wettability in lithium ion batteries
CN117556245B (en) Method for detecting filtered impurities in tetramethylammonium hydroxide production
Zhang et al. Prediction approach of larch wood density from visible–near-infrared spectroscopy based on parameter calibrating and transfer learning
CN114184599A (en) Single-cell Raman spectrum acquisition number estimation method, data processing method and device
Inik et al. Prediction of Soil Organic Matter with Deep Learning
Saberioon et al. Enhancing soil organic carbon prediction of LUCAS soil database using deep learning and deep feature selection
Albinet et al. Prediction of exchangeable potassium in soil through mid-infrared spectroscopy and deep learning: From prediction to explainability
CN116186507A (en) Feature subset selection method, device and storage medium
CN116399836A (en) Cross-talk fluorescence spectrum decomposition method based on alternating gradient descent algorithm
CN111062118B (en) Multilayer soft measurement modeling system and method based on neural network prediction layering
Yun Method of Selecting Calibration Samples
Thomas Incorporating auxiliary predictor variation in principal component regression models
González-Vargas et al. Validation methods for population models of gene expression dynamics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant