CN112750507A

CN112750507A - Method for simultaneously detecting content of nitrate and nitrite in water based on hybrid machine learning model

Info

Publication number: CN112750507A
Application number: CN202110054882.XA
Authority: CN
Inventors: 熊莎; 吴琼; 张航; 李勇刚
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-05-04
Anticipated expiration: 2041-01-15
Also published as: CN112750507B

Abstract

The invention belongs to the field of spectral signal analysis, and particularly relates to a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model. The method comprises the following steps: acquiring a series of machine spectral data of nitrate and nitrite mixed solution samples with different nitrogen contents; classifying the samples according to the optimal critical concentration to obtain four types of samples; establishing a relation model between the nitrogen content of each corresponding nitrate and nitrite in the four types of samples and the corresponding spectral data; screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and acquiring spectral data of a sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite. The method of the invention realizes the accurate and rapid detection of nitrate nitrogen and nitrite nitrogen, and can ensure the detection sensitivity under low concentration.

Description

Method for simultaneously detecting content of nitrate and nitrite in water based on hybrid machine learning model

Technical Field

The invention belongs to the field of spectral signal analysis, and particularly relates to a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model.

Background

At present, various nitrogen-containing compound detection technologies exist in the market, and the detection technologies have great differences in the aspects of detection principles, calculation methods, operation processes, application fields and the like. The multi-component concentration instrument analysis method for relatively mature domestic and foreign research mainly comprises the following steps: electrochemistry, capillary electrophoresis, ion chromatography, biosensing, and spectrophotometry. The electrochemical measurement technology is not perfect in the aspect of monitoring the concentration of trace substances to be measured, and the surface of an electrode is easy to be polluted in an actual sample, so that the detection result is easy to be unstable. The method based on capillary electrophoresis is reliable, but needs large-scale instruments, is complex to operate and is difficult to realize field automatic monitoring. Chromatography can analyze the concentration of various ionic components simultaneously, and has high safety, but the equipment needs to be maintained frequently, and is time-consuming and expensive. The approach of biosensors needs to solve the problems of robustness, selectivity and standardization of operation. The ultraviolet-visible, near-infrared, fluorescence and other spectrum technologies are nondestructive, universal and flexible detection methods, have all the characteristics required for online monitoring, and are economical, feasible, rapid, simple and convenient methods at present. According to the light absorption characteristics of nitrate and nitrite, a rapid and simple ultraviolet-visible spectrophotometry is selected as a basic detection method.

Sequence analysis is commonly used in conventional spectrophotometry for the detection of nitrate and nitrite: firstly, analyzing nitrite in a sample by using a Griess reagent method, then reducing another same sample (generally using a copper/cadmium column), ensuring that all nitrate is converted into nitrite, then repeating nitrite analysis, and calculating the nitrate concentration by difference. This method is an indirect analysis for nitrate, is time consuming and highly dependent on the accuracy of the nitrite detection, and secondly the Griess method involves toxic chemical reagents, is harmful to the body and pollutes the environment. Researchers propose that the ultraviolet absorption spectra of both nitrate and nitrite can be used for direct measurement, and because the ultraviolet absorption spectra of nitrate and nitrite are similar in shape in the first half and have very close absorption peak wavelengths which are nearly overlapped, in actual operation, the contribution of nitrite and nitrate is difficult to separate from the collected spectra, while the existing direct spectroscopy still uses the traditional stoichiometric method to process spectral data, and faces the problems of narrow application range and low detection precision. In recent years, the combination of ultraviolet spectroscopy and machine learning methods has been successfully applied to the rapid detection of various compounds, but few studies have been made on the separation of nitrate and nitrite. Early experiments show that when a common machine learning model is oriented to a mixed solution of nitrate and nitrite within a certain concentration range, the sensitivity of the model for predicting components at low concentration is insufficient, and a machine learning method which can still maintain the same level of detection precision when the concentration of an analyte changes greatly needs to be found urgently.

Disclosure of Invention

Based on the above, the invention provides a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model, which combines classification and regression algorithms, can ensure that the detection precision of nitrate and nitrite in the whole model range reaches balance, is simple and convenient to operate and low in cost, and can simultaneously realize accurate and rapid detection of nitrate and nitrite in a simple environment.

The invention provides a method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model, which specifically comprises the following steps:

s1, preparing a series of nitrate and nitrite mixed solution samples with different nitrogen contents, and measuring the spectral data of the samples;

s2, forming a two-dimensional plane by using the nitrogen content of nitrate and nitrite in the sample, obtaining the optimal critical concentration, dividing the two-dimensional plane into four sub-regions, and obtaining four types of samples by taking the sample in each sub-region as one type of sample;

s3, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each of the four types of samples and the corresponding spectral data to realize automatic classification of the samples;

s4, taking samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model;

and S5, acquiring the spectral data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite in the sample to be detected.

Further, the step S3 is specifically:

and training the nitrogen content of the nitrate and nitrite corresponding to each type in the four types of samples and the corresponding spectral data to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.

Further, the obtaining a classification model of a support vector machine specifically includes:

the objective function of the support vector classification model is as follows:

s.t.y_i(ω^Tx_i+b)≥1-ξ_i,ξ_i≥0,i＝1,2,…,l

said x_iIs a sample vector, x_jIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξ_iRepresents a relaxation variable;

selecting a Gaussian kernel function as a kernel function of a support vector machine, wherein the function expression of the Gaussian kernel function is as follows:

in the formula x_i,x_jFeatures representing samples in a low dimensional spaceThe eigenvector, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.

Further, the obtaining of the random forest classification model specifically includes:

sampling the samples to obtain self-service samples, namely constructing a CART tree, extracting a plurality of features from each node of the CART tree, calculating a Gini index of each feature, and obtaining classification features with classification capability; the calculation method of the Gini index D of the sample comprises the following steps

Said C is_kThe number of Kth category;

and classifying according to the classification characteristics to obtain a tree structure with completely split nodes.

Further, the obtaining the logistic regression model specifically includes:

the logistic regression model is as follows:

in the formula, omega is weight, x is input sample data, and y is the probability that the sample is the positive class of the classifier;

the loss function of the model is:

where ω is a weight, N is the number of samples,

is the probability that the sample is positive, y_nSample class label, 0 or 1.

Further, in the step S4, selecting the characteristic wavelength by using a stable variable substitution method, and establishing the optimal variable subset specifically includes:

obtaining sub data sets of a sample space and a variable space by adopting Monte Carlo sampling, and obtaining the sub data in the sample spaceThe stability of each variable is calculated in a centralized way, and an elite variable with high stability, namely stability S, is obtained_jThe calculation formula is as follows:

in the formula b_ijThe regression coefficient of the jth variable of the ith sample,

the mean value of the regression coefficient of the jth variable is obtained, and M is the total number of samples;

performing variable displacement analysis on the subdata set of the variable space, calculating the displacement degree, and obtaining the important variable with high displacement degree, namely the displacement degree PD_jThe calculation formula is as follows: PD (photo diode)_j＝PCE_j-SCE_jIn the formula PCE_jMean root mean square error, SCE, for models separately built with multiple wavelength subsets without j variables_jRespectively establishing a mean root mean square error value of a model by using a plurality of wavelength subsets containing j variables;

and combining the elite variables and the important variables, and obtaining an optimal variable subset by using a cross validation method.

Further, the final model structure in step S4 is:

wherein,

x_iis the sample vector, σ is the bandwidth of the Gaussian kernel, i.e., the kernel parameter, [ b α [ ]₁α₂…α_n]The value is constant and can be obtained by solving the target function of the least square support vector machine by a Lagrange method.

Further, the step S5 of determining the type of the sample to be tested according to the relationship model specifically includes:

and classifying by respectively adopting a support vector machine classification model, a random forest classification model and a logistic regression model to obtain three classes, and selecting the class which accounts for most of the three classes as the class of the sample to be detected.

Further, the conditions for measuring the spectrum data in the steps S1 and S5 are as follows:

the spectral scanning range is 190-400nm, and the spectral scanning interval is 1 nm.

Further, the optimal critical concentration in step S2 is 0.4mg N L^-1。

Has the advantages that:

the invention prepares a series of mixed solutions of nitrate and nitrite in advance, measures the spectrum data, and uses the data to establish a mixed machine learning model through classification and regression algorithm.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of sample classification provided by an embodiment of the present invention;

FIG. 3 is a frame diagram of an algorithm for analyzing the content of a sample to be tested according to an embodiment of the present invention;

FIG. 4 is a graph comparing the effect of predicting nitrate concentration using a single model and a mixed model according to the present invention;

FIG. 5 is a graph comparing the effect of predicting nitrite concentration using a single model and a mixture model according to embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Based on research, the invention finds that the combination of the ultraviolet spectrum and the machine learning method can be used for simultaneously and rapidly detecting nitrate and nitrite, but when a common machine learning model is oriented to a mixed solution of nitrate and nitrite within a certain concentration range, the sensitivity of predicting components at low concentration is insufficient, and a machine learning method which can still maintain the detection precision at the same level when the concentration of an analyte changes greatly is urgently needed to be found.

As shown in fig. 1, in one embodiment, a flow chart of a method for simultaneously detecting nitrate and nitrite contents in water based on a hybrid machine learning model is provided, which specifically includes the following steps:

step S101, a series of nitrate and nitrite mixed solution samples with different nitrogen contents are configured, and the spectral data of the samples are measured.

In the present example, a nitrate nitrogen and nitrite nitrogen standard stock solution was first prepared: dried 0.7221g potassium nitrate or 0.4928g sodium nitrite is weighed and dissolved in a proper amount of fresh deionized water, transferred into a 1000ml volumetric flask, diluted to the marked line by the deionized water and mixed uniformly for later use. Diluting to 10mg N L for use^-1The standard use solution of (1). All reagents were analytical grade (national chemical reagents, ltd., china). Respectively preparing nitrite N with the concentration of 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5 and 3.0mg N L^-1The nitrogen concentration of the nitrate is 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5 and 3.0mg N L^-1A total of 100 mixed samples. The deionized water is used as a reference solution for background subtraction, and the spectral data of each wavelength point is measured at intervals of 1nm within the wavelength range of 190-400 nm.

Step S102, forming a two-dimensional plane by using the nitrogen content of nitrate and nitrite in the sample, obtaining the optimal critical concentration, dividing the two-dimensional plane into four sub-regions, and obtaining four types of samples by taking the sample in each sub-region as one type of sample.

As shown in FIG. 2, the embodiment of the present invention provides a sample classification diagram, wherein the concentration plane of nitrate and nitrite is divided into four sub-regions for respective modeling, the critical concentrations for dividing the sub-regions are selected at lower positions due to insufficient sensitivity for analyte prediction at low concentration, and the critical concentrations are selected to be 0.3, 0.4 and 0.8mg N L^-1The modeling analysis was performed, and the results are shown in Table 1, when the critical concentration was 0.4mg N L^-1The whole model has higher classification accuracy and lower average relative error; the nitrate and nitrite contents in each sub-zone are characterized differently: the nitrate and nitrite contents in the zone 1 are both low; the nitrate content in zone 2 is much higher than nitrite; the concentration of nitrate in zone 3 is much lower than the concentration of nitrite; both nitrate and nitrite levels are higher in zone 4. Compared with a single full model, each sub-model is more adaptive to the sample characteristics of each sub-region and has higher prediction precision.

TABLE 1 comparison of model Performance at different Critical concentrations (mg N L)^-1)

S103, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each of the four types of samples and the corresponding spectral data to realize automatic classification of the samples;

in the embodiment of the invention, the nitrogen content of the nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data are trained to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.

In the embodiment of the invention, an MATLAB tool box of LIBSVM-farutoultimateVersion is used for training a classification model of a support vector machine, and the target function of the classification model is as follows:

s.t.y_i(ω^Tx_i+b)≥1-ξ_i,ξ_i≥0,i＝1,2,…,l (1)

in the formula x_iIs a sample vector, y_iIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, l is the total number of samples, C is a penalty factor, ξ_iRepresents a relaxation variable;

in the formula x_i,x_jRepresenting the feature vector of the sample in a low-dimensional space, wherein sigma is the bandwidth of a Gaussian kernel;

in the process of modeling by using the SVM, firstly, normalization preprocessing is carried out on absorbance data, the data are mapped to the range of 0-1 to accelerate the convergence speed of a training network, then, the data dimension of an input layer is reduced by using Principal Component Analysis (PCA), and two hyper-parameters, namely a penalty factor C and a kernel parameter sigma, are optimized by using a Particle Swarm Optimization (PSO).

The SVC function integration in the LIBSVM-farutoultimateVersion tool box realizes the functions, and the functions are as follows: [ predict _ label, accuracy, bestc, bestg ] ═ SVC (train _ label, train _ data, test _ label, test _ data, Method _ option), where Method _ option is a structure, set to: the Method _ option.scale is 1, the Method _ option.pca is 0, and the Method _ option.type is 2, namely, an SVM classification model is established to obtain a prediction sample type prediction _ label, and an optimal penalty factor C and a kernel parameter g are simultaneously output.

In the embodiment of the invention, an MATLAB tool box of RF _ MexStandalone-v0.02 is used for training a random forest classification model, firstly, k new self-help sample sets are extracted randomly in a put-back mode by applying a bootstrap method from original training samples, and k CART trees are constructed, wherein the samples which are not extracted each time form k pieces of data outside bags; assuming n features, randomly drawing m features at each node of each tree, and selecting one feature with the most classification capability for node splitting by calculating the Kernel index of each feature, wherein for a given sample D, K classes are assumed, the number of the K classes is CK, and the calculation formula of the Kernel index of the sample D is as follows:

if the selected attribute is A, then the calculation formula of the Gini index of the split data set D is as follows:

where K denotes that the sample D is divided into K parts and the data set D is split into K D_jA data set;

and forming a tree structure by using a node complete splitting mode, growing each CART tree to the maximum extent, voting the generated each tree on the sample category, and judging the final classification result of the unknown sample according to a minority majority-obeying principle.

In the embodiment of the invention, a program is written in MATLAB to realize logistic regression, and the output of a linear regression model is used as the input of a sigmoid function to obtain a mathematical expression model of the logistic regression, which has the following formula:

the loss function is used to measure the difference between the output of the model and the real output, and in the logistic regression, the value of the loss function is equal to the total probability that the sample is in a certain class, and the formula is as follows:

where ω is a weight, N is the number of samples,

is the probability that the sample is positive, y_nSample class label, 0 or 1.

According to the idea of maximum likelihood estimation, the optimal ω needs to be obtained to realize the loss function and obtain the maximum value, at this time, a random gradient descent method is applied, an initial value of ω is randomly generated, and then the optimal ω is obtained by continuously iterating through the following formula:

in the formula, ω_tIs an initial value of ω, ω_t+1Is a new value of ω;

substituting the obtained omega value into a mathematical model of logistic regression to calculate the class probability score of each sample, and taking the class with the highest probability score as the final class of the sample; the concept of onevsall is also utilized to expand the logistic regression to realize multi-classification, and assuming that the data has N classes, 1 independent binary classifier is established for each class in the N classes by using the logistic regression. For classifier i, let the sample with label ═ i be the positive class, the rest samples be the negative class, and so on. Inputting sample data to be predicted, obtaining the probability p of judging the sample data to be a corresponding positive class by all classifiers, and taking the sample type corresponding to the maximum probability in the p as the final prediction type.

And (3) voting sample categories respectively according to classification models established by a support vector machine, a random forest and logistic regression, and taking the category (more than or equal to 2) of the obtained majority votes as the final category of the samples.

And step S104, taking the samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model.

In the embodiment of the invention, because the classifier has higher error probability at the boundary of the region, each sub-model comprises samples distributed on the boundary so as to avoid larger prediction error caused by classification error, a stable variable displacement method (SVP) is adopted to select characteristic wavelengths, an optimal variable subset is established, a sub-regression model is established by adopting a least square vector machine, the SVP is based on the evolution principle of intraspecific competition and survival of a fitter, the variables are evaluated by considering the stability and the displacement of the variables and statistical data related to the model performance, and the variable subset with the minimum RMSE mean value and relatively lower standard difference value is taken as the optimal variable; for each sub-region, SVP selected a unique variable subset of nitrite and nitrate, respectively. Models built using a specific subset of variables can be adapted to the characteristics of the target ion, resulting in better performance. And establishing a least square support vector machine model in MATLAB by using an LSSVMlabv1_8_ R2009b _ R2011a tool box, searching for optimal regularization parameters and kernel parameters by using a RBF kernel function and using a grid search to obtain a sub-regression model of each sub-region.

In the embodiment of the invention, a stable variable displacement method (SVP) is used for establishing a model for the components of nitrate and nitrite in each subregion respectively and selecting an optimal characteristic wavelength subset; the method comprises the steps of firstly obtaining sub-data sets of a sample space and a variable space by Monte Carlo sampling, calculating the stability of each variable in the sub-data sets of the sample space, sequencing the stability, taking the variable with high stability as an elite variable, and taking the rest as a normal variable. Stability S_jThe calculation formula is as follows:

the mean value of the regression coefficient of the jth variable is shown, and M is the total number of samples.

Then carrying out variable displacement analysis on the subdata sets of the variable space, calculating the displacement degree of each variable and sequencing the variables with high displacement degree as important variables; degree of substitution PD_jThe calculation formula is as follows:

PD_j＝PCE_j-SCE_j (9)

PCE in formula_jThe rms error mean of the models respectively built with the plurality of wavelength subsets not containing the j variable, and SCEj is the rms error mean of the models respectively built with the remaining plurality of wavelength subsets containing the j variable.

The elite variables and the important variables are merged into a new subset of variables, and the process is repeated. And obtaining N variable subsets through N iterations, and finally selecting the variable subset with the minimum mean root mean square error and relatively low standard deviation value as the optimal subset by utilizing cross validation.

The LSSVMlabv1_8_ R2009b _ R2011a tool box was used to train 4 Least Squares Support Vector Machine (LSSVM) regression sub-models. LSSVM is an SVM in which the loss function is a quadratic loss function, and its objective function is as follows:

in the formula, x_iIs a sample vector, y_iIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξ_iRepresents the variable of the relaxation of the fiber,

a non-linear mapping function for mapping the sample space to a high-dimensional feature space.

Using the RBF kernel function, as follows:

at this time, the final model structure of the LSSVM is as follows:

model parameter [ alpha ] in the formula₁α₂…α_n]The LSSVM target function can be solved by using a Lagrangian method.

Wherein α ═ α₁,α₂,…,α_n]Is a lagrange multiplier.

In an LSSVMlabv1_8_ R2009b _ R2011a toolbox, an LSSVM model can be established by initializing model parameters by using a tunelssvm function, and an optimal penalty factor C and a kernel parameter g found by using grid search can be output, wherein the initial values of C and g are set to be 100 and 0.01, and the tunelssvm function is as follows: model ═ tunelsvm (model _ ori, optfun, costfun, costfun _ args), its input parameter is set to costfun ═ crossvalatelsvm'; costfun _ args ═ {10, 'mse' }; optfun ═ gridsearch'; model _ ori ═ initlsvm (trnX, trnY, 'function evaluation', c, g, 'RBF _ kernel'), then, a regression model is built by utilizing a trainllsvm function, a model structure body is output and is used as an important input quantity of a simlsvm function, and a predicted value Y of an unknown sample can be output.

And S105, acquiring spectral data of the sample to be detected, determining the type of the sample to be detected according to the relation model, and analyzing and predicting by adopting a regression sub-model corresponding to the type of the sample to be detected to obtain the concentrations of nitrate and nitrite in the sample to be detected.

As shown in fig. 3, an embodiment of the present invention provides an algorithm frame diagram for content analysis of a sample to be tested, acquiring spectral data, i.e., spectral data, of the sample to be tested, classifying (LR) the sample by using a Support Vector Machine (SVM) classification model, a random forest classification (RF) model, and a logistic regression model to obtain three categories i, j, and k, and selecting a majority category of the three categories i, j, and k as a category of the sample to be tested.

TABLE 2 Category determination of three base classifier vote inconsistencies

In the embodiment of the invention, after the category l is obtained, a stable variable displacement method is used for selecting the variable subset corresponding to the category area, a least square support vector is used for establishing a regression model, and finally the predicted values of the concentrations of the nitrate and the nitrite are obtained.

In the embodiment of the invention, leave-one-out cross validation is adopted as an evaluation strategy, and Average Relative Error (ARE), Maximum Relative Error (MRE), predicted Root Mean Square Error (RMSEP) and decision coefficient (R) ARE utilized²) Four classical parameters were used to evaluate the performance of the established model, and all the procedures of this example were done in MATLAB.

As shown in table 3, the results of the predictive analysis of the concentration of the mixed solution by the hybrid machine learning model of the present invention and the single machine learning model are compared, wherein the single machine learning model first selects the characteristic wavelength by using SVP, and then establishes a model by using LSSVM.

TABLE 3 test results using different algorithms

As can be seen from Table 3, the results of the prediction method using the hybrid machine learning model of the present invention show that the average relative error of nitrate is reduced from 6.25% to 1.64%, the maximum relative error is reduced from 39.96% to 5.01%, the average relative error of nitrite is reduced from 12.37% to 4.58%, and the maximum relative error is reduced from 79.81% to 9.23%. As shown in fig. 4 and 5, the effect of predicting nitrate and nitrite concentrations for the single model and the mixed model provided in the examples of the present invention is shown in comparison. Although the average relative error predicted by the single modeling is small when the concentration of the analyte is relatively high: (<10%) but when the analyte concentration is below 0.4mg N L^-1When the method is used, the prediction error is greatly increased; the prediction method of the hybrid machine learning model provided by the invention has the advantages that the average relative error of hybrid modeling is always controlled to be below 5% no matter how the concentration of the analyte changes in the modeling area, and the performance is more stable.

The embodiment of the invention provides a hybrid machine learning model combining classification and regression algorithms at the same time, which can solve the problem of unbalanced precision of nitrate and nitrite prediction by a single model. In addition, a support vector machine, a random forest and a logistic regression are used for establishing a joint classifier, so that the classification system is optimized. The experimental result shows that compared with other direct spectroscopy methods using a single model, the method obviously reduces the maximum relative error of predicting the concentrations of nitrate and nitrite and improves the prediction precision of low-concentration components. It should be understood that the method of the present invention is not only applicable to the mixed solution of nitrate and nitrite prepared in the present embodiment in a certain concentration ratio, but also applicable to any water sample in any concentration range with nitrate and nitrite as main components.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims

1. A method for simultaneously detecting the content of nitrate and nitrite in water based on a hybrid machine learning model is characterized by specifically comprising the following steps:

s3, establishing a relation model between the nitrogen content of the nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data;

s4, taking samples in the sub-regions and on the classification boundaries as modeling samples, screening characteristic wavelengths with high sensitivity and correlation, and establishing a regression sub-model; and

2. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model according to claim 1, wherein the step S3 is specifically as follows:

3. The method for simultaneously detecting the nitrate and nitrite contents in water based on the hybrid machine learning model as claimed in claim 2, wherein the obtaining the support vector machine classification model specifically comprises:

s.t.y_i(ω^Tx_i+b)≥1-ξ_i，ξ_i≥0，i＝1，2，...，l

said x_iIs a sample vector, y_iIs a sample class label, ω is a vector whose dimension is equal to the feature dimension of the sample, b is a real number, n is the total number of samples, C is a penalty factor, ξ_iRepresents a relaxation variable;

in the formula x_i,x_jRepresentsThe feature vector of the sample in the low dimensional space, σ, is the bandwidth of the gaussian kernel, i.e., the kernel parameter.

4. The method for simultaneously detecting the nitrate and nitrite content in water based on the hybrid machine learning model as claimed in claim 2, wherein the obtaining the random forest classification model specifically comprises:

Said C is_kThe number of Kth category;

5. The method for simultaneous detection of nitrate and nitrite content in water based on hybrid machine learning model according to claim 2, wherein the obtaining of the logistic regression model specifically comprises:

the logistic regression model is as follows:

the loss function of the model is:

where ω is a weight, N is the number of samples,

is the probability that the sample is positive, y_nSample class label, 0 or 1.

6. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model as claimed in claim 1, wherein said step S4 is implemented by selecting characteristic wavelength by using stable variable substitution method, and establishing optimal variable subset specifically comprises:

obtaining sub data sets of a sample space and a variable space by adopting Monte Carlo sampling, calculating the stability of each variable in the sub data sets of the sample space, and obtaining an elite variable with high stability, namely stability S_jThe calculation formula is as follows:

7. The method for simultaneously detecting the nitrate and nitrite contents in water based on the hybrid machine learning model as claimed in claim 6, wherein the regression submodel in the step S4 is:

wherein,

x_iis the sample vector, σ is the bandwidth of the Gaussian kernel, i.e., the kernel parameter, [ b α [ ]₁ α₂… α_n]The value is constant and can be obtained by solving the target function of the least square support vector machine by a Lagrange method.

8. The method for simultaneous detection of nitrate and nitrite contents in water based on hybrid machine learning model as claimed in claim 2, wherein the determination of the type of the sample to be detected according to the relationship model in step S5 is specifically as follows:

9. The method for simultaneous detection of nitrate and nitrite in water based on hybrid machine learning model as claimed in claim 1, wherein the conditions for determining the spectral data in steps S1 and S5 are:

10. The method for simultaneously detecting the contents of nitrate and nitrite in water based on the hybrid machine learning model as claimed in claim 1, wherein the optimal critical concentration in the step S2 is 0.4mg NL^-1。