CN112750507B - Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model - Google Patents
Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model Download PDFInfo
- Publication number
- CN112750507B CN112750507B CN202110054882.XA CN202110054882A CN112750507B CN 112750507 B CN112750507 B CN 112750507B CN 202110054882 A CN202110054882 A CN 202110054882A CN 112750507 B CN112750507 B CN 112750507B
- Authority
- CN
- China
- Prior art keywords
- sample
- model
- nitrate
- nitrite
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 229910002651 NO3 Inorganic materials 0.000 title claims abstract description 67
- IOVCWXUNBOPUCH-UHFFFAOYSA-M Nitrite anion Chemical compound [O-]N=O IOVCWXUNBOPUCH-UHFFFAOYSA-M 0.000 title claims abstract description 67
- NHNBFGGVMKEFGY-UHFFFAOYSA-N Nitrate Chemical compound [O-][N+]([O-])=O NHNBFGGVMKEFGY-UHFFFAOYSA-N 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000010801 machine learning Methods 0.000 title claims abstract description 28
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 title claims abstract description 19
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 claims abstract description 28
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 16
- 229910052757 nitrogen Inorganic materials 0.000 claims abstract description 14
- 239000011259 mixed solution Substances 0.000 claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims abstract description 10
- 230000035945 sensitivity Effects 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract 2
- 230000006870 function Effects 0.000 claims description 34
- 238000012706 support-vector machine Methods 0.000 claims description 21
- 230000003595 spectral effect Effects 0.000 claims description 20
- 238000007477 logistic regression Methods 0.000 claims description 18
- 238000013145 classification model Methods 0.000 claims description 14
- 238000007637 random forest analysis Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 6
- 238000012614 Monte-Carlo sampling Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 238000006073 displacement reaction Methods 0.000 claims 4
- 238000011549 displacement method Methods 0.000 claims 1
- 238000012843 least square support vector machine Methods 0.000 claims 1
- 239000011159 matrix material Substances 0.000 claims 1
- 238000012795 verification Methods 0.000 claims 1
- JVMRPSJZNHXORP-UHFFFAOYSA-N ON=O.ON=O.ON=O.N Chemical compound ON=O.ON=O.ON=O.N JVMRPSJZNHXORP-UHFFFAOYSA-N 0.000 abstract description 3
- MMDJDBSEMBIJBB-UHFFFAOYSA-N [O-][N+]([O-])=O.[O-][N+]([O-])=O.[O-][N+]([O-])=O.[NH6+3] Chemical compound [O-][N+]([O-])=O.[O-][N+]([O-])=O.[O-][N+]([O-])=O.[NH6+3] MMDJDBSEMBIJBB-UHFFFAOYSA-N 0.000 abstract description 3
- 238000006467 substitution reaction Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 239000012491 analyte Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000003153 chemical reaction reagent Substances 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 239000008367 deionised water Substances 0.000 description 3
- 229910021641 deionized water Inorganic materials 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 150000002823 nitrates Chemical class 0.000 description 3
- 238000002798 spectrophotometry method Methods 0.000 description 3
- 238000000862 absorption spectrum Methods 0.000 description 2
- 238000005251 capillar electrophoresis Methods 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- FGIUAXJPYTZDNR-UHFFFAOYSA-N potassium nitrate Chemical compound [K+].[O-][N+]([O-])=O FGIUAXJPYTZDNR-UHFFFAOYSA-N 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- LPXPTNMVRIOKMN-UHFFFAOYSA-M sodium nitrite Chemical compound [Na+].[O-]N=O LPXPTNMVRIOKMN-UHFFFAOYSA-M 0.000 description 2
- 239000000243 solution Substances 0.000 description 2
- 238000004611 spectroscopical analysis Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000000870 ultraviolet spectroscopy Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000002835 absorbance Methods 0.000 description 1
- 238000010521 absorption reaction Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910052793 cadmium Inorganic materials 0.000 description 1
- RDVQTQJAUFDLFA-UHFFFAOYSA-N cadmium Chemical compound [Cd][Cd][Cd][Cd][Cd][Cd][Cd][Cd][Cd] RDVQTQJAUFDLFA-UHFFFAOYSA-N 0.000 description 1
- 231100000481 chemical toxicant Toxicity 0.000 description 1
- 238000012569 chemometric method Methods 0.000 description 1
- 238000004587 chromatography analysis Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000002848 electrochemical method Methods 0.000 description 1
- 230000005518 electrochemistry Effects 0.000 description 1
- 238000013209 evaluation strategy Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000004255 ion exchange chromatography Methods 0.000 description 1
- 230000031700 light absorption Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 150000002826 nitrites Chemical class 0.000 description 1
- -1 nitrogen-containing compound Chemical class 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 235000010333 potassium nitrate Nutrition 0.000 description 1
- 239000004323 potassium nitrate Substances 0.000 description 1
- 239000012088 reference solution Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012882 sequential analysis Methods 0.000 description 1
- 235000010288 sodium nitrite Nutrition 0.000 description 1
- 239000012086 standard solution Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000011550 stock solution Substances 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 239000003440 toxic substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/20—Identification of molecular entities, parts thereof or of chemical compositions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/259—Fusion by voting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Investigating Or Analyzing Non-Biological Materials By The Use Of Chemical Means (AREA)
Abstract
Description
技术领域Technical field
本发明属于光谱信号分析领域,具体涉及基于混合机器学习模型的同时检测水中硝酸盐和亚硝酸盐含量的方法。The invention belongs to the field of spectral signal analysis, and specifically relates to a method for simultaneously detecting nitrate and nitrite content in water based on a hybrid machine learning model.
背景技术Background technique
目前市场上存在多种含氮化合物检测技术,其在检测原理、计算方法、操作工艺、应用领域等方面存在很大差异。国内外研究比较成熟的多组分浓度仪器分析方法主要有:电化学、毛细管电泳、离子色谱、生物传感和分光光度法。电化学测量技术在监测痕量待测物浓度方面不够完善,而且在实际样品中由于电极表面易受污染,容易导致检测结果不稳定。基于毛细管电泳的方法比较可靠,但需要大型仪器,且操作复杂,难以实现现场自动化监测。色谱法可同时分析多种离子成分浓度,安全性高,但设备需要经常维护,耗时且昂贵。生物传感器的方法需要解决操作的鲁棒性、选择性和标准化问题。紫外-可见、近红外、荧光等光谱技术是一种无损、通用、灵活的检测方法,具有进行在线监测所需要的所有特性,是目前比较经济可行、快速简便的一种方法。根据硝酸盐与亚硝酸盐的吸光特性,选择快速、简便的紫外-可见分光光度法作为基本检测方法。There are currently a variety of nitrogen-containing compound detection technologies on the market, which differ greatly in detection principles, calculation methods, operating techniques, and application fields. The relatively mature multi-component concentration instrument analysis methods at home and abroad mainly include: electrochemistry, capillary electrophoresis, ion chromatography, biosensing and spectrophotometry. Electrochemical measurement technology is not perfect in monitoring the concentration of trace analytes, and in actual samples, the electrode surface is easily contaminated, which can easily lead to unstable detection results. The method based on capillary electrophoresis is relatively reliable, but requires large-scale instruments and is complex to operate, making it difficult to achieve on-site automated monitoring. Chromatography can analyze the concentrations of multiple ion components simultaneously and is highly safe, but the equipment requires frequent maintenance, which is time-consuming and expensive. Biosensor approaches need to address issues of robustness, selectivity, and standardization of operation. Spectroscopic technologies such as ultraviolet-visible, near-infrared, and fluorescence are non-destructive, versatile, and flexible detection methods that have all the characteristics required for online monitoring. They are currently a relatively economical, feasible, fast and simple method. Based on the light absorption characteristics of nitrate and nitrite, the fast and simple UV-visible spectrophotometry was selected as the basic detection method.
检测硝酸盐与亚硝酸盐的传统分光光度法中常用到顺序分析:首先使用Griess试剂法分析样品中亚硝酸盐,再对另一份相同的样品进行还原(一般使用铜/镉柱),确保所有硝酸盐转化为亚硝酸盐后,再重复亚硝酸盐分析,即可通过差值计算硝酸盐浓度。这种方法对硝酸盐而言属于间接分析,耗时且非常依赖亚硝酸盐的检测准确度,其次Griess法涉及到有毒化学试剂,对身体有害且污染环境。有研究者提出可利用硝酸盐与亚硝酸盐两者的紫外吸收光谱对其进行直接测定,由于硝酸盐与亚硝酸盐的紫外吸收光谱在前半段形状相似,且吸收峰值波长非常接近,将近重叠,在实际操作中,很难从收集到的光谱中分离亚硝酸盐和硝酸盐的贡献,而现有的直接光谱法仍在使用传统的化学计量法处理光谱数据,面临着适用范围窄,检测精度不高的问题。近年来,紫外光谱和机器学习方法相结合已成功应用于多种化合物的快速检测,然而在分离硝酸盐与亚硝酸盐这方面仍然少有研究。经前期实验表明普通的机器学习模型面向一定浓度范围内的硝酸盐与亚硝酸盐混合溶液时,对预测低浓度下的组分灵敏度不足,亟需寻找一种在分析物浓度变化较大时检测精度仍能维持同一水平的机器学习方法。Sequential analysis is commonly used in traditional spectrophotometric methods for detecting nitrate and nitrite: first, use the Griess reagent method to analyze the nitrite in the sample, and then reduce another identical sample (usually using a copper/cadmium column) to ensure After all nitrates have been converted to nitrites and the nitrite analysis is repeated, the nitrate concentration can be calculated from the difference. This method is an indirect analysis of nitrate, which is time-consuming and highly dependent on the accuracy of nitrite detection. Secondly, the Griess method involves toxic chemical reagents, which are harmful to the body and pollute the environment. Some researchers have proposed that the UV absorption spectra of nitrate and nitrite can be used to directly measure them. Since the UV absorption spectra of nitrate and nitrite have similar shapes in the first half, and the absorption peak wavelengths are very close and almost overlap. , in actual operation, it is difficult to separate the contributions of nitrite and nitrate from the collected spectra, and the existing direct spectroscopy methods still use traditional chemometric methods to process spectral data, facing the problem of narrow application range and detection The problem of low accuracy. In recent years, the combination of UV spectroscopy and machine learning methods has been successfully applied to the rapid detection of a variety of compounds. However, there are still few studies on the separation of nitrate and nitrite. Preliminary experiments have shown that when ordinary machine learning models are used for mixed solutions of nitrate and nitrite within a certain concentration range, they are not sensitive enough to predict components at low concentrations. There is an urgent need to find a method for detecting analytes when their concentration changes greatly. Machine learning methods that still maintain the same level of accuracy.
发明内容Contents of the invention
基于此,本发明针对上述的问题,提供了一种基于混合机器学习模型的同时检测水中硝酸盐和亚硝酸盐含量的方法,该方法结合分类与回归算法,可以保证整个模型范围内对硝酸盐和亚硝酸盐的检测精度达到均衡,操作简便,成本低,可同时实现简单环境下的硝酸盐和亚硝酸盐的精确快速检测。Based on this, the present invention aims at the above-mentioned problems and provides a method for simultaneously detecting nitrate and nitrite content in water based on a hybrid machine learning model. This method combines classification and regression algorithms to ensure that nitrate content is detected within the entire model range. The detection accuracy of nitrate and nitrite is balanced, easy to operate, low cost, and can achieve accurate and rapid detection of nitrate and nitrite in simple environments at the same time.
本发明提供了一种基于混合机器学习模型的同时检测水中硝酸盐和亚硝酸盐含量的方法,具体包括:The present invention provides a method for simultaneously detecting nitrate and nitrite content in water based on a hybrid machine learning model, which specifically includes:
S1:配置一系列不同含氮量的硝酸盐和亚硝酸盐混合溶液样本,并测定所述样本的光谱数据;S1: Prepare a series of nitrate and nitrite mixed solution samples with different nitrogen contents, and measure the spectral data of the samples;
S2:以所述样本中硝酸盐和亚硝酸盐的含氮量构成二维平面,并获取最佳临界浓度,将所述二维平面划分为四个子区域,每个子区域内的样本为一类样本,获得四类样本;S2: Construct a two-dimensional plane based on the nitrogen content of nitrate and nitrite in the sample, and obtain the optimal critical concentration. Divide the two-dimensional plane into four sub-regions, and the samples in each sub-region are classified into one category. Samples, four types of samples are obtained;
S3:将所述四类样本中每类对应的硝酸盐和亚硝酸盐的含氮量与所对应的光谱数据建立关系模型,以实现样本的自动分类;S3: Establish a relationship model between the nitrogen content of nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data to achieve automatic classification of samples;
S4:将所述子区域内以及分类边界上的样本作为建模样本,筛选具有高灵敏度和相关性的特征波长,建立回归子模型;S4: Use samples within the sub-region and on the classification boundary as modeling samples, screen characteristic wavelengths with high sensitivity and correlation, and establish a regression sub-model;
S5:获取待测样品的光谱数据,根据所述关系模型确定待测样品类别,并采用与待测样品类别对应的回归子模型进行分析预测,获得待测样品的中硝酸盐和亚硝酸盐的浓度。S5: Obtain the spectral data of the sample to be tested, determine the category of the sample to be tested according to the relationship model, and use the regression sub-model corresponding to the category of the sample to be tested to perform analysis and prediction, and obtain the content of nitrate and nitrite in the sample to be tested. concentration.
进一步的,所述步骤S3具体为:Further, the step S3 is specifically:
将所述四类样本中每类对应的硝酸盐和亚硝酸盐的含氮量与所对应的光谱数据训练获得支持向量机分类模型、随机森林分类模型和逻辑回归模型。The nitrogen content of nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data are trained to obtain a support vector machine classification model, a random forest classification model and a logistic regression model.
进一步的,所述获得支持向量机分类模型具体包括:Further, obtaining a support vector machine classification model specifically includes:
所述支持向量分类模型的目标函数为:The objective function of the support vector classification model is:
s.t. yi(ωTxi+b)≥1-ξi,ξi≥0,i=1,2,...,lst y i (ω T x i +b)≥1-ξ i , ξ i ≥0, i=1, 2,...,l
所述xi是样本向量,xj是样本分类标记,ω是一个矢量,其维数等于样本的特征维数,b是一个实数,n是样本总数,C是惩罚因子,ξi代表松弛变量;The x i is the sample vector, x j is the sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, n is the total number of samples, C is the penalty factor, and ξ i represents the slack variable ;
选取高斯核函数作为支持向量机的核函数,其函数表达式如下:The Gaussian kernel function is selected as the kernel function of the support vector machine, and its function expression is as follows:
式中xi,xj代表样本在低维空间的特征向量,σ是高斯核的带宽,即核参数。In the formula, x i and x j represent the characteristic vector of the sample in the low-dimensional space, and σ is the bandwidth of the Gaussian kernel, that is, the kernel parameter.
进一步的,所述获得随机森林分类模型具体包括:Further, obtaining a random forest classification model specifically includes:
将所述样本进行抽样获得自助样本即并构建CART树,并从所述CART树的每个节点处抽取数个特征,计算每个特征的基尼指数,获得具有分类能力的分类特征;所述样本的基尼指数D计算方法为所述Ck为第K个类别的数量;Sampling the sample to obtain a self-service sample and constructing a CART tree, extracting several features from each node of the CART tree, calculating the Gini index of each feature, and obtaining classification features with classification capabilities; the sample The calculation method of Gini index D is The C k is the number of the Kth category;
根据所述分类特征进行分类,获得节点完全分裂的树结构。Classify according to the classification features to obtain a tree structure with completely split nodes.
进一步的,所述获得逻辑回归模型具体包括:Further, obtaining the logistic regression model specifically includes:
所述逻辑回归模型为:The logistic regression model is:
式中为权重,x为输入样本数据,y是样本为该分类器正类的概率;in the formula is the weight, x is the input sample data, and y is the probability that the sample is the positive class of the classifier;
模型的损失函数为:The loss function of the model is:
式中,为权重,N为样本数,/>为该样本为正类的概率,yn为样本类别标签,0或1。In the formula, is the weight, N is the number of samples,/> is the probability that the sample is a positive class, yn is the sample category label, 0 or 1.
进一步的,所述步骤S4中采用稳定变量置换法选择特征波长,建立最优变量子集具体为:Further, in step S4, the stable variable replacement method is used to select the characteristic wavelength, and the optimal variable subset is established as follows:
采用蒙特卡罗抽样获得样本空间和变量空间的子数据集,在样本空间的子数据集中计算每个变量的稳定性,获得稳定性高的精英变量,稳定性Sj计算公式为:式中bij为第i个样本第j个变量的回归系数,/>为第j个变量的回归系数平均值,M为样本总数;Monte Carlo sampling is used to obtain sub-data sets of the sample space and variable space. The stability of each variable is calculated in the sub-data set of the sample space to obtain elite variables with high stability. The stability S j calculation formula is: In the formula, b ij is the regression coefficient of the j-th variable of the i-th sample,/> is the average regression coefficient of the j-th variable, M is the total number of samples;
在所述变量空间的子数据集中进行变量置换分析,计算置换度,获取置换度高的重要变量,置换度PDj计算公式为:PDj=PCEj-SCEj,式中PCEj为用不含j变量的多个波长子集分别建立的模型的均方根误差均值,SCEj为用剩下的含j变量的多个波长子集分别建立的模型的均方根误差均值;Perform variable substitution analysis in the sub-data set of the variable space, calculate the degree of substitution, and obtain important variables with high degree of substitution. The calculation formula of the degree of substitution PD j is: PD j = PCE j -SCE j , where PCE j is used or not. The mean value of the root mean square error of the model established with multiple wavelength subsets containing the j variable, and SCE j is the mean value of the root mean square error of the model established with the remaining multiple wavelength subsets containing the j variable;
将所述精英变量和重要变量合并,并利用交叉验证方法获得最优变量子集。The elite variables and important variables are combined, and a cross-validation method is used to obtain the optimal variable subset.
进一步的,所述步骤S4中最终模型结构为:Further, the final model structure in step S4 is:
其中,xi是样本向量,σ是高斯核的带宽,即核参数,[b α1 α2…αn]为常量,可由拉格朗日方法求解最小二乘支持向量机目标函数得到。in, x i is the sample vector, σ is the bandwidth of the Gaussian kernel, that is, the kernel parameter, [b α 1 α 2 … α n ] is a constant, which can be obtained by solving the least squares support vector machine objective function using the Lagrangian method.
进一步的,所述步骤S5中的根据所述关系模型确定待测样品类别具体为:Further, in step S5, determining the category of the sample to be tested based on the relationship model is specifically:
分别采用支持向量机分类模型、随机森林分类模型和逻辑回归模型进行分类获得三种类别,选取三种类别中占多数的类别作为待测样品的类别。The support vector machine classification model, the random forest classification model and the logistic regression model were respectively used for classification to obtain three categories, and the majority category among the three categories was selected as the category of the sample to be tested.
进一步的,所述步骤S1和S5中测定光谱数据的条件为:Further, the conditions for measuring spectral data in steps S1 and S5 are:
光谱扫描范围为190-400nm,光谱扫描间隔为1nm。The spectral scanning range is 190-400nm, and the spectral scanning interval is 1nm.
进一步的,所述步骤S2中最佳临界浓度为0.4mg N L-1。Further, the optimal critical concentration in step S2 is 0.4 mg NL -1 .
有益效果:Beneficial effects:
本发明通过预先配置一系列硝酸盐和亚硝酸盐的混合溶液,并测定其光谱数据,并利用上述数据通过分类和回归算法,建立了一种混合机器学习模型,通过上述学习模型,仅需要测定待测样品的光谱数据,即可对待测样品中的硝酸盐和亚硝酸盐含量精确快速检测,可以保证在整个建模范围内对硝酸盐和亚硝酸盐的检测精度达到均衡,提高对低浓度组分的预测精度,且操作简便,成本低。The present invention pre-configures a series of mixed solutions of nitrate and nitrite, measures its spectral data, and uses the above data to establish a hybrid machine learning model through classification and regression algorithms. Through the above learning model, only measurement is required. The spectral data of the sample to be tested can accurately and quickly detect the nitrate and nitrite content in the sample to be tested, which can ensure that the detection accuracy of nitrate and nitrite is balanced within the entire modeling range, and improve the detection of low concentrations. Prediction accuracy of components, easy operation and low cost.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.
图1为本发明实施例提供的一种基于混合机器学习模型的同时检测水中硝酸盐和亚硝酸盐含量的方法的流程图;Figure 1 is a flow chart of a method for simultaneously detecting nitrate and nitrite content in water based on a hybrid machine learning model provided by an embodiment of the present invention;
图2为本发明实施例提供的样本分类示意图;Figure 2 is a schematic diagram of sample classification provided by an embodiment of the present invention;
图3为本发明实施例提供的待测样品含量分析的算法框架图;Figure 3 is an algorithm framework diagram for content analysis of a sample to be tested provided by an embodiment of the present invention;
图4为本发明实施例提供的单一模型和混合模型预测硝酸盐浓度的效果对比图;Figure 4 is a comparison chart of the effects of single model and hybrid model in predicting nitrate concentration provided by the embodiment of the present invention;
图5为本发明实施例提供的单一模型和混合模型预测亚硝酸盐浓度的效果对比图。Figure 5 is a comparison chart of the effects of single model and hybrid model in predicting nitrite concentration provided by the embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.
本发明基于研究发现,将紫外光谱与机器学习方法结合可用于硝酸盐与亚硝酸盐的同时快速检测,但普通的机器学习模型面向一定浓度范围内的硝酸盐与亚硝酸盐混合溶液时,对预测低浓度下的组分灵敏度不足,亟需寻找一种在分析物浓度变化较大时检测精度仍能维持在同一水平的机器学习方法。The present invention is based on the research discovery that combining ultraviolet spectroscopy with machine learning methods can be used for rapid detection of nitrate and nitrite at the same time. However, when the ordinary machine learning model is used for mixed solutions of nitrate and nitrite within a certain concentration range, it is difficult to detect nitrate and nitrite at the same time. The sensitivity of predicting components at low concentrations is insufficient, and there is an urgent need to find a machine learning method that can maintain the same level of detection accuracy when the analyte concentration changes greatly.
如图1所示,在一个实施例中,提出了基于混合机器学习模型的同时检测水中硝酸盐和亚硝酸盐含量的方法的流程图,具体包括以下步骤:As shown in Figure 1, in one embodiment, a flow chart of a method for simultaneously detecting nitrate and nitrite content in water based on a hybrid machine learning model is proposed, which specifically includes the following steps:
步骤S101,配置一系列不同含氮量的硝酸盐和亚硝酸盐混合溶液样本,并测定所述样本的光谱数据。Step S101: Prepare a series of nitrate and nitrite mixed solution samples with different nitrogen contents, and measure the spectral data of the samples.
在本发明实施例中,首先制作硝酸盐氮及亚硝酸盐氮标准贮备溶液:称取已干燥的0.7221g硝酸钾或0.4928g亚硝酸钠溶于适量新鲜的去离子水中,移入1000ml容量瓶中,用去离子水稀释至标线,混匀备用。临用时再稀释为10mg N L-1的标准使用液。所有试剂均为分析级(国药化学试剂有限公司,中国)。分别配制亚硝酸盐氮浓度为0.1、0.2、0.3、0.4、0.8、1.2、1.6、2.0、2.5、3.0mg N L-1,硝酸盐氮浓度为0.1、0.2、0.3、0.4、0.8、1.2、1.6、2.0、2.5、3.0mg N L-1的混合溶液,一共100组混合样品。以去离子水作参比溶液进行背景扣除,在190-400nm波长范围内,间隔1nm测量各波长点的光谱数据。In the embodiment of the present invention, first prepare nitrate nitrogen and nitrite nitrogen standard stock solutions: weigh dried 0.7221g potassium nitrate or 0.4928g sodium nitrite, dissolve it in an appropriate amount of fresh deionized water, and move it into a 1000ml volumetric flask. , dilute to the mark with deionized water, mix well and set aside. Before use, dilute to a standard solution of 10 mg NL -1 . All reagents were of analytical grade (Sinopharm Chemical Reagent Co., Ltd., China). The nitrite nitrogen concentrations are respectively prepared as 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6, 2.0, 2.5, 3.0mg NL -1 and the nitrate nitrogen concentrations are 0.1, 0.2, 0.3, 0.4, 0.8, 1.2, 1.6 , 2.0, 2.5, 3.0 mg NL -1 mixed solutions, a total of 100 sets of mixed samples. Use deionized water as a reference solution for background subtraction, and measure the spectral data at each wavelength point in the wavelength range of 190-400 nm at an interval of 1 nm.
步骤S102,以所述样本中硝酸盐和亚硝酸盐的含氮量构成二维平面,并获取最佳临界浓度,将所述二维平面划分为四个子区域,每个子区域内的样本为一类样本,获得四类样本。Step S102, use the nitrogen content of nitrate and nitrite in the sample to form a two-dimensional plane, obtain the optimal critical concentration, and divide the two-dimensional plane into four sub-regions, and the sample in each sub-region is one Class samples, four classes of samples are obtained.
如图2所示,本发明实施例提供了样本分类示意图,将硝酸盐和亚硝酸盐的浓度平面图划分为四个子区域分别建模,由于对低浓度下的分析物预测灵敏度不足,用于划分子区域的临界浓度被选择在较低的位置,分别选择临界浓度为0.3、0.4和0.8mg N L-1进行建模分析,结果如表1所示,当临界浓度为0.4mg N L-1时,整体模型具有较高的分类准确率和较低的平均相对误差;各子区域中硝酸盐与亚硝酸盐的含量各有不同的特征:区域1中硝酸盐和亚硝酸盐含量均较低;区域2中硝酸盐的含量远高于亚硝酸盐;区域3中硝酸盐的浓度远低于亚硝酸盐的浓度;区域4中硝酸盐和亚硝酸盐含量均较高。相较于单一全模型,每个子模型更适应各个子区域的样本特征,具有更高的预测精度。As shown in Figure 2, the embodiment of the present invention provides a schematic diagram of sample classification. The concentration plan of nitrate and nitrite is divided into four sub-regions for modeling respectively. Due to insufficient sensitivity in predicting analytes at low concentrations, it is used for dividing The critical concentration of the sub-region is selected at a lower position, and the critical concentrations are 0.3, 0.4 and 0.8mg NL -1 respectively for modeling analysis. The results are shown in Table 1. When the critical concentration is 0.4mg NL -1 , The overall model has high classification accuracy and low average relative error; the nitrate and nitrite contents in each sub-region have different characteristics: the nitrate and nitrite contents in region 1 are both low; The nitrate content in 2 is much higher than that of nitrite; the nitrate concentration in zone 3 is much lower than that of nitrite; and the nitrate and nitrite content in zone 4 are both high. Compared with a single full model, each sub-model is more adaptable to the sample characteristics of each sub-region and has higher prediction accuracy.
表1不同临界浓度下的模型性能比较(mg N L-1)Table 1 Comparison of model performance under different critical concentrations (mg NL -1 )
步骤S103:将所述四类样本中每类对应的硝酸盐和亚硝酸盐的含氮量与所对应的光谱数据建立关系模型,以实现样本的自动分类;Step S103: Establish a relationship model between the nitrogen content of nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data to achieve automatic classification of samples;
在本发明实施例中,将所述四类样本中每类对应的硝酸盐和亚硝酸盐的含氮量与所对应的光谱数据训练获得支持向量机分类模型、随机森林分类模型和逻辑回归模型。In the embodiment of the present invention, the nitrogen content of nitrate and nitrite corresponding to each type of the four types of samples and the corresponding spectral data are trained to obtain a support vector machine classification model, a random forest classification model and a logistic regression model. .
在本发明实施例中,使用LIBSVM-farutoUltimateVersion的MATLAB工具箱训练支持向量机分类模型,其目标函数为:In the embodiment of the present invention, the MATLAB toolbox of LIBSVM-farutoUltimateVersion is used to train the support vector machine classification model, and its objective function is:
s.t. yi(ωTxi+b)≥1-ξi,ξi≥0,i=1,2,...,l (1)st y i (ω T x i +b)≥1-ξ i , ξ i ≥0, i=1, 2,...,l (1)
式中xi是样本向量,yi是样本分类标记,ω是一个矢量,其维数等于样本的特征维数,b是一个实数,l是样本总数,C是惩罚因子,ξi代表松弛变量;In the formula, x i is the sample vector, y i is the sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, l is the total number of samples, C is the penalty factor, and ξ i represents the slack variable. ;
选取高斯核函数作为支持向量机的核函数,其函数表达式如下:The Gaussian kernel function is selected as the kernel function of the support vector machine, and its function expression is as follows:
式中xi,xj代表样本在低维空间的特征向量,σ是高斯核的带宽;In the formula, x i and x j represent the characteristic vector of the sample in the low-dimensional space, and σ is the bandwidth of the Gaussian kernel;
在利用SVM建模的过程中,首先对吸光度数据进行归一化预处理,将数据映射到0~1的范围内加快训练网络的收敛速度,然后使用主成分分析(PCA)降低输入层的数据维数,利用粒子群算法(PSO)对惩罚因子C和核参数σ这两个超参数进行了调优。In the process of using SVM modeling, the absorbance data is first normalized and pre-processed, and the data is mapped to the range of 0 to 1 to speed up the convergence speed of the training network, and then principal component analysis (PCA) is used to reduce the data of the input layer. Dimension, the particle swarm algorithm (PSO) was used to tune the two hyperparameters, the penalty factor C and the kernel parameter σ.
LIBSVM-farutoUltimateVersion工具箱中的SVC函数整合实现了上述功能,该函数如下:[predict_label,accuracy,bestc,bestg]=SVC(train_label,train_data,test_label,test_data,Method_option),其中Method_option是一个结构体,将其设置为:Method_option.scale=1,Method_option.pca=0,Method_option.type=2,即可建立SVM分类模型得到预测样本类别predict_label,并同时输出最佳的惩罚因子C和核参数g。The SVC function in the LIBSVM-farutoUltimateVersion toolbox integrates to achieve the above functions. The function is as follows: [predict_label, accuracy, bestc, bestg] = SVC (train_label, train_data, test_label, test_data, Method_option), where Method_option is a structure. The settings are: Method_option.scale=1, Method_option.pca=0, Method_option.type=2. The SVM classification model can be established to obtain the predicted sample category predict_label, and the optimal penalty factor C and kernel parameter g can be output at the same time.
在本发明实施例中,使用RF_MexStandalone-v0.02的MATLAB工具箱训练随机森林分类模型,首先从原始训练样本中,应用bootstrap方法有放回地随机抽取k个新的自助样本集,并由此构建k棵CART树,每次未被抽到的样本组成了k个袋外数据;假设有n个特征,在每一棵树的每个节点处随机抽取m个特征,通过计算每个特征的基尼指数,选择一个最具有分类能力的特征进行节点分裂,其中对于给定的样本D,假设有K个类别,第K个类别的数量为CK,样本D的基尼指数的计算公式如下:In the embodiment of the present invention, the MATLAB toolbox of RF_MexStandalone-v0.02 is used to train the random forest classification model. First, from the original training samples, the bootstrap method is used to randomly extract k new bootstrap sample sets with replacement, and thereby Construct k CART trees, and the unsampled samples each time constitute k out-of-bag data; assuming there are n features, randomly extract m features at each node of each tree, and calculate the Gini index, select a feature with the most classification ability for node splitting. For a given sample D, assuming that there are K categories, the number of K-th categories is CK, the calculation formula of the Gini index of sample D is as follows:
如果选取的属性为A,那么分裂后的数据集D的基尼指数计算公式如下:If the selected attribute is A, then the Gini index calculation formula of the split data set D is as follows:
式中k表示样本D被分为K个部分,数据集D分裂成为K个Dj数据集;In the formula, k means that the sample D is divided into K parts, and the data set D is split into K data sets D j ;
使用节点完全分裂的方式形成树结构,并且让每棵CART树最大限度地生长,最后让生成的每棵树对样本类别进行投票,按照少数服从多数的原则判定未知样本的最终分类结果。The tree structure is formed by completely splitting the nodes, and each CART tree is allowed to grow to the maximum extent. Finally, each generated tree is allowed to vote on the sample category, and the final classification result of the unknown sample is determined according to the principle of the minority obeying the majority.
在本发明实施例中,在MATLAB里编写程序实现逻辑回归,把线性回归模型的输出作为sigmoid函数的输入得到逻辑回归的数学表达模型,如下式:In the embodiment of the present invention, a program is written in MATLAB to implement logistic regression, and the output of the linear regression model is used as the input of the sigmoid function to obtain the mathematical expression model of the logistic regression, as follows:
式中为权重,x为输入样本数据,y是样本为该分类器正类的概率;in the formula is the weight, x is the input sample data, and y is the probability that the sample is the positive class of the classifier;
损失函数用来衡量模型的输出与真实输出的差别,在逻辑回归中损失函数的值等于样本为某一类别的总概率,公式如下:The loss function is used to measure the difference between the output of the model and the real output. In logistic regression, the value of the loss function is equal to the total probability that the sample is a certain category. The formula is as follows:
式中,为权重,N为样本数,/>为该样本为正类的概率,yn为样本类别标签,0或1。In the formula, is the weight, N is the number of samples,/> is the probability that the sample is a positive class, y n is the sample category label, 0 or 1.
根据极大似然估计思想,需要求得最佳ω实现损失函数取得最大值,此时运用随机梯度下降法,先随机产生一个ω的初始值,然后通过如下公式不断迭代从而求得最佳ω:According to the idea of maximum likelihood estimation, it is necessary to find the optimal ω to achieve the maximum value of the loss function. At this time, the stochastic gradient descent method is used to randomly generate an initial value of ω, and then continuously iterate through the following formula to obtain the optimal ω :
式中,为/>初始值./>为/>新值;In the formula, for/> Initial value./> for/> new value;
将求的值代入逻辑回归的数学模型计算每个样本的类别概率得分,将概率得分最高的类别作为该样本的最终类别;这里还利用了onevsall思想对逻辑回归进行扩展实现多分类,假设数据有N个类别,使用逻辑回归对N类中的每个类别建立1个独立的二元分类器。对于分类器i,将label==i的样本设为正类,其余样本设为负类,以此类推。输入待预测样本数据,得到所有分类器判断其为对应正类的概率p,取p中最大的那个概率对应的样本类型作为最后预测类型。will ask for The value is substituted into the mathematical model of logistic regression to calculate the category probability score of each sample, and the category with the highest probability score is used as the final category of the sample; the onevsall idea is also used to extend the logistic regression to achieve multi-classification, assuming that the data has N categories , use logistic regression to build an independent binary classifier for each category in N categories. For classifier i, the sample with label==i is set as the positive class, the remaining samples are set as the negative class, and so on. Input the sample data to be predicted, get the probability p that all classifiers judge it to be the corresponding positive class, and take the sample type corresponding to the largest probability in p as the final prediction type.
根据支持向量机、随机森林和逻辑回归建立的分类模型分别对样本类别进行投票,将获得多数选票的类别(≥2)作为样本最终类别。The sample categories are voted on respectively based on the classification model established by support vector machine, random forest and logistic regression, and the category with the majority of votes (≥2) is used as the final category of the sample.
步骤S104,将所述子区域内以及分类边界上的样本作为建模样本,筛选具有高灵敏度和相关性的特征波长,建立回归子模型。Step S104: Use the samples in the sub-region and on the classification boundary as modeling samples, screen out characteristic wavelengths with high sensitivity and correlation, and establish a regression sub-model.
在本发明实施例中,由于分类器在区域边界出错的概率更大,每个子模型都囊括了分布在边界上的样本,以避免分类错误造成更大的预测误差,采用稳定变量置换法(SVP)选择特征波长,建立最优变量子集,并采用最小二乘向量机建立子回归模型,所述SVP是基于种内竞争和适者生存的进化原理,考虑变量的稳定性、置换度以及与模型性能相关的统计数据对变量进行评估,将RMSE均值最小、标准差值相对较低的变量子集视为最优变量;对于每个子区域,SVP分别选择了亚硝酸盐和硝酸盐的唯一变量子集。利用变量的专门子集建立的模型可以适应目标离子的特性,从而获得更好的性能。并在MATLAB中使用LSSVMlabv1_8_R2009b_R2011a工具箱建立最小二乘支持向量机模型,使用RBF核函数,同样使用网格搜索查找最佳正则化参数和核参数,获得每个子区域的子回归模型。In the embodiment of the present invention, since the probability of classifier error is greater at the regional boundary, each sub-model includes samples distributed on the boundary to avoid larger prediction errors caused by classification errors. The stable variable replacement method (SVP) is used ) selects the characteristic wavelength, establishes the optimal variable subset, and uses the least squares vector machine to establish a sub-regression model. The SVP is based on the evolutionary principles of intraspecific competition and survival of the fittest, taking into account the stability, replacement degree and relationship between variables. Statistics related to model performance were used to evaluate the variables, and the subset of variables with the smallest RMSE mean and relatively low standard deviation value was regarded as the optimal variable; for each sub-region, SVP selected the unique variables for nitrite and nitrate respectively. Subset. Models built with specialized subsets of variables can be adapted to the properties of the target ions, resulting in better performance. And use the LSSVMlabv1_8_R2009b_R2011a toolbox in MATLAB to build a least squares support vector machine model, use the RBF kernel function, and also use grid search to find the best regularization parameters and kernel parameters to obtain the sub-regression model of each sub-region.
在本发明实施例中,使用稳定变量置换法(SVP)分别为各个子区域的硝酸盐与亚硝酸盐组分建立模型选择最优特征波长子集;先用蒙特卡罗抽样获得样本空间和变量空间的子数据集,在样本空间的子数据集中计算每个变量的稳定性并排序,将稳定性高的变量作为精英变量,其余为正常变量。稳定性Sj计算公式为:In the embodiment of the present invention, the stable variable substitution method (SVP) is used to establish models for the nitrate and nitrite components in each sub-region to select the optimal characteristic wavelength subset; first, Monte Carlo sampling is used to obtain the sample space and variables In the sub-data set of the space, the stability of each variable is calculated and sorted in the sub-data set of the sample space, and the variables with high stability are regarded as elite variables, and the rest are normal variables. The calculation formula of stability S j is:
式中bij为第i个样本第j个变量的回归系数,为第j个变量的回归系数平均值,M为样本总数。In the formula, b ij is the regression coefficient of the j-th variable of the i-th sample, is the average regression coefficient of the j-th variable, and M is the total number of samples.
然后在变量空间的子数据集中进行变量置换分析,计算每个变量的置换度并排序将置换度高的变量作为重要变量;置换度PDj计算公式为:Then perform variable substitution analysis in the sub-data set of the variable space, calculate the degree of substitution of each variable and sort the variables with high degree of substitution as important variables; the calculation formula of the degree of substitution PD j is:
PDj=PCEj-SCEj (9)PD j =PCE j -SCE j (9)
式中PCEj为用不含j变量的多个波长子集分别建立的模型的均方根误差均值,SCEj为用剩下的含j变量的多个波长子集分别建立的模型的均方根误差均值。In the formula, PCE j is the mean root mean square error of the model built with multiple wavelength subsets that do not contain j variables, and SCE j is the mean square error of the model built with the remaining multiple wavelength subsets that contain j variables. Root error mean.
将精英变量和重要变量合并到一个新的变量子集中,重复上述过程。N次迭代得到N个变量子集,最后利用交叉验证选择均方根误差均值最小、标准差值相对较低的变量子集作为最优子集。Merge the elite and important variables into a new subset of variables and repeat the above process. N iterations obtain N variable subsets, and finally use cross-validation to select the variable subset with the smallest mean root mean square error and relatively low standard deviation as the optimal subset.
使用LSSVMlabv1_8_R2009b_R2011a工具箱训练4个最小二乘支持向量机(LSSVM)回归子模型。LSSVM是损失函数为二次损失函数的SVM,其目标函数如下:Use the LSSVMlabv1_8_R2009b_R2011a toolbox to train 4 least squares support vector machine (LSSVM) regression submodels. LSSVM is an SVM whose loss function is a quadratic loss function. Its objective function is as follows:
式中,xi是样本向量,yi是样本分类标记,ω是一个矢量,其维数等于样本的特征维数,b是一个实数,n是样本总数,C是惩罚因子,ξi代表松弛变量,为将样本空间映射到高维特征空间的非线性映射函数。In the formula, x i is the sample vector, y i is the sample classification label, ω is a vector whose dimension is equal to the characteristic dimension of the sample, b is a real number, n is the total number of samples, C is the penalty factor, and ξ i represents relaxation. variable, is a nonlinear mapping function that maps sample space to high-dimensional feature space.
使用RBF核函数,如下:Use the RBF kernel function as follows:
此时LSSVM最终模型结构为:At this time, the final model structure of LSSVM is:
式中模型参数[α1 α2 … αn]可使用拉格朗日方法求解LSSVM目标函数得到。The model parameters [α 1 α 2 … α n ] in the formula can be obtained by solving the LSSVM objective function using the Lagrangian method.
其中α=[α1,α2,…,αn]是拉格朗日乘子。Where α = [α 1 , α 2 , ..., α n ] is the Lagrange multiplier.
在LSSVMlabv1_8_R2009b_R2011a工具箱中,利用tunelssvm函数初始化模型参数即可建立LSSVM模型,并能输出利用网格搜索查找到的最佳的惩罚因子C和核参数g,其中C和g初始值被设为100和0.01,tunelssvm函数如下:model=tunelssvm(model_ori,optfun,costfun,costfun_args),将其输入参数设置为costfun=′crossvalidatelssvm′;costfun_args={10,′mse′};optfun=′gridsearch′;model_ori=initlssvm(trnX,trnY,′function estimation′,c,g,′RBF_kernel′),再利用trainlssvm函数建立回归模型,输出model结构体,将其作为simlssvm函数的重要输入量,即可输出对未知样本的预测值Y。In the LSSVMlabv1_8_R2009b_R2011a toolbox, the LSSVM model can be established by initializing the model parameters using the tunelssvm function, and can output the optimal penalty factor C and kernel parameter g found using grid search, where the initial values of C and g are set to 100 and 0.01, the tunelssvm function is as follows: model=tunelssvm(model_ori, optfun, costfun, costfun_args), set its input parameters to costfun='crossvalidatessvm'; costfun_args={10,'mse'}; optfun='gridsearch'; model_ori=initlssvm (trnX, trnY, 'function estimation', c, g, 'RBF_kernel'), then use the trainlssvm function to establish a regression model, output the model structure, and use it as an important input of the simlssvm function to output predictions for unknown samples ValueY.
步骤S105,获取待测样品的光谱数据,根据所述关系模型确定待测样品类别,并采用与待测样品类别对应的回归子模型进行分析预测,获得待测样品的中硝酸盐和亚硝酸盐的浓度。Step S105, obtain the spectral data of the sample to be tested, determine the category of the sample to be tested according to the relationship model, and use the regression sub-model corresponding to the category of the sample to be tested to perform analysis and prediction, and obtain the medium nitrate and nitrite of the sample to be tested. concentration.
如图3所示,本发明实施例提供了待测样品含量分析的算法框架图,获取待测样品的的光谱数据,即光谱数据,再采用支持向量机(SVM)分类模型、随机森林分类(RF)模型和逻辑回归模型进行分类(LR)获得三种类别i、j、k,选取三种类别i、j、k中占多数的类别作为待测样品的类别,本发明采用三种分类器建立联合分类器对样品类别进行投票,使得预测类别与真实类别相匹配,三种分类模型投票不一致情况的类别,如表2所示,可知由于本发明中投票机制的存在,最终得到正确的分类结果。As shown in Figure 3, the embodiment of the present invention provides an algorithm framework diagram for the content analysis of the sample to be tested, obtains the spectral data of the sample to be tested, that is, the spectral data, and then uses the support vector machine (SVM) classification model, random forest classification ( RF) model and logistic regression model are used for classification (LR) to obtain three categories i, j, k, and the majority category among the three categories i, j, k is selected as the category of the sample to be tested. The present invention uses three classifiers A joint classifier is established to vote on the sample categories so that the predicted categories match the real categories. The categories of voting inconsistencies among the three classification models are shown in Table 2. It can be seen that due to the existence of the voting mechanism in the present invention, the correct classification is finally obtained. result.
表2三个基分类器投票不一致的类别确定Table 2 Determination of categories with inconsistent votes from three base classifiers
在本发明实施例中,当获得类别l后,使用稳定变量置换法选择对应类别区域的变量子集,并使用最小二乘支持向量建立回归模型,最后获得硝酸盐与亚硝酸盐的浓度预测值。In the embodiment of the present invention, after obtaining category l, the stable variable replacement method is used to select a subset of variables corresponding to the category area, and the least squares support vector is used to establish a regression model, and finally the predicted values of nitrate and nitrite concentrations are obtained .
在本发明实施例中,采用留一交叉验证作为评价策略,利用平均相对误差(ARE)、最大相对误差(MRE)、预测均方根误差(RMSEP)和决定系数(R2)四个经典参数来评价所建立的模型的性能,本实例全部程序在MATLAB中完成。In the embodiment of the present invention, leave-one-out cross-validation is used as the evaluation strategy, using four classic parameters: average relative error (ARE), maximum relative error (MRE), root mean square error of prediction (RMSEP) and coefficient of determination (R 2 ) To evaluate the performance of the established model, all procedures in this example are completed in MATLAB.
如表3所示,对比本发明混合机器学习模型和采用单一机器学习模型对混合溶液浓度预测分析结果,其中单一机器学习模型先使用SVP选择特征波长,再用LSSVM建立模型。As shown in Table 3, the mixed machine learning model of the present invention is compared with the prediction and analysis results of the mixed solution concentration using a single machine learning model. The single machine learning model first uses SVP to select the characteristic wavelength, and then uses LSSVM to build the model.
表3.使用不同算法的检测结果Table 3. Detection results using different algorithms
由表3可知,采用本发明混合机器学习模型的预测方法,结果显示硝酸盐的平均相对误差由6.25%降至1.64%,最大相对误差从39.96%降至5.01%,亚硝酸盐的平均相对误差由12.37%降至4.58%,最大相对误差从79.81%降至9.23%。如图4、5所示,分别为本发明实施例提供的单一模型和混合模型预测硝酸盐和亚硝酸盐浓度的效果对比图。虽然单一建模在分析物的浓度相对较高时预测的平均相对误差较小(<10%),但当分析物浓度低于0.4mg N L-1时,其预测误差大大增加;而本发明混合机器学习模型的预测方法,不论分析物浓度在建模区域内如何变化,混合建模的平均相对误差始终控制在5%以下,性能更加稳定。As can be seen from Table 3, using the prediction method of the hybrid machine learning model of the present invention, the results show that the average relative error of nitrate dropped from 6.25% to 1.64%, the maximum relative error dropped from 39.96% to 5.01%, and the average relative error of nitrite It dropped from 12.37% to 4.58%, and the maximum relative error dropped from 79.81% to 9.23%. As shown in Figures 4 and 5, respectively, the effects of single model and hybrid model in predicting nitrate and nitrite concentrations provided by the embodiments of the present invention are compared. Although the average relative error predicted by a single model is small (<10%) when the concentration of the analyte is relatively high, its prediction error increases greatly when the analyte concentration is lower than 0.4 mg NL -1 ; while the mixed model of the present invention For the prediction method of the machine learning model, no matter how the analyte concentration changes within the modeling area, the average relative error of the hybrid modeling is always controlled below 5%, and the performance is more stable.
本发明实施例提供了一种同时结合分类和回归算法的混合机器学习模型,该模型可以解决单一模型预测硝酸盐和亚硝酸盐精度不均衡的问题。此外还使用支持向量机、随机森林和逻辑回归建立联合分类器优化了分类系统。实验结果表明,与其他使用单一模型的直接光谱法相比,该方法显著降低了预测硝酸盐和亚硝酸盐浓度的最大相对误差,提高了对低浓度组分的预测精度。应当理解,本发明所述方法不单单适用于本实施例中配制的一定浓度比例的硝酸盐与亚硝酸盐混合溶液,还可以适用于以硝酸盐与亚硝酸盐为主要成分的任何浓度范围内的任何水样。Embodiments of the present invention provide a hybrid machine learning model that combines classification and regression algorithms at the same time. This model can solve the problem of uneven accuracy in predicting nitrate and nitrite by a single model. In addition, the classification system was optimized by building a joint classifier using support vector machine, random forest, and logistic regression. Experimental results show that compared with other direct spectroscopy methods using a single model, this method significantly reduces the maximum relative error in predicting nitrate and nitrite concentrations and improves the prediction accuracy for low-concentration components. It should be understood that the method of the present invention is not only applicable to the mixed solution of nitrate and nitrite at a certain concentration ratio prepared in this embodiment, but can also be applied to any concentration range with nitrate and nitrite as the main components. of any water sample.
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the patent scope of the present invention. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the scope of protection of the patent of the present invention should be determined by the appended claims.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
应该理解的是,虽然本发明各实施例的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of various embodiments of the present invention are shown in sequence as indicated by arrows, these steps are not necessarily executed in the order indicated by arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The order of execution is not necessarily sequential, but may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of the stages.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer programs. The programs can be stored in a non-volatile computer-readable storage medium. , when the program is executed, it may include the processes of the above-mentioned method embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, All should be considered to be within the scope of this manual.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110054882.XA CN112750507B (en) | 2021-01-15 | 2021-01-15 | Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110054882.XA CN112750507B (en) | 2021-01-15 | 2021-01-15 | Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112750507A CN112750507A (en) | 2021-05-04 |
CN112750507B true CN112750507B (en) | 2023-12-22 |
Family
ID=75652155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110054882.XA Active CN112750507B (en) | 2021-01-15 | 2021-01-15 | Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112750507B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115950854B (en) * | 2022-12-02 | 2023-10-13 | 北京理工大学 | A method for predicting ammonium nitrate concentration in nitric acid-ammonium nitrate solution |
CN115901677B (en) * | 2022-12-02 | 2023-12-22 | 北京理工大学 | Method for predicting concentration of ammonium nitrate in nitric acid-ammonium nitrate solution with updating mechanism |
CN118152705B (en) * | 2024-02-02 | 2024-12-17 | 北京工业大学重庆研究院 | Method for determining multi-parameter substitution index of abundance of effluent resistance gene of sewage plant |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106153601A (en) * | 2016-10-08 | 2016-11-23 | 江南大学 | A kind of method based on SERS detection grease oxide in trace quantities since |
CN107024445A (en) * | 2017-04-17 | 2017-08-08 | 中国科学院南京土壤研究所 | The modeling method and detection method of the quick detection of Nitrate in Vegetable |
CN109001080A (en) * | 2018-05-18 | 2018-12-14 | 内蒙古师范大学 | A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior |
CN109187392A (en) * | 2018-09-26 | 2019-01-11 | 中南大学 | A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model |
US10229370B1 (en) * | 2017-08-29 | 2019-03-12 | Massachusetts Mutual Life Insurance Company | System and method for managing routing of customer calls to agents |
CN110591075A (en) * | 2019-06-28 | 2019-12-20 | 四川大学华西医院 | A kind of PEG-Peptide linear-dendritic drug delivery system and its preparation method and application |
-
2021
- 2021-01-15 CN CN202110054882.XA patent/CN112750507B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106153601A (en) * | 2016-10-08 | 2016-11-23 | 江南大学 | A kind of method based on SERS detection grease oxide in trace quantities since |
CN107024445A (en) * | 2017-04-17 | 2017-08-08 | 中国科学院南京土壤研究所 | The modeling method and detection method of the quick detection of Nitrate in Vegetable |
US10229370B1 (en) * | 2017-08-29 | 2019-03-12 | Massachusetts Mutual Life Insurance Company | System and method for managing routing of customer calls to agents |
CN109001080A (en) * | 2018-05-18 | 2018-12-14 | 内蒙古师范大学 | A kind of solubility of lanthanum acylalaninies complex and the research method of Assembling Behavior |
CN109187392A (en) * | 2018-09-26 | 2019-01-11 | 中南大学 | A kind of zinc liquid trace metal ion concentration prediction method based on two-zone model |
CN110591075A (en) * | 2019-06-28 | 2019-12-20 | 四川大学华西医院 | A kind of PEG-Peptide linear-dendritic drug delivery system and its preparation method and application |
Non-Patent Citations (1)
Title |
---|
基于机器学习的微量农药光谱预测模型;陈菁菁;《北京信息科技大学学报》;第35卷(第2期);第62-66页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112750507A (en) | 2021-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112750507B (en) | Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model | |
CN106124449B (en) | A kind of soil near-infrared spectrum analysis prediction technique based on depth learning technology | |
WO2020133944A1 (en) | Method for constructing water quality index prediction model, and method for monitoring water quality index | |
CN104062259B (en) | A kind of use the method for total saponin content near infrared spectrum quick test complex prescription glue mucilage | |
CN103712939B (en) | A kind of pollutant levels approximating method based on uv-vis spectra | |
CN109060771B (en) | A Consensus Model Construction Method Based on Different Spectral Feature Sets | |
CN103234922A (en) | Rapid soil organic matter detection method based on large sample soil visible-near infrared spectrum classification | |
CN109669023A (en) | A kind of soil attribute prediction technique based on Multi-sensor Fusion | |
CN110726694A (en) | Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm | |
CN112051256A (en) | LIBS measurement method and system of element content to be measured based on CNN model | |
Liu et al. | Series fusion of scatter correction techniques coupled with deep convolution neural network as a promising approach for NIR modeling | |
CN108827909B (en) | A rapid soil classification method based on visible-near-infrared spectroscopy and multi-object fusion | |
CN117219182A (en) | Organic carbon component rapid prediction method based on in-situ spectrum and machine learning model | |
Yu et al. | Ensemble calibration model of near-infrared spectroscopy based on functional data analysis | |
Jia et al. | Prediction of soil organic carbon contents in Tibet using a visible near-infrared spectral library | |
Liang et al. | Improved SVR based on CARS and BAS for hydrocarbon concentration detection | |
Lincy et al. | Deep residual network for soil nutrient assessment using optical sensors | |
CN115326749A (en) | Method and apparatus for measuring contaminants | |
CN117556245B (en) | Method for detecting filtered impurities in tetramethylammonium hydroxide production | |
Liu et al. | Detection of Apple Taste Information Using Model Based on Hyperspectral Imaging and Electronic Tongue Data. | |
CN110887798B (en) | Nonlinear full-spectrum water turbidity quantitative analysis method based on extreme random tree | |
CN107356556A (en) | A kind of double integrated modelling approach of Near-Infrared Spectra for Quantitative Analysis | |
CN110910970A (en) | A method for predicting the toxicity of chemicals using zebrafish embryos as receptors by building a QSAR model | |
CN115270951A (en) | Self-adaptive dissolved organic carbon online detection method based on multi-source spectrum fusion | |
Wan et al. | A cobalt ion concentration detection model with temperature interference resistance via a novel contrastive neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |