CN114354666B - Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection - Google Patents

Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection Download PDF

Info

Publication number
CN114354666B
CN114354666B CN202111677903.XA CN202111677903A CN114354666B CN 114354666 B CN114354666 B CN 114354666B CN 202111677903 A CN202111677903 A CN 202111677903A CN 114354666 B CN114354666 B CN 114354666B
Authority
CN
China
Prior art keywords
wavelength
algorithm
variable
wavelength variable
heavy metal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111677903.XA
Other languages
Chinese (zh)
Other versions
CN114354666A (en
Inventor
任顺
陆旻波
任东
安毅
杨信廷
王纪华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Three Gorges University CTGU
Original Assignee
China Three Gorges University CTGU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Three Gorges University CTGU filed Critical China Three Gorges University CTGU
Priority to CN202111677903.XA priority Critical patent/CN114354666B/en
Priority to CN202311682639.8A priority patent/CN117874480A/en
Publication of CN114354666A publication Critical patent/CN114354666A/en
Application granted granted Critical
Publication of CN114354666B publication Critical patent/CN114354666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Analysing Materials By The Use Of Radiation (AREA)

Abstract

The invention relates to a method for extracting and optimizing soil heavy metal spectral characteristics based on wavelength frequency selection, which comprises the following steps: collecting a soil sample, configuring the sample, and obtaining a spectrum of the sample to form a sample data set; the BOSS algorithm is operated for a plurality of times, the selected probability of each variable is calculated, the wavelength variable with large probability is selected, the RMSECV average value of the prediction model is calculated, the number of the wavelength variables is adjusted to enable the RMSECV average value to be minimum, and the optimal number N of the wavelength variables is determined; repeatedly running the ICO-BOSS algorithm in series to select wavelength variables, calculating the probability of each variable being selected, selecting N wavelength variables with large probability, calculating the RMSECV average value of the prediction model, and adjusting the number of the wavelength variables to enable the RMSECV average value to be minimum so as to obtain an optimal wavelength variable set; and predicting the heavy metal content by using the obtained wavelength variable set. According to the invention, the ICO-BOSS algorithm connected in series is adopted, and the wavelength frequency selection strategy is adopted to select the optimal wavelength variable set, so that the method is used for predicting the heavy metal content, and the stability and the accuracy of a prediction model are improved.

Description

Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection
Technical Field
The invention belongs to the field of soil heavy metal detection, and particularly relates to a wavelength frequency selection-based soil heavy metal spectral feature extraction optimization method.
Background
With the development of chemical industry, traffic and agriculture, the heavy metal pollution condition of soil exists widely. At present, the problem of heavy metal pollution of soil is common in most areas of China. Soil heavy metal pollution is related to the influence of modern industrial and mining industries and agricultural production and human activities besides natural factors, and is a main cause of soil heavy metal pollution. The heavy metal pollution of soil mainly comprises cadmium, arsenic, lead, copper, chromium, mercury and the like. Because the soil heavy metal pollution has the characteristics of long-term property, concealment, difficult disappearance, irreversibility and the like, the heavy metal in the polluted soil has poor mobility and long retention time. In addition, heavy metals are difficult to degrade by microorganisms, and once absorbed by water, plants and other media, the heavy metals are easy to enter the human food chain, so that the human health is affected. Therefore, the supervision and monitoring of heavy metals in soil has great significance for Chinese agricultural safety production and human health assurance.
At present, the traditional soil heavy metal detection method mostly adopts chemical analysis instruments, such as an atomic absorption spectrometry, an atomic fluorescence spectrometry, an inductively coupled plasma mass spectrometry, an inductively coupled plasma emission spectrometry and the like, and has high precision, but certain environmental pollution exists in the detection process, the efficiency is low, the cost is high, and the rapid detection of the soil heavy metal is not facilitated.
As a rapid nondestructive testing method, the X-ray fluorescence spectrometry is compared with the traditional chemical testing method, and has the advantages of simple sample pretreatment, low measurement cost, simple instrument operation and relatively stable result. The method can rapidly determine the content of the metal elements in the soil on site in a large scale, and has important significance for soil pollution investigation and rapid detection and screening of various heavy metal elements in the soil. The guided soft threshold algorithm (Bootstrapping soft shrinkage, BOSS) and the interval combination optimization algorithm (Interval combination optimization, ICO) are popular spectrum selection algorithms at present, and although ICO and BOSS operate at high speed, the weighted bootstrap sampling adopted by ICO and the self-service random sampling adopted by BOSS are high in randomness, so that the stability and the accuracy of a prediction model are affected.
Disclosure of Invention
Aiming at the problems, the invention provides a method for extracting and optimizing the spectral characteristics of soil heavy metals based on wavelength frequency selection, which utilizes a serial interval combination optimization algorithm ICO and a guided soft threshold algorithm BOSS, namely an ICO-BOSS algorithm as a spectral wavelength variable selection algorithm, and utilizes a simulated annealing algorithm to optimize the parameters of the guided soft threshold algorithm to obtain optimal guided soft threshold algorithm parameters; and repeatedly operating a wavelength variable selection algorithm by adopting a wavelength frequency selection strategy, calculating the probability of each wavelength variable being selected, selecting the wavelength variable with high probability of being selected, and establishing a prediction model by using a partial least square method (Partial Least Square, PLS) for detecting the heavy metal content of soil, thereby improving the stability and the precision of the prediction model.
The technical scheme of the invention is a soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection, which comprises the following steps:
step 1: collecting a soil sample, and preparing a soil sample with a preset heavy metal concentration range; acquiring an X-ray fluorescence spectrum of a soil sample, wherein the content value of a heavy metal element is calibrated by a chemical method to form a sample spectrum data set;
step 2: repeating the operation of the guided soft threshold algorithm for a plurality of times, calculating the selected probability of each wavelength variable, selecting the wavelength variable with large selected probability, establishing a partial least square method model to predict the heavy metal content, calculating the average value of the interactive verification root mean square error (Root mean square error of cross validation, RMSECV), increasing or decreasing the number of the selected wavelength variables until the average value of the RMSECV is minimum, and determining the optimal number N of the wavelength variables selected by the guided soft threshold algorithm;
step 3: repeatedly running the ICO-BOSS algorithm in series to select wavelength variables of the spectrum, calculating the probability that each wavelength variable is selected, sorting the wavelength variables according to the probability, selecting N wavelength variables from the selected wavelength variables, predicting heavy metal content, calculating the RMSECV average value of the partial least square model, and increasing or decreasing the number of the selected wavelength variables until the RMSECV average value is minimum, so as to obtain the optimal wavelength variable for predicting heavy metal content;
step 4: and (3) acquiring a spectrum of the soil sample to be detected, and predicting the heavy metal content by utilizing the wavelength variable obtained in the step (3).
Further, the interval combination optimization algorithm ICO includes the following steps:
1) Determining the optimal interval division number, the number of partial least square sub-models and the proportion of the sub-models,
dividing a spectrum into a plurality of subintervals, respectively establishing a partial least square method submodel to predict heavy metal content, observing test results under different number of interval divisions, and taking the number of interval divisions corresponding to the minimum root mean square error value as the number of optimal subintervals;
2) Performing combination optimization on the wavelength interval;
2.1 Partial least square sub-model generation, wherein a weighted bootstrap sampling WBS is adopted to generate a spectrum subset formed by random combination of M different wavelength intervals, M represents sampling times, the initial sampling weight of each wavelength variable is 1, and the probability p of the selected wavelength variable i in one sampling is high i The formula of (2) is as follows:
wherein n represents the number of wavelength variations, w i A sampling weight representing wavelength i;
2.2 Calculating the RMSECV value of the sub-model corresponding to the combined subset of each wavelength interval by adopting a partial least square method and a 5-fold interactive test mode;
2.3 Extracting a section combination subset with a ratio alpha from all the wavelength section combinations, calculating the RMSECV average value of the sub-model corresponding to the section combination subset, and recording the RMSECV average value as the RMSECV m
2.4 Counting the frequency of occurrence of the wavelength variation of each wavelength interval in the interval combination subset determined in step 2.3), the sampling weight w of the ith wavelength interval in the next iteration i The formula of (2) is as follows:
f in i Representing the frequency, k, of occurrence of a wavelength variable of the ith wavelength interval in the extracted combined subset of intervals best Combined sub-packet representing extracted intervalsThe number of wavelength intervals contained;
repeating steps 2.1) to 2.4) to form an iterative loop until RMSECV m Rising occurs, iteration is stopped, and step 2.5) is executed;
2.5 In the last iteration, RMSECV m The group of wavelength intervals with the smallest value is taken as the final selected wavelength interval.
Further, the guided soft threshold algorithm BOSS specifically includes:
s1: generating K subsets in a wavelength variable space by adopting a self-help random sampling method, extracting wavelength variables in each subset, eliminating repeated wavelength variables, and giving equal weights to the remaining wavelength variables after elimination;
s2: establishing a submodel for the wavelength variable subset obtained in the step S1 by using a partial least square method, calculating a RMSECV value of the submodel, and extracting an optimal submodel by using a smaller RMSECV value;
s3: calculating regression coefficients of the submodels, normalizing and summing all regression vectors to obtain the weight of a new wavelength variable;
w in the formula i Represents the weight of the wavelength variable i, K represents the number of submodels, b i,k Representing the absolute value of the normalized regression coefficient of the variable i in the kth sub-model;
s4: based on the obtained weight of the wavelength variable, a weighted bootstrap sampling method is applied to generate a new subset, the wavelength variable is extracted from the subset, a repeated variable is removed, a submodel is built by using a partial least square method, the variable with a larger absolute value of a regression coefficient is given to the larger weight, the steps S2, S3 and S4 are repeatedly executed until the number of the wavelength variable of the obtained new subset is 1, the operation is stopped, and the subset with the minimum RMSECV value is used as an optimal wavelength variable set in the iterative process.
Preferably, step 2 further includes optimizing parameters of the soft threshold algorithm by using a simulated annealing algorithm to obtain optimal soft threshold algorithm parameters, and specifically includes:
step one: for each parameter, an initial solution x is selected 0 Let the current iteration solution x i =x 0 Initializing the iteration step number l to l=0, and the current iteration temperature t l =t 0 ,t 0 Representing any desirable value of the parameter;
step two: if the current temperature reaches the internal circulation stopping condition, executing the third step; otherwise, from the current solution x i Is (x) i ) Is selected randomly for a neighbor x j Calculating Δf ij =f(x j )-f(x i ),Δf ij Model RMSECV difference, f (x i )、f(x j ) RMSECV representing the current solution and the new solution, respectively; if Deltaf ij If not more than 0, receiving new solution to let x i =x j Step three is performed with iteration number l=l+1, otherwise exp (- Δf) is calculated ij /t k ) If exp (- Δf) ij /t k ) > random (0, 1), then accept the new solution, let x i =x j The iteration times l=l+1, otherwise, reselecting the neighbor, and executing the step two;
step three: judging whether the iteration termination times are reached, if so, executing the fourth step, otherwise, executing the second step to carry out the next iteration; step four: judging whether the RMESCV of the model reaches a set threshold value, if so, outputting the current solution, otherwise, reducing the temperature value and jumping to the second step, and starting a new round of iterative search until the termination condition is met.
The optimal guidance soft threshold algorithm parameters obtained in the step 2 comprise: the iteration number n=50, the sampling number k=1500, and the model selection ratio δ=5%.
Compared with the prior art, the invention has the beneficial effects that:
1) According to the method, the ICO-BOSS algorithm connected in series is adopted to select the wavelength variable of the spectrum, the wavelength frequency selection strategy is adopted, the wavelength variable selection algorithm is operated repeatedly, the probability of each wavelength variable being selected is calculated, the wavelength variable with high probability of being selected is selected and used for detecting the heavy metal content of the soil, and the stability and the accuracy of a prediction model are improved;
2) According to the invention, the parameters of the soft threshold guiding algorithm are optimized by adopting the simulated annealing algorithm, so that the optimal soft threshold guiding algorithm parameters are obtained, and the wavelength variable selection is carried out on the spectrum by utilizing the ICO-BOSS algorithm connected in series, so that the prediction effect of the prediction model is further improved;
3) The ICO-BOSS algorithm in series performs preliminary screening on the full spectrum by using the ICO algorithm, and then carefully selects the selected wavelength interval by using the BOSS algorithm, so that the problem that the wavelength variable set selected by the BOSS algorithm independently contains irrelevant information variables and even interference variables is solved.
Drawings
The invention is further described below with reference to the drawings and examples.
Fig. 1 is a schematic flow chart of a method for extracting and optimizing spectral characteristics of soil heavy metals according to an embodiment of the invention.
Fig. 2a is a schematic diagram of the prediction effect of a prediction model established by a BOSS algorithm using a frequency selection strategy on a training set.
Fig. 2b is a schematic diagram of the prediction effect of the prediction model established by the BOSS algorithm using the frequency selection strategy on the test set.
FIG. 3a is a schematic diagram showing the predictive effect of a predictive model on a training set, which is built by using ICO-BOSS algorithm with frequency selection strategy. FIG. 3b is a schematic diagram showing the predictive effect of the predictive model on the test set, which is established by the ICO-BOSS algorithm using the frequency selection strategy.
Detailed Description
The embodiment detects the content of the heavy metal element Cr in the soil.
As shown in fig. 1, the method for extracting and optimizing the spectral characteristics of the soil heavy metal based on wavelength frequency selection comprises the following steps:
step 1: collecting a soil sample in a farmland without a pollution source within a sampling site range of 1 km, preparing a soil sample with a preset heavy metal concentration range by adopting a concentration gradient method, and obtaining an X-ray fluorescence spectrum of the soil sample, wherein the content value of a heavy metal element is calibrated by a chemical method to form a sample spectrum data set, and dividing the sample spectrum data set into a training set and a test set according to a proportion;
step 2: optimizing parameters of a guiding soft threshold algorithm by using a simulated annealing algorithm to obtain optimal parameters of the guiding soft threshold algorithm; repeatedly running the guided soft threshold algorithm for 100 times, calculating the selected probability of each wavelength variable, selecting the wavelength variable with large selected probability, establishing a partial least square method model to predict heavy metal content, calculating the RMSECV average value, increasing or decreasing the number of the selected wavelength variables until the RMSECV average value is minimum, and determining that the optimal number of the wavelength variables selected by the guided soft threshold algorithm is 32;
the principal component number of the partial least square model is 10. The optimal guiding soft threshold algorithm parameters obtained by the simulated annealing algorithm comprise: the iteration number n=50, the sampling number k=1500, and the model selection ratio δ=5%.
The frequencies of the wavelength variables obtained by running the guided soft threshold algorithm 100 times are shown in table 1, wherein the wavelength variables are arranged in descending order of frequency.
Table 1 wavelength variable frequency data table obtained by operating BOSS algorithm multiple times
Selecting a wavelength variable with the frequency exceeding 40, predicting the content of Cr element by using a partial least square method model, and calculating RMSE and R 2 And sequentially selecting wavelength variables with frequencies exceeding 40, 50, 60 and 70, and calculating the prediction error of the partial least square method model, as shown in table 2.
Table 2 comparison table of prediction errors of prediction models of wavelength variables of different frequencies selected by BOSS algorithm
Compared with the prediction model of the wavelength variable set selected by the single-time operation BOSS algorithm, the R of the prediction model of the wavelength variable set selected by the wavelength frequency selection strategy is adopted 2 c 、RMSE c 、R 2 p 、RMSE p All have lifting. The prediction model of the wavelength variable with the frequency exceeding 60 has the best effect and the highest stability, and the relation between the prediction result and the true value is shown in fig. 2a and 2 b.
Step 3: and repeatedly operating the ICO-BOSS algorithm in series for 100 times to perform wavelength variable selection on the spectrum, namely, performing primary screening by using the ICO algorithm, and performing fine screening on the screened wavelength variable by using the BOSS algorithm. Then calculating the probability that each wavelength variable is selected, sorting the wavelength variables according to the probability, selecting the wavelength variables with the frequencies exceeding 50, 60, 70 and 80, predicting the heavy metal content, calculating the RMSECV average value of the partial least square model, and increasing or decreasing the number of the selected wavelength variables until the RMSECV average value is minimum, so as to obtain the optimal wavelength variable for predicting the heavy metal content;
the principal component number of the partial least square model is 10. The parameters of the BOSS algorithm are the same as those of the BOSS algorithm in step 2.
The frequency of the wavelength variable obtained by running the ICO-BOSS algorithm 100 times is shown in Table 3, in which the wavelength variables are arranged in descending order of frequency.
Selecting a wavelength variable with the frequency exceeding 50, predicting the content of Cr element by using a partial least square method model, and calculating RMSE and R 2 And sequentially selecting wavelength variables with frequencies exceeding 50, 60, 70 and 80, and calculating the prediction error of the partial least square method model, as shown in table 4.
As can be seen from table 4, the prediction model of the wavelength variable with the frequency exceeding 70 has the best prediction effect and the highest stability, and the relationship between the prediction result and the true value is shown in fig. 3a and 3 b.
In the embodiment, the modeling effect of the ICO-BOSS algorithm connected in series in the invention is compared with the modeling effect of the BOSS not adopting the frequency selection strategy, the ICO-BOSS connected in series and the BOSS algorithm adopting the frequency selection strategy, as shown in the table 5.
As can be seen from Table 5, the root mean square error RMSE of the partial least squares predictive model established by the BOSS and series ICO-BOSS algorithms employing the frequency selection strategy is compared with the conventional method without employing the frequencyRMSE reduction, R, of predictive model established by BOSS algorithm of sub-selection strategy 2 The prediction effect of the prediction model established by the ICO-BOSS algorithm which adopts the frequency selection strategy in series is improved most obviously. The comparison result shows that the frequency selection strategy can improve the prediction effect and stability of the heavy metal concentration prediction model to a certain extent.
TABLE 3 wavelength variable frequency data Table obtained by running ICO-BOSS algorithm 100 times
Table 4 comparison of prediction errors of the prediction models of wavelength variables of different frequencies selected by ICO-BOSS algorithm
Table 5 comparison of PLS modeling performance for different wavelength selection algorithms
Step 4: and (3) acquiring a spectrum of a soil sample to be detected, establishing a partial least square method model by utilizing the wavelength variable obtained in the step (3), and predicting the heavy metal Cr content.
The interval combination optimization algorithm ICO of the embodiment includes the following steps:
1) Determining the optimal interval division number, the number of partial least square sub-models and the proportion of the sub-models,
dividing a spectrum into a plurality of subintervals, respectively establishing a partial least square method submodel to predict heavy metal content, observing test results under different number of interval divisions, and taking the number of interval divisions corresponding to the minimum root mean square error value as the number of optimal subintervals;
2) Performing combination optimization on the wavelength interval;
2.1 Partial least squares submodel generationGenerating a spectrum subset formed by randomly combining M different wavelength intervals by adopting weighted bootstrap sampling WBS, wherein M represents sampling times, the initial sampling weight of each wavelength variable is 1, and the probability p of the selected wavelength variable i in one sampling is high i The formula of (2) is as follows:
wherein n represents the number of wavelength variations, w i A sampling weight representing wavelength i;
2.2 Calculating the RMSECV value of the sub-model corresponding to the combined subset of each wavelength interval by adopting a partial least square method and a 5-fold interactive test mode;
2.3 Extracting a section combination subset with a ratio alpha from all the wavelength section combinations, calculating the RMSECV average value of the sub-model corresponding to the section combination subset, and recording the RMSECV average value as the RMSECV m
2.4 Counting the frequency of occurrence of each wavelength interval in the interval combination subset determined in step 2.3), the sampling weight w of the ith wavelength interval in the next iteration i The formula of (2) is as follows:
f in i Representing the frequency, k, of occurrence of the ith wavelength interval in the extracted combined subset of intervals best Representing the number of wavelength intervals contained in the extracted combined subset of intervals;
repeating steps 2.1) to 2.4) to form an iterative loop until RMSECV m Rising occurs, iteration is stopped, and step 2.5) is executed;
2.5 In the last iteration, RMSECV m The group of wavelength intervals with the smallest value is taken as the final selected wavelength interval.
The weighted bootstrap sampling method (WBS) described in the examples refers to the weighted bootstrap sampling method disclosed in the paper published in Ren Shun et al 2020, "soil heavy metal content prediction based on X-ray fluorescence spectrum and multi-feature tandem strategy".
The boot soft threshold algorithm BOSS of the embodiment specifically includes:
s1: generating K subsets in a wavelength variable space by adopting a self-help random sampling method, extracting wavelength variables in each subset, eliminating repeated wavelength variables, and giving equal weights to the remaining wavelength variables after elimination;
s2: establishing a submodel for the wavelength variable subset obtained in the step S1 by using a partial least square method, calculating a RMSECV value of the submodel, and extracting an optimal submodel by using a smaller RMSECV value;
s3: calculating regression coefficients of the submodels, normalizing and summing all regression vectors to obtain the weight of a new wavelength variable;
w in the formula i Represents the weight of the wavelength variable i, K represents the number of submodels, b i,k Representing the absolute value of the normalized regression coefficient of the variable i in the kth sub-model;
s4: based on the obtained weight of the wavelength variable, a weighted bootstrap sampling method is applied to generate a new subset, the wavelength variable is extracted from the subset, a repeated variable is removed, a submodel is built by using a partial least square method, the variable with a larger absolute value of a regression coefficient is given to the larger weight, the steps S2, S3 and S4 are repeatedly executed until the number of the wavelength variable of the obtained new subset is 1, the operation is stopped, and the subset with the minimum RMSECV value is used as an optimal wavelength variable set in the iterative process.
In an embodiment, optimizing parameters of a soft-threshold guiding algorithm by using a simulated annealing algorithm to obtain optimal soft-threshold guiding algorithm parameters, specifically including:
step one: for each parameter, an initial solution x is selected 0 Let the current iteration solution x i =x 0 Initializing the iteration step number l to l=0, and the current iteration temperature t l =t 0 ,t 0 Representing any desirable value of the parameter;
step two: if the current temperature reaches the internal circulation stopping condition, executing the third step; otherwise, from the current solution x i Is (x) i ) Is selected randomly for a neighbor x j Calculating Δf ij =f(x j )-f(x i ),Δf ij Model RMSECV difference, f (x i )、f(x j ) RMSECV representing the current solution and the new solution, respectively; if Deltaf ij If not more than 0, receiving new solution to let x i =x j Step three is performed with iteration number l=l+1, otherwise exp (- Δf) is calculated ij /t k ) If exp (- Δf) ij /t k ) > random (0, 1), then accept the new solution, let x i =x j The iteration times l=l+1, otherwise, reselecting the neighbor, and executing the step two;
step three: judging whether the iteration termination times are reached, if so, executing the fourth step, otherwise, executing the second step to carry out the next iteration;
step four: judging whether the RMESCV of the model reaches a set threshold value, if so, outputting the current solution, otherwise, reducing the temperature value and jumping to the second step, and starting a new round of iterative search until the termination condition is met.

Claims (2)

1. The method is characterized in that a series interval combination optimization algorithm ICO and a guided soft threshold algorithm BOSS, namely an ICO-BOSS algorithm are utilized as a spectrum wavelength variable selection algorithm, a wavelength frequency selection strategy is adopted, a wavelength variable selection algorithm is repeatedly operated, the probability that each wavelength variable is selected is calculated, the wavelength variable with high probability is selected, and the method is used for detecting the heavy metal content of the soil and comprises the following steps:
step 1: collecting a soil sample, and preparing a soil sample with a preset heavy metal concentration range; acquiring an X-ray fluorescence spectrum of a soil sample, wherein the content value of a heavy metal element is calibrated by a chemical method to form a sample spectrum data set;
step 2: repeatedly running the guiding soft threshold algorithm for a plurality of times, calculating the selected probability of each wavelength variable, selecting the wavelength variable with large selected probability, establishing a partial least square method model to predict heavy metal content, calculating the interactive verification root mean square error average value, increasing or decreasing the number of the selected wavelength variables until the interactive verification root mean square error average value is minimum, and determining the optimal number N of the wavelength variables selected by the guiding soft threshold algorithm;
step 3: repeatedly running the ICO-BOSS algorithm in series to select wavelength variables, calculating the probability that each wavelength variable is selected, sorting the wavelength variables according to the probability, selecting N wavelength variables from the selected wavelength variables, predicting heavy metal content, calculating the interactive verification root mean square error average value of the partial least square method model, and increasing or decreasing the number of the selected wavelength variables until the interactive verification root mean square error average value is minimum, so as to obtain the optimal wavelength variable for predicting heavy metal content;
step 4: acquiring a spectrum of a soil sample to be detected, and predicting the heavy metal content by utilizing the wavelength variable obtained in the step 3;
the interval combination optimization algorithm ICO comprises the following steps:
1) Determining the optimal interval division number, the number of partial least square sub-models and the proportion of the sub-models,
dividing a spectrum into a plurality of subintervals, respectively establishing a partial least square method submodel to predict heavy metal content, observing test results under different number of interval divisions, and taking the number of interval divisions corresponding to the minimum root mean square error value as the number of optimal subintervals;
2) Performing combination optimization on the wavelength interval;
2.1 Partial least square sub-model generation, which adopts weighted bootstrap sampling to generate a spectrum subset formed by random combination of M different wavelength intervals, M represents sampling times, the initial sampling weight of each wavelength variable is 1, and the probability p of the selected wavelength variable i in one sampling is that i The formula of (2) is as follows:
wherein n represents the number of wavelength variations, w i A sampling weight representing wavelength i;
2.2 Calculating the interactive verification root mean square error value of the sub-model corresponding to the combined subset of each wavelength interval by adopting a partial least square method and a 5-fold interactive verification mode;
2.3 Extracting a section combination subset with the proportion alpha from all the wavelength section combinations, calculating the interactive verification root mean square error average value of the sub-model corresponding to the section combination subset, and recording the interactive verification root mean square error average value as RMSECV m
2.4 Counting the frequency of occurrence of the wavelength variation of each wavelength interval in the interval combination subset determined in step 2.3), the sampling weight w of the ith wavelength interval in the next iteration i The formula of (2) is as follows:
f in i Representing the frequency, k, of occurrence of a wavelength variable of the ith wavelength interval in the extracted combined subset of intervals best Representing the number of wavelength intervals contained in the extracted combined subset of intervals;
repeating steps 2.1) to 2.4) to form an iterative loop until RMSECV m Rising occurs, iteration is stopped, and step 2.5) is executed;
2.5 In the last iteration, RMSECV m The group of wavelength intervals with the smallest value is taken as the final selected wavelength interval;
the boot soft threshold algorithm BOSS specifically includes:
s1: generating subsets in a wavelength variable space by adopting a self-help random sampling method, extracting wavelength variables in each subset, eliminating repeated wavelength variables, and giving equal weights to the remaining wavelength variables after elimination;
s2: establishing a sub-model for the wavelength variable subset obtained in the step S1 by using a partial least square method, calculating the interactive verification root mean square error of the sub-model, and extracting an optimal sub-model by using a smaller interactive verification root mean square error value;
s3: calculating regression coefficients of the submodels, normalizing and summing all regression vectors to obtain the weight of a new wavelength variable;
w in the formula i Represents the weight of the wavelength variable i, K represents the number of submodels, b i,k Representing the absolute value of the normalized regression coefficient of the variable i in the kth sub-model;
s4: based on the obtained weight of the wavelength variable, a weighted bootstrap sampling method is applied to generate a new subset, the wavelength variable is extracted from the subset, a repeated variable is removed, a submodel is built by using a partial least square method, a variable with a larger absolute value of a regression coefficient is given to the larger weight, the steps S2, S3 and S4 are repeatedly executed until the number of the wavelength variable of the obtained new subset is 1, the operation is stopped, and the subset with the minimum root mean square error of interactive verification is used as an optimal wavelength variable set in the iterative process;
step 2 further comprises optimizing parameters of the soft threshold algorithm by using a simulated annealing algorithm to obtain optimal parameters of the soft threshold algorithm, and specifically comprises the following steps:
step one: for each parameter, an initial solution x is selected 0 Let the current iteration solution x i =x 0 Initializing the iteration step number l to l=0, and the current iteration temperature t l =t 0 ,t 0 Representing any desirable value of the parameter;
step two: if the current temperature reaches the internal circulation stopping condition, executing the third step; otherwise, from the current solution x i Is (x) i ) Is selected randomly for a neighbor x j Calculating Δf ij =f(x j )-f(x i ),Δf ij Representing the current solution x i With neighbor x j Interactive verification of root mean square error difference, f (x j )、f(x i ) Respectively represent x i 、x j Intersection of predictive models as parametersMutually verifying root mean square error values; if Deltaf ij If not more than 0, receiving new solution to let x i =x j Step three is performed with iteration number l=l+1, otherwise exp (- Δf) is calculated ij /t k ) If exp (- Δf) ij /t k ) > random (0, 1), then accept the new solution, let x i =x j The iteration times l=l+1, otherwise, reselecting the neighbor, and executing the step two;
step three: judging whether the iteration termination times are reached, if so, executing the fourth step, otherwise, executing the second step to carry out the next iteration;
step four: judging whether the interactive verification root mean square error value of the prediction model reaches a set threshold value, if so, outputting the current solution, otherwise, reducing the temperature value and jumping to the second step, and starting a new round of iterative search until the termination condition is met.
2. The method for extracting and optimizing the spectral characteristics of the soil heavy metals based on the wavelength frequency selection according to claim 1, wherein the optimal guiding soft threshold algorithm parameters obtained in the step 2 comprise: the iteration number n=50, the sampling number k=1500, and the model selection ratio δ=5%.
CN202111677903.XA 2021-12-31 2021-12-31 Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection Active CN114354666B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111677903.XA CN114354666B (en) 2021-12-31 2021-12-31 Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection
CN202311682639.8A CN117874480A (en) 2021-12-31 2021-12-31 ICO-BOSS algorithm-based soil heavy metal spectral feature extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111677903.XA CN114354666B (en) 2021-12-31 2021-12-31 Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202311682639.8A Division CN117874480A (en) 2021-12-31 2021-12-31 ICO-BOSS algorithm-based soil heavy metal spectral feature extraction method

Publications (2)

Publication Number Publication Date
CN114354666A CN114354666A (en) 2022-04-15
CN114354666B true CN114354666B (en) 2023-12-26

Family

ID=81105237

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111677903.XA Active CN114354666B (en) 2021-12-31 2021-12-31 Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection
CN202311682639.8A Pending CN117874480A (en) 2021-12-31 2021-12-31 ICO-BOSS algorithm-based soil heavy metal spectral feature extraction method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202311682639.8A Pending CN117874480A (en) 2021-12-31 2021-12-31 ICO-BOSS algorithm-based soil heavy metal spectral feature extraction method

Country Status (1)

Country Link
CN (2) CN114354666B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115656074B (en) * 2022-12-28 2023-04-07 山东省科学院海洋仪器仪表研究所 Adaptive selection and estimation method for sea water COD (chemical oxygen demand) spectral variable characteristics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107727676A (en) * 2017-09-14 2018-02-23 三峡大学 A kind of heavy metal content in soil modeling method based on to space before partial least squares algorithm
CN109902411A (en) * 2019-03-07 2019-06-18 三峡大学 Heavy metal content in soil detects modeling method and device, detection method and device
CN110361356A (en) * 2019-07-30 2019-10-22 长春理工大学 A kind of near infrared spectrum Variable Selection improving wheat water content precision of prediction
CN110991064A (en) * 2019-12-11 2020-04-10 广州城建职业学院 Soil heavy metal content inversion model generation method and system, storage medium and inversion method
CN111504942A (en) * 2020-04-26 2020-08-07 长春理工大学 Near infrared spectrum analysis method for improving prediction accuracy of protein in milk
CN113049507A (en) * 2021-03-09 2021-06-29 三峡大学 Multi-model fused spectral wavelength selection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107727676A (en) * 2017-09-14 2018-02-23 三峡大学 A kind of heavy metal content in soil modeling method based on to space before partial least squares algorithm
CN109902411A (en) * 2019-03-07 2019-06-18 三峡大学 Heavy metal content in soil detects modeling method and device, detection method and device
CN110361356A (en) * 2019-07-30 2019-10-22 长春理工大学 A kind of near infrared spectrum Variable Selection improving wheat water content precision of prediction
CN110991064A (en) * 2019-12-11 2020-04-10 广州城建职业学院 Soil heavy metal content inversion model generation method and system, storage medium and inversion method
CN111504942A (en) * 2020-04-26 2020-08-07 长春理工大学 Near infrared spectrum analysis method for improving prediction accuracy of protein in milk
CN113049507A (en) * 2021-03-09 2021-06-29 三峡大学 Multi-model fused spectral wavelength selection method

Also Published As

Publication number Publication date
CN114354666A (en) 2022-04-15
CN117874480A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN101430276B (en) Wavelength variable optimization method in spectrum analysis
CN102735642B (en) Method for quickly and losslessly identifying virgin olive oil and olive-residue oil
CN103913432B (en) Based on the near-infrared spectrum wavelength system of selection of particle cluster algorithm
CN105630743A (en) Spectrum wave number selection method
CN107632010B (en) Method for quantifying steel sample by combining laser-induced breakdown spectroscopy
CN107958267B (en) Oil product property prediction method based on spectral linear representation
CN112462001B (en) Gas sensor array model calibration method for data amplification based on condition generation countermeasure network
CN114354666B (en) Soil heavy metal spectral feature extraction and optimization method based on wavelength frequency selection
CN110569566A (en) Method for predicting mechanical property of plate strip
CN113889198A (en) Transformer fault diagnosis method and equipment based on oil chromatogram time-frequency domain information and residual error attention network
CN115829157A (en) Chemical water quality index prediction method based on variational modal decomposition and auto former model
CN113420795A (en) Mineral spectrum classification method based on void convolutional neural network
CN114757413A (en) Bad data identification method based on time sequence series analysis coupling neural network prediction
CN116010884A (en) Fault diagnosis method of SSA-LightGBM oil-immersed transformer based on principal component analysis
CN116559110A (en) Self-adaptive near infrared spectrum transformation method based on correlation and Gaussian curve fitting
CN109283169A (en) A kind of Raman spectral peaks recognition methods of robust
CN111914490A (en) Pump station unit state evaluation method based on deep convolution random forest self-coding
CN116992362A (en) Transformer fault characterization feature quantity screening method and device based on Xia Puli value
CN116610990A (en) Method and device for identifying hidden danger of breaker based on characteristic space differentiation
CN115130377A (en) Soil heavy metal prediction method of BOSS-SAPSO (Bill of plant-oriented chemical-mechanical System) optimization extreme learning machine
CN115982566A (en) Multi-channel fault diagnosis method for hydroelectric generating set
CN113011086B (en) Estimation method of forest biomass based on GA-SVR algorithm
CN112697745A (en) Method for measuring alcohol content of white spirit
CN113361209A (en) Quantitative analysis method for magnetic anomaly of surface defects of high-temperature alloy
CN113688895A (en) Method and system for detecting abnormal firing zone of ceramic roller kiln based on simplified KECA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant