CN113378473A - Underground water arsenic risk prediction method based on machine learning model - Google Patents

Underground water arsenic risk prediction method based on machine learning model Download PDF

Info

Publication number
CN113378473A
CN113378473A CN202110698696.XA CN202110698696A CN113378473A CN 113378473 A CN113378473 A CN 113378473A CN 202110698696 A CN202110698696 A CN 202110698696A CN 113378473 A CN113378473 A CN 113378473A
Authority
CN
China
Prior art keywords
parameter
model
arsenic
data set
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110698696.XA
Other languages
Chinese (zh)
Other versions
CN113378473B (en
Inventor
曹文庚
付宇
高媛媛
王小东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Hydrogeology and Environmental Geology CAGS
Original Assignee
Institute of Hydrogeology and Environmental Geology CAGS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Hydrogeology and Environmental Geology CAGS filed Critical Institute of Hydrogeology and Environmental Geology CAGS
Priority to CN202110698696.XA priority Critical patent/CN113378473B/en
Publication of CN113378473A publication Critical patent/CN113378473A/en
Application granted granted Critical
Publication of CN113378473B publication Critical patent/CN113378473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Game Theory and Decision Science (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Geophysics And Detection Of Objects (AREA)

Abstract

The invention provides a groundwater arsenic risk prediction method based on a machine learning model, which comprises the following steps: s1, collecting data; s2, defining a model task according to the research area space scale of a research target and the arsenic standard exceeding degree of underground water; s3, evaluating the data matching degree of the machine learning algorithm by taking the performance of the machine learning algorithm on the full parameter data set and the space parameter data set as an evaluation standard; s4, based on the algorithm after the evaluation and screening in the step S3, carrying out an ultra-parameter sensitivity test on the model task in the step S2, determining the ultra-parameter range to be debugged, optimizing the ultra-parameter debugging process of all subsequent models, and completing the construction of full-parameter or space parameter models of all model tasks; and S5, predicting the arsenic risk of the underground water by using the constructed probability estimation model. The method is used for selecting the algorithm based on the machine learning algorithm, optimizing the statistical modeling establishment process and constructing the high-precision underground water arsenic statistical model.

Description

Underground water arsenic risk prediction method based on machine learning model
Technical Field
The invention relates to the technical field of underground water safety monitoring, in particular to an underground water arsenic risk prediction method based on a machine learning model.
Background
Groundwater resources are used as a primary source of residential drinking water in many countries and regions where the potential risk of arsenic exposure is a serious health hazard to humans. Arsenic in groundwater is colorless and odorless and is difficult to detect. At present, the technology and equipment for remedying arsenic pollution of underground water are not popularized yet, and the blind area of centralized water supply and water change engineering still exists. Particularly in rural areas where water is not centrally supplied, arsenic exposure to groundwater has become one of the most troublesome problems for rural drinking water safety. Arsenic in groundwater, after entering the human body, can damage the human body by denaturing proteins and enzymes, damaging cells and disrupting gene regulation, and causes acute and chronic toxic symptoms. Long-term drinking of high-arsenic underground water can cause damage to multiple organs and multiple systems, including skin lesions, cardiovascular and cerebrovascular diseases and nervous system diseases, and further cause cancers of other organs of the multiple organs, and can potentially vary from tens of years to decades.
Due to the heterogeneity of the spatial distribution of arsenic in groundwater, a large number of monitoring samples and analytical measurements are required to implement the policy of government and related institutions for ensuring water supply safety, which consumes a large amount of manpower, material resources and immeasurable time cost. Therefore, on the premise that high-density underground water quality investigation cannot be comprehensively realized to guarantee water safety, high-arsenic underground water distribution and prediction research is carried out in countries and regions with widespread and wide underground water arsenic pollution distribution by a statistical modeling method, the underground water arsenic standard exceeding condition of an area without sampling is predicted, reliable scientific basis is provided for sampling investigation and water use decision making, and the method has important social significance. Meanwhile, the statistical modeling research based on big data can systematically analyze the spatial heterogeneity of the underground water arsenic on multiple scales, describe and invert the formation process and key control factors of the underground water arsenic with different scales, and has important scientific significance.
At present, most of the research on the analysis of the arsenic pollution of underground water based on a statistical model has a relatively fixed flow of the method, and the method mainly comprises the following steps: 1. modeling a model for the data using one or two statistical methods; 2. calculating the model performance under different statistical indexes; 3. calculating a probability prediction distribution based on the model; 4. model and result interpretation. In this way, analysis based on statistical models lacks pre-selection evaluation and necessary demonstration in the selection process of the algorithms, which may result in that the selected algorithms are not suitable for groundwater arsenic data in research areas, so that the established models have only low performance, thereby bringing about the disadvantages of unreliable risk prediction and unreliable model result interpretation.
Disclosure of Invention
The invention aims to provide a groundwater arsenic risk prediction method based on a machine learning model, which is characterized in that algorithm selection is carried out based on a machine learning algorithm, a statistical modeling establishment process is optimized, a high-precision groundwater arsenic statistical model is constructed, and data characteristics can be captured more comprehensively, so that more reliable prediction and result analysis are provided.
In order to achieve the purpose, the invention provides the following scheme:
a groundwater arsenic risk prediction method based on a machine learning model comprises the following steps:
s1, data collection: selecting predictor variables potentially related to groundwater arsenic exceedance, including: the method comprises the following steps of (1) collecting data of relevant predictive variables, and sorting the data into a full-parameter data set and a spatial parameter data set;
s2, customizing the model task: defining a model task according to the research area space scale of a research target and the arsenic standard exceeding degree of underground water;
s3, establishing and evaluating an algorithm selection mechanism: the method comprises the following steps of taking the performance of a machine learning algorithm on a full-parameter data set and a spatial parameter data set as an evaluation standard to evaluate the fitness of the machine learning algorithm to the data, wherein the specific evaluation steps are as follows: selecting a plurality of potential machine learning algorithms; dividing the data set for a plurality of times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; comprehensively considering the average value and the range of the performance measurement, and establishing the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the space parameter data set by each algorithm; screening an algorithm with sufficient excellence potential to perform the modeling of step S4;
s4, constructing a probability estimation model: based on the algorithm evaluated and screened in the step S3, carrying out super-parameter sensitivity test on the model task in the step S2, determining the super-parameter range to be debugged, optimizing super-parameter debugging processes of all subsequent models, and completing construction of full-parameter or space parameter models of all model tasks;
and S5, predicting the arsenic risk of the underground water by using the probability estimation model constructed in the step S4.
Further, in step S1, the spatial parameter data set includes data corresponding to geological parameters, geographic parameters and hydrographic parameters, and the full parameter data set includes data corresponding to hydrochemical parameters, geological parameters, geographic parameters and hydrographic parameters.
Further, in step S2, a model task is defined according to the spatial scale of the research area of the research target and the arsenic standard exceeding degree of the groundwater, specifically:
the spatial dimensions of the study area include: the system comprises a national scale research area, a northwest scale research area, a northeast scale research area, a south scale research area and a south shallow region scale research area;
groundwater arsenic exceedance includes three thresholds: 5. mu.g/L, 10. mu.g/L, 50. mu.g/L;
combine 2 different types of data sets: a full parameter dataset and a spatial parameter dataset;
the task of defining the model is to adopt different algorithms to respectively establish 30 different models.
Further, in step S3, selecting a plurality of potential machine learning algorithms specifically includes: logistic regression, random forest, and boosted regression trees.
Further, in step S3, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the method for comprehensively considering the average value and the range of the performance measurement to establish the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the spatial parameter data set specifically comprises the following steps:
randomly mapping the full parameter data set and the spatial parameter data set according to the following steps of 7: 3, generating a training set and a test set according to the proportion;
estimating the generalization ability of the hyperparameter under a certain value by adopting a 10-fold cross validation method repeated for 10 times in a training set; the 10-fold cross validation refers to dividing the training set into 10 parts with equal size, selecting one subset as the validation set each time, and using the union set of the remaining 9 subsets as the sub-training set;
and (3) fitting the algorithm to the sub-training set to generate a model by setting a hyper-parameter value, calculating the performance measurement of the model on the verification set, traversing 10 sub-sets to be respectively used as the verification sets, obtaining the average value of the performance measurement on 10 verification sets to be used as the performance measurement value of the model corresponding to the hyper-parameter value set in the 10-fold cross-validation, and evaluating the potential generalization capability of each algorithm by using the deviation and the variance of the performance measurement.
Further, in step S3, the performance of the machine learning algorithm on the full parameter data set and the spatial parameter data set is used as an evaluation criterion, wherein the performance as the evaluation criterion includes accuracy, sensitivity, specificity and ROC average.
Further, in step S4, based on the algorithm evaluated and screened in step S3, the model task in step S2 is subjected to the hyperparameter sensitivity test, the hyperparameter range to be debugged is determined, the hyperparameter debugging process of all subsequent models is optimized, and the construction of the full-parameter or spatial-parameter models of all model tasks is completed, which specifically includes:
the sensitivity of the hyperparameter in the algorithm after evaluation and screening under different data sets is verified aiming at three typical model tasks by combining a grid search and cross validation method, so as to optimize a hyperparameter debugging process, wherein the three typical model tasks are as follows: the method comprises the following steps of carrying out statistics modeling on underground water arsenic of a full-parameter data set in a national scale research area, carrying out statistics modeling on underground water arsenic of a spatial parameter data set in the national scale research area, and carrying out statistics modeling on underground water arsenic of a full-parameter data set in a northwest scale research area;
aiming at each hyper parameter participating in debugging, a limited representative value is selected according to the characteristics of a research object, a grid structure of a multi-dimensional space is drawn in a permutation and combination mode, and each possible hyper parameter value is tried by traversing all nodes, so that a compromise result of feasibility and comprehensiveness is achieved;
and traversing all points of the grid in the hyper-parameter space by 10-fold cross validation repeated for 10 times in combination with grid search, and then comparing the performance metrics corresponding to all hyper-parameter combinations to select the hyper-parameter value corresponding to the highest performance metric.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the groundwater arsenic risk prediction method based on the machine learning model provided by the invention is based on various machine learning algorithms at the front edge, utilizes the collected multidimensional data set with large sample amount, takes performance measurement as reference from the practical purpose of modeling, focuses on the moderate matching between the research algorithm and the practical modeling problem, establishes a set of modeling process suitable for groundwater data simulation of a research object, can be used for predicting the spatial distribution of high-arsenic groundwater, analyzes the cause of the high-arsenic groundwater and the control mechanism of the distribution of the high-arsenic groundwater under different scales, and has important significance for groundwater arsenic mechanism research, groundwater resource utilization and water use safety.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a groundwater arsenic risk prediction method based on a machine learning model according to an embodiment of the invention;
FIG. 2 is a graph of BLR, RF and BRT performance evaluated in accuracy, sensitivity, specificity and ROC on a training set of a national full-parameter model task with a threshold of 10 μ g/L;
FIG. 3 is a graph of BLR, RF and BRT performance on a test set in a national full-parameter model task with a threshold of 10 μ g/L, evaluated in terms of accuracy, sensitivity, specificity and ROC.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a groundwater arsenic risk prediction method based on a machine learning model, which is characterized in that algorithm selection is carried out based on a machine learning algorithm, a statistical modeling establishment process is optimized, a high-precision groundwater arsenic statistical model is constructed, and data characteristics can be captured more comprehensively, so that more reliable prediction and result analysis are provided.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a groundwater arsenic risk prediction method based on a machine learning model in an embodiment of the present invention, and as shown in fig. 1, the groundwater arsenic risk prediction method based on the machine learning model in the embodiment of the present invention includes the following steps:
s1, data collection: selecting predictor variables potentially related to groundwater arsenic exceedance, including: the method comprises the following steps of (1) collecting data of relevant predictive variables, and sorting the data into a full-parameter data set and a spatial parameter data set;
s2, customizing the model task: defining a model task according to the research area space scale of a research target and the arsenic standard exceeding degree of underground water;
s3, establishing and evaluating an algorithm selection mechanism: the method comprises the following steps of taking the performance of a machine learning algorithm on a full-parameter data set and a spatial parameter data set as an evaluation standard to evaluate the fitness of the machine learning algorithm to the data, wherein the specific evaluation steps are as follows: selecting a plurality of potential machine learning algorithms; dividing the data set for a plurality of times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; comprehensively considering the average value and the range of the performance measurement, and establishing the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the space parameter data set by each algorithm; screening an algorithm with sufficient excellence potential to perform the modeling of step S4;
s4, constructing a probability estimation model: based on the algorithm evaluated and screened in the step S3, carrying out super-parameter sensitivity test on the model task in the step S2, determining the super-parameter range to be debugged, optimizing super-parameter debugging processes of all subsequent models, and completing construction of full-parameter or space parameter models of all model tasks;
and S5, predicting the arsenic risk of the underground water by using the probability estimation model constructed in the step S4.
In step S1, the spatial parameter data set includes data corresponding to geological parameters, geographic parameters, and hydrological parameters, such as soil, ion exchange, river network, elevation, potential evapotranspiration, precipitation, temperature, surface runoff, irrigation, gravity, terrain, structure, and sediment; the full parameter data set comprises data corresponding to a water chemistry parameter, a geological parameter, a geographic parameter and a hydrological parameter. It can be seen that the full parameter data set includes not only spatial parameters, but also water chemistry parameters.
In the step S2, a model task is defined according to the research area spatial scale of the research target and the groundwater arsenic standard degree, specifically:
the spatial dimensions of the study area include: the system comprises a national scale research area, a northwest scale research area, a northeast scale research area, a south scale research area and a south shallow region scale research area;
groundwater arsenic exceedance includes three thresholds: 5. mu.g/L, 10. mu.g/L, 50. mu.g/L;
combine 2 different types of data sets: a full parameter dataset and a spatial parameter dataset;
the task of defining the model is to adopt different algorithms to respectively establish 30 different models.
Take the result of a threshold of 10. mu.g/L as an example, where a: representing the model space scale type, and the nation refers to the national model scale; an S guide part area; superficial SS guideline layer; NE refers to the northeast region and NW refers to the northwest region. b: over-represents the model of the Over-sampling training set, and the parentheses behind the model name represent the result of the super-parameter debugging, wherein BRT is interaction.
Overall, BRT and RF perform very closely and well on each model, as shown in table 1. The BRT and RF performance on the training set is best, the accuracy is basically about 95%, the sensitivity and specificity fluctuate according to the data set, and the ROC value is basically over 0.99. The method has the advantages that after parameter adjustment of the modeling process of the method, the two algorithms can well fit data of the training set and capture differences of whether the arsenic in the underground water exceeds the standard or not on a natural system or not on the back of the data of the training set. Therefore, to demonstrate the dominant, universal numerical laws captured therein, emphasis is placed on the behavior of the two algorithms corresponding to the model on the test set, as well as the variation in performance across different modeling tasks.
In the test set, the accuracy of the full-parameter models of different tasks mostly falls between 85% and 95%, and the ROC value basically fluctuates above and below 0.95. The sensitivity and specificity fluctuate due to data, the stronger the imbalance, the less the data set data, the greater the difference between sensitivity and specificity, and the more the model performance is focused on the prediction performance of the main class. The spatial parametric model showed a significant drop in performance over these four performance metrics compared to the full parametric model (table 1), which is not consistent with the performance of the two dataset models in the training set.
TABLE 1 BRT and RF model Performance in each model at 10 μ g/L threshold
Figure BDA0003129520580000071
In step S3, the properties as evaluation criteria include accuracy, sensitivity, specificity, and ROC average. The method specifically comprises the following steps of selecting a plurality of potential machine learning algorithms: logistic regression, random forest, and boosted regression trees. The accuracy, sensitivity and specificity are used to evaluate stepwise logistic regression, random forests and boosted regression tree model performance. The accuracy value range is [0,1], and the closer the model accuracy is to 1, the better the performance is. The ROC curve measures a comprehensive estimate of the model performance when the threshold varies over a range of 0, 1. The ROC curve is plotted by the variation of sensitivity and specificity when the threshold is traversed over the range [0,1] with sensitivity as the abscissa and true case rate (1-specificity) as the ordinate. AUC is the area under the ROC curve, and the distinguishing capability of the model for the two categories is measured. The AUC value range is [0.5, 1], the closer the AUC is to 1.0, the higher the model performance is; when the value is equal to 0.5, the model efficiency is the lowest, and the application value is not high. Groundwater arsenic data has strong heterogeneity, so the AUC (ROC) ratio for evaluating the distinguishing capability of the model to the categories from the aspect of the proportion of the respective categories, not the quantity, is more suitable for comprehensively considering the performance of the model.
In addition, in step S3, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the method for comprehensively considering the average value and the range of the performance measurement to establish the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the spatial parameter data set specifically comprises the following steps:
randomly mapping the full parameter data set and the spatial parameter data set according to the following steps of 7: 3, generating a training set and a test set according to the proportion;
estimating the generalization ability of the hyperparameter under a certain value by adopting a 10-fold cross validation method repeated for 10 times in a training set; the 10-fold cross validation refers to dividing the training set into 10 parts with equal size, selecting one subset as the validation set each time, and using the union set of the remaining 9 subsets as the sub-training set;
and (3) fitting the algorithm to the sub-training set to generate a model by setting a hyper-parameter value, calculating the performance measurement of the model on the verification set, traversing 10 sub-sets to be respectively used as the verification sets, obtaining the average value of the performance measurement on 10 verification sets to be used as the performance measurement value of the model corresponding to the hyper-parameter value set in the 10-fold cross-validation, and evaluating the potential generalization capability of each algorithm by using the deviation and the variance of the performance measurement.
In the embodiment of the invention, a plurality of models with different purposes are established in the research of underground water arsenic statistical models developed in the Bengal area, and the models can be divided into a regional model and a national model which show multi-scale differences. Therefore, in order to verify the potential matching degree of the model performance of the three machine learning algorithms under different research purposes as widely and efficiently as possible, the invention selects a scenario with the most complete consideration to data in the algorithm selection process, namely selects a data set with the most complete data quantity and parameters: a statistical model of groundwater arsenic on a national scale and taking into account all parameters. Wherein, 10 mug/L is selected as a threshold value for judging whether arsenic exceeds the standard or not in the divided underground water.
FIG. 2 shows the performance of stepwise logistic regression (BLR), Random Forest (RF) and Boosted Regression Trees (BRT) on a training set in a national full-parameter model task with a threshold of 10 μ g/L with accuracy, sensitivity, specificity and ROC assessment; FIG. 3 shows the performance of stepwise logistic regression (BLR), Random Forest (RF) and Boosted Regression Trees (BRT) on a test set of the national full-parameter model task with a threshold of 10 μ g/L, evaluated with accuracy, sensitivity, specificity and ROC.
As can be seen from fig. 2 and 3, in the groundwater arsenic statistical task under the national full parameter, the RF and BRT algorithms have better performance and lower error and variance, the performance metric in the training set is very close to the test set, and in the test set, the average values of the accuracy, sensitivity, specificity and ROC are all around 0.9, and the quartile range is within 0.05. The performance of BLR is all lower than both of these integrated regression tree models, and there is a significant difference in performance of the performance metrics across the training and test sets. The stepwise logistic regression test set was around 0.7 for accuracy, sensitivity, specificity and ROC mean, but around 0.2 for quartering as shown in figure 3. This indicates that for the Bengal data, the generalization error of the model may not ignore the variance effect. In particular, the generalization capability of the BLR bangladesh model is mainly affected by the model variance, which has not been considered in the existing stepwise logistic regression groundwater arsenic statistical modeling studies. Since the model variance is mainly influenced by the spatial distribution characteristics of the data, the popularization of the algorithm comparison process in the groundwater arsenic statistical modeling research in the region where the data quality (distribution uniformity and data density) is similar to or worse than Bengal is necessary.
In comparison of algorithms based on Bengal groundwater arsenic data, the performance of RF and BRT algorithms are significantly better than stepwise logistic regression. Therefore, random forest and lifting regression trees are more suitable for statistical model studies of high arsenic groundwater in southeast asian regions like mengla and cambodia than logistic regression. Therefore, step S3 screens out both the RF and BRT algorithms in this embodiment.
In step S4, based on the algorithm after the evaluation and screening in step S3, the model task in step S2 is subjected to an hyperparameter sensitivity test, an hyperparameter range to be debugged is determined, the hyperparameter debugging process of all subsequent models is optimized, and the construction of full-parameter or spatial-parameter models of all model tasks is completed, which specifically includes:
the sensitivity of the hyperparameter in the algorithm after evaluation and screening under different data sets is verified aiming at three typical model tasks by combining a grid search and cross validation method, so as to optimize a hyperparameter debugging process, wherein the three typical model tasks are as follows: the method comprises the following steps of carrying out statistics modeling on underground water arsenic of a full-parameter data set in a national scale research area, carrying out statistics modeling on underground water arsenic of a spatial parameter data set in the national scale research area, and carrying out statistics modeling on underground water arsenic of a full-parameter data set in a northwest scale research area;
aiming at each hyper parameter participating in debugging, a limited representative value is selected according to the characteristics of a research object, a grid structure of a multi-dimensional space is drawn in a permutation and combination mode, and each possible hyper parameter value is tried by traversing all nodes, so that a compromise result of feasibility and comprehensiveness is achieved;
and traversing all points of the grid in the hyper-parameter space by 10-fold cross validation repeated for 10 times in combination with grid search, and then comparing the performance metrics corresponding to all hyper-parameter combinations to select the hyper-parameter value corresponding to the highest performance metric.
In this embodiment, two algorithms, namely RF and BRT, are selected as modeling methods for a montage groundwater arsenic statistical model study. The hyper-parameter debugging is a final adjusting step of the complexity of the model when determining an algorithm and a parameter space, and is also a necessary step. The characteristics of the existing literature and the research object of the user are combined to select the potential values of the super-parameter and the potential values of the RF and BRT algorithms are shown in a table 2.
TABLE 2 promotion of hyper-parameters and parameter-tuning ranges to be debugged in regression trees and random forests
Figure BDA0003129520580000101
In a specific embodiment, a 10-fold cross validation method with 10 repetitions is adopted in the training set to estimate the generalization ability under a certain value of the hyper-parameter. The 10-fold cross validation refers to dividing a target set (namely a training set in model debugging) into 10 parts with equal size, selecting one subset as a validation set each time, using a collection of the remaining 9 subsets as a sub-training set, enabling an algorithm to fit the sub-training set to generate a model by setting a hyper-parameter value, calculating performance measurement of the model on the validation set, traversing 10 subsets respectively as the validation sets, and obtaining a mean value of the performance measurement on 10 validation sets as a performance measurement value of a model corresponding to the hyper-parameter value set in the 10-fold cross validation, namely the estimation of generalization ability. In order to avoid errors introduced by set division, a target set is randomly divided for ten times, and 10 sets with equal size are divided for 10-fold cross validation each time. And taking the average value of the performance metrics obtained by 10-fold cross validation for 10 times as the performance metric value of the model corresponding to the super-parameter value by 10-fold cross validation repeated for 10 times.
In the algorithm selection process and the super-parameter sensitivity analysis, a method for a subsequent modeling task and a super-parameter debugging process are determined. Thus, in the modeling task, the data directly involved in the task is represented by 7: and 3, dividing the training set and the test set, applying a lifting regression tree and a random forest algorithm, implementing an optimized super-parameter debugging process in a cross validation mode to obtain an optimal model, and calculating the performance measurement of the model on the training set and the test set. Wherein, 30 different models are respectively established for 3 thresholds (5 mug/L, 10 mug/L and 50 mug/L), 5 research areas (national scale research area, northwest, northeast, south and south shallow regional scale research area) and 2 different types of data sets (full parameter data set and spatial parameter data set), the lifting regression tree and the random forest. Particularly, due to the fact that certain non-equilibrium exists in groundwater arsenic data in the Bengal area, for a data set with low sample size or serious equilibrium, 26 models are additionally and respectively established for a random forest and a lifting regression tree aiming at an oversampling training set. A total of 112 models were built. The random forest and the lifting regression tree algorithm are realized by using randomForest and GBM toolkits in the R language in the processes.
The model performance is evaluated through 4 different performance metrics (accuracy, sensitivity, specificity and area under ROC curve), and the two methods are obtained to have similar performance in a training set and a testing set and have similar variation trend in a data set, but the performances of corresponding unbalanced data still have difference. In the area facing the area with the groundwater with the arsenic exceeding the standard, the random forest can obtain higher sensitivity and lower specificity, and in the area facing the area with the groundwater with the arsenic exceeding the standard, the random forest can obtain lower sensitivity and higher specificity. This may be caused by the parallel structure of the random forest itself, and each tree simulates a bootstract sample set that is distributed in the same way as the region data set, so that each subtree is affected by the imbalance. For the lifting regression tree, due to the serial structure of the lifting regression tree, except that the first tree learns by taking unbalanced data as an object, the added subtrees learn the errors of the existing model, and the distribution of the errors is not influenced by the data distribution after multiple iterations, so that the BRT can obtain more balanced simulation on the unbalanced data. Therefore, all subsequent parts related to probability simulation and prediction are subject to the result of lifting the regression tree model.
In order to reduce the influence of non-equalized data on the model and thus reduce the bias of model estimation and prediction, oversampling is used as an additional pre-process to add to the original training set for expansion for equalized data set. However, based on models of oversampled data, the output probabilities will change as the data set fluctuates and therefore will not be useful for modeling probabilities. The data set after the oversampling preprocessing obtains the accuracy and the ROC value which are similar to those of the original data set model, but the difference between the sensitivity and the specificity of the model is greatly reduced, and the random forest and the lifting regression tree obtain the similar sensitivity and the similar specificity.
Based on the excellent performance of the random forest and the lifting regression tree models in various modeling tasks, the lifting regression tree and the random forest can provide model basis for subsequent model discussion. The lifting regression tree based on the oversampling training set and the random forest model have balanced performance and are used for calculating factor importance scores, and the non-preprocessed lifting regression tree model is used for estimating probability and drawing a probability prediction distribution diagram due to the stability performance of the non-balanced data.
The algorithm selection flow provided by the invention takes the generalization error decomposition as a theoretical basis, comprehensively considers the potential generalization capability of the algorithm on groundwater arsenic data in a research area from the angle of deviation and variance, and selects an algorithm suitable for a research task before a formal model task is established. Compared with the traditional underground water arsenic statistical modeling process for comparing the performance of the method after the modeling task is completed, the algorithm selection process not only additionally considers the characteristic of spatial data modeling, but also estimates the fitness of the algorithm and the data at the beginning of the modeling task. This avoids using inappropriate algorithms to perform the modeling task, greatly reduces unnecessary computational consumption, and allows for more potential algorithms to be considered to perform the modeling task. The embodiment provided by the invention shows that the performance of the lifting regression tree and the random forest is more excellent, the influence of the variance on the model performance cannot be ignored, and the generalization capability of the stepwise logistic regression model is mainly influenced by the variance, so that for the areas with the data quality close to or worse than Bengal, the generalization capability of the model is estimated from the variance and deviation angle by using an algorithm selection process in the groundwater arsenic statistical modeling.
Aiming at the existing data, the sensitivity of 4 hyper-parameters related to the lifting regression tree algorithm is analyzed by taking the accuracy, the sensitivity, the specificity and the area under an ROC curve as performance measurement based on repeated multi-fold cross validation and grid search, the lifting regression tree algorithm is verified to have excellent fitting capability on the Bengal underground water arsenic data set, and data characteristics and a natural system built in the data characteristics can be described. From the viewpoint of interpretation of the method, the super-parameter debugging process for improving the regression tree is optimized on the basis of ensuring the performance of the model, and the modeling efficiency is greatly improved.
The probability estimation models have excellent model performance, the importance calculation model has more balanced sensitivity and specificity, and the final risk prediction graph not only can capture the space distribution characteristics of arsenic in large and medium-scale underground water like a Kriging difference value method, but also has higher resolution and can predict the fine distribution of small and medium-scale underground water. The stable and excellent performances of the random forest algorithm and the lifting regression tree algorithm on different modeling tasks of Bengal show that the two algorithms have application potentials for being popularized to other affected areas on the research of groundwater arsenic statistical modeling.
The groundwater arsenic risk prediction method based on the machine learning model provided by the invention is based on various machine learning algorithms at the front edge, utilizes the collected multidimensional data set with large sample amount, takes performance measurement as reference from the practical purpose of modeling, focuses on the moderate matching between the research algorithm and the practical modeling problem, establishes a set of modeling process suitable for groundwater data simulation of a research object, can be used for predicting the spatial distribution of high-arsenic groundwater, analyzes the cause of the high-arsenic groundwater and the control mechanism of the distribution of the high-arsenic groundwater under different scales, and has important significance for groundwater arsenic mechanism research, groundwater resource utilization and water use safety.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (7)

1. A groundwater arsenic risk prediction method based on a machine learning model is characterized by comprising the following steps:
s1, data collection: selecting predictor variables potentially related to groundwater arsenic exceedance, including: the method comprises the following steps of (1) collecting data of relevant predictive variables, and sorting the data into a full-parameter data set and a spatial parameter data set;
s2, customizing the model task: defining a model task according to the research area space scale of a research target and the arsenic standard exceeding degree of underground water;
s3, establishing and evaluating an algorithm selection mechanism: the method comprises the following steps of taking the performance of a machine learning algorithm on a full-parameter data set and a spatial parameter data set as an evaluation standard to evaluate the fitness of the machine learning algorithm to the data, wherein the specific evaluation steps are as follows: selecting a plurality of potential machine learning algorithms; dividing the data set for a plurality of times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; comprehensively considering the average value and the range of the performance measurement, and establishing the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the space parameter data set by each algorithm; screening an algorithm with sufficient excellence potential to perform the modeling of step S4;
s4, constructing a probability estimation model: based on the algorithm evaluated and screened in the step S3, carrying out super-parameter sensitivity test on the model task in the step S2, determining the super-parameter range to be debugged, optimizing super-parameter debugging processes of all subsequent models, and completing construction of full-parameter or space parameter models of all model tasks;
and S5, predicting the arsenic risk of the underground water by using the probability estimation model constructed in the step S4.
2. A groundwater arsenic risk prediction method according to claim 1, wherein in step S1, the spatial parameter data set comprises data corresponding to geological parameters, geographical parameters and hydrologic parameters, and the full parameter data set comprises data corresponding to hydrochemical parameters, geological parameters, geographical parameters and hydrologic parameters.
3. The groundwater arsenic risk prediction method based on the machine learning model according to claim 1, wherein in step S2, a model task is defined according to a spatial scale of a research area of a research target and a groundwater arsenic standard degree, and specifically comprises:
the spatial dimensions of the study area include: the system comprises a national scale research area, a northwest scale research area, a northeast scale research area, a south scale research area and a south shallow region scale research area;
groundwater arsenic exceedance includes three thresholds: 5. mu.g/L, 10. mu.g/L, 50. mu.g/L;
combine 2 different types of data sets: a full parameter dataset and a spatial parameter dataset;
the task of defining the model is to adopt different algorithms to respectively establish 30 different models.
4. The groundwater arsenic risk prediction method based on a machine learning model as claimed in claim 1, wherein in step S3, selecting a plurality of potential machine learning algorithms specifically comprises: logistic regression, random forest, and boosted regression trees.
5. The groundwater arsenic risk prediction method based on machine learning model as claimed in claim 4, wherein in step S3, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the method for comprehensively considering the average value and the range of the performance measurement to establish the potential generalization capability of the groundwater arsenic statistical modeling based on the full parameter data set and the spatial parameter data set specifically comprises the following steps:
randomly mapping the full parameter data set and the spatial parameter data set according to the following steps of 7: 3, generating a training set and a test set according to the proportion;
estimating the generalization ability of the hyperparameter under a certain value by adopting a 10-fold cross validation method repeated for 10 times in a training set; the 10-fold cross validation refers to dividing the training set into 10 parts with equal size, selecting one subset as the validation set each time, and using the union set of the remaining 9 subsets as the sub-training set;
and (3) fitting the algorithm to the sub-training set to generate a model by setting a hyper-parameter value, calculating the performance measurement of the model on the verification set, traversing 10 sub-sets to be respectively used as the verification sets, obtaining the average value of the performance measurement on 10 verification sets to be used as the performance measurement value of the model corresponding to the hyper-parameter value set in the 10-fold cross-validation, and evaluating the potential generalization capability of each algorithm by using the deviation and the variance of the performance measurement.
6. A groundwater arsenic risk prediction method according to claim 1, wherein in step S3, the performance of the machine learning algorithm on the full parameter data set and the spatial parameter data set is used as an evaluation criterion, wherein the performance as the evaluation criterion includes accuracy, sensitivity, specificity and ROC average value.
7. The groundwater arsenic risk prediction method based on the machine learning model as claimed in claim 3, wherein in step S4, based on the algorithm after the evaluation and screening in step S3, the model task in step S2 is subjected to the hyperparameter sensitivity test, the hyperparameter range to be debugged is determined, the hyperparameter debugging process of all subsequent models is optimized, and the construction of full-parameter or spatial-parameter models of all model tasks is completed, specifically comprising:
the sensitivity of the hyperparameter in the algorithm after evaluation and screening under different data sets is verified aiming at three typical model tasks by combining a grid search and cross validation method, so as to optimize a hyperparameter debugging process, wherein the three typical model tasks are as follows: the method comprises the following steps of carrying out statistics modeling on underground water arsenic of a full-parameter data set in a national scale research area, carrying out statistics modeling on underground water arsenic of a spatial parameter data set in the national scale research area, and carrying out statistics modeling on underground water arsenic of a full-parameter data set in a northwest scale research area;
aiming at each hyper parameter participating in debugging, a limited representative value is selected according to the characteristics of a research object, a grid structure of a multi-dimensional space is drawn in a permutation and combination mode, and each possible hyper parameter value is tried by traversing all nodes, so that a compromise result of feasibility and comprehensiveness is achieved;
and traversing all points of the grid in the hyper-parameter space by 10-fold cross validation repeated for 10 times in combination with grid search, and then comparing the performance metrics corresponding to all hyper-parameter combinations to select the hyper-parameter value corresponding to the highest performance metric.
CN202110698696.XA 2021-06-23 2021-06-23 Groundwater arsenic risk prediction method based on machine learning model Active CN113378473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110698696.XA CN113378473B (en) 2021-06-23 2021-06-23 Groundwater arsenic risk prediction method based on machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110698696.XA CN113378473B (en) 2021-06-23 2021-06-23 Groundwater arsenic risk prediction method based on machine learning model

Publications (2)

Publication Number Publication Date
CN113378473A true CN113378473A (en) 2021-09-10
CN113378473B CN113378473B (en) 2024-01-12

Family

ID=77578854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110698696.XA Active CN113378473B (en) 2021-06-23 2021-06-23 Groundwater arsenic risk prediction method based on machine learning model

Country Status (1)

Country Link
CN (1) CN113378473B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970307A (en) * 2022-02-25 2022-08-30 海仿(上海)科技有限公司 General reverse calculation method applied to high-end equipment material design optimization
CN115481750A (en) * 2022-09-20 2022-12-16 云南省农业科学院农业环境资源研究所 On-line prediction method and system for nitrate nitrogen in underground water based on machine learning
CN115878599A (en) * 2022-10-26 2023-03-31 河北雄安睿天科技有限公司 Sewage industry data cleaning method
CN115982139A (en) * 2022-11-23 2023-04-18 中国地质大学(北京) Mining area topographic data cleaning method and device, electronic equipment and storage medium
CN117010274A (en) * 2023-07-11 2023-11-07 中国地质科学院水文地质环境地质研究所 Intelligent early warning method for harmful elements in underground water based on integrated incremental learning
CN117333321A (en) * 2023-09-27 2024-01-02 中山大学 Agricultural irrigation water consumption estimation method, system and medium based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108896996A (en) * 2018-05-11 2018-11-27 中南大学 A kind of Pb-Zn deposits absorbing well, absorption well water sludge interface ultrasonic echo signal classification method based on random forest
CN110765418A (en) * 2019-10-09 2020-02-07 清华大学 Intelligent set evaluation method and system for basin water and sand research model
CN112101796A (en) * 2020-09-16 2020-12-18 清华大学合肥公共安全研究院 Water environment pollution risk comprehensive perception and recognition system
JPWO2020255413A1 (en) * 2019-06-21 2020-12-24
CN112382352A (en) * 2020-10-30 2021-02-19 华南理工大学 Method for quickly evaluating structural characteristics of metal organic framework material based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108896996A (en) * 2018-05-11 2018-11-27 中南大学 A kind of Pb-Zn deposits absorbing well, absorption well water sludge interface ultrasonic echo signal classification method based on random forest
JPWO2020255413A1 (en) * 2019-06-21 2020-12-24
CN110765418A (en) * 2019-10-09 2020-02-07 清华大学 Intelligent set evaluation method and system for basin water and sand research model
CN112101796A (en) * 2020-09-16 2020-12-18 清华大学合肥公共安全研究院 Water environment pollution risk comprehensive perception and recognition system
CN112382352A (en) * 2020-10-30 2021-02-19 华南理工大学 Method for quickly evaluating structural characteristics of metal organic framework material based on machine learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970307A (en) * 2022-02-25 2022-08-30 海仿(上海)科技有限公司 General reverse calculation method applied to high-end equipment material design optimization
CN114970307B (en) * 2022-02-25 2024-06-04 海仿(上海)科技有限公司 General reverse calculation method applied to material design optimization of high-end equipment
CN115481750A (en) * 2022-09-20 2022-12-16 云南省农业科学院农业环境资源研究所 On-line prediction method and system for nitrate nitrogen in underground water based on machine learning
CN115878599A (en) * 2022-10-26 2023-03-31 河北雄安睿天科技有限公司 Sewage industry data cleaning method
CN115982139A (en) * 2022-11-23 2023-04-18 中国地质大学(北京) Mining area topographic data cleaning method and device, electronic equipment and storage medium
CN117010274A (en) * 2023-07-11 2023-11-07 中国地质科学院水文地质环境地质研究所 Intelligent early warning method for harmful elements in underground water based on integrated incremental learning
CN117010274B (en) * 2023-07-11 2024-05-10 中国地质科学院水文地质环境地质研究所 Intelligent early warning method for harmful elements in underground water based on integrated incremental learning
CN117333321A (en) * 2023-09-27 2024-01-02 中山大学 Agricultural irrigation water consumption estimation method, system and medium based on machine learning
CN117333321B (en) * 2023-09-27 2024-07-09 中山大学 Agricultural irrigation water consumption estimation method, system and medium based on machine learning

Also Published As

Publication number Publication date
CN113378473B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
CN113378473A (en) Underground water arsenic risk prediction method based on machine learning model
Somarathna et al. More data or a better model? Figuring out what matters most for the spatial prediction of soil carbon
Bowden et al. Input determination for neural network models in water resources applications. Part 1—background and methodology
CN112506990B (en) Hydrological data anomaly detection method based on spatiotemporal information
Ghorbani et al. Chaos-based multigene genetic programming: A new hybrid strategy for river flow forecasting
Li et al. Predicting the effect of land use and climate change on stream macroinvertebrates based on the linkage between structural equation modeling and bayesian network
Fountalis et al. Spatio-temporal network analysis for studying climate patterns
Huang et al. Predictive performance of ensemble hydroclimatic forecasts: Verification metrics, diagnostic plots and forecast attributes
CN112785450A (en) Soil environment quality partitioning method and system
CN103577676A (en) Grey weighting method for sewage treatment process comprehensive evaluation
CN111914487B (en) Data-free regional hydrological parameter calibration method based on antagonistic neural network
CN118350678B (en) Water environment monitoring data processing method and system based on Internet of things and big data
CN117993305B (en) Dynamic evaluation method for river basin land utilization and soil erosion relation
CN117787081A (en) Hydrological model parameter uncertainty analysis method based on Morris and Sobol methods
Liu et al. Failure prediction of municipal water pipes using machine learning algorithms
CN115049026A (en) Regression analysis method of space non-stationarity relation based on GSNNR
Sotomayor et al. Implications of macroinvertebrate taxonomic resolution for freshwater assessments using functional traits: The Paute River Basin (Ecuador) case
CN109388664A (en) A kind of middle and small river basin similitude method of discrimination
Idrus Distance Analysis Measuring for Clustering using K-Means and Davies Bouldin Index Algorithm
Li et al. Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling
CN117421562B (en) Ocean dissolved oxygen content space-time distribution prediction method, system, medium and equipment
CN117010274B (en) Intelligent early warning method for harmful elements in underground water based on integrated incremental learning
CN116701974A (en) Precipitation multi-element space-time change analysis and attribution identification method under climate change
CN113890833B (en) Network coverage prediction method, device, equipment and storage medium
CN114862249A (en) River basin non-point source pollution prevention and control method and system based on key landscape indexes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant