CN113378473B

CN113378473B - Groundwater arsenic risk prediction method based on machine learning model

Info

Publication number: CN113378473B
Application number: CN202110698696.XA
Authority: CN
Inventors: 曹文庚; 付宇; 高媛媛; 王小东
Original assignee: Institute of Hydrogeology and Environmental Geology CAGS
Current assignee: Institute of Hydrogeology and Environmental Geology CAGS
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2024-01-12
Anticipated expiration: 2041-06-23
Also published as: CN113378473A

Abstract

The invention provides a machine learning model-based groundwater arsenic risk prediction method, which comprises the following steps: s1, collecting data; s2, defining a model task according to the space scale of a research area of a research target and the standard exceeding degree of arsenic in underground water; s3, evaluating the matching degree of the machine learning algorithm on the data by taking the performance of the machine learning algorithm on the full parameter data set and the space parameter data set as an evaluation standard; s4, performing a super-parameter sensitivity test on the model task in the step S2 based on the algorithm evaluated and screened in the step S3, determining super-parameters and ranges to be debugged, optimizing super-parameter debugging processes of all subsequent models, and completing construction of all parameter or space parameter models of all model tasks; s5, predicting the arsenic risk of the underground water by using the constructed probability estimation model. The invention carries out algorithm selection based on a machine learning algorithm, optimizes the statistical modeling establishment flow, and constructs a high-precision groundwater arsenic statistical model.

Description

Groundwater arsenic risk prediction method based on machine learning model

Technical Field

The invention relates to the technical field of underground water safety monitoring, in particular to an underground water arsenic risk prediction method based on a machine learning model.

Background

Groundwater resources are used as a major residential drinking water source in many countries and regions, and the potential risk of arsenic exposure is a serious hazard to human health. Arsenic in groundwater is colorless and odorless, and is difficult to detect. At present, the technology and equipment for repairing arsenic pollution in underground water are not popular, and the blind area of the centralized water supply-to-water project still exists. Particularly in rural areas with non-centralized water supply, arsenic exposure of groundwater has become one of the most troublesome problems for rural drinking water safety. Arsenic in groundwater can damage human body by means of denaturing protein and enzyme, damaging cells and regulating disorder genes after entering human body, and cause acute and chronic toxic symptoms. Long-term drinking of high arsenic groundwater can cause damage to multiple organs and multiple systems, including skin lesions, cardiovascular and cerebrovascular diseases, and nervous system diseases, which in turn can cause cancer in multiple organs and other organs, and can be a latency period of tens to decades.

Due to the heterogeneity of the arsenic spatial distribution of the groundwater, a large number of monitoring samples and analytical measurements are required to implement policies for securing water supply by governments and related institutions, which consumes a large amount of manpower and material resources and an irreducible time cost. Therefore, under the premise that high-density underground water quality investigation cannot be comprehensively realized to ensure water use safety, high-arsenic underground water distribution and prediction research are carried out in countries and regions with common and wide arsenic pollution distribution in the underground water by a statistical modeling method, the condition of exceeding the standard of the underground water arsenic in an un-sampled area is prejudged, and reliable scientific basis is provided for sampling investigation and water use decision, so that the method has important social significance. Meanwhile, the statistical modeling research based on big data can systematically analyze the spatial heterogeneity of the arsenic in the groundwater on multiple scales, and describe and invert the formation process and key control factors of the arsenic in the groundwater with different scales, thereby having important scientific significance.

At present, most of underground water arsenic pollution analysis and research based on a statistical model have relatively fixed method flow, and mainly comprise the following steps: 1. modeling a model for the data using one or two statistical methods; 2. calculating model performances under different statistical indexes; 3. calculating a probability prediction distribution based on the model; 4. and (5) model and result interpretation. Such a lack of pre-selection evaluation and necessary demonstration in the selection process of the algorithm based on the analysis of the statistical model may cause that the selected algorithm is not suitable for the groundwater arsenic data in the research area, so that the established model has only lower performance, thereby bringing the defects of unreliable risk prediction and unreliable interpretation of model results.

Disclosure of Invention

The invention aims to provide a machine learning model-based groundwater arsenic risk prediction method, which is characterized in that algorithm selection is carried out based on a machine learning algorithm, a statistical modeling building flow is optimized, a high-precision groundwater arsenic statistical model is built, and data characteristics can be more comprehensively captured, so that more reliable prediction and result analysis are provided.

In order to achieve the above object, the present invention provides the following solutions:

a machine learning model-based groundwater arsenic risk prediction method comprises the following steps:

s1, data collection: selecting a prediction variable potentially associated with arsenic oversubstance in groundwater, comprising: the method comprises the steps of collecting data of related prediction variables, and sorting the data into a full-parameter data set and a space-parameter data set;

s2, customizing a model task: defining a model task according to the spatial scale of a research area of a research target and the standard exceeding degree of arsenic in underground water;

s3, establishing and evaluating an algorithm selection mechanism: the performance of the machine learning algorithm on the full parameter data set and the space parameter data set is used as an evaluation standard to evaluate the matching degree of the machine learning algorithm on the data, and the specific evaluation steps are as follows: selecting a plurality of potential machine learning algorithms; dividing the data set for several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; establishing potential generalization capability of the arsenic statistical modeling of the underground water based on the full parameter data set and the space parameter data set by taking the mean value and the range of the performance measurement into comprehensive consideration; screening an algorithm with enough excellent potential to perform modeling in the step S4;

s4, constructing a probability estimation model: based on the algorithm after the screening is evaluated in the step S3, the model task in the step S2 is subjected to a hypersensitive test, the hypersensitive range of the model task to be debugged is determined, the hypersensitive debugging flow of all the subsequent models is optimized, and the construction of the full-parameter or space parameter model of all the model tasks is completed;

s5, predicting the arsenic risk of the underground water by using the probability estimation model constructed in the step S4.

Further, in step S1, the spatial parameter data set includes data corresponding to a geological parameter, a geographical parameter and a hydrological parameter, and the full parameter data set includes data corresponding to a biochemical parameter, a geological parameter, a geographical parameter and a hydrological parameter.

Further, in step S2, a model task is defined according to the spatial scale of the research area of the research target and the arsenic standard exceeding degree of the groundwater, specifically:

the spatial dimensions of the investigation region include: a national scale study area, a northwest scale study area, a northeast scale study area, a south scale study area, and a south shallow area scale study area;

the arsenic overstock level of groundwater includes three thresholds: 5 μg/L,10 μg/L,50 μg/L;

in combination with 2 different types of data sets: a full parameter data set and a spatial parameter data set;

the model task is defined to respectively establish 30 different models by adopting different algorithms.

Further, in step S3, a plurality of potential machine learning algorithms are selected, including: logistic regression, random forests, and lifting regression trees.

Further, in step S3, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the potential generalization capability of the underground water arsenic statistical modeling is established by each algorithm based on the full parameter data set and the space parameter data set according to the mean value and the range comprehensive consideration of the performance measurement, and the method specifically comprises the following steps:

the full parameter data set and the space parameter data set are randomly pressed according to 7:3, generating a training set and a testing set according to the proportion;

estimating the generalization capability of the super-ginseng under a certain value by adopting a 10-fold cross-validation method repeated for 10 times in a training set; the 10-fold cross validation refers to dividing a training set into 10 parts with equal size, selecting one subset as a validation set each time, and taking the collection of the rest 9 subsets as a sub-training set;

the super-parameter value is set to enable the algorithm to fit the sub-training set to generate a model, performance measurement of the model on the verification set is calculated, 10 subsets are traversed to serve as the verification sets respectively, the average value of the performance measurement on the 10 verification sets is obtained to serve as the performance measurement value of the model corresponding to the super-parameter value set in the 10-fold cross verification, and the potential generalization capability of each algorithm is estimated according to the deviation and variance of the performance measurement.

Further, in step S3, performance of the machine learning algorithm on the full parameter data set and the spatial parameter data set is used as an evaluation criterion, where the performance as the evaluation criterion includes accuracy, sensitivity, specificity, and ROC average.

Further, in step S4, based on the algorithm after the evaluation and screening in step S3, the model task in step S2 is subjected to the hypersensitive test, the hypersensitive range to be debugged is determined, the hypersensitive debugging flow of all subsequent models is optimized, and the construction of the full-parameter or spatial parameter model of all model tasks is completed, which specifically includes:

by combining the grid searching and cross-validation methods, the sensitivity of the super-parameters in the screened algorithm under different data sets is validated and evaluated for three typical model tasks, so as to optimize the super-parameter debugging flow, wherein the three typical model tasks are respectively: groundwater arsenic statistical modeling of the full-parameter data set in the national-scale research area, groundwater arsenic statistical modeling of the spatial parameter data set in the national-scale research area and groundwater arsenic statistical modeling of the full-parameter data set in the northwest-scale research area;

selecting a limited representative value according to the characteristics of a research object aiming at each super parameter participating in debugging, drawing a grid structure of a multidimensional space in a permutation and combination mode, and trying each possible super parameter value by traversing all nodes so as to achieve a trade-off result of feasibility and comprehensiveness;

and combining grid search, traversing all points of the grid in the super-ginseng space by using 10 repeated 10-fold cross validation, and comparing performance metrics corresponding to all super-ginseng combinations to select a super-ginseng value corresponding to the highest performance metric.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the method for predicting the arsenic risk of the groundwater based on the machine learning model, based on various machine learning algorithms of the front edge, the collected multidimensional and large-sample-size data set is utilized, from the practical purpose of modeling, the performance measurement is taken as a reference, the matching degree between the algorithm and the practical modeling problem is mainly studied, a modeling flow suitable for groundwater data simulation of a research object is established, the modeling flow can be used for predicting the spatial distribution of arsenic groundwater, the control mechanism of the cause of the high-arsenic groundwater and the distribution of the cause of the high-arsenic groundwater under different scales is analyzed, and the method has important significance for groundwater arsenic mechanism research, groundwater resource utilization and water safety.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a machine learning model-based groundwater arsenic risk prediction method according to an embodiment of the invention;

FIG. 2 is a graph of performance of BLR, RF and BRT on a training set in a national parametric model task with 10 μg/L as a threshold, assessed with accuracy, sensitivity, specificity and ROC;

FIG. 3 is a graph of the performance of BLR, RF and BRT on a test set in a national parametric model task with 10 μg/L as a threshold, assessed with accuracy, sensitivity, specificity and ROC.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Fig. 1 is a schematic flow chart of a machine learning model-based groundwater arsenic risk prediction method according to an embodiment of the invention, and as shown in fig. 1, the machine learning model-based groundwater arsenic risk prediction method according to the embodiment of the invention includes the following steps:

In step S1, the spatial parameter dataset includes data corresponding to geological parameters, geographical parameters and hydrologic parameters, such as parameters of soil, ion exchange, river network, elevation, potential evaporation, precipitation, temperature, surface runoff, irrigation, gravity, topography, construction, sedimentation, etc.; the full parameter data set comprises data corresponding to a water chemistry parameter, a geological parameter, a geographic parameter and a hydrologic parameter. It can be seen that the full parameter dataset includes not only spatial parameters but also water chemistry parameters.

In the step S2, a model task is defined according to a spatial scale of a research area of a research target and an arsenic exceeding degree of groundwater, specifically:

Taking the result of a threshold of 10. Mu.g/L as an example, wherein a: representing the model space scale type, and the orientation refers to the national model scale; s guide part area; SS guideline shallow; NE refers to northeast region, NW refers to northwest region. b: over-represents the model of the oversampled training set, and the parenthesis following the model name represents the result of the super-parametric debugging, where BRT is interaction. Depth and n.minobsnnode super-parameters, respectively, and RF is Mtry super-parameters.

Overall, as shown in table 1, BRT and RF perform very closely and well on each model. The best performance of BRT and RF on training set is shown by the accuracy of about 95%, sensitivity and specificity fluctuate according to data set, ROC value is above 0.99. This shows that after the modeling flow of the method is used for parameter adjustment, the two algorithms can well fit the data of the training set, and capture the difference of whether the arsenic in the groundwater exceeds the standard or not and is represented on a natural system behind the data of the training set. Thus, to demonstrate the major, generic numerical laws captured therein, emphasis is placed on the performance of the two algorithmic correspondence models on the test set, as well as the variation in performance across different modeling tasks.

In the test set, the accuracy of the full-parameter model for different tasks falls mostly between 85% -95%, and ROC values fluctuate substantially up and down 0.95. The sensitivity and specificity fluctuate due to data, the stronger the imbalance, the less data is in the data set, the greater the sensitivity and specificity difference, and the model performance is more biased to the predictive performance of the main class. The spatial parametric model showed a significant drop in these four performance metrics compared to the full parametric model (table 1), which is different from the consistency of the performance of the two dataset models in the training set.

TABLE 1 Performance of BRT and RF models in each model at a threshold of 10 μg/L

In step S3, the performance as an evaluation criterion includes accuracy, sensitivity, specificity, and ROC average. The method specifically comprises the steps of selecting a plurality of potential machine learning algorithms, wherein the method specifically comprises the following steps: logistic regression, random forests, and lifting regression trees. Are used to evaluate stepwise logistic regression, random forests, and to boost regression tree model performance with accuracy, sensitivity, and specificity. The accuracy value range is [0,1], and the model accuracy is closer to 1, the performance is better. The ROC curve measures a comprehensive estimate of the model performance when the threshold value varies within the range of 0, 1. The ROC curve is plotted with sensitivity on the abscissa and true case rate (1-specificity) on the ordinate, with the change in sensitivity and specificity as the threshold traverses the [0,1] range. AUC is the area under the ROC curve, measuring the ability of the model to distinguish between the two categories. The AUC value range is [0.5,1], and the closer the AUC is to 1.0, the higher the model performance is; when the model efficiency is equal to 0.5, the model efficiency is the lowest, and the application value is not provided. There is a strong non-uniformity in the arsenic data in the groundwater, so that the AUC (ROC) ratio of the classification discrimination capability of the model to the classification is estimated from the proportion of the respective classification rather than the quantity angle, and is more suitable for comprehensively considering the performance of the model.

In addition, in step S3, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the potential generalization capability of the underground water arsenic statistical modeling is established by each algorithm based on the full parameter data set and the space parameter data set according to the mean value and the range comprehensive consideration of the performance measurement, and the method specifically comprises the following steps:

The embodiment of the invention establishes various models with different purposes in the research of the groundwater arsenic statistical model developed in the Bengala region, and can be divided into a regional model and a national model which embody multi-scale differences. Therefore, in order to verify the potential matching degree of the model performance of three machine learning algorithms under different research purposes as widely and efficiently as possible, the invention selects a scenario with the most complete data consideration in the algorithm selection process, namely selects the data set with the most complete parameters of the data quantity: a national scale and considering all parameters of the groundwater arsenic statistical model. Wherein, 10 mug/L is selected as a threshold value for dividing whether arsenic in groundwater exceeds standard or not.

FIG. 2 shows the performance of progressive logistic regression (BLR), random Forest (RF), and lifting regression trees (BRT) on a training set in a national parametric model task with 10 μg/L as a threshold, evaluated with accuracy, sensitivity, specificity, and ROC; FIG. 3 shows the performance of progressive logistic regression (BLR), random Forest (RF), and lifting regression trees (BRT) on test sets in a national parametric model task with 10 μg/L as a threshold, assessed with accuracy, sensitivity, specificity, and ROC.

As can be seen from fig. 2 and 3, in the groundwater arsenic statistics task under the national parameters, the RF and BRT algorithms have better performance and lower errors and variances, the performance metrics in the training set and the test set are very close, and in the test set, the average value of the accuracy, the sensitivity, the specificity and the ROC is about 0.9, and the quartile range is within 0.05. While the performance of BLR is all lower than the two integrated regression tree models, and there is a significant difference in performance metrics across the training set and the test set. The test set of stepwise logistic regression had accuracy, sensitivity, specificity and ROC average values of around 0.7, but quartile range of around 0.2, as shown in fig. 3. This shows that for the Bengala data, the generalization error of the model has no negligible effect on the variance. In particular, the generalization ability of the BLR monte carlo model is mainly affected by model variance, which has not been considered in the existing stepwise logistic regression groundwater arsenic statistical modeling study. Since the model variance is mainly affected by the spatial distribution characteristics of the data, the popularization of the algorithm comparison flow in the groundwater arsenic statistical modeling research of the region with similar or worse data quality (distribution uniformity and data density) and the Montgan is necessary.

In algorithm comparison based on the arsenic data of the Bengala groundwater, the performance of the RF and BRT algorithms is significantly better than the stepwise logistic regression. Thus, random forests and lifting regression trees are more suitable for statistical model studies of high arsenic groundwater in southeast asia such as Bengala and Cambodia than logistic regression. Thus, in this embodiment step S3 screens out both the RF and BRT algorithms.

In step S4, based on the algorithm evaluated and screened in step S3, the model task in step S2 is subjected to a hyper-parameter sensitivity test, the hyper-parameters and the scope to be debugged are determined, the hyper-parameters debugging flow of all subsequent models is optimized, and the construction of the full-parameter or space parameter model of all model tasks is completed, which specifically comprises:

In the embodiment, two algorithms of RF and BRT are selected as modeling methods for the research of the Mengladesquamation arsenic statistical model. Super-parametric debugging is the last step of adjusting the complexity of the model, which is also necessary when determining the algorithm and parameter space. The characteristics of the existing literature and the self research object are combined to select the potential values of the super-parameters and the consideration, and the super-parameters and the potential values of the RF algorithm and the BRT algorithm are shown in the table 2.

Table 2 lifting the hyper-parameters and tuning ranges to be debugged in regression trees and random forests

In a specific embodiment, a 10-fold cross-validation method of 10 repetitions is employed in the training set to estimate the generalization ability at a certain value of the super-parameters. The 10-fold cross validation refers to dividing a target set (namely a training set in model debugging) into 10 parts with equal size, selecting one subset as a validation set each time, selecting the collection of the remaining 9 subsets as a sub-training set, fitting an algorithm to the sub-training set to generate a model by setting a super-parameter value, calculating the performance measurement of the model on the validation set, traversing the 10 subsets as the validation set respectively, and obtaining the average value of the performance measurement on the 10 validation sets as the performance measurement value of the model corresponding to the super-parameter value set in the 10-fold cross validation, namely the generalization capability estimation. In order to avoid errors introduced by set partitioning, ten random partitions are performed on the target set, 10 equally sized sets are partitioned each time, and 10-fold cross validation is performed. And taking the average value of the obtained performance metrics of 10 times of 10-fold cross validation as the performance metric value of the model corresponding to the super-parameter value of 10 times of repeated 10-fold cross validation.

As the method of the subsequent modeling task and the super-parameter debugging flow are determined in the previous algorithm selection flow and super-parameter sensitivity analysis. Thus, in modeling tasks, the data directly related to the task is represented by 7: and 3, dividing a training set and a testing set, applying a lifting regression tree and a random forest algorithm, implementing an optimized super-parameter debugging process in a cross-validation mode to obtain an optimal model, and calculating performance metrics of the model on the training set and the testing set. Wherein, for 3 thresholds (5 mug/L, 10 mug/L, 50 mug/L), 5 study areas (national scale study area, northwest, northeast, south shallow area scale study area), 2 different types of data sets (full parameter data set and space parameter data set), 30 different models are respectively built for lifting regression trees and random forests. In particular, because of certain unbalance of groundwater arsenic data in the Bengala region, 26 models are additionally built for an oversampling training set by a random forest and a lifting regression tree according to a data set with low sample quantity or serious sub-balance. A total of 112 models were built. The above flows all implement random forest and lifting regression tree algorithms by using randomForest and GBM toolkit in R language.

The model performance was evaluated by 4 different performance metrics (accuracy, sensitivity, specificity and area under ROC curve) to obtain that the two methods had similar performance in the training and test sets, similar trend in the data sets, but still different in the performance of the corresponding unbalanced data. In the area with the standard exceeding of the underground water arsenic as the main class, the random forest can obtain higher sensitivity and lower specificity, whereas in the area with the standard exceeding of the underground water arsenic as the secondary class, the random forest can obtain lower sensitivity and higher specificity. This may be caused by the parallel structure of the random forest itself, each tree being modeled as a bootstract sample set distributed with the regional dataset, so each subtree is affected by the imbalance. For the lifting regression tree, because of the serial structure of the lifting regression tree, besides the first tree is used for learning by taking unbalanced data as an object, the added subtrees are used for learning errors of the existing model, and the distribution of the errors is not influenced by the data distribution after a plurality of iterations, so that the BRT can obtain more balanced simulation on unbalanced data. Therefore, all the parts following the probability simulation and prediction are subject to the result of the lifting regression tree model.

In order to reduce the influence of unbalanced data on the model and thus reduce the bias of model estimation and prediction, the over-sampling is added as additional preprocessing to the original training set for expansion, so as to balance the data set. However, the output probability of an oversampled data-based model will vary with dataset variations and will therefore not be usable for modeling probability. The data set subjected to the oversampling pretreatment obtains the accuracy and the ROC value which are similar to those of the original data set model, but the difference between the sensitivity and the specificity of the model is greatly reduced, and the random forest and the lifting regression tree obtain similar sensitivity and specificity.

Based on the random forest and the excellent performance of the lifting regression tree model in various modeling tasks, the lifting regression tree and the random forest can provide model basis for subsequent model discussion. The lifting regression tree model which is not preprocessed is used for estimating probability and drawing a probability prediction distribution diagram due to stability of the lifting regression tree model to unbalanced data.

The algorithm selection flow provided by the invention is decomposed into theoretical basis by generalization errors, potential generalization capability of the algorithm on the groundwater arsenic data of the research area is comprehensively considered from the angles of deviation and variance, and the algorithm suitable for the research task is selected before the establishment of the formal model task. Compared with the traditional groundwater arsenic statistical modeling process, the performance of the method is compared after the modeling task is completed, the algorithm selection process not only additionally considers the characteristic of space data modeling, but also estimates the matching degree of the algorithm and the data at the beginning of the modeling task. This avoids using unsuitable algorithms to perform modeling tasks, greatly reduces unnecessary computational consumption, and allows for more potential algorithms to perform modeling tasks. The embodiment provided by the invention shows that the performance of the regression tree and the random forest is improved more excellently, the influence of variance on the model performance cannot be ignored, and the generalization capability of the gradual logistic regression model is mainly influenced by the variance, so that the generalization capability of the model is estimated from the angles of variance and deviation by using an algorithm selection flow for the area with data quality similar to that of Bengal or worse.

Aiming at the existing data, the invention analyzes the sensitivity of 4 super-parameters related to a lifting regression tree algorithm by taking accuracy, sensitivity, specificity and area under ROC curve as performance measurement based on repeated multi-fold cross validation and grid search, verifies that the lifting regression tree algorithm has excellent fitting capability on the Montgrass underground arsenic data set, and can describe data characteristics and a natural system in the data characteristics. From the perspective of method interpretation, the super-parameter debugging process of the regression tree is optimized based on the performance of the model, and the modeling efficiency is greatly improved.

The probability estimation model has excellent model performance, the importance calculation model has more balanced sensitivity and specificity, and the final risk prediction graph not only can capture the arsenic spatial distribution characteristics of large and medium scale groundwater like a kriging difference method, but also has higher resolution, and can predict the fine distribution of medium and small scale. The stable and excellent performances of the random forest and lifting regression tree algorithm on different modeling tasks of Bengala show that the two algorithms have application potential for popularization to other affected areas on the research of groundwater arsenic statistical modeling.

According to the method for predicting the arsenic risk of the groundwater based on the machine learning model, based on various machine learning algorithms of the front edge, the collected multidimensional and large-sample-size data set is utilized, from the practical purpose of modeling, the performance measurement is taken as a reference, the matching degree between the algorithm and the practical modeling problem is mainly studied, a modeling flow suitable for groundwater data simulation of a research object is established, the modeling flow can be used for predicting the spatial distribution of arsenic groundwater, the control mechanism of the cause of the high-arsenic groundwater and the distribution of the cause of the high-arsenic groundwater under different scales is analyzed, and the method has important significance for groundwater arsenic mechanism research, groundwater resource utilization and water safety.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The machine learning model-based underground water arsenic risk prediction method is characterized by comprising the following steps of:

s3, establishing and evaluating an algorithm selection mechanism: the performance of the machine learning algorithm on the full parameter data set and the space parameter data set is used as an evaluation standard to evaluate the matching degree of the machine learning algorithm on the data, and the specific evaluation steps are as follows:

selecting a plurality of potential machine learning algorithms; the method specifically comprises the following steps: logistic regression, random forest and lifting regression tree;

dividing the data set for several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; establishing potential generalization capability of the arsenic statistical modeling of the underground water based on the full parameter data set and the space parameter data set by taking the mean value and the range of the performance measurement into comprehensive consideration; the method specifically comprises the following steps:

fitting an algorithm to a sub-training set to generate a model by setting a super-parameter value, calculating performance metrics of the model on a verification set, traversing 10 subsets to be used as verification sets respectively, obtaining the average value of the performance metrics on the 10 verification sets to be used as the performance metric value of the model corresponding to the super-parameter value set in the 10-fold cross verification, and evaluating potential generalization capability of each algorithm by using the deviation and variance of the performance metrics;

screening an algorithm with enough excellent potential to perform modeling in the step S4; the method specifically comprises the following steps:

taking the performance of the machine learning algorithm on the full parameter data set and the space parameter data set as an evaluation standard, wherein the performance as the evaluation standard comprises accuracy, sensitivity, specificity and ROC average value;

s4, constructing a probability estimation model: based on the algorithm after the screening is evaluated in the step S3, the model task in the step S2 is subjected to a hypersensitive test, the hypersensitive range of the model task to be debugged is determined, the hypersensitive debugging flow of all the subsequent models is optimized, and the construction of the full-parameter or space parameter model of all the model tasks is completed; the method specifically comprises the following steps:

combining grid search, traversing all points of grids in a super-ginseng space by using 10-fold cross validation repeated for 10 times, and comparing performance metrics corresponding to all super-ginseng combinations to select a super-ginseng value corresponding to the highest performance metric;

2. The machine learning model-based groundwater arsenic risk prediction method according to claim 1, wherein in step S1, the spatial parameter dataset includes data corresponding to geological parameters, geographical parameters and hydrologic parameters, and the full parameter dataset includes data corresponding to hydro-chemical parameters, geological parameters, geographical parameters and hydrologic parameters.

3. The machine learning model-based groundwater arsenic risk prediction method according to claim 1, wherein in step S2, model tasks are defined according to a study area spatial scale of a study target and a degree of arsenic exceeding of groundwater, specifically: