CN117010274A

CN117010274A - Intelligent early warning method for harmful elements in underground water based on integrated incremental learning

Info

Publication number: CN117010274A
Application number: CN202310840526.XA
Authority: CN
Inventors: 曹文庚; 付宇; 潘登; 南天; 任宇; 张娟娟
Original assignee: Henan Provincial Natural Resources Monitoring And Land Improvement Institute; North China University of Water Resources and Electric Power; Institute of Hydrogeology and Environmental Geology CAGS
Current assignee: Henan Provincial Natural Resources Monitoring And Land Improvement Institute; North China University of Water Resources and Electric Power; Institute of Hydrogeology and Environmental Geology CAGS
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-11-07
Anticipated expiration: 2043-07-11

Abstract

The invention provides an intelligent early warning method for harmful elements in underground water based on integrated incremental learning, which comprises the following steps: collecting data; constructing and cleaning a data set; constructing a candidate base learner model; establishing and evaluating a basic learner and a meta learner algorithm selection mechanism; constructing an integrated probability estimation model; constructing an integrated incremental probability estimation model, and adding data of different time periods into the constructed integrated probability estimation model; and carrying out risk prediction on the harmful elements in the underground water by using the constructed integrated incremental probability estimation model. The invention carries out algorithm selection based on a machine learning algorithm, optimizes the statistical modeling establishment flow, and constructs a high-precision integrated incremental groundwater harmful element statistical model.

Description

Intelligent early warning method for harmful elements in underground water based on integrated incremental learning

Technical Field

The invention relates to the technical field of underground water safety monitoring, in particular to an intelligent early warning method for harmful elements in underground water based on integrated incremental learning.

Background

Groundwater is one of the important water resources and is important for human life and industrial and agricultural production. In many areas of our country, groundwater is still the most important water supply source, accounting for 1/5 of the total water supply in the country.

The relationship between the groundwater pollution and the groundwater quality is a very serious challenge faced by people when the groundwater is reasonably developed and utilized, and in order to protect groundwater resources of a region, a potential groundwater pollution risk factor of the region and the antifouling performance capability of the region are firstly evaluated to a certain extent, the place where groundwater pollution is easy to occur in the region is known, and then reasonable groundwater protection measures can be specifically formulated according to the evaluation result.

At present, no technology capable of accurately evaluating and early warning the groundwater pollution exists.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an intelligent early warning method for harmful elements in underground water based on integrated incremental learning.

In order to achieve the above object, the present invention provides the following solutions:

an intelligent early warning method for harmful elements in underground water based on integrated incremental learning comprises the following steps:

determining predicted variables potentially related to harmful element staining of groundwater according to groundwater related documents;

carrying out data preprocessing on the predicted variables to obtain a data set;

defining a model task according to the exceeding degree of harmful elements in the underground water, and constructing an early warning model of the harmful elements in the underground water by adopting a machine learning algorithm;

selecting a plurality of potential base learners and meta-learner algorithms, dividing the data set to obtain a training set-test set combination, training and testing the base learners and the meta-learner algorithms according to the divided training set-test set combination, recording test performance, and comprehensively considering potential generalization ability of the base learners and the meta-learner algorithms to building an underground water harmful element early warning model based on the data by using the mean value and the range of performance measurement to obtain a screened base learner and a screened meta-learner;

performing a super-parameter sensitivity test on the underground water harmful element early-warning model by using the base learner and the element learner to determine super-parameters and the range to be debugged, optimizing super-parameter debugging flows of all subsequent models, and completing construction of an integrated probability estimation model of a two-layer structure;

based on the integrated probability estimation model, data are continuously added into the built integrated probability estimation model by utilizing data in different time periods to obtain an incremental integrated probability estimation model;

and predicting the risk of harmful elements in the underground water by using the incremental integration probability estimation model.

Preferably, the data set includes air temperature, precipitation, true evapotranspiration, river distance, cumulative amplitude, water level internationally, water level burial depth, hydraulic gradient, elevation, yellow river breach, clay layer, clay-to-sand ratio, fourth-period topography, water enrichment, precipitation infiltration coefficient, permeability coefficient, water supply, soil physicochemical characteristics, land utilization, vegetation index, and slope.

Preferably, the preprocessing of the data of the predicted variable to obtain a data set includes:

performing data cleaning, data normalization and data cleaning, wherein the data cleaning comprises abnormal value detection, missing value processing and repeated data processing; the data normalization includes normalization and normalization;

and performing multiple collinearity analysis and recursive feature elimination on the feature data set of the prediction variable.

Preferably, the model task for defining the exceeding degree of harmful elements in the underground water is specifically as follows:

the study area does not divide the spatial scale;

the standard of exceeding the standard of harmful elements is according to the national standard of the content of drinking water elements;

the model task is defined to respectively establish 30 different models by adopting different algorithms.

Preferably, the potential machine learning algorithm includes: random forest, extremely random tree, bagging algorithm, extreme gradient lifting, gradient lifting decision tree, self-adaptive enhancement algorithm, support vector machine, linear discriminant analysis, k-nearest neighbor algorithm, logistic regression, and multi-layer perceptron.

Preferably, the data set is divided to obtain a training set-test set combination, the basic learner and the meta learner algorithm are trained and tested according to the divided training set-test set combination, the testing performance is recorded, the potential generalization capability of the basic learner and the meta learner algorithm to the building of the underground water harmful element early warning model based on the data is comprehensively considered according to the mean value and the range of the performance measurement, and the screened basic learner and meta learner are obtained specifically comprise:

the dataset was randomly pressed at 7:3, generating a training set and a testing set according to the proportion;

estimating the generalization capability of the super-ginseng under a certain value by adopting a 5-fold cross-validation method repeated for 5 times in a training set; the 5-fold cross validation refers to dividing a training set into 5 parts with equal size, selecting one subset as a validation set each time, and taking the collection of the remaining 4 subsets as a sub-training set;

the super-parameter value is set to enable the algorithm to fit the sub-training set to generate a model, performance metrics of the model on the verification set are calculated, 5 subsets are traversed to serve as the verification sets respectively, the average value of the performance metrics on the 5 verification sets is obtained to serve as the performance metric value of the model corresponding to the super-parameter value set in 5-fold cross verification, and potential generalization capability of each algorithm is estimated according to deviation and variance of the performance metrics.

Preferably, the performance of the evaluation criteria includes AUC, accuracy, precision, sensitivity, specificity, and F1 value.

Preferably, the base learner and the meta learner are used for performing a super-parameter sensitivity test on the underground water harmful element early-warning model to determine the super-parameter and the range to be debugged, optimize the super-parameter debugging flow of all subsequent models, and complete the construction of the integrated probability estimation model of the two-layer structure, and specifically comprise the following steps:

by combining a particle swarm algorithm, grid searching and a cross-validation method, verifying and evaluating the sensitivity of the super-parameters in the screened algorithm under a data set aiming at model tasks, so as to optimize the super-parameter debugging flow, wherein the model tasks are as follows:

selecting a limited representative value according to the characteristics of a research object aiming at each super parameter participating in debugging, drawing a grid structure of a multidimensional space in a permutation and combination mode, and trying each possible super parameter value by traversing all nodes so as to achieve a trade-off result of feasibility and comprehensiveness;

and combining grid search, traversing all points of the grid in the super-ginseng space by using 5 repeated 5-fold cross validation, comparing performance metrics corresponding to all super-ginseng combinations, selecting a super-ginseng value corresponding to the highest performance metric, and constructing a task of integrating a probability estimation model by using the optimized model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides an intelligent early warning method for harmful elements in underground water based on integrated incremental learning, which is based on various machine learning algorithms at the front edge, utilizes collected multi-dimensional and large-sample-size data sets, takes performance measurement as a reference from the practical purpose of modeling, and builds a modeling flow suitable for modeling underground water data simulation of a research object by mainly researching the matching degree between a constructed Stacking algorithm and the practical modeling problem, and can be used for predicting the spatial distribution of high arsenic underground water, analyzing the cause of the high arsenic underground water and the control mechanism of the distribution of the high arsenic underground water under different scales, and has important significance for the study of the mechanism of the underground water and the utilization of underground water resources and the safety of water.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a technical route provided by an embodiment of the present invention;

FIG. 3 is a diagram of feature selection results using recursive feature elimination provided by an embodiment of the present invention;

FIG. 4 is a hierarchical clustering lineage diagram according to various machine learning model performance metrics, provided by embodiments of the present invention.

Fig. 5 is a Stacking model construction flow based on a base learner and a meta learner according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Fig. 1 is a flowchart of a method provided by an embodiment of the present invention, and as shown in fig. 1, the present invention provides an intelligent early warning method for harmful elements in groundwater based on integrated incremental learning, including the following steps:

step 100: determining predicted variables potentially related to harmful element staining of groundwater according to groundwater related documents;

step 200: carrying out data preprocessing on the predicted variables to obtain a data set;

step 300: defining a model task according to the exceeding degree of harmful elements in the underground water, and constructing an early warning model of the harmful elements in the underground water by adopting a machine learning algorithm;

step 400: selecting a plurality of potential base learners and meta-learner algorithms, dividing the data set to obtain a training set-test set combination, training and testing the base learners and the meta-learner algorithms according to the divided training set-test set combination, recording test performance, and comprehensively considering potential generalization ability of the base learners and the meta-learner algorithms to building an underground water harmful element early warning model based on the data by using the mean value and the range of performance measurement to obtain a screened base learner and a screened meta-learner;

step 500: performing a super-parameter sensitivity test on the underground water harmful element early-warning model by using the base learner and the element learner to determine super-parameters and the range to be debugged, optimizing super-parameter debugging flows of all subsequent models, and completing construction of an integrated probability estimation model of a two-layer structure;

step 600: based on the integrated probability estimation model, data are continuously added into the built integrated probability estimation model by utilizing data in different time periods to obtain an incremental integrated probability estimation model;

step 700: and predicting the risk of harmful elements in the underground water by using the incremental integration probability estimation model.

Fig. 2 is a schematic flow chart of an intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to an embodiment of the invention, and as shown in fig. 2, the method for predicting risk of arsenic in groundwater based on a machine learning model provided by the embodiment of the invention comprises the following steps:

s1, data collection: the method mainly comprises the steps of collecting groundwater related documents, selecting prediction variables potentially related to harmful elements of groundwater, mainly comprising 7 types of human activities, climates, gas-covered belt characteristics, deposition environments, soil physicochemical characteristics, hydrogeology and the like, collecting related prediction variable data, and adopting a plurality of time periods for data collection.

S2, constructing and cleaning a data set, and performing data preprocessing, characteristic data set construction, data type unbalance processing and the like on the data collected in the S1;

s3, constructing a candidate base learner model: and defining a model task according to the exceeding degree of the harmful elements in the underground water, constructing an early warning model of the harmful elements in the underground water by adopting a common machine learning algorithm, and determining the exceeding parameters and the exceeding range to be debugged.

S4, establishing and evaluating algorithm selection mechanisms of the base learner and the meta learner: the performance of the machine learning algorithm on the data set is used as an evaluation standard to evaluate the matching degree of the algorithm on the data. The specific evaluation steps are as follows: selecting a plurality of potential machine learning algorithms; dividing the data set to obtain a training set-test set combination; training and testing the divided training set-testing set combination and recording testing performance; comprehensively considering potential generalization capability of the algorithm on building an underground water harmful element early warning model based on data by using the mean value and the range of the performance measurement; algorithms with sufficiently excellent potential are screened for S5 modeling.

S5, constructing an integrated probability estimation model: and (3) performing a super-parameter sensitivity test on the model task in the step (S3) by using the base learner and the meta learner which are evaluated and screened in the step (S4), determining the super-parameter and the range to be debugged, optimizing the super-parameter debugging flow of all the subsequent models, and completing the construction of the stacking model task with a two-layer structure.

S6, utilizing the integrated probability estimation model constructed in the step S5, and utilizing data of different time periods to continuously add data into the constructed model.

S7, predicting the risk of harmful elements in the underground water by utilizing the built incremental integrated probability estimation model in the step S6.

In step S1, the data set includes air temperature, precipitation, actual Evaporation (AET), river distance, cumulative amplitude, annual change of water level, water level burial depth, hydraulic gradient, elevation, yellow river breach, clay layer, clay-sand ratio, fourth-period landform, water enrichment, precipitation infiltration coefficient, permeability coefficient, water supply degree, soil physicochemical characteristics, land utilization, vegetation index (NDVI), and gradient.

In the step S2, the data set is constructed and cleaned, the specific data preprocessing method includes data cleaning, data normalization, data cleaning includes abnormal value detection, missing value processing, repeated data processing, and the like, and common data normalization includes normalization, standardization, and the like. The feature data set construction processing method comprises multiple collinearity analysis and recursive feature elimination. The data class unbalance processing method is undersampling.

The feature data set construction mainly comprises multiple collinearity analysis and recursive feature elimination. The multiple co-linearity index is evaluated as a coefficient of variance expansion (VIF), and when VIF >10, it is indicated that the environmental variable has serious co-linearity problems, and it is necessary to examine and process the environmental variable.

Table 1 shows the results of a multiplex colinear analysis using VIF as an evaluation index

The recursive feature selection step is to construct a model according to all features in the feature space, then select the feature with the best performance, put the selected feature into the feature subset, remove the feature from the feature space, and continue to execute the previous step in the new feature space until the feature space becomes an empty set. The order of adding the features into the feature subset is the importance degree order of the features, and the features are selected according to the order.

Fig. 3 shows a resulting graph of feature selection with recursive feature elimination. It can be seen from the figure that the model performs best when the number of environmental variables is 18. And taking elevation, AET, NDVI, surface layer silt fraction, fourth-period landform, water supply degree, river buffering, yellow river breach, rainfall infiltration system, accumulated amplitude, sand sticking ratio, gradient, air temperature, permeability coefficient, hydraulic gradient, water level burial depth and water level annual change as final factor variables of the model through multiple collinearity analysis and feature extraction result screening.

The class unbalance processing mainly uses clustering processing in an undersampling method, and indexes for evaluating clustering effects comprise Calinski-Harbasz Score, inertia and Silhouette Score. Wherein Calinski-Harbasz Score computes the Score by evaluating the variance between classes and the variance within the class, the larger the value the better. Inertia is the sum of squares of the distances from all sample points to the centroid in a cluster, also called intra-cluster sum of squares, and the smaller the value, the more similar the samples in each cluster, and the better the clustering effect. The Silhouette Score is a contour coefficient, is an index for evaluating the good or bad clustering effect, has a value range of [ -1,1], and has better clustering effect as the contour coefficient is larger. And carrying out 10 times of clustering screening on the data set to obtain the final unbalanced class processing result with the best primary effect.

Table 2 is a table of the performance of evaluating class imbalance clusters with Calinski-Harabasz Score, inertia and Silhouette Score

In the step S3, a model task is defined according to the exceeding degree of harmful elements in the groundwater of the research target, specifically:

the study area does not divide the spatial scale.

The standard of exceeding the standard of harmful elements is according to the national standard of the content of drinking water elements, for example, the arsenic concentration in groundwater cannot exceed 10 mug/L.

In step S4, the performance as an evaluation criterion includes AUC, accuracy, precision, sensitivity, specificity, and F1 value. The method specifically comprises the steps of selecting a plurality of potential machine learning algorithms, wherein the method specifically comprises the following steps: the method specifically comprises the following steps: random Forest (RF), extreme random tree (ExtraTrees), bagging algorithm (TreeBag), extreme gradient boosting (XGBoost), gradient Boosting Decision Tree (GBDT), adaptive boosting algorithm (AdaBoost), support Vector Machine (SVM), linear Discriminant Analysis (LDA), k nearest neighbor algorithm (KNN), logistic Regression (LR), multi-layer perceptron (MLP). The accuracy value range is [0,1], and the model accuracy is closer to 1, the performance is better. The accuracy value range is [0,1], and the model accuracy is closer to 1, the performance is better. The sensitivity value range is [0,1], and the model accuracy is closer to 1, and the performance is better. The specificity value range is [0,1], and the model accuracy is closer to 1, and the performance is better. kappa calculations are-1 to 1, but in general kappa falls between 0 and 1, and can be divided into five groups to represent different levels of consistency: extremely low consistency of 0.0 to 0.20, general consistency of 0.21 to 0.40, moderate consistency of 0.41 to 0.60, high consistency of 0.61 to 0.80 and almost complete consistency of 0.81 to 1. The ROC curve measures a comprehensive estimate of the model performance when the threshold value varies within the range of 0, 1. The ROC curve is plotted with sensitivity on the abscissa and true case rate (1-specificity) on the ordinate, with the change in sensitivity and specificity as the threshold traverses the [0,1] range. AUC is the area under the ROC curve, measuring the ability of the model to distinguish between the two categories. The AUC value range is [0.5,1], and the closer the AUC is to 1.0, the higher the model performance is; when the model efficiency is equal to 0.5, the model efficiency is the lowest, and the application value is not provided. There is a strong non-uniformity in the arsenic data in the groundwater, so that the AUC (ROC) ratio of the classification discrimination capability of the model to the classification is estimated from the proportion of the respective classification rather than the quantity angle, and is more suitable for comprehensively considering the performance of the model.

In addition, in step S4, the data set is divided several times to obtain different training set-test set combinations; training and testing each training set-testing set combination and recording the testing performance of each algorithm; the potential generalization capability of intelligent early warning modeling of harmful elements of underground water is established based on a data set by each algorithm based on the mean value and the range comprehensive consideration of performance measurement, and the method specifically comprises the following steps:

the dataset was randomized at 7:3, generating a training set and a testing set according to the proportion;

the method comprises the steps of fitting algorithms to sub-training exercise sets to generate models through setting super-parameter values, calculating performance metrics of the models on verification sets, traversing 5 subsets to serve as verification sets respectively, obtaining average values of the performance metrics on the 5 verification sets to serve as performance metric values of the models corresponding to the super-parameter values set in the 5-fold cross verification, evaluating each algorithm exercise set to generate models through deviation and variance of the performance metrics, calculating the performance metrics of the models on the verification sets, traversing 5 subsets to serve as verification sets respectively, obtaining average values of the performance metrics on the 5 verification sets to serve as performance metric values of the models corresponding to the super-parameter values set in the 5-fold cross verification, and evaluating potential generalization capability of each algorithm through deviation and variance of the performance metrics.

The embodiment of the invention establishes various models with different algorithms in the research of the integrated increment learning-based underground water arsenic intelligent early warning model developed in the North Hemsl area. Therefore, in order to verify the potential matching degree of model performance of 11 machine learning algorithms under different research purposes as widely and efficiently as possible, the invention selects a unified data set in the algorithm selection process: and S2, constructing a final data set after the step S2 and cleaning. Wherein, 10 mug/L is selected as a threshold value for dividing whether arsenic in groundwater exceeds standard or not. And verifying and evaluating the sensitivity of the super-parameters in the screened algorithm under different data sets according to model tasks by combining the grid searching and the cross verification method, so as to optimize the super-parameter debugging flow and establish the underground arsenic statistical modeling of each algorithm model. Selecting a limited representative value according to the characteristics of a research object aiming at each super parameter participating in debugging, drawing a grid structure of a multidimensional space in a permutation and combination mode, and trying each possible super parameter value by traversing all nodes so as to achieve a trade-off result of feasibility and comprehensiveness;

and combining grid search, traversing all points of the grid in the super-ginseng space by using 10 repeated 10-fold cross validation, and comparing performance metrics corresponding to all super-ginseng combinations to select a super-ginseng value corresponding to the highest performance metric.

This shows that after the modeling flow of the method is used for parameter adjustment, the two algorithms can well fit the data of the training set, and capture the difference of whether the arsenic in the groundwater exceeds the standard or not and is represented on a natural system behind the data of the training set. Thus, to demonstrate the major, generic numerical laws captured therein, emphasis is placed on the performance of the two algorithmic correspondence models on the test set, as well as the variation in performance across different modeling tasks.

Table 3 shows the AUC, accuracy, precision, sensitivity, specificity and F1 values from multiple evaluations to yield a summary of performance metrics for the different models at 10 μg/L as a threshold.

From table 3, it can be seen that the RF, extraTrees and TreeBag models based on the Bagging integration algorithm perform significantly better than other models, and perform better on each index, and AUC falls between 86% -89% in most cases. And secondly, based on a Boosting model, the AUC mostly falls between 82% and 85%. The MLP model has poor performance indexes in all aspects, so that the MLP model has low application value in predicting groundwater pollution in the area.

The entropy method is a mathematical method for determining the model evaluation index weight in the model, and can reduce the influence caused by artificial interference. The larger the variation degree of a certain index in the entropy method is, the smaller the information entropy value is, and the larger the information content is; conversely, the smaller the degree of variation of a certain index, the larger the information entropy value, and the smaller the amount of information contained. Therefore, the entropy method is adopted to assign weights to different performance measurement indexes according to the internal variation degrees of different indexes. The results of the entropy calculation are shown in Table 4.

Table 4 entropy method for calculating weight result summary of performance metric index

AUC, accuracy, precision, recall, F1, kappa were analyzed for the 6 performance metrics variability. The Precision variation degree is larger, the information entropy is smaller, and therefore the provided information amount is larger and the weight is higher. And multiplying the weight coefficient calculated by the entropy method by each measurement index, and sequencing the weight coefficients according to the sequence from high to low. The calculation results are shown in Table 5.

Table 5 entropy weight composite score ranking for independent machine learning models

The tree-based model (RF, ET, treeBag, XGBoost, adaBoost, GBDT) performs better than other models in terms of rank. XGBoost is optimal in Boosting integration algorithm, RF is optimal in Bagging integration algorithm, SVM is optimal in other models, and MLP is worst.

By training 11 models, statistics on 6 performance metrics are obtained. Hierarchical clustering analysis is carried out according to model performance measurement indexes, a clustering result pedigree chart is shown in fig. 4, model performance measurement indexes (table 4) and comprehensive evaluation scores (table 5) show that MLP predicts poor overall performance of groundwater pollution in the area, the rest 10 models are divided into 4 clusters, the model differences in the clusters are small, the model differences among the clusters are large, GBDT, XGBoost, adaBoost in the first group are models of boost integration ideas, extraTrees, RF, treeBag in the second group are models of Bagging integration ideas, KNN, LR, LDA in the third group are simpler models, and the fourth group only contains SVM. In the comprehensive view, the clustering result of hierarchical clustering accords with the understanding of the model principle, and the clustering effect is good.

In order to meet the requirements of a Stacking integrated algorithm on superior performance and different algorithms of a basic learner, the study selects a model with optimal performance from different clusters of hierarchical clustering results to be incorporated into a Stacking structure, and therefore XGBoost, RF, SVM is selected as a basic learning period device. The selection of the meta learner of the Stacking integrated algorithm should meet the two requirements of simple algorithm structure and better performance, so the LDA is selected as the meta learner of the Stacking model.

In step S5, based on the algorithm after the evaluation and screening in step S4, the construction of all model tasks is completed, which specifically includes:

the Stacking integrated learning refers to that the prediction results of the base learner are fused by using a Stacking integrated strategy on the basis of the integrated learning so as to obtain better prediction results than those of a single learner. Each base learner in the Stacking model can fully exert respective advantages, make up for the advantages, reduce the risk of poor generalization capability of a single algorithm model, and improve the prediction precision of the model. The Stacking model operation flow is shown in fig. 5:

(1) The collected data is preprocessed and then divided into a training set and a prediction set. The training set is divided into 5 subsets, which are respectively marked as P1-P5, and five-fold cross validation is performed.

(2) In the 1 st fold cross validation, P2-P5 is used as a training set to train the base learner to obtain X1-1, P1 is used as a validation set to predict by using X1-1, and a prediction result is marked as a1. And analogically, obtaining prediction X1_1-5, outputting results A1-a 5, and stacking and combining the output results to be marked as A1.

(3) And inputting the test sets s1-s5 into the trained predictions X1-5 of each round for prediction, and finally taking the average value of the five rounds of prediction results as a new test set, and marking as B1.

(4) The remaining two base learners X2 and X3 perform the above steps. The training set outputs the predicted results A1-A3, and the test set outputs the results B1-B3. Inputting the two results into a two-layer element learner, and obtaining a final result by training and predicting the two results.

In step S6, the model effect is evaluated using AUC, accuracy, specificity and Recall by comparing the effect with the three base learner model using the Stacking integrated model. Cross-validation results of XGBoost, RF, SVM and Stacking models on test sets Table 6 shows that the Stacking model has the largest AUC, accuracy, specificity and Recall values. Stacking was increased by 4.76%,4.67%, 5.56% and 4.06% compared to AUC, accuracy, specificity and Recall, respectively, and by 1.01%, 0.09%, 4.00% and 0.41% compared to RF, 13.28%, 9.26%, 13.17% and 0.44% compared to SVM, respectively. From the model evaluation index, the Stacking model obtains the best prediction precision, and model specificity and sensitivity are 0.8056 and 0.8108 respectively, which shows that the Stacking model has better performance on the prediction of areas polluted by underground water and areas not polluted by underground water.

Table 6XGBoost, RF, SVM and Stacking Performance metric comparison

In step S7, the incremental effect is represented by the data of three different years, namely 2010, 2010+2019, 2010+2019+2020, by using the constructed model, and the performance metrics of the three different years are compared as shown in table 7. The model was built with the largest AUC, accuracy, specificity and Recall values based on 2010+2019+2020 data. The model is improved by 2.85%,4.11%, 2.19% and 6.08% respectively by AUC, accuracy, specificity and Recall based on 2010+2019+2020, and is improved by 2.40%, 4.11%, 6.15% and 2.03% respectively by 2010+2019+2020, and the best prediction accuracy is obtained by the model based on 2010+2019+2020 from the viewpoint of model evaluation indexes.

Table 72010, 2010+2019, and 2010+2019+2020 performance metrics comparisons

The algorithm selection flow provided by the invention is characterized in that the generalization error is decomposed into theoretical basis, potential generalization capability of the algorithm on the groundwater harmful element data in a research area is comprehensively considered from the angles of deviation and variance, the algorithm suitable for the research task is selected before the establishment of a formal model task, then the basic learner and the meta learner are screened through hierarchical clustering and entropy value method, so that a Stacking model is constructed, and the model precision is further improved by adding data of different years. Compared with the traditional statistical modeling flow of harmful elements in the underground water, the modeling is directly carried out by using the existing algorithm, and the performance of the model method of the existing algorithm is compared after the modeling task is completed. The algorithm selection flow not only additionally considers the characteristic of space data modeling, but also estimates the matching degree of the algorithm and the data at the beginning of the modeling task, and integrates and fuses a plurality of models. This avoids using unsuitable algorithms to perform the Stacking modeling task, greatly reduces unnecessary computational consumption, and allows more potential algorithms to be considered to perform the modeling task. The embodiment provided by the invention shows that random forest, extreme gradient lifting, support vector machine and linear analysis are better in performance in all learning models, the influence of variance on the model performance cannot be ignored, the generalization capability of the multi-layer perceptron model is mainly influenced by variance, and the effect is poor, so that the generalization capability of the model from variance and deviation angles is estimated by using an algorithm selection flow for the statistical modeling of harmful elements in the underground water in the region with data quality similar to the North of the Hemsleys.

Aiming at the existing data, the invention analyzes the sensitivity of 6 super-parameters related to 11 algorithms based on particle swarm optimization algorithm, repeated multi-fold cross validation and grid search by taking area, accuracy, precision, sensitivity, specificity and F1 value under ROC curve as performance metrics, screens a base learner and a meta-learner by the 6 performance metrics, and verifies that the Stacking algorithm has excellent fitting capability on groundwater arsenic data set in North Henan with the accuracy, sensitivity, specificity and area under ROC curve, and can describe data characteristics and natural system contained therein. From the perspective of method interpretation, the method optimizes the super-parameter debugging flow of Stacking based on the performance of the model, and greatly improves modeling efficiency.

The probability estimation model has excellent model performance, the importance calculation model has more balanced sensitivity and specificity, and the final risk prediction graph not only can capture the spatial distribution characteristics of the arsenic in the ground as a random forest, a support vector machine and an extreme gradient lifting method, but also has higher resolution, and can predict the fine distribution of small and medium scales. The stable and excellent performance of the Stacking model on modeling tasks shows that the Stacking algorithm has application potential of being popularized to other affected areas on the research of groundwater arsenic statistical modeling.

The intelligent early warning method for the underground water harmful elements based on the integrated incremental learning, provided by the invention, is based on various machine learning algorithms at the front edge, utilizes the collected multi-dimensional and large-sample-size data set, takes performance measurement as a reference from the practical purpose of modeling, and builds a modeling flow suitable for underground water data simulation of a research object by focusing on the matching degree between the research algorithm and the practical modeling problem, can be used for predicting the spatial distribution of the underground water harmful elements, analyzes the control mechanism of the cause of the underground water of the harmful elements, and has important significance for the mechanism research of the underground water harmful elements, and the utilization of underground water resources and the safety of water.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. An intelligent early warning method for harmful elements in underground water based on integrated incremental learning is characterized by comprising the following steps:

2. The intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to claim 1, wherein the data set includes air temperature, precipitation, true evapotranspiration, distance from river, cumulative amplitude, annual change of water level, water level burial depth, hydraulic gradient, elevation, yellow river breach, clay layer, clay-sand ratio, fourth-period topography, water enrichment, precipitation infiltration coefficient, permeability coefficient, water supply, soil physicochemical characteristics, land utilization, vegetation index, and gradient.

3. The intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to claim 1, wherein the data preprocessing is performed on the predicted variables to obtain a data set, and the method comprises the following steps:

4. The intelligent early warning method for the harmful elements in the underground water based on the integrated incremental learning, which is characterized in that the model task for defining the exceeding degree of the harmful elements in the underground water is specifically as follows:

the study area does not divide the spatial scale;

5. The intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to claim 1, wherein the potential machine learning algorithm comprises: random forest, extremely random tree, bagging algorithm, extreme gradient lifting, gradient lifting decision tree, self-adaptive enhancement algorithm, support vector machine, linear discriminant analysis, k-nearest neighbor algorithm, logistic regression, and multi-layer perceptron.

6. The intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to claim 4, wherein the data set is divided to obtain a training set-test set combination, the basic learner and the meta learner algorithm are trained and tested according to the divided training set-test set combination, and test performance is recorded, the potential generalization ability of the basic learner and the meta learner algorithm to the building of an early warning model for harmful elements in groundwater based on data is comprehensively considered according to the mean value and the range of performance metrics, and the screened basic learner and meta learner are obtained specifically comprises:

7. The intelligent early warning method for harmful elements in groundwater based on integrated incremental learning according to claim 1, wherein the performance of the evaluation criteria includes AUC, accuracy, precision, sensitivity, specificity and F1 value.

8. The intelligent early warning method for harmful elements in underground water based on integrated incremental learning according to claim 1, wherein the method is characterized in that the basic learner and the meta learner are used for performing a hypersensitive test on the harmful elements in underground water early warning model to determine the hyperspectrum and the scope to be debugged, optimize the hyperspectral debugging flow of all subsequent models, and complete the construction of an integrated probability estimation model of a two-layer structure, and specifically comprises the following steps: