CN113807570B

CN113807570B - XGBoost-based reservoir dam risk level assessment method and system

Info

Publication number: CN113807570B
Application number: CN202110924472.6A
Authority: CN
Inventors: 丁炜; 金有杰; 高佳琦; 刘娜; 孙建庭; 林艳燕; 陈季; 牛睿平
Original assignee: Jiangsu Naiwch Cooperation; Nanjing Water Conservancy and Hydrology Automatization Institute Ministry of Water Resources
Current assignee: Jiangsu Naiwch Cooperation; Nanjing Water Conservancy and Hydrology Automatization Institute Ministry of Water Resources
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2024-02-02
Anticipated expiration: 2041-08-12
Also published as: CN113807570A

Abstract

The invention provides a reservoir dam risk level assessment method and a reservoir dam risk level assessment system based on XGBoost, which are characterized in that input features are preprocessed by a feature engineering technology, and data content and format are normalized; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, the dangerous degree of potential safety hazards of the reservoir dam is automatically identified, the efficiency and accuracy of evaluation prediction of the risk level of the reservoir dam are improved, the risk early warning capability of the reservoir dam is comprehensively enhanced, and long-term safe operation of the reservoir dam is ensured.

Description

XGBoost-based reservoir dam risk level assessment method and system

Technical Field

The invention belongs to the technical field of water conservancy and hydrology, relates to a safety risk assessment method based on reservoir dam data acquisition, and particularly relates to a reservoir dam risk level assessment method and system based on XGBoost.

Background

At present, reservoirs in China share 9.8 spare seats, most of the reservoirs are built in the last 50-70 th century, due to serious problems of reservoir dam risks and prominent dam break risks caused by historical, economic and technical reasons, in order to reduce the probability of reservoir dam failure, risk assessment must be carried out regularly, and the current reservoir dam safety risk assessment mainly relies on scientific research technicians with abundant experience to analyze and evaluate the information materials of the existing reservoir dams, so that the method has some problems and disadvantages:

(1) The degree of automation of the analytical evaluation process is low. The 'congenital deficiency' of the engineering quality of the reservoir dam in China and the low informatization level are achieved, so that the current reservoir dam assessment is realized through manual analysis, manual modeling and manual assessment, manpower and material resources are greatly consumed, and the burden of technicians is increased.

(2) The intelligent degree of modeling and parameter optimization is low. The current subjective methods of establishing a risk assessment system, establishing a risk assessment model for selecting and setting index factors depend on historical statistical data, expert experience and the like, so that a person with considerable experience is required to manually select and determine parameters when the established risk assessment system and the risk assessment model are used, and the risk assessment system and the risk assessment model are difficult to use for reservoir management personnel.

(3) The model method has low versatility. In order to pertinently improve the accuracy of the evaluation model, in the past, relevant processing is carried out on input characteristics and model parameters of the model according to a research area at the beginning of constructing the model, and once the research area or an influence factor in the research area changes, the risk evaluation model is not applicable any more and needs to be readjusted by a technician.

Disclosure of Invention

In order to solve the problems, the invention provides a reservoir dam risk level assessment and prediction method with high universality, high convenience, high intelligence and high accuracy based on XGBoost for different types of potential risk influencing factors (such as dam building materials, flood discharge gate width and the like) of a reservoir dam, and the early warning capability of the safety risk of the reservoir dam is improved.

The invention realizes 'intelligent' of reservoir dam (group) risk prediction and assessment: preprocessing input features and normalizing data content and format are realized through feature engineering technologies such as cleaning abnormal samples, sampling the samples, normalizing, discretizing and the like; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; and the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, and the accuracy of the model is improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a reservoir dam risk level assessment method based on XGBoost comprises the following steps:

step one: acquiring characteristic data (characteristic details are shown in table 1) and a risk level table (table 2) related to risk influence factors of a reservoir dam, overview the main data types, data formats, data loss, variance, mean values and the like of the characteristic data, and performing characteristic engineering treatment such as characteristic cleaning, data conversion, data filling and the like on all the characteristic data to obtain a data set, wherein the treatment process mainly comprises the following steps:

(1) Processing the problems of inconsistent data content and title, abnormal data format, missing data content and the like;

(2) And carrying out feature transformation on the text features, and converting text data such as dam types, impervious body types, dam building materials, dam building purposes and the like into numerical data by using LabelEncoder.

Preferably, the pretreatment and feature engineering of the reservoir dam feature data set may further include the following treatments, as shown in fig. 5:

1. firstly, counting the missing condition of the characteristic value, and deleting invalid characteristics with the missing proportion being more than 60%;

2. removing repeated characteristic information in the reservoir dam data set aiming at the reservoir dam data set with numerical type and text type;

3. judging whether the data set is a mixed data set containing numerical values and text data, if the text data exists, selecting to map the characteristic values and the numerical values of the text data or map the characteristic values into a high-dimensional space according to the characteristic quantity, and then carrying out dimension reduction on the characteristics by combining methods such as principal component analysis, association algorithm, correlation analysis and the like. The reservoir dam data set used in the patent is a mixed data set, and simultaneously comprises numerical data and text data, the text data can be mapped to perform subsequent dimension reduction operation, a characteristic dimension reduction effect schematic diagram (taking 6-dimensional data as an example) is shown in fig. 6, and the 6-dimensional characteristic is changed into a three-dimensional characteristic after dimension reduction, as shown in fig. 7.

4. Filling the missing characteristic value information in the characteristic set, wherein the characteristic of the reservoir dam is a discrete value, so that the missing value filling can use the characteristic average value or mode of the reservoir dam in the research area; in addition, the missing values can be uniformly set to a specific value, so that the robustness of the model is improved.

Table 1 reservoir dam risk impact factor table

Table 2 reservoir dam risk rating table

Sequence number	Potential risk rating	Risk level definition
			1	Low (Low)	The reservoir dam is not casualty and has little influence on economy and environment
2	Middle (Significa)	The reservoir dam is not casualty due to accident, but has a certain influence on the economic environment
			3	High (High)	The reservoir dam is extremely likely to cause casualties

Step two: based on the data set completed in the first step, the imbalance condition of the samples is checked, and the samples in the categories in the real multi-category data set can hardly ensure that the number of the samples in each category is similar, so that the samples need to be processed by adopting corresponding technical means. Based on the data layer, in order to solve the problem of extremely unbalanced sample of the reservoir dam data set, the data set is ensured to meet the model requirement by a fusion sampling mode based on downsampling and class equilibrium sampling; based on the algorithm level, the corresponding weights of various categories can be set and adjusted according to the risk grade classification condition of the reservoir dam, penalty items are added in the algorithm establishment process, and the like; and (3) carrying out fusion sampling treatment on the reservoir dam risk level data set according to the sample condition of the reservoir dam risk level data set obtained in the step (A).

Step three: dividing the data set processed in the second step into a training set and a testing set: the training set is used for inputting a model, and deep information of each feature in training data is continuously learned through the model, so that the training set has the capability of evaluating and predicting; the test set is used to evaluate the model to evaluate the predictive power. The data set dividing method includes a leave-out method, a self-service method, a leave-in method and the like, and the data sets are divided according to the proportion of 8:2, wherein test set data cannot appear in the training set in order to prevent data leakage and fully evaluate the quality of the model.

Step four: establishing an XGBoost reservoir dam risk level assessment prediction model, setting general parameters, task parameters and base learner parameters as default parameters, adjusting a mode of calculating feature importance according to data set features, and setting a mode of outputting a result of each reservoir dam risk level.

Specifically, in the fourth step, an XGBoost model is constructed, all risk influence factors in a training set are input into the model for training, the importance degree of all risk influence factors and the risk level of each reservoir dam are output, and the method specifically comprises the following steps:

the risk level of the reservoir dam is estimated and predicted by adopting an XGBoost algorithm, and the specific formula of an objective function is as follows:

in the above-mentioned method, the step of,for evaluating the regular term function of the complexity of the model, +.>For evaluating the loss function of the model fitting degree, yt represents the true score value on the sample,/->Representing the predictive score value on the sample.

The feature importance calculation can adopt a mode of weight, gain and cover, wherein the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that a feature covers when splitting a node.

Step five: based on training set data and an established XGBoost reservoir dam risk assessment prediction grade model, opening the data set by Pandas, selecting 25 characteristic data by iloc and inputting the data as X variables to the model to obtain a preliminary XGBoost model. Wherein the X variable is x_train in the training set data (25 feature data in the training set), and the risk level is taken as a Y variable to be input into the model, wherein Y is y_train in the training set data (risk level corresponding to each sample, risk level corresponds to table 2).

Step six: according to the XGBoost model established in the fifth step, the data sets are grouped by using Cross validation, parameters are set, an optimized parameter range and a parameter searching step length are determined, a part of the data sets are mainly used as training sets for training the classifier, a verification set is used for verifying the model, and the final classification accuracy is recorded as the performance of the classifier.

Preferably, in step six, a cross-validation method is adopted, and the cross-validation method can be selected according to actual situations:

if the data amount is not particularly large, KFOLD, namely K-fold cross validation, can be selected; if the computer computing resources are sufficient, an attempt may be made to select LOO, i.e., leave One Out verification.

Step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch, determining the optimal parameters, and outputting the parameter combination with highest precision in k training results as the final parameters of the model.

Step eight: combining the optimal parameters obtained in the step seven into a model, and evaluating the performance of the model on the whole aspect by adopting a confusion matrix and Precision, recall, accuracy, F-score on a test set.

Preferably, in step eight, the performance of the reservoir dam risk level evaluation prediction model is evaluated:

the classification accuracy is visually displayed by adopting a confusion matrix, and the basic structure is shown in table 3:

TABLE 3 confusion matrix

Wherein TP is a sample with correct Class1 classification; FP is a sample that misclassifies Class 2 as Class 1; FN is a sample that misclassifies Class1 into Class 2; TN is a sample of correct Class 2 classification.

Accuracy, precision, recall and F-score indexes are adopted for comprehensively evaluating the performance of the model, accuracy reflects the overall prediction Accuracy of the model, and the higher the overall Accuracy is, the better the performance of the model is represented; precision reflects the correct sample duty ratio of the model prediction result, and the larger the value is, the fewer the misclassification is; recall reflects the proportion of the true category correctly detected by the model, and the larger the value of Recall is, the less the model is missed to be classified; the F score is a harmonic average evaluation index of Precision and Recall, the larger the value of the F score is, the better the performance of the model is indicated, and the four indexes are basically expressed as follows:

compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the invention, a machine learning technology is utilized to construct a reservoir dam risk level assessment prediction model, deep characteristic information of a large amount of data is mined, and intelligent multi-classification of the reservoir dam risk level is realized; the risk level of the reservoir dam is identified and judged with high precision, high intellectualization and high automation, the time cost and the economic cost of manual data processing, manual data analysis and manual modeling are reduced, the risk degree of potential safety hazards of the reservoir dam is automatically identified, the efficiency and the accuracy of the risk level assessment and prediction of the reservoir dam are improved, the risk early warning capability of the reservoir dam is comprehensively enhanced, and the long-term safe operation of the reservoir dam is ensured.

2. The XGBoost multi-classification algorithm is introduced into a reservoir dam risk level assessment prediction model, and compared with the machine learning algorithms such as KNN, SVM and the like, the XGBoost multi-classification algorithm has the outstanding advantages that: (1) The XGBoost is regulated to a loss function according to the risk level assessment requirement of the reservoir dam, so that the XGBoost can output multiple risk level assessment results (2) and the Taylor series of the XGBoost is expanded to the second order when the loss function is calculated, and the gradient descent speed and the error precision are improved; (3) The L1 norm and the L2 norm are added in the objective function, so that the complexity of the model is reduced, the risk of overfitting of the model is reduced, and the generalization capability of the model is improved; (4) Aiming at the missing value, the processing strategy can be automatically learned, the robustness of the model is improved (5) the performance index of the model is better, and the risk level assessment prediction precision is higher.

3. The invention forms a set of reservoir dam data characteristic engineering processing system based on reservoir dam data characteristics, which mainly comprises a missing data filling method, a data encoding method and a data sampling method. And can improve sample imbalance in the dataset.

4. According to the GridSearch with cross-arrival-based optimal parameter automatic calculation method, optimal key parameter combinations of models under different subsets are automatically searched, overfitting caused by unreasonable data sets is reduced, the models are ensured to have the highest precision and the best generalization capability, the XGBoost model parameter adjustment time is shortened, and the quick optimization of the model performance is realized.

5. The method breaks through the limitation that only a single reservoir can be subjected to risk assessment prediction at a time in the past, can be used for carrying out risk assessment prediction on the single reservoir, and realizes automatic and rapid assessment prediction on the risk level of the reservoir group based on the weight obtained by model training and the missing feature processing strategy.

Drawings

FIG. 1 is a schematic flow chart of a reservoir dam risk level assessment method based on XGBoost.

FIG. 2 is a Cross-validation with GridSearch flow chart.

Fig. 3 is a schematic diagram of feature importance.

Fig. 4 is a confusion moment.

Fig. 5 is a schematic diagram of reservoir dam feature data set processing.

Fig. 6 is a 6-dimensional feature diagram.

Fig. 7 is a 3-dimensional feature diagram.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The invention provides a reservoir dam (group) risk assessment prediction method, which realizes reservoir dam risk prediction assessment 'intellectualization', and the method preprocesses input characteristics by a characteristic engineering technology and normalizes data content and format; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; and the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, and the accuracy of the model is improved.

Specifically, the invention provides a reservoir dam risk level assessment method based on XGBoost, which is shown in a figure 1 and comprises the following steps:

step one: firstly, acquiring a reservoir dam feature original data set (comprising feature data and a risk level table) which comprises application of map and head () in python to view data briefly, and using info () to view data types (shown in table 4), wherein the dam types, anti-seepage body types, dam building materials and dam building purposes in the data types are text, the data formats are objects, and the text data are mapped into numerical values by using a LabelEncoder; the repair time is numerical data, the data format is object, and the data content and the data format are cleaned and normalized; zero padding is used for missing content in the data.

Table 4 data population case table

Step two: according to the data set with pretreatment and feature engineering, a value_counts function is adopted in python to check the distribution condition of samples, a sample function is adopted to downsample the data according to the distribution condition of the samples, random sampling is selected in a sampling mode, and random_state (random sampling seed is set to be a determined value for ensuring that the result can be reproduced) is set to be 1.

Step three: introducing a train_test_spin function from a sklearn. Model_selection function library, and dividing a data set by using the train_test_split, wherein the characteristic data is an input variable X, the risk level is a predicted variable Y, the dividing ratio is determined by a test_size parameter, the parameter is set to be 0.2, 80% of the parameters are training data, and 20% of the parameters are test data; to ensure reproducibility of the samples, the random_state is set to 20.

Step four: an XGBClassifier function is imported from the XGBoost, an XGBoost reservoir dam risk level assessment prediction model is built, in general parameter configuration of the model, a base learner parameter "boost" is set to be "gbtree", and a computing resource allocation parameter "tree_method" is set to be "cpu_hist". The feature importance calculation can adopt a mode of weight, gain and cover, wherein the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that the feature covers when splitting the node, where the feature importance is chosen to be calculated in the "gain" way.

The fourth step specifically comprises the steps of establishing a reservoir dam risk level assessment prediction model based on multi-classification XGBoost:

assuming F represents the function space of tree F, tree F _k (·) e F, where k=1, 2,3,4, …, K, i.e. formula (1):

wherein x is _t In order to be able to input the input,is the predicted output.

In order for the model to have better performance in classification, we need to minimize the objective function, which is shown in (2):

wherein the method comprises the steps ofIs a regular term for measuring the complexity of the model, and the smaller the function value is, the complexity isThe lower the generalization capability is, the stronger the generalization capability is, and the specific formula is as follows:

t is the number of leaf nodes, w is the weight of the leaf nodes, and gamma and lambda are adjustable parameters.

The cross entropy loss function is used for measuring the fitting degree of a model to training data, and the specific formula is as follows:

in formula (4), y _ij Is a true value, pij is a predicted value, wherei=1, 2,3, …, I-1, item I->

The mth prediction output is calculated according to formula (1), namely:

order theSimplifying the objective function can obtain:

taylor second order expansion is performed on the method:

wherein a is _t Is thatFirst derivative of (b) _t Is->And a second derivative. Let I _m = { t: q (xt) =m } represents an example set of leaf m, and equation (7) is expressed as:

the optimal objective function for the minimization of (8) above is known as:

meanwhile, the weight of the optimal condition of the objective function is as follows:

i.e. the data is fitted continuously during the training of the model and as close as possible to equation (9).

Because the reservoir dam risk level output in the reservoir dam risk level evaluation prediction model is greater than two and different from the previous continuous output value, the loss function in the XGBoost model needs to be adjusted. In the above formula (4)Correspondingly, the objective parameter is set to "multi" at XGBoost model parameter selection: softmax "to output multi-classification results; in addition, the method can be set according to different requirements of a reservoir dam risk level evaluation prediction modelAnd setting a corresponding loss function. In addition, the fourth step further comprises the steps of setting initial parameters of the model:

the model parameters mainly comprise general parameters, training parameters and learning task parameters, and the initialization model is used for determining the boost and tree_method in the general parameters and the objective and seed in the learning task parameters. Wherein boost=: "gbtree", tree_method= "gpu_hist"; seed=27.

Step five: based on the training set data divided in the second step, opening data by using pd.read_csv in Pandas, selecting 25 reservoir dam characteristics such as distance, dam type, seepage-proofing body type, dam material, dam building purpose, dam building time and the like by using an iloc function, and taking the reservoir dam characteristics as independent variables X of model training; and taking the risk level as a dependent variable Y of model training, and inputting X and Y data into the model to obtain a preliminary XGBoost model.

Step six: and fifthly, establishing an XGBoost model in the step five, adopting K-fold Cross validation, importing KFOLD from sklearn. Model_selection, dividing a data set into K (k=5) subsets, taking 1 subset of the K subsets as a test set and K-1 subsets as training sets, training the model K times, calculating K times of results and average values of the results, analyzing the estimated predictive ability of the model on different data sets, and judging the generalization performance and the overfitting degree of the model, as shown in a Cross-evaluation part in fig. 2.

Step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch, importing GridSearchCV from sklearn. Model_selection, selecting classifier parameter escimer as XGBClassification, taking parameters such as n_evacuator, column_bytel, boost, objective, gamma, max_depth, min_child_weight and the like in XGBClassification as tuning ranges, selecting an output index as accuracy, determining the optimal parameter by calculation, and outputting the parameter combination with highest precision in k (k=5) training results as the final parameter of the model.

Step eight: combining the optimal parameters obtained in the step seven into a model, outputting the feature importance, eliminating non-important influence factors according to the feature importance, inputting a new feature set into the model for training according to requirements, evaluating the performances of different models in all aspects by adopting a confusion matrix and Precision, recall, accuracy, F-Score on a test set, and selecting an optimal scheme with highest precision from the model. Specific feature importance (fig. 3), confusion matrix (fig. 4) and accuracy evaluation index are shown in table 5.

Table 5 precision evaluation index

The step eight specifically includes feature importance, confusion matrix and precision index evaluation analysis:

it can be seen from fig. 3 that the importance of checking the frequency, damming material, type of impermeable body, gate width and number of gates among 25 features is significantly higher than other features. The related research results show that under the actual engineering condition, the dam building material and the waterproof material have important influences on the deformation and seepage of the reservoir dam respectively; the width and the number of the gates of the reservoir dam determine the flood discharge capacity of the reservoir, and in a historical dam break event, the insufficient flood discharge capacity of the reservoir and the incapability of timely discharging flood are one of important reasons for dam break occurrence; the dam safety inspection is an extremely important circle for ensuring the safety of the reservoir dam after the reservoir dam is built, has the characteristics of timeliness, comprehensiveness and intuitiveness, can discover important potential safety hazards such as seepage, cracks, equipment faults and the like in time, can greatly improve the safety of the dam by increasing inspection frequency, and shows that the model has certain rationality for predicting and evaluating potential risks of the reservoir dam foundation.

The classification result prediction accuracy is shown in a confusion matrix in fig. 4, the high risk prediction accuracy is 95%, the medium risk prediction accuracy is 88%, and the low risk accuracy is 91%. The overall accuracy of the model prediction result can be obtained by using an accuracy evaluation formula and is 91.3%. The failure of the reservoir dam often brings extremely serious consequences, so that the prediction accuracy of the high risk level in engineering application has higher requirements, the high risk prediction accuracy of the model reaches 95 percent and the overall accuracy exceeds 90 percent, thereby meeting the requirements of practical application.

The invention also provides a reservoir dam risk level assessment system based on XGBoost, which is used for realizing a reservoir dam risk level assessment method based on XGBoost. Specifically, the reservoir dam risk level assessment system based on XGBoost comprises a characteristic data acquisition module, a data processing module, a data set dividing module, a prediction model building module, a model training module, a model verification module, an optimal parameter determining module and a performance assessment module, wherein the characteristic data acquisition module is used for acquiring characteristic data and a risk level table related to reservoir dam risk influence factors, preprocessing and characteristic engineering processing are carried out on the data to obtain a data set, and the first content of the step in the reservoir dam risk level assessment method based on XGBoost is specifically realized; the data processing module is used for checking the condition of sample equalization in the data set, processing unbalanced samples to make the samples more balanced, and specifically realizing the second content in the reservoir dam risk level assessment method based on XGBoost; the data set dividing module is used for dividing the data set processed by the data processing module into a training set and a testing set, and particularly realizing the third content in the reservoir dam risk level assessment method based on XGBoost; the prediction model building module is used for building a reservoir dam risk level assessment prediction model based on XGBoost, setting parameters, adjusting a feature importance calculating mode, setting a risk level result outputting mode, and particularly achieving the fourth content of the step in the reservoir dam risk level assessment method based on XGBoost. The model training module is used for taking the characteristic data as an X variable and the risk level as a Y variable, training the prediction model building module to build the prediction model for estimating the risk level of the reservoir dam based on XGBoost, and particularly realizing the fifth content of the reservoir dam risk level estimating method based on XGBoost. The model verification module is used for grouping the data sets according to the model obtained by the model training module, taking a part of the data sets as a training set to train the classifier, then utilizing the verification set to verify the model, recording the final classification accuracy as the performance of the classifier, and particularly realizing the sixth content in the reservoir dam risk level assessment method based on XGBoost. The optimal parameter determining module is used for traversing all parameter combinations in training, determining optimal parameters of the combination and outputting the combination, and particularly realizing the seventh content of the step in the reservoir dam risk level assessment method based on XGBoost. The performance evaluation module is used for combining the optimal parameters obtained by the optimal parameter determination module into a model, evaluating the performance of the model in all aspects on a test set, and specifically realizing the eighth content of the step in the reservoir dam risk level evaluation method based on XGBoost.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The reservoir dam risk level assessment method based on XGBoost is characterized by comprising the following steps of:

step one: acquiring characteristic data and a risk level table related to risk influence factors of a reservoir dam, preprocessing all the characteristic data, performing characteristic cleaning, data conversion and data filling aiming at data content and format, and normalizing the data to obtain a data set;

step two: based on the data set completed in the first step, checking the unbalance condition of the data set sample, and when the unbalance condition exists in the sample, processing the sample by adopting a technical means to ensure that the sample is more balanced; the technical means is as follows: carrying out fusion sampling treatment on the reservoir dam risk level data set obtained in the first step according to the sample condition of the reservoir dam risk level data set, and ensuring that the data set meets the model requirement by a fusion sampling mode based on downsampling and class balance sampling based on a data layer; based on the algorithm level, corresponding weights of various categories are set and adjusted according to the risk level classification condition of the reservoir dam, and penalty items are added in the algorithm establishment process;

step three: dividing the data set processed in the second step into a training set and a testing set based on the data set; the training set is used for inputting a model, and deep information of each feature in training data is continuously learned through the model, so that the training set has the capability of evaluating and predicting; the test set is used for evaluating the model to evaluate the prediction ability;

step four: establishing an XGBoost-based reservoir dam risk level assessment prediction model, setting general parameters, task parameters and base learner parameters as default parameters, adjusting a mode of calculating feature importance according to data set features, and setting a mode of outputting a result of each reservoir dam risk level;

in the above-mentioned method, the step of,for evaluating the regular term function of the complexity of the model, +.>To evaluate the loss function of model fitting, y _t True score value, expressed on the sample, +.>Representing a predictive score value on the sample;

the feature importance calculation adopts a mode of weight, gain and cover, and the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that the feature covers when splitting the node;

the building of the XGBoost reservoir dam risk level assessment prediction model specifically comprises the following steps:

wherein x is _t In order to be able to input the input,is a predictive output;

the minimization objective function is as shown in (2):

wherein the method comprises the steps ofIs a regular term for measuring the complexity of the model, and the specific formula is as follows:

t is the number of leaf nodes, w is the weight of the leaf nodes, and gamma and lambda are adjustable parameters;

the specific formula of the cross entropy loss function is as follows:

in formula (4), y _ij Is the true value, p _ij Is a predicted value in whichi=1, 2,3, …, I-1, item I->

The mth prediction output is calculated according to formula (1), namely:

order theSimplifying the objective function can obtain:

taylor second order expansion is performed on the method:

wherein a is _t Is thatFirst derivative of (b) _t Is->A second derivative; let I _m ＝{t:q(x _t ) =m } represents an example set of leaf m, and equation (7) is expressed as:

the optimal objective function for the minimization of (8) above is known as:

fitting the data continuously during the training process of the model, and approximating the formula (9) as close as possible;

step five: based on training set data and an established XGBoost reservoir dam risk assessment prediction grade model, opening a data set by Pandas, selecting 25 characteristic data by iloc and inputting the data as X variables to the model to obtain a preliminary XGBoost model; the risk level is taken as a Y variable to be input into the model, wherein Y is y_train in the training set data;

step six: according to the XGBoost model established in the fifth step, grouping the data sets by utilizing Cross validation, setting parameters, determining an optimized parameter range and a parameter searching step length, taking a part of the data sets as a training set to train the classifier, verifying the model by utilizing a verification set, and recording the final classification accuracy as the performance of the classifier;

step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch to determine the optimal parameters, and outputting the parameter combination with highest precision in k training results as the final parameters of the model;

step eight: and D, combining the optimal parameters obtained in the step seven into a model, evaluating the performance of the model in all aspects on a test set, and selecting an optimal scheme with highest precision from the model.

2. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein the preprocessing in step one comprises at least one of the following means:

(1) Processing the problems of inconsistent data content and title, abnormal data format and missing data content;

(2) And carrying out feature transformation on the text features, and converting text data of dam types, impervious body types, dam building materials and dam building purposes into numerical data by using LabelEncoder.

3. The XGBoost-based reservoir dam risk level assessment method according to claim 2, further comprising the process of:

(1) Firstly, counting the missing condition of the characteristic value, and deleting invalid characteristics with the missing proportion being more than 60%;

(2) Removing repeated characteristic information in the reservoir dam data set aiming at the reservoir dam data set with numerical type and text type;

(3) Judging whether the data set is a mixed data set containing numerical values and text data, if the text data exists, selecting to map the characteristic values and the numerical values of the text data or map the characteristic values into a high-dimensional space according to the characteristic quantity, and then performing dimension reduction on the characteristics;

(4) Filling the missing characteristic value information in the characteristic set, wherein the characteristic average value or mode of the reservoir dam in the research area is used for filling the missing value because the characteristic of the reservoir dam is mostly a discrete value; or uniformly setting the missing values to a specific value to improve the robustness of the model.

4. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the fourth step, all risk influence factors in the training set are input into a model for training, and the importance degree of all risk influence factors and the risk level of each reservoir dam are output.

5. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the fourth step, a loss function in XGBoost model is adjusted:

in the above formula (4)Correspondingly, at XThe GBoost model parameter is selected by setting the objective parameter to "multi: softmax "to output multi-classification results; or setting corresponding loss functions according to different requirements of the reservoir dam risk level assessment prediction model.

6. The method for evaluating the risk level of the reservoir dam based on XGBoost according to claim 1, wherein the verification model process in the step six adopts a cross verification mode, and K-fold cross verification is selected or Leave One Out verification is selected.

7. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein the step eight further outputs the feature importance thereof, eliminates non-important influence factors according to the feature importance, and inputs a new feature set into a model for training according to requirements.

8. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the eighth step, classification accuracy is visually displayed by using a confusion matrix, model performance is comprehensively assessed by using Accuracy, precision, recall and F-Score indexes, and four indexes are basically represented by the following formulas:

9. A XGBoost-based reservoir dam risk level assessment system for implementing the XGBoost-based reservoir dam risk level assessment method of any one of claims 1-8, comprising: the system comprises a characteristic data acquisition module, a data processing module, a data set dividing module, a prediction model building module, a model training module, a model verification module, an optimal parameter determining module and a performance evaluation module; the characteristic data acquisition module is used for acquiring characteristic data and a risk level table related to the risk influence factors of the reservoir dam, and preprocessing and characteristic engineering processing are carried out on the data to obtain a data set; the data processing module is used for checking the sample balance condition in the data set and processing unbalanced samples so as to make the samples more balanced; the data set dividing module is used for dividing the data set processed by the data processing module into a training set and a testing set; the prediction model building module is used for building an XGBoost reservoir dam risk level assessment prediction model, setting parameters, adjusting a feature importance calculating mode and setting a risk level output mode; the model training module is used for taking characteristic data as an X variable and taking a risk level as a Y variable, and training the prediction model building module to build an XGBoost reservoir dam risk level assessment prediction model; the model verification module is used for grouping the data sets according to the model obtained by the model training module, training the classifier by taking a part of the data sets as a training set, verifying the model by using a verification set, and recording the final classification accuracy as the performance of the classifier; the optimal parameter determining module is used for traversing all parameter combinations in training, determining optimal parameters and outputting the optimal parameters; the performance evaluation module is used for combining the optimal parameters obtained by the optimal parameter determination module into a model and evaluating the performance of the model in all aspects on the test set.