CN113807570B - XGBoost-based reservoir dam risk level assessment method and system - Google Patents

XGBoost-based reservoir dam risk level assessment method and system Download PDF

Info

Publication number
CN113807570B
CN113807570B CN202110924472.6A CN202110924472A CN113807570B CN 113807570 B CN113807570 B CN 113807570B CN 202110924472 A CN202110924472 A CN 202110924472A CN 113807570 B CN113807570 B CN 113807570B
Authority
CN
China
Prior art keywords
model
data
reservoir dam
risk level
xgboost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110924472.6A
Other languages
Chinese (zh)
Other versions
CN113807570A (en
Inventor
丁炜
金有杰
高佳琦
刘娜
孙建庭
林艳燕
陈季
牛睿平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Naiwch Cooperation
Nanjing Water Conservancy and Hydrology Automatization Institute Ministry of Water Resources
Original Assignee
Jiangsu Naiwch Cooperation
Nanjing Water Conservancy and Hydrology Automatization Institute Ministry of Water Resources
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Naiwch Cooperation, Nanjing Water Conservancy and Hydrology Automatization Institute Ministry of Water Resources filed Critical Jiangsu Naiwch Cooperation
Priority to CN202110924472.6A priority Critical patent/CN113807570B/en
Publication of CN113807570A publication Critical patent/CN113807570A/en
Application granted granted Critical
Publication of CN113807570B publication Critical patent/CN113807570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Abstract

The invention provides a reservoir dam risk level assessment method and a reservoir dam risk level assessment system based on XGBoost, which are characterized in that input features are preprocessed by a feature engineering technology, and data content and format are normalized; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, the dangerous degree of potential safety hazards of the reservoir dam is automatically identified, the efficiency and accuracy of evaluation prediction of the risk level of the reservoir dam are improved, the risk early warning capability of the reservoir dam is comprehensively enhanced, and long-term safe operation of the reservoir dam is ensured.

Description

XGBoost-based reservoir dam risk level assessment method and system
Technical Field
The invention belongs to the technical field of water conservancy and hydrology, relates to a safety risk assessment method based on reservoir dam data acquisition, and particularly relates to a reservoir dam risk level assessment method and system based on XGBoost.
Background
At present, reservoirs in China share 9.8 spare seats, most of the reservoirs are built in the last 50-70 th century, due to serious problems of reservoir dam risks and prominent dam break risks caused by historical, economic and technical reasons, in order to reduce the probability of reservoir dam failure, risk assessment must be carried out regularly, and the current reservoir dam safety risk assessment mainly relies on scientific research technicians with abundant experience to analyze and evaluate the information materials of the existing reservoir dams, so that the method has some problems and disadvantages:
(1) The degree of automation of the analytical evaluation process is low. The 'congenital deficiency' of the engineering quality of the reservoir dam in China and the low informatization level are achieved, so that the current reservoir dam assessment is realized through manual analysis, manual modeling and manual assessment, manpower and material resources are greatly consumed, and the burden of technicians is increased.
(2) The intelligent degree of modeling and parameter optimization is low. The current subjective methods of establishing a risk assessment system, establishing a risk assessment model for selecting and setting index factors depend on historical statistical data, expert experience and the like, so that a person with considerable experience is required to manually select and determine parameters when the established risk assessment system and the risk assessment model are used, and the risk assessment system and the risk assessment model are difficult to use for reservoir management personnel.
(3) The model method has low versatility. In order to pertinently improve the accuracy of the evaluation model, in the past, relevant processing is carried out on input characteristics and model parameters of the model according to a research area at the beginning of constructing the model, and once the research area or an influence factor in the research area changes, the risk evaluation model is not applicable any more and needs to be readjusted by a technician.
Disclosure of Invention
In order to solve the problems, the invention provides a reservoir dam risk level assessment and prediction method with high universality, high convenience, high intelligence and high accuracy based on XGBoost for different types of potential risk influencing factors (such as dam building materials, flood discharge gate width and the like) of a reservoir dam, and the early warning capability of the safety risk of the reservoir dam is improved.
The invention realizes 'intelligent' of reservoir dam (group) risk prediction and assessment: preprocessing input features and normalizing data content and format are realized through feature engineering technologies such as cleaning abnormal samples, sampling the samples, normalizing, discretizing and the like; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; and the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, and the accuracy of the model is improved.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a reservoir dam risk level assessment method based on XGBoost comprises the following steps:
step one: acquiring characteristic data (characteristic details are shown in table 1) and a risk level table (table 2) related to risk influence factors of a reservoir dam, overview the main data types, data formats, data loss, variance, mean values and the like of the characteristic data, and performing characteristic engineering treatment such as characteristic cleaning, data conversion, data filling and the like on all the characteristic data to obtain a data set, wherein the treatment process mainly comprises the following steps:
(1) Processing the problems of inconsistent data content and title, abnormal data format, missing data content and the like;
(2) And carrying out feature transformation on the text features, and converting text data such as dam types, impervious body types, dam building materials, dam building purposes and the like into numerical data by using LabelEncoder.
Preferably, the pretreatment and feature engineering of the reservoir dam feature data set may further include the following treatments, as shown in fig. 5:
1. firstly, counting the missing condition of the characteristic value, and deleting invalid characteristics with the missing proportion being more than 60%;
2. removing repeated characteristic information in the reservoir dam data set aiming at the reservoir dam data set with numerical type and text type;
3. judging whether the data set is a mixed data set containing numerical values and text data, if the text data exists, selecting to map the characteristic values and the numerical values of the text data or map the characteristic values into a high-dimensional space according to the characteristic quantity, and then carrying out dimension reduction on the characteristics by combining methods such as principal component analysis, association algorithm, correlation analysis and the like. The reservoir dam data set used in the patent is a mixed data set, and simultaneously comprises numerical data and text data, the text data can be mapped to perform subsequent dimension reduction operation, a characteristic dimension reduction effect schematic diagram (taking 6-dimensional data as an example) is shown in fig. 6, and the 6-dimensional characteristic is changed into a three-dimensional characteristic after dimension reduction, as shown in fig. 7.
4. Filling the missing characteristic value information in the characteristic set, wherein the characteristic of the reservoir dam is a discrete value, so that the missing value filling can use the characteristic average value or mode of the reservoir dam in the research area; in addition, the missing values can be uniformly set to a specific value, so that the robustness of the model is improved.
Table 1 reservoir dam risk impact factor table
Table 2 reservoir dam risk rating table
Sequence number Potential risk rating Risk level definition
1 Low (Low) The reservoir dam is not casualty and has little influence on economy and environment
2 Middle (Significa) The reservoir dam is not casualty due to accident, but has a certain influence on the economic environment
3 High (High) The reservoir dam is extremely likely to cause casualties
Step two: based on the data set completed in the first step, the imbalance condition of the samples is checked, and the samples in the categories in the real multi-category data set can hardly ensure that the number of the samples in each category is similar, so that the samples need to be processed by adopting corresponding technical means. Based on the data layer, in order to solve the problem of extremely unbalanced sample of the reservoir dam data set, the data set is ensured to meet the model requirement by a fusion sampling mode based on downsampling and class equilibrium sampling; based on the algorithm level, the corresponding weights of various categories can be set and adjusted according to the risk grade classification condition of the reservoir dam, penalty items are added in the algorithm establishment process, and the like; and (3) carrying out fusion sampling treatment on the reservoir dam risk level data set according to the sample condition of the reservoir dam risk level data set obtained in the step (A).
Step three: dividing the data set processed in the second step into a training set and a testing set: the training set is used for inputting a model, and deep information of each feature in training data is continuously learned through the model, so that the training set has the capability of evaluating and predicting; the test set is used to evaluate the model to evaluate the predictive power. The data set dividing method includes a leave-out method, a self-service method, a leave-in method and the like, and the data sets are divided according to the proportion of 8:2, wherein test set data cannot appear in the training set in order to prevent data leakage and fully evaluate the quality of the model.
Step four: establishing an XGBoost reservoir dam risk level assessment prediction model, setting general parameters, task parameters and base learner parameters as default parameters, adjusting a mode of calculating feature importance according to data set features, and setting a mode of outputting a result of each reservoir dam risk level.
Specifically, in the fourth step, an XGBoost model is constructed, all risk influence factors in a training set are input into the model for training, the importance degree of all risk influence factors and the risk level of each reservoir dam are output, and the method specifically comprises the following steps:
the risk level of the reservoir dam is estimated and predicted by adopting an XGBoost algorithm, and the specific formula of an objective function is as follows:
in the above-mentioned method, the step of,for evaluating the regular term function of the complexity of the model, +.>For evaluating the loss function of the model fitting degree, yt represents the true score value on the sample,/->Representing the predictive score value on the sample.
The feature importance calculation can adopt a mode of weight, gain and cover, wherein the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that a feature covers when splitting a node.
Step five: based on training set data and an established XGBoost reservoir dam risk assessment prediction grade model, opening the data set by Pandas, selecting 25 characteristic data by iloc and inputting the data as X variables to the model to obtain a preliminary XGBoost model. Wherein the X variable is x_train in the training set data (25 feature data in the training set), and the risk level is taken as a Y variable to be input into the model, wherein Y is y_train in the training set data (risk level corresponding to each sample, risk level corresponds to table 2).
Step six: according to the XGBoost model established in the fifth step, the data sets are grouped by using Cross validation, parameters are set, an optimized parameter range and a parameter searching step length are determined, a part of the data sets are mainly used as training sets for training the classifier, a verification set is used for verifying the model, and the final classification accuracy is recorded as the performance of the classifier.
Preferably, in step six, a cross-validation method is adopted, and the cross-validation method can be selected according to actual situations:
if the data amount is not particularly large, KFOLD, namely K-fold cross validation, can be selected; if the computer computing resources are sufficient, an attempt may be made to select LOO, i.e., leave One Out verification.
Step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch, determining the optimal parameters, and outputting the parameter combination with highest precision in k training results as the final parameters of the model.
Step eight: combining the optimal parameters obtained in the step seven into a model, and evaluating the performance of the model on the whole aspect by adopting a confusion matrix and Precision, recall, accuracy, F-score on a test set.
Preferably, in step eight, the performance of the reservoir dam risk level evaluation prediction model is evaluated:
the classification accuracy is visually displayed by adopting a confusion matrix, and the basic structure is shown in table 3:
TABLE 3 confusion matrix
Wherein TP is a sample with correct Class1 classification; FP is a sample that misclassifies Class 2 as Class 1; FN is a sample that misclassifies Class1 into Class 2; TN is a sample of correct Class 2 classification.
Accuracy, precision, recall and F-score indexes are adopted for comprehensively evaluating the performance of the model, accuracy reflects the overall prediction Accuracy of the model, and the higher the overall Accuracy is, the better the performance of the model is represented; precision reflects the correct sample duty ratio of the model prediction result, and the larger the value is, the fewer the misclassification is; recall reflects the proportion of the true category correctly detected by the model, and the larger the value of Recall is, the less the model is missed to be classified; the F score is a harmonic average evaluation index of Precision and Recall, the larger the value of the F score is, the better the performance of the model is indicated, and the four indexes are basically expressed as follows:
compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the invention, a machine learning technology is utilized to construct a reservoir dam risk level assessment prediction model, deep characteristic information of a large amount of data is mined, and intelligent multi-classification of the reservoir dam risk level is realized; the risk level of the reservoir dam is identified and judged with high precision, high intellectualization and high automation, the time cost and the economic cost of manual data processing, manual data analysis and manual modeling are reduced, the risk degree of potential safety hazards of the reservoir dam is automatically identified, the efficiency and the accuracy of the risk level assessment and prediction of the reservoir dam are improved, the risk early warning capability of the reservoir dam is comprehensively enhanced, and the long-term safe operation of the reservoir dam is ensured.
2. The XGBoost multi-classification algorithm is introduced into a reservoir dam risk level assessment prediction model, and compared with the machine learning algorithms such as KNN, SVM and the like, the XGBoost multi-classification algorithm has the outstanding advantages that: (1) The XGBoost is regulated to a loss function according to the risk level assessment requirement of the reservoir dam, so that the XGBoost can output multiple risk level assessment results (2) and the Taylor series of the XGBoost is expanded to the second order when the loss function is calculated, and the gradient descent speed and the error precision are improved; (3) The L1 norm and the L2 norm are added in the objective function, so that the complexity of the model is reduced, the risk of overfitting of the model is reduced, and the generalization capability of the model is improved; (4) Aiming at the missing value, the processing strategy can be automatically learned, the robustness of the model is improved (5) the performance index of the model is better, and the risk level assessment prediction precision is higher.
3. The invention forms a set of reservoir dam data characteristic engineering processing system based on reservoir dam data characteristics, which mainly comprises a missing data filling method, a data encoding method and a data sampling method. And can improve sample imbalance in the dataset.
4. According to the GridSearch with cross-arrival-based optimal parameter automatic calculation method, optimal key parameter combinations of models under different subsets are automatically searched, overfitting caused by unreasonable data sets is reduced, the models are ensured to have the highest precision and the best generalization capability, the XGBoost model parameter adjustment time is shortened, and the quick optimization of the model performance is realized.
5. The method breaks through the limitation that only a single reservoir can be subjected to risk assessment prediction at a time in the past, can be used for carrying out risk assessment prediction on the single reservoir, and realizes automatic and rapid assessment prediction on the risk level of the reservoir group based on the weight obtained by model training and the missing feature processing strategy.
Drawings
FIG. 1 is a schematic flow chart of a reservoir dam risk level assessment method based on XGBoost.
FIG. 2 is a Cross-validation with GridSearch flow chart.
Fig. 3 is a schematic diagram of feature importance.
Fig. 4 is a confusion moment.
Fig. 5 is a schematic diagram of reservoir dam feature data set processing.
Fig. 6 is a 6-dimensional feature diagram.
Fig. 7 is a 3-dimensional feature diagram.
Detailed Description
The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The invention provides a reservoir dam (group) risk assessment prediction method, which realizes reservoir dam risk prediction assessment 'intellectualization', and the method preprocesses input characteristics by a characteristic engineering technology and normalizes data content and format; driving the model by data, and reducing the influence of subjective factors on the model; the optimal parameters of the model are adaptively calculated by GridSearch and Cross-validation technologies, so that the working efficiency is improved, and meanwhile, the manpower and material resources are saved; and the machine learning technology is used as a core to construct an evaluation prediction model, deep features of mass data are deeply excavated, and the accuracy of the model is improved.
Specifically, the invention provides a reservoir dam risk level assessment method based on XGBoost, which is shown in a figure 1 and comprises the following steps:
step one: firstly, acquiring a reservoir dam feature original data set (comprising feature data and a risk level table) which comprises application of map and head () in python to view data briefly, and using info () to view data types (shown in table 4), wherein the dam types, anti-seepage body types, dam building materials and dam building purposes in the data types are text, the data formats are objects, and the text data are mapped into numerical values by using a LabelEncoder; the repair time is numerical data, the data format is object, and the data content and the data format are cleaned and normalized; zero padding is used for missing content in the data.
Table 4 data population case table
Step two: according to the data set with pretreatment and feature engineering, a value_counts function is adopted in python to check the distribution condition of samples, a sample function is adopted to downsample the data according to the distribution condition of the samples, random sampling is selected in a sampling mode, and random_state (random sampling seed is set to be a determined value for ensuring that the result can be reproduced) is set to be 1.
Step three: introducing a train_test_spin function from a sklearn. Model_selection function library, and dividing a data set by using the train_test_split, wherein the characteristic data is an input variable X, the risk level is a predicted variable Y, the dividing ratio is determined by a test_size parameter, the parameter is set to be 0.2, 80% of the parameters are training data, and 20% of the parameters are test data; to ensure reproducibility of the samples, the random_state is set to 20.
Step four: an XGBClassifier function is imported from the XGBoost, an XGBoost reservoir dam risk level assessment prediction model is built, in general parameter configuration of the model, a base learner parameter "boost" is set to be "gbtree", and a computing resource allocation parameter "tree_method" is set to be "cpu_hist". The feature importance calculation can adopt a mode of weight, gain and cover, wherein the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that the feature covers when splitting the node, where the feature importance is chosen to be calculated in the "gain" way.
The fourth step specifically comprises the steps of establishing a reservoir dam risk level assessment prediction model based on multi-classification XGBoost:
assuming F represents the function space of tree F, tree F k (·) e F, where k=1, 2,3,4, …, K, i.e. formula (1):
wherein x is t In order to be able to input the input,is the predicted output.
In order for the model to have better performance in classification, we need to minimize the objective function, which is shown in (2):
wherein the method comprises the steps ofIs a regular term for measuring the complexity of the model, and the smaller the function value is, the complexity isThe lower the generalization capability is, the stronger the generalization capability is, and the specific formula is as follows:
t is the number of leaf nodes, w is the weight of the leaf nodes, and gamma and lambda are adjustable parameters.
The cross entropy loss function is used for measuring the fitting degree of a model to training data, and the specific formula is as follows:
in formula (4), y ij Is a true value, pij is a predicted value, wherei=1, 2,3, …, I-1, item I->
The mth prediction output is calculated according to formula (1), namely:
order theSimplifying the objective function can obtain:
taylor second order expansion is performed on the method:
wherein a is t Is thatFirst derivative of (b) t Is->And a second derivative. Let I m = { t: q (xt) =m } represents an example set of leaf m, and equation (7) is expressed as:
the optimal objective function for the minimization of (8) above is known as:
meanwhile, the weight of the optimal condition of the objective function is as follows:
i.e. the data is fitted continuously during the training of the model and as close as possible to equation (9).
Because the reservoir dam risk level output in the reservoir dam risk level evaluation prediction model is greater than two and different from the previous continuous output value, the loss function in the XGBoost model needs to be adjusted. In the above formula (4)Correspondingly, the objective parameter is set to "multi" at XGBoost model parameter selection: softmax "to output multi-classification results; in addition, the method can be set according to different requirements of a reservoir dam risk level evaluation prediction modelAnd setting a corresponding loss function. In addition, the fourth step further comprises the steps of setting initial parameters of the model:
the model parameters mainly comprise general parameters, training parameters and learning task parameters, and the initialization model is used for determining the boost and tree_method in the general parameters and the objective and seed in the learning task parameters. Wherein boost=: "gbtree", tree_method= "gpu_hist"; seed=27.
Step five: based on the training set data divided in the second step, opening data by using pd.read_csv in Pandas, selecting 25 reservoir dam characteristics such as distance, dam type, seepage-proofing body type, dam material, dam building purpose, dam building time and the like by using an iloc function, and taking the reservoir dam characteristics as independent variables X of model training; and taking the risk level as a dependent variable Y of model training, and inputting X and Y data into the model to obtain a preliminary XGBoost model.
Step six: and fifthly, establishing an XGBoost model in the step five, adopting K-fold Cross validation, importing KFOLD from sklearn. Model_selection, dividing a data set into K (k=5) subsets, taking 1 subset of the K subsets as a test set and K-1 subsets as training sets, training the model K times, calculating K times of results and average values of the results, analyzing the estimated predictive ability of the model on different data sets, and judging the generalization performance and the overfitting degree of the model, as shown in a Cross-evaluation part in fig. 2.
Step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch, importing GridSearchCV from sklearn. Model_selection, selecting classifier parameter escimer as XGBClassification, taking parameters such as n_evacuator, column_bytel, boost, objective, gamma, max_depth, min_child_weight and the like in XGBClassification as tuning ranges, selecting an output index as accuracy, determining the optimal parameter by calculation, and outputting the parameter combination with highest precision in k (k=5) training results as the final parameter of the model.
Step eight: combining the optimal parameters obtained in the step seven into a model, outputting the feature importance, eliminating non-important influence factors according to the feature importance, inputting a new feature set into the model for training according to requirements, evaluating the performances of different models in all aspects by adopting a confusion matrix and Precision, recall, accuracy, F-Score on a test set, and selecting an optimal scheme with highest precision from the model. Specific feature importance (fig. 3), confusion matrix (fig. 4) and accuracy evaluation index are shown in table 5.
Table 5 precision evaluation index
The step eight specifically includes feature importance, confusion matrix and precision index evaluation analysis:
it can be seen from fig. 3 that the importance of checking the frequency, damming material, type of impermeable body, gate width and number of gates among 25 features is significantly higher than other features. The related research results show that under the actual engineering condition, the dam building material and the waterproof material have important influences on the deformation and seepage of the reservoir dam respectively; the width and the number of the gates of the reservoir dam determine the flood discharge capacity of the reservoir, and in a historical dam break event, the insufficient flood discharge capacity of the reservoir and the incapability of timely discharging flood are one of important reasons for dam break occurrence; the dam safety inspection is an extremely important circle for ensuring the safety of the reservoir dam after the reservoir dam is built, has the characteristics of timeliness, comprehensiveness and intuitiveness, can discover important potential safety hazards such as seepage, cracks, equipment faults and the like in time, can greatly improve the safety of the dam by increasing inspection frequency, and shows that the model has certain rationality for predicting and evaluating potential risks of the reservoir dam foundation.
The classification result prediction accuracy is shown in a confusion matrix in fig. 4, the high risk prediction accuracy is 95%, the medium risk prediction accuracy is 88%, and the low risk accuracy is 91%. The overall accuracy of the model prediction result can be obtained by using an accuracy evaluation formula and is 91.3%. The failure of the reservoir dam often brings extremely serious consequences, so that the prediction accuracy of the high risk level in engineering application has higher requirements, the high risk prediction accuracy of the model reaches 95 percent and the overall accuracy exceeds 90 percent, thereby meeting the requirements of practical application.
The invention also provides a reservoir dam risk level assessment system based on XGBoost, which is used for realizing a reservoir dam risk level assessment method based on XGBoost. Specifically, the reservoir dam risk level assessment system based on XGBoost comprises a characteristic data acquisition module, a data processing module, a data set dividing module, a prediction model building module, a model training module, a model verification module, an optimal parameter determining module and a performance assessment module, wherein the characteristic data acquisition module is used for acquiring characteristic data and a risk level table related to reservoir dam risk influence factors, preprocessing and characteristic engineering processing are carried out on the data to obtain a data set, and the first content of the step in the reservoir dam risk level assessment method based on XGBoost is specifically realized; the data processing module is used for checking the condition of sample equalization in the data set, processing unbalanced samples to make the samples more balanced, and specifically realizing the second content in the reservoir dam risk level assessment method based on XGBoost; the data set dividing module is used for dividing the data set processed by the data processing module into a training set and a testing set, and particularly realizing the third content in the reservoir dam risk level assessment method based on XGBoost; the prediction model building module is used for building a reservoir dam risk level assessment prediction model based on XGBoost, setting parameters, adjusting a feature importance calculating mode, setting a risk level result outputting mode, and particularly achieving the fourth content of the step in the reservoir dam risk level assessment method based on XGBoost. The model training module is used for taking the characteristic data as an X variable and the risk level as a Y variable, training the prediction model building module to build the prediction model for estimating the risk level of the reservoir dam based on XGBoost, and particularly realizing the fifth content of the reservoir dam risk level estimating method based on XGBoost. The model verification module is used for grouping the data sets according to the model obtained by the model training module, taking a part of the data sets as a training set to train the classifier, then utilizing the verification set to verify the model, recording the final classification accuracy as the performance of the classifier, and particularly realizing the sixth content in the reservoir dam risk level assessment method based on XGBoost. The optimal parameter determining module is used for traversing all parameter combinations in training, determining optimal parameters of the combination and outputting the combination, and particularly realizing the seventh content of the step in the reservoir dam risk level assessment method based on XGBoost. The performance evaluation module is used for combining the optimal parameters obtained by the optimal parameter determination module into a model, evaluating the performance of the model in all aspects on a test set, and specifically realizing the eighth content of the step in the reservoir dam risk level evaluation method based on XGBoost.
The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (9)

1. The reservoir dam risk level assessment method based on XGBoost is characterized by comprising the following steps of:
step one: acquiring characteristic data and a risk level table related to risk influence factors of a reservoir dam, preprocessing all the characteristic data, performing characteristic cleaning, data conversion and data filling aiming at data content and format, and normalizing the data to obtain a data set;
step two: based on the data set completed in the first step, checking the unbalance condition of the data set sample, and when the unbalance condition exists in the sample, processing the sample by adopting a technical means to ensure that the sample is more balanced; the technical means is as follows: carrying out fusion sampling treatment on the reservoir dam risk level data set obtained in the first step according to the sample condition of the reservoir dam risk level data set, and ensuring that the data set meets the model requirement by a fusion sampling mode based on downsampling and class balance sampling based on a data layer; based on the algorithm level, corresponding weights of various categories are set and adjusted according to the risk level classification condition of the reservoir dam, and penalty items are added in the algorithm establishment process;
step three: dividing the data set processed in the second step into a training set and a testing set based on the data set; the training set is used for inputting a model, and deep information of each feature in training data is continuously learned through the model, so that the training set has the capability of evaluating and predicting; the test set is used for evaluating the model to evaluate the prediction ability;
step four: establishing an XGBoost-based reservoir dam risk level assessment prediction model, setting general parameters, task parameters and base learner parameters as default parameters, adjusting a mode of calculating feature importance according to data set features, and setting a mode of outputting a result of each reservoir dam risk level;
the risk level of the reservoir dam is estimated and predicted by adopting an XGBoost algorithm, and the specific formula of an objective function is as follows:
in the above-mentioned method, the step of,for evaluating the regular term function of the complexity of the model, +.>To evaluate the loss function of model fitting, y t True score value, expressed on the sample, +.>Representing a predictive score value on the sample;
the feature importance calculation adopts a mode of weight, gain and cover, and the weight reflects the occurrence times of the features in the tree; "gain" reflects the average gain at feature split; "cover" is the number of all samples that the feature covers when splitting the node;
the building of the XGBoost reservoir dam risk level assessment prediction model specifically comprises the following steps:
assuming F represents the function space of tree F, tree F k (·) e F, where k=1, 2,3,4, …, K, i.e. formula (1):
wherein x is t In order to be able to input the input,is a predictive output;
the minimization objective function is as shown in (2):
wherein the method comprises the steps ofIs a regular term for measuring the complexity of the model, and the specific formula is as follows:
t is the number of leaf nodes, w is the weight of the leaf nodes, and gamma and lambda are adjustable parameters;
the specific formula of the cross entropy loss function is as follows:
in formula (4), y ij Is the true value, p ij Is a predicted value in whichi=1, 2,3, …, I-1, item I->
The mth prediction output is calculated according to formula (1), namely:
order theSimplifying the objective function can obtain:
taylor second order expansion is performed on the method:
wherein a is t Is thatFirst derivative of (b) t Is->A second derivative; let I m ={t:q(x t ) =m } represents an example set of leaf m, and equation (7) is expressed as:
the optimal objective function for the minimization of (8) above is known as:
meanwhile, the weight of the optimal condition of the objective function is as follows:
fitting the data continuously during the training process of the model, and approximating the formula (9) as close as possible;
step five: based on training set data and an established XGBoost reservoir dam risk assessment prediction grade model, opening a data set by Pandas, selecting 25 characteristic data by iloc and inputting the data as X variables to the model to obtain a preliminary XGBoost model; the risk level is taken as a Y variable to be input into the model, wherein Y is y_train in the training set data;
step six: according to the XGBoost model established in the fifth step, grouping the data sets by utilizing Cross validation, setting parameters, determining an optimized parameter range and a parameter searching step length, taking a part of the data sets as a training set to train the classifier, verifying the model by utilizing a verification set, and recording the final classification accuracy as the performance of the classifier;
step seven: for each training in Cross-validation, traversing all parameter combinations by GridSearch to determine the optimal parameters, and outputting the parameter combination with highest precision in k training results as the final parameters of the model;
step eight: and D, combining the optimal parameters obtained in the step seven into a model, evaluating the performance of the model in all aspects on a test set, and selecting an optimal scheme with highest precision from the model.
2. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein the preprocessing in step one comprises at least one of the following means:
(1) Processing the problems of inconsistent data content and title, abnormal data format and missing data content;
(2) And carrying out feature transformation on the text features, and converting text data of dam types, impervious body types, dam building materials and dam building purposes into numerical data by using LabelEncoder.
3. The XGBoost-based reservoir dam risk level assessment method according to claim 2, further comprising the process of:
(1) Firstly, counting the missing condition of the characteristic value, and deleting invalid characteristics with the missing proportion being more than 60%;
(2) Removing repeated characteristic information in the reservoir dam data set aiming at the reservoir dam data set with numerical type and text type;
(3) Judging whether the data set is a mixed data set containing numerical values and text data, if the text data exists, selecting to map the characteristic values and the numerical values of the text data or map the characteristic values into a high-dimensional space according to the characteristic quantity, and then performing dimension reduction on the characteristics;
(4) Filling the missing characteristic value information in the characteristic set, wherein the characteristic average value or mode of the reservoir dam in the research area is used for filling the missing value because the characteristic of the reservoir dam is mostly a discrete value; or uniformly setting the missing values to a specific value to improve the robustness of the model.
4. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the fourth step, all risk influence factors in the training set are input into a model for training, and the importance degree of all risk influence factors and the risk level of each reservoir dam are output.
5. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the fourth step, a loss function in XGBoost model is adjusted:
in the above formula (4)Correspondingly, at XThe GBoost model parameter is selected by setting the objective parameter to "multi: softmax "to output multi-classification results; or setting corresponding loss functions according to different requirements of the reservoir dam risk level assessment prediction model.
6. The method for evaluating the risk level of the reservoir dam based on XGBoost according to claim 1, wherein the verification model process in the step six adopts a cross verification mode, and K-fold cross verification is selected or Leave One Out verification is selected.
7. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein the step eight further outputs the feature importance thereof, eliminates non-important influence factors according to the feature importance, and inputs a new feature set into a model for training according to requirements.
8. The XGBoost-based reservoir dam risk level assessment method according to claim 1, wherein in the eighth step, classification accuracy is visually displayed by using a confusion matrix, model performance is comprehensively assessed by using Accuracy, precision, recall and F-Score indexes, and four indexes are basically represented by the following formulas:
wherein TP is a sample with correct Class1 classification; FP is a sample that misclassifies Class 2 as Class 1; FN is a sample that misclassifies Class1 into Class 2; TN is a sample of correct Class 2 classification.
9. A XGBoost-based reservoir dam risk level assessment system for implementing the XGBoost-based reservoir dam risk level assessment method of any one of claims 1-8, comprising: the system comprises a characteristic data acquisition module, a data processing module, a data set dividing module, a prediction model building module, a model training module, a model verification module, an optimal parameter determining module and a performance evaluation module; the characteristic data acquisition module is used for acquiring characteristic data and a risk level table related to the risk influence factors of the reservoir dam, and preprocessing and characteristic engineering processing are carried out on the data to obtain a data set; the data processing module is used for checking the sample balance condition in the data set and processing unbalanced samples so as to make the samples more balanced; the data set dividing module is used for dividing the data set processed by the data processing module into a training set and a testing set; the prediction model building module is used for building an XGBoost reservoir dam risk level assessment prediction model, setting parameters, adjusting a feature importance calculating mode and setting a risk level output mode; the model training module is used for taking characteristic data as an X variable and taking a risk level as a Y variable, and training the prediction model building module to build an XGBoost reservoir dam risk level assessment prediction model; the model verification module is used for grouping the data sets according to the model obtained by the model training module, training the classifier by taking a part of the data sets as a training set, verifying the model by using a verification set, and recording the final classification accuracy as the performance of the classifier; the optimal parameter determining module is used for traversing all parameter combinations in training, determining optimal parameters and outputting the optimal parameters; the performance evaluation module is used for combining the optimal parameters obtained by the optimal parameter determination module into a model and evaluating the performance of the model in all aspects on the test set.
CN202110924472.6A 2021-08-12 2021-08-12 XGBoost-based reservoir dam risk level assessment method and system Active CN113807570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110924472.6A CN113807570B (en) 2021-08-12 2021-08-12 XGBoost-based reservoir dam risk level assessment method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110924472.6A CN113807570B (en) 2021-08-12 2021-08-12 XGBoost-based reservoir dam risk level assessment method and system

Publications (2)

Publication Number Publication Date
CN113807570A CN113807570A (en) 2021-12-17
CN113807570B true CN113807570B (en) 2024-02-02

Family

ID=78942777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110924472.6A Active CN113807570B (en) 2021-08-12 2021-08-12 XGBoost-based reservoir dam risk level assessment method and system

Country Status (1)

Country Link
CN (1) CN113807570B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114331160B (en) * 2021-12-30 2023-04-28 四川大学 Dam blocking and dam bursting disaster chain mode identification method based on landslide river blocking form
CN115632845B (en) * 2022-10-12 2023-12-05 南京联创数字科技有限公司 Scenic spot algorithm application risk assessment method based on risk scoring card
CN115619078B (en) * 2022-10-25 2023-06-02 广东工业大学 Method and device for predicting hazard risk level of small animals in transformer substation
CN116362552B (en) * 2023-05-31 2023-09-05 江西省水利科学院(江西省大坝安全管理中心、江西省水资源管理中心) Method for evaluating safety risk level of small reservoir
CN116894588A (en) * 2023-06-05 2023-10-17 中国科学院地理科学与资源研究所 Big data-based intelligent grain supply and demand management method, system and medium
CN116975401A (en) * 2023-09-19 2023-10-31 杭州美创科技股份有限公司 Database field identification method, device, computer equipment and storage medium
CN117370827A (en) * 2023-12-07 2024-01-09 飞特质科(北京)计量检测技术有限公司 Fan quality grade assessment method based on deep clustering model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480341A (en) * 2017-07-21 2017-12-15 河海大学 A kind of dam safety comprehensive method based on deep learning
CN108615114A (en) * 2018-04-28 2018-10-02 中国电建集团昆明勘测设计研究院有限公司 A kind of reservoir dam safety risk estimating method based on bowknot model
CN112381309A (en) * 2020-11-23 2021-02-19 珠江水利委员会珠江水利科学研究院 Reservoir dam safety monitoring and early warning method, device and system and storage medium
CN112396305A (en) * 2020-11-10 2021-02-23 中国电力建设股份有限公司 Method for determining risk level of dam of cascade reservoir group
CN112949181A (en) * 2021-03-02 2021-06-11 国能大渡河枕头坝发电有限公司 Early warning prediction method of multi-source associated data, storage medium and electronic equipment
CN112948932A (en) * 2021-03-05 2021-06-11 广西路桥工程集团有限公司 Surrounding rock grade prediction method based on TSP forecast data and XGboost algorithm
CN112949900A (en) * 2021-01-18 2021-06-11 水利部交通运输部国家能源局南京水利科学研究院 Reservoir dam safety information intelligent perception fusion early warning method and terminal equipment
CN113139570A (en) * 2021-03-05 2021-07-20 河海大学 Dam safety monitoring data completion method based on optimal hybrid valuation
WO2021148966A1 (en) * 2020-01-23 2021-07-29 Novartis Ag A computer-implemented system and method for outputting a prediction of an exacerbation and/or hospitalization of asthma

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480341A (en) * 2017-07-21 2017-12-15 河海大学 A kind of dam safety comprehensive method based on deep learning
CN108615114A (en) * 2018-04-28 2018-10-02 中国电建集团昆明勘测设计研究院有限公司 A kind of reservoir dam safety risk estimating method based on bowknot model
WO2021148966A1 (en) * 2020-01-23 2021-07-29 Novartis Ag A computer-implemented system and method for outputting a prediction of an exacerbation and/or hospitalization of asthma
CN112396305A (en) * 2020-11-10 2021-02-23 中国电力建设股份有限公司 Method for determining risk level of dam of cascade reservoir group
CN112381309A (en) * 2020-11-23 2021-02-19 珠江水利委员会珠江水利科学研究院 Reservoir dam safety monitoring and early warning method, device and system and storage medium
CN112949900A (en) * 2021-01-18 2021-06-11 水利部交通运输部国家能源局南京水利科学研究院 Reservoir dam safety information intelligent perception fusion early warning method and terminal equipment
CN112949181A (en) * 2021-03-02 2021-06-11 国能大渡河枕头坝发电有限公司 Early warning prediction method of multi-source associated data, storage medium and electronic equipment
CN112948932A (en) * 2021-03-05 2021-06-11 广西路桥工程集团有限公司 Surrounding rock grade prediction method based on TSP forecast data and XGboost algorithm
CN113139570A (en) * 2021-03-05 2021-07-20 河海大学 Dam safety monitoring data completion method based on optimal hybrid valuation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于RFE-RF-XGBoost的坝体位移预测研究;王昕宇 等;《东北师大学报(自然科学版)》;第53卷(第2期);第60-66页 *
基于XGBoost的水库大坝基础设施潜在风险评估预测;丁炜 等;《人民长江》;第54卷(第4期);第241-246页 *
大坝风险评估与管理关键技术研究进展;盛金保 等;《中国科学:技术科学》;第48卷(第10期);第1057-1067页 *

Also Published As

Publication number Publication date
CN113807570A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
CN113807570B (en) XGBoost-based reservoir dam risk level assessment method and system
CN113255848B (en) Water turbine cavitation sound signal identification method based on big data learning
CN114757309B (en) Multi-physical-field monitoring data collaborative fusion engineering disaster early warning method and system
CN112529341B (en) Drilling well leakage probability prediction method based on naive Bayesian algorithm
WO2023142424A1 (en) Power financial service risk control method and system based on gru-lstm neural network
CN112039903B (en) Network security situation assessment method based on deep self-coding neural network model
CN111339712A (en) Method for predicting residual life of proton exchange membrane fuel cell
CN110929918A (en) 10kV feeder line fault prediction method based on CNN and LightGBM
CN112287018B (en) 10kV pole tower damage risk assessment method and system under typhoon disaster
CN110636066B (en) Network security threat situation assessment method based on unsupervised generative reasoning
CN112147432A (en) BiLSTM module based on attention mechanism, transformer state diagnosis method and system
CN112001110A (en) Structural damage identification monitoring method based on vibration signal space real-time recursive graph convolutional neural network
CN115470962A (en) LightGBM-based enterprise confidence loss risk prediction model construction method
CN114169374A (en) Cable-stayed bridge stay cable damage identification method and electronic equipment
CN116050281A (en) Foundation pit deformation monitoring method and system
CN115982141A (en) Characteristic optimization method for time series data prediction
CN114997578A (en) Smart wind power plant real-time state evaluation method based on deep learning
CN114118460A (en) Low-voltage transformer area line loss rate abnormity detection method and device based on variational self-encoder
CN117195505A (en) Evaluation method and system for informatization evaluation calibration model of electric energy meter
CN116776260A (en) Rock burst grade double-model step-by-step prediction method based on machine learning
CN116010884A (en) Fault diagnosis method of SSA-LightGBM oil-immersed transformer based on principal component analysis
CN114066075A (en) Customer loss prediction method based on deep learning
CN111210147B (en) Sintering process operation performance evaluation method and system based on time sequence feature extraction
CN115330526A (en) Enterprise credit scoring method and device
CN114298413A (en) Hydroelectric generating set runout trend prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant