CN111950585A

CN111950585A - XGboost-based underground comprehensive pipe gallery safety condition assessment method

Info

Publication number: CN111950585A
Application number: CN202010604912.5A
Authority: CN
Inventors: 岑健; 胡联粤; 刘溪; 伍银波; 熊建斌
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-11-17

Abstract

The invention relates to an underground comprehensive pipe rack safety condition assessment method based on XGboost, belonging to the field of underground comprehensive pipe rack safety, comprising the following steps of: s1, acquiring a data original sample of the underground pipe gallery; s2, preprocessing the data original sample; s3, performing feature construction on the preprocessed data to obtain a feature-constructed data set; s4, performing feature selection on the feature-constructed data set to obtain a data sample set; and S5, inputting the data sample set into the XGboost algorithm model for model training to obtain a safety condition evaluation model, and determining the safety state of the underground pipe gallery according to the output result. According to the method, the collected data are preprocessed according to the characteristics of the underground pipe gallery, and a machine learning method is adopted, so that the safety condition of the data underground pipe gallery is judged more accurately and intelligently, the management efficiency is improved, and the management cost is reduced.

Description

XGboost-based underground comprehensive pipe gallery safety condition assessment method

Technical Field

The invention relates to the field of safety of underground comprehensive pipe galleries, in particular to an XGboost-based underground comprehensive pipe gallery safety condition assessment method.

Background

The utility tunnel is as city lifeline, assembles multiple pipeline facility, and the system composition is huge complicated, has very many potential safety risk hidden dangers, and the mechanism of occurrence is complicated changeable, and the incident influence is big, and the safety guarantee is the problem that construction operation in the future in-process is not neglected. A large amount of data generated in the running process of the underground comprehensive pipe gallery are not utilized at present, and therefore management resources are wasted.

The safety condition of the underground comprehensive pipe rack is very important, and the safety condition is in line with the environment (CO) in the pipe rack₂Concentration, temperature, humidity, CH₄Concentration), the operation and maintenance condition (equipment service life, the maintenance frequency of equipment, the purchase time of equipment) of piping lane etc. multiple factor is relevant, but the safety condition assessment work of utility tunnel at present stage is too strong and scientific not enough.

At present, many automatic monitoring and controlling systems for the running conditions of the safety equipment of the commercial underground comprehensive pipe gallery on the market mainly are used for directly upgrading and updating some equipment for the monitoring system for industrial safety monitoring, and directly sleeve the equipment into the working environment of the underground comprehensive pipe gallery, so that a lot of safety problems exist:

(1) most of the traditional distributed monitoring adopts a manual management mode, the operation and management cost is high, and the management level and quality can not be effectively ensured;

(2) the monitoring and processing system has large demand on data acquisition capacity, is easy to cause congestion in the information data transmission process, and has poor capability of analyzing and judging disaster situations;

(3) the management of monitoring videos and monitoring data is various, and the problem of information isolated island exists between systems, which brings difficulty to decision making of relevant departments.

Disclosure of Invention

Aiming at the defects, the invention establishes a model for evaluating the safety condition of the underground comprehensive pipe gallery and provides an XGboost-based underground comprehensive pipe gallery safety condition evaluation method.

In order to achieve the above purpose, the invention provides the following technical scheme:

an underground comprehensive pipe gallery safety condition assessment method based on XGboost comprises the following steps:

s1, acquiring a data original sample of the underground pipe gallery, wherein the data original sample comprises human factor data, equipment factor data and environment factor data;

s2, preprocessing the data original sample, wherein the preprocessing comprises missing value processing and abnormal data processing;

s3, performing feature construction on the preprocessed data to obtain a feature-constructed data set;

s4, performing feature selection on the feature-constructed data set, wherein the feature data after feature selection form a data sample set;

and S5, inputting the data sample set into a pre-trained XGboost algorithm model, and determining the safety state of the underground pipe gallery according to the output result.

The human factor data comprises operation normative, health condition, dangerous object processing condition and damage degree of pipelines and equipment.

The equipment factor data comprises the reliability degree of a combined part, the safety coefficient of parts, the design rationality, the environmental adaptability of the equipment, equipment safety devices, the failure rate of the equipment, the periodic inspection rate and the maintenance rate.

The environmental factor data types include temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration, and carbon monoxide concentration.

Dividing the data original sample into category characteristic data and numerical characteristic data, wherein missing value processing means: the missing numerical features are filled using a mean and the missing class features are filled using a mode.

And the abnormal data processing means that an isolated forest abnormal value detection algorithm is used for identifying abnormal values in the original data sample and deleting the abnormal values.

The feature construction comprises the following steps:

A. performing numerical value conversion on the category characteristic data;

B. and performing box separation operation on the numerical characteristic data, and performing characteristic combination.

In step S4, a Filter algorithm is used to perform feature selection on the feature value data set to obtain a data sample set.

The method also comprises the following steps: and acquiring the hyperparameter of the XGboost algorithm model by adopting a Bayesian optimization algorithm, wherein the hyperparameter is used for optimizing the XGboost algorithm model.

An XGboost-based underground comprehensive pipe gallery safety condition evaluation system comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.

Compared with the prior art, the invention has the beneficial effects that:

(1) the method for evaluating the safety condition of the underground comprehensive pipe gallery based on the isolated forest model is established, and after relevant data of the underground pipe gallery are collected, mean value filling is carried out on missing numerical characteristics, and mode filling is carried out on missing category characteristics through preprocessing; after preprocessing, combined features and descriptive features are constructed, numerical characteristic data are subjected to box separation processing feature selection, and finally, the data are output to an XGboost algorithm model after feature selection, training or actual measurement is carried out, and data analysis is completed to determine that each part of the underground pipe gallery is in a safe state or an unsafe state. According to the method, the collected data are preprocessed according to the characteristics of the underground pipe gallery, and a machine learning method is adopted, so that the safety data of the underground pipe gallery can be judged more accurately and intelligently, the management efficiency is improved, and the management cost is reduced.

(2) According to the method, the acquired data are filled with the missing values, and the category characteristics are constructed in a cross combination mode, so that the extracted characteristic data are finer, and therefore, the method has high accuracy in the assessment of the safety data of the underground comprehensive pipe gallery, wherein the AUC value can reach 95.99%, and the F1-score can reach 91.53%.

(3) The management of monitoring videos and monitoring data is various, and the problem of information isolated island exists between systems, which brings difficulty to decision making of relevant departments. The method of the invention uniformly preprocesses the data of each part of the underground pipe gallery, realizes the fusion of the data of each part of the system, the data can macroscopically show the environment condition of the underground pipe gallery, all the data are directly transmitted into the model, and the model is obtained by training a large amount of data, so that the accuracy is higher during actual measurement, the uniform processing of various data is realized, the problem of information isolated island existing among systems is solved, the model is used for obtaining whether the safety state exists, and the obtained result can provide data support and theoretical basis for the decision of relevant departments.

Description of the drawings:

fig. 1 is a flow chart of an underground comprehensive pipe gallery safety condition evaluation method based on XGBoost in the present invention.

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.

Example 1

An XGboost-based underground comprehensive pipe gallery safety condition assessment method is shown in a flow chart of fig. 1, and comprises the following steps:

and S1, acquiring a data sample.

And acquiring data samples in the underground comprehensive pipe gallery, wherein the data comprises human factor data, equipment factor data and environmental factor data, and the related data is shown in tables 1-3. And positive and negative samples are determined according to the safety condition in the data pipe gallery, the data in the safe state is the positive sample, and the data in the unsafe state is the negative sample.

TABLE 1 artifact data types and corresponding results

TABLE 2 device factor data types and corresponding results

TABLE 3 environmental factor data types and corresponding results

And S2, preprocessing the data.

The preprocessing of the data comprises missing value processing and abnormal data processing.

1. The missing value processing is to fill in with the mean and mode of the feature string containing the missing value, the missing numerical feature is to fill in with the mean, and the missing category feature is to fill in with the mode. The numerical characteristics include temperature value, humidity value, water level value, oxygen concentration value, hydrogen sulfide concentration value, methane concentration value, carbon monoxide concentration value and the like, and after the average value is obtained, the average value is adopted for filling missing data. The category characteristic data comprises operation normative, health conditions, dangerous object processing conditions, pipeline and equipment damage degrees, combination part reliability, part safety factors, design rationality, equipment environment adaptability and the like, a numerical value (namely a mode) with the largest occurrence frequency in the data is determined, and the missing data is filled by the data. The processing of missing value processing is not limited to using mean or mode filling.

2. And (3) processing abnormal data: and (4) identifying abnormal values by using an isolated forest abnormal value detection algorithm and deleting the abnormal values.

a) An isolated Forest (Isolation Forest) is an anomaly detection method proposed by Liufei et al in 2008, and is an unsupervised anomaly detection method widely applied to continuous data. It does not need to rely on any measurement of distance or population density, thus greatly eliminating the complexity and computational cost of the distance and population density measurement method. In solitary forest, anomalies are clearly defined because of "outliers that are easily isolated", i.e., those isolated outliers that are sparsely distributed and are far from the population where the population density is high. The regions where the independent outliers with sparsely distributed populations are located can be generally and definitely considered as the regions with relatively low probability of abnormality occurring in the outlier regions, and thus the data occurring in the regions can be considered as abnormal. Isolated forest anomalies have constant nonlinear time complexity, constant training time and space complexity when in use, have strong expansion capability, can quickly process a large amount of data and multidimensional data problems, and have wide application in the industry.

b) The isolated forest algorithm is an unsupervised learning algorithm and is generally mainly used for detecting abnormal data and values. Before understanding an isolated forest (hereinafter, referred to as iForest), an isolated tree (hereinafter, referred to as iTree) needs to be understood, wherein the iTree isolated tree is a random binary tree, and each node has a left subtree and a right subtree, or has two sub-attributes and a node, or has no sub-node. When designing an iTree for a multidimensional data set D, firstly, a data attribute value and a value v of which the attribute value of each data is smaller than a and a are selected, then the attribute value of each data in each data set is classified according to the value of the data attribute a, the attribute values of which the word number is smaller than a and v are respectively placed on the left data sub-tree, and the attribute values of which the word number is larger than v and the word number are respectively placed on the word number in the right sub-tree. And finally, constructing the subtrees continuously and recursively according to the rule, and stopping construction until the height of the tree reaches a threshold value or the data set only has one sample or a plurality of samples of the same type.

c) After the iTree structure is completed, data can be predicted, and the data is searched from the iTree root node until the data falls on the leaf nodes. Assuming that the path length from the root node to the leaf node is h (x), h (x) is very short because the outliers are sparse and sparse and are quickly distributed to the leaf nodes in the iTree. Therefore, whether certain data is an abnormal value or not can be judged by using h (x), and the path length h (x) is normalized before judgment, so that all sample indexes are in the same order of magnitude, and comprehensive evaluation and comparison are facilitated. A data set comprising n samples, the average path length c (n) of the tree being:

d) where H (i) is a harmonic number, which can be estimated as ln (i) +0.5712156 (Euler Marsdorzony constant). c (n) is the average path length after a given number of samples, and the path length E (h (x)) used to normalize the samples is the average height value of the sample x over all iTrees. The anomaly score for sample x is defined as:

e) as can be seen from equation (2), when E (h (x)) approaches c (n), the anomaly score approaches 0.5, and it cannot be determined whether the sample is abnormal. When E (h (x)) approaches 0, the outlier score approaches 1, and the sample is determined to be an outlier. When E (h (x)) approaches to n-1, the abnormal score value approaches to 0, the sample is judged to be normal, and the generated tree is a random selection attribute, so that the finally obtained random tree is unstable and has great randomness. But it is more accurate and stable if combining multiple itrees to form Forest results. When the iForest is constructed, a part of data sets are randomly sampled to construct a new iTree, and the difference of different iTrees is guaranteed. In addition, if the data set is too large, the area with more abnormal values and dense is judged as a normal value, so that the size of the random sampling data set is limited when each tree is constructed. And (3) after the construction of the iForest is completed, calculating the abnormal score of the sample by using an equation (2), and deleting the abnormal data corresponding to the abnormal score.

And S3, constructing the characteristics.

The features are classified into category features and numerical features according to the types of the feature values.

Class characteristics: for example, the characteristic of the damage degree of the pipeline and the equipment includes no damage, slight damage, moderate damage, serious damage and the like. This non-numerical data cannot be directly put into the model, and needs to be converted into numerical data to be recognized by the computer. The number of feature values of such features is generally limited and can be enumerated one by one, so that such features can also be called discrete features. If the set of feature values of a feature is [1,2,3,4], such feature can be considered to be derived from the non-numeric feature after encoding, and thus classified as a class feature.

Numerical characteristics: in the present embodiment, the numerical characteristics include temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration, carbon monoxide concentration, and the like, for example, the temperature characteristic is represented by a range, and any value in the range may be taken, so that the range is infinite, and such a characteristic may also be called a continuous characteristic.

The characteristic construction comprises the following steps:

the method comprises the following steps: performing numerical conversion on the category characteristic data;

common category feature processing modes include One-hot coding, mean coding, LabeleEncoding and the like. The method of the invention adopts one-hot coding. One-hot encoding is the representation of categorical variables as binary vectors, first mapping categorical values to integer values, then each integer value is represented as a binary vector, except for the index of the integer, which is a zero value, which is marked 1. For example; binary codes corresponding to [ "no damage", "slight damage", "moderate damage", "slight damage" ] are [ [1,0,0], [0,1,0], [0,0,1], [0,1,0] ], respectively, and for other class characteristics in the collected data, for example: the operation normative, the health condition, the processing condition of dangerous objects, the reliability of the combined part, the safety coefficient of parts, the design rationality, the maintenance convenience and the like are also subjected to numerical value conversion processing, and preparation is made for subsequent feature selection.

Step two: feature combinations are performed on the continuous type features (numerical type features).

The method comprises the following specific steps:

1. eigenvalue expansion is performed using statistical methods. The extension here means that in addition to the acquired data itself, a new descriptive feature value is added to the sample, and the new descriptive feature value is calculated based on the acquired data, for example, the acquired numerical feature value is calculated by counting, maximizing, calculating variance, averaging, skewing, summing, etc., so as to generate a new descriptive feature value, and the data sample is richer. The process of characteristic combination and box separation operation of numerical characteristics is described in detail below by taking temperature data as an example.

For example, as shown in table 4, the power compartment has three temperature sensors and three humidity sensors, and the values of the three temperature sensors of the power compartment can be averaged, so that a new characteristic value of the temperature of the power compartment, an average temperature value, and the number of samples of the temperature value of the power compartment are obtained. Similarly, both the gas and thermal compartments may do so. The average temperature and humidity of the three compartments can be obtained as shown in table 5.

TABLE 4 temperature and humidity values detected by sensors in electric power compartment, gas compartment and heating power compartment

TABLE 5 average temperature and average humidity of electric, gas and thermal compartments

	Temperature of	Humidity
			Electric power cabin	19	47
Gas cabin	20	46
			Heating power cabin	21	45

2. The combined characteristics can also be generated by performing logical operation on the characteristics of the same cabin

The utility tunnel contains a variety of cabins: the method comprises the steps that the same characteristics of the same cabin can be subjected to logic operation to generate new characteristics, the generated new characteristics reflect the change trend or state of the cabin, for example, the temperature characteristics can be grouped according to the cabins, and then the difference between the maximum temperature and the minimum temperature is calculated to generate new combined characteristics.

The generated new descriptive characteristics and the generated combined characteristics increase data dimensionality, and the generated new combined characteristic data can reflect hidden information of the underground comprehensive pipe gallery.

The data sample can be subjected to model training or input into a model for actual measurement after the numerical conversion is carried out on the category characteristic data and the characteristic combination is carried out on the numerical characteristic. However, the data samples are increased, and although the accuracy of analysis can be improved, the amount of data is increased, which affects the processing efficiency, and therefore, the binning operation is performed. The binning operation comprises binning the category feature data and binning the numerical features.

And S4, performing box separation operation on the continuous features.

Performing box separation operation on the continuous features, namely acquiring all data of a certain type in the data sample; finding a value range interval (which can be understood as a range interval consisting of a minimum value and a maximum value), averaging the value range interval according to the number of the sub-boxes, equally dividing the value range interval into N sub-boxes according to the average value, correspondingly, obtaining the value range of each sub-box, and determining which sub-box the data belongs to according to the value range of the sub-box for the data of the type in the data sample, so as to assign the data as the characteristic value of the sub-box. For example, the following steps are carried out:

for example, the temperature range of the same cabin is [20-35], the temperature range needs to be divided into three sub-boxes, so that the average value is (35-20)/3 is 5, the value range of the first sub-box after the sub-box is [20-25 ], the value range of the second sub-box is [25-30 ], the value range of the third sub-box is [30-35], wherein the characteristic value of the temperature after the sub-box of [20-25) is set to 1, [25-30) is set to 2, and [30-35] is set to 3, if a certain temperature value is 26.5482, the temperature value falls into the value range of the second sub-box, so that the temperature value is converted into 2, and the sub-box operation is performed to make the value approach to the value, thereby performing the classification processing. The binning operation is not limited to this characteristic of temperature. The number of the sub-boxes is related to the classification density, and when the number of the sub-boxes is less, the classification types are less, the data is subjected to approximate processing in a larger range, and the calculation amount is reduced. When the number of the sub-boxes is increased, corresponding classification is increased, data is subjected to approximate processing in a smaller range, and although the calculation amount is reduced, the reduction degree is lower. According to the requirement of data analysis, the design of box separation operation can be carried out, and the information quantity and the calculated quantity of data are considered.

And S5, feature selection.

The purpose of feature selection is to reduce the number of features and dimension reduction, so that the generalization capability of the model is stronger, overfitting is reduced, the time of model training is reduced, the complexity of the model is reduced, and the effect of the model is improved. The invention uses Filter to select the characteristics.

And (3) selecting features by using a Filter: one of the Filter filtering operations is: each dimension feature is "scored", i.e., the feature of each dimension is given a weight, such weight representing the importance of the feature, and then sorted by weight. Feature selection is performed first and then the learner is trained, so the process of feature selection is independent of the learner. This is equivalent to performing a filtering operation on the features and then training the classifier with the feature subset. One specific implementation method is to use variance filtering to screen the class of features through the variance of the features themselves, for example, if the variance of a feature itself is small, it means that there is substantially no difference in the feature for a sample, most values in the feature may be the same, even the value of the whole feature is the same, and the feature has no effect on sample differentiation, so that no matter what the feature engineering needs to do next, the feature with the variance of 0 needs to be eliminated preferentially. If the variance threshold is thresholded, all features with variance less than thresholded are discarded, that is, all features with same record are deleted.

As can be seen from Table 7, the original feature data dimension of the training set is (6700, 48), after the constructed feature is expanded, the data dimension of the training set is increased to (6700, 256), the data dimension is increased, and after the Filter feature selection filtering is adopted, the data dimension of the training set is decreased to (6700, 106). The effect of the constructed features on the model effect is shown in table 6. Although the trained algorithms are different, after the characteristics are constructed (characteristic construction), the consumed time is increased, the occupied memory is increased, and the evaluation indexes of AUC and F1-Score are both obviously improved.

TABLE 6 Classification Effect before and after feature construction

TABLE 7 feature sizes before and after feature engineering

TABLE 8 Classification Effect before and after feature selection Using Filter

As can be seen from comparison of classification effects before and after feature selection by using the Filter, the consumed time and the occupied memory are greatly reduced, and two evaluation indexes of AUC and AUC of F1-Score and F1-Score are obviously improved. Therefore, through feature construction, binning and feature selection, feature engineering processing is carried out on the data, the data processing efficiency is improved, and the model effect is improved to a certain extent.

And S6, training the model.

First, data set partitioning is performed.

Dividing the data samples after the feature selection processing to obtain a training set and a test set;

if the original data set is a table of 10000 samples, there are several features, each sample has a label, which indicates whether the sample is a positive or negative example. Before data training, data needs to be cut (training set: test set: 7:3), the data is divided into 6700 pieces of data of the training set and 3300 pieces of data of the test set, the training set comprises samples and marks thereof for model training, and the test set comprises samples and marks thereof for model verification.

Next, model training is performed.

And taking the feature variable after feature selection as an input variable, constructing an XGboost model by adopting an XGboost algorithm, and training the model by utilizing the XGboost algorithm.

The XGboost is an English abbreviation of Extreme Gradient Boosting and is one of Boosting algorithms, the basic design and idea of the algorithm are that a strong learning classifier is formed by integrating a plurality of weak learning classifiers and algorithms, all learning using the classifier algorithm can be trained for many times, the type and data set of each classifier algorithm training are randomly and automatically extracted from the type and data set of the original classifier algorithm training, n classifier training samples are generated, a function for predicting classification problems is obtained after the training data set completes prediction, finally, each prediction function classifies the original classification problems by using a voting mode, and a simple average method is used for predicting the new training sample type of each regression classification problem.

The XGboost is a lifting tree model, a plurality of base models are fused to form a classifier with the best effect, the base model of the XGboost algorithm can be customized, and a CART tree is generally used.

The XGboost objective function is defined as:

the model objective function is composed of two parts, the first main part is the loss function of the leaf node, and the other part is considered as the regularization term of the leaf node. k represents the number of leaf nodes, the regular term can control the score of the leaf nodes not to be too large, so as to prevent overfitting, and f (x) is one regression tree.

For the XGBoost model, the newly generated tree model needs to be able to accurately fit the model residual predicted by the last generation, i.e. when generating t trees, the prediction score can be written as:

the objective function at this time is:

XGboost performs Taylor second-order expansion on the loss function as follows:

wherein, g_iIs the first derivative, h_iAs second derivative:

in the formula, the values of the node loss functions of each leaf sample are added together, and because the final loss of each leaf sample falls into the node of one leaf, all the same samples of the same node as the leaf can be recombined,

by rewriting the formula (9), the objective function can be written as a unitary quadratic function with respect to the leaf node score ω, and in this case, the optimum ω and objective function value can be solved simply. Thus, the optimal ω and objective function formula is:

the main parameters that the XGBoost needs to adjust are shown in table 11 below:

TABLE 11 XGboost the main parameters to be adjusted

As a specific embodiment of the invention, 10000 data samples form a data set (each sample comprises 56 characteristic values), the data set is divided into 6700 training sets and 3300 verification sets, the characteristic values of the 6700 data and a manually preset label are input into a model together, and the model is trained to obtain a trained model; and inputting 3300 pieces of verification set data into the trained model, outputting a label value, scoring the output label value by comparing the label value with the original verification set data, wherein the score is high, the better the model training is, setting a scoring threshold, if the score is higher than the scoring threshold during verification, finishing the model training, otherwise, continuing the model training.

During the actual measurement, each piece of state data (including 56 characteristic values) acquired in real time is input into the trained model, the label value corresponding to the piece of data is output, if the result is 0, the underground pipe gallery is safe, and if the result is 1, the underground pipe gallery is considered to be unsafe.

And S5, optimizing the model.

In order to make the model performance show better performance and effect, a group of optimal hyper-parameters needs to be quickly found, searching and adjusting of the hyper-parameters are complicated and important, generally, by using a manual hyper-parameter adjusting method, the rough performance and effect of the hyper-parameter model are firstly determined, then, the method uses two methods of grid search and random search to quickly search, obtains a better model as a hyper-parameter, however, grid search usually requires a lot of space to run, wastes a lot of effort and time to evaluate the search space, cannot accurately and quickly find the type and area that can reach the optimal point, therefore, the minimum value of the model is optimized by using the Bayesian algorithm, and the Bayesian optimization algorithm is an optimization method based on the Bayesian model and used for accurately searching the minimum value of the model function. The Bayesian optimization algorithm mainly comprises three steps:

(1) a prior function is selected to express the assumption about the function to be optimized.

(2) The result of the optimization by maximization of the marginal likelihood parameter distribution by maximum likelihood parameter estimation is the hyperparameter.

(3) And obtaining a specific acquisition function according to the obtained optimized hyper-parameter. The acquisition function is then selected to construct a utility function from the posterior model to determine the next sample point. The acquisition function may balance the samples at points with a low modeling objective function and search for areas that have not been sampled.

After the XGBoost algorithm uses the bayesian algorithm to adjust parameters, the obtained model has the best effect, and the finally obtained optimal hyper-parameters of the XGBoost algorithm are shown in table 12.

Table 12 optimum parameters of XGboost model selected by Bayesian algorithm

After the model is constructed by using DT, LR, NC, SVM, GBDT and XGboost algorithms, classification results of the model before and after optimization by using the Bayesian algorithm are shown in Table 13.

TABLE 13 Classification Effect before and after optimization Using Bayesian Algorithm

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. An underground comprehensive pipe gallery safety condition assessment method based on XGboost is characterized by comprising the following steps:

2. The XGboost-based underground utility tunnel safety assessment method according to claim 1, wherein the artifact data includes operational norms, health conditions, hazardous material handling conditions, pipe and equipment damage levels.

3. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 1, wherein the equipment factor data comprises a combination part reliability degree, a part safety coefficient, design rationality and equipment environment adaptability, equipment safety devices, an equipment failure rate, a periodic inspection rate and a maintenance rate.

4. The XGboost-based underground utility tunnel safety condition assessment method according to claim 1, wherein the environmental factor data types comprise temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration and carbon monoxide concentration.

5. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to any one of claims 1 to 4, wherein the data original samples are divided into category characteristic data and numerical characteristic data, and the missing value processing is that: the missing numerical features are filled using a mean and the missing class features are filled using a mode.

6. The XGboost-based underground utility tunnel safety condition assessment method according to claim 5, wherein the abnormal data processing is to identify abnormal values in the original sample of data by using an isolated forest abnormal value detection algorithm and delete the abnormal values.

7. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 6, wherein the feature construction comprises the following steps:

A. performing numerical value conversion on the category characteristic data;

8. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 7, wherein in step S4, a feature value data set is subjected to feature selection by using a Filter algorithm to obtain a data sample set.

9. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 8, further comprising the steps of:

and acquiring the hyperparameter of the XGboost algorithm model by adopting a Bayesian optimization algorithm, wherein the hyperparameter is used for optimizing the XGboost algorithm model.

10. An XGboost-based underground comprehensive pipe gallery safety condition evaluation system is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.