CN111950585A - XGboost-based underground comprehensive pipe gallery safety condition assessment method - Google Patents

XGboost-based underground comprehensive pipe gallery safety condition assessment method Download PDF

Info

Publication number
CN111950585A
CN111950585A CN202010604912.5A CN202010604912A CN111950585A CN 111950585 A CN111950585 A CN 111950585A CN 202010604912 A CN202010604912 A CN 202010604912A CN 111950585 A CN111950585 A CN 111950585A
Authority
CN
China
Prior art keywords
data
xgboost
pipe gallery
feature
safety condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010604912.5A
Other languages
Chinese (zh)
Inventor
岑健
胡联粤
刘溪
伍银波
熊建斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Polytechnic Normal University
Original Assignee
Guangdong Polytechnic Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Polytechnic Normal University filed Critical Guangdong Polytechnic Normal University
Priority to CN202010604912.5A priority Critical patent/CN111950585A/en
Publication of CN111950585A publication Critical patent/CN111950585A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • G06Q50/265Personal security, identity or safety

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Security & Cryptography (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an underground comprehensive pipe rack safety condition assessment method based on XGboost, belonging to the field of underground comprehensive pipe rack safety, comprising the following steps of: s1, acquiring a data original sample of the underground pipe gallery; s2, preprocessing the data original sample; s3, performing feature construction on the preprocessed data to obtain a feature-constructed data set; s4, performing feature selection on the feature-constructed data set to obtain a data sample set; and S5, inputting the data sample set into the XGboost algorithm model for model training to obtain a safety condition evaluation model, and determining the safety state of the underground pipe gallery according to the output result. According to the method, the collected data are preprocessed according to the characteristics of the underground pipe gallery, and a machine learning method is adopted, so that the safety condition of the data underground pipe gallery is judged more accurately and intelligently, the management efficiency is improved, and the management cost is reduced.

Description

XGboost-based underground comprehensive pipe gallery safety condition assessment method
Technical Field
The invention relates to the field of safety of underground comprehensive pipe galleries, in particular to an XGboost-based underground comprehensive pipe gallery safety condition assessment method.
Background
The utility tunnel is as city lifeline, assembles multiple pipeline facility, and the system composition is huge complicated, has very many potential safety risk hidden dangers, and the mechanism of occurrence is complicated changeable, and the incident influence is big, and the safety guarantee is the problem that construction operation in the future in-process is not neglected. A large amount of data generated in the running process of the underground comprehensive pipe gallery are not utilized at present, and therefore management resources are wasted.
The safety condition of the underground comprehensive pipe rack is very important, and the safety condition is in line with the environment (CO) in the pipe rack2Concentration, temperature, humidity, CH4Concentration), the operation and maintenance condition (equipment service life, the maintenance frequency of equipment, the purchase time of equipment) of piping lane etc. multiple factor is relevant, but the safety condition assessment work of utility tunnel at present stage is too strong and scientific not enough.
At present, many automatic monitoring and controlling systems for the running conditions of the safety equipment of the commercial underground comprehensive pipe gallery on the market mainly are used for directly upgrading and updating some equipment for the monitoring system for industrial safety monitoring, and directly sleeve the equipment into the working environment of the underground comprehensive pipe gallery, so that a lot of safety problems exist:
(1) most of the traditional distributed monitoring adopts a manual management mode, the operation and management cost is high, and the management level and quality can not be effectively ensured;
(2) the monitoring and processing system has large demand on data acquisition capacity, is easy to cause congestion in the information data transmission process, and has poor capability of analyzing and judging disaster situations;
(3) the management of monitoring videos and monitoring data is various, and the problem of information isolated island exists between systems, which brings difficulty to decision making of relevant departments.
Disclosure of Invention
Aiming at the defects, the invention establishes a model for evaluating the safety condition of the underground comprehensive pipe gallery and provides an XGboost-based underground comprehensive pipe gallery safety condition evaluation method.
In order to achieve the above purpose, the invention provides the following technical scheme:
an underground comprehensive pipe gallery safety condition assessment method based on XGboost comprises the following steps:
s1, acquiring a data original sample of the underground pipe gallery, wherein the data original sample comprises human factor data, equipment factor data and environment factor data;
s2, preprocessing the data original sample, wherein the preprocessing comprises missing value processing and abnormal data processing;
s3, performing feature construction on the preprocessed data to obtain a feature-constructed data set;
s4, performing feature selection on the feature-constructed data set, wherein the feature data after feature selection form a data sample set;
and S5, inputting the data sample set into a pre-trained XGboost algorithm model, and determining the safety state of the underground pipe gallery according to the output result.
The human factor data comprises operation normative, health condition, dangerous object processing condition and damage degree of pipelines and equipment.
The equipment factor data comprises the reliability degree of a combined part, the safety coefficient of parts, the design rationality, the environmental adaptability of the equipment, equipment safety devices, the failure rate of the equipment, the periodic inspection rate and the maintenance rate.
The environmental factor data types include temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration, and carbon monoxide concentration.
Dividing the data original sample into category characteristic data and numerical characteristic data, wherein missing value processing means: the missing numerical features are filled using a mean and the missing class features are filled using a mode.
And the abnormal data processing means that an isolated forest abnormal value detection algorithm is used for identifying abnormal values in the original data sample and deleting the abnormal values.
The feature construction comprises the following steps:
A. performing numerical value conversion on the category characteristic data;
B. and performing box separation operation on the numerical characteristic data, and performing characteristic combination.
In step S4, a Filter algorithm is used to perform feature selection on the feature value data set to obtain a data sample set.
The method also comprises the following steps: and acquiring the hyperparameter of the XGboost algorithm model by adopting a Bayesian optimization algorithm, wherein the hyperparameter is used for optimizing the XGboost algorithm model.
An XGboost-based underground comprehensive pipe gallery safety condition evaluation system comprises at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
(1) the method for evaluating the safety condition of the underground comprehensive pipe gallery based on the isolated forest model is established, and after relevant data of the underground pipe gallery are collected, mean value filling is carried out on missing numerical characteristics, and mode filling is carried out on missing category characteristics through preprocessing; after preprocessing, combined features and descriptive features are constructed, numerical characteristic data are subjected to box separation processing feature selection, and finally, the data are output to an XGboost algorithm model after feature selection, training or actual measurement is carried out, and data analysis is completed to determine that each part of the underground pipe gallery is in a safe state or an unsafe state. According to the method, the collected data are preprocessed according to the characteristics of the underground pipe gallery, and a machine learning method is adopted, so that the safety data of the underground pipe gallery can be judged more accurately and intelligently, the management efficiency is improved, and the management cost is reduced.
(2) According to the method, the acquired data are filled with the missing values, and the category characteristics are constructed in a cross combination mode, so that the extracted characteristic data are finer, and therefore, the method has high accuracy in the assessment of the safety data of the underground comprehensive pipe gallery, wherein the AUC value can reach 95.99%, and the F1-score can reach 91.53%.
(3) The management of monitoring videos and monitoring data is various, and the problem of information isolated island exists between systems, which brings difficulty to decision making of relevant departments. The method of the invention uniformly preprocesses the data of each part of the underground pipe gallery, realizes the fusion of the data of each part of the system, the data can macroscopically show the environment condition of the underground pipe gallery, all the data are directly transmitted into the model, and the model is obtained by training a large amount of data, so that the accuracy is higher during actual measurement, the uniform processing of various data is realized, the problem of information isolated island existing among systems is solved, the model is used for obtaining whether the safety state exists, and the obtained result can provide data support and theoretical basis for the decision of relevant departments.
Description of the drawings:
fig. 1 is a flow chart of an underground comprehensive pipe gallery safety condition evaluation method based on XGBoost in the present invention.
Detailed Description
The present invention will be described in further detail with reference to test examples and specific embodiments. It should be understood that the scope of the above-described subject matter is not limited to the following examples, and any techniques implemented based on the disclosure of the present invention are within the scope of the present invention.
Example 1
An XGboost-based underground comprehensive pipe gallery safety condition assessment method is shown in a flow chart of fig. 1, and comprises the following steps:
and S1, acquiring a data sample.
And acquiring data samples in the underground comprehensive pipe gallery, wherein the data comprises human factor data, equipment factor data and environmental factor data, and the related data is shown in tables 1-3. And positive and negative samples are determined according to the safety condition in the data pipe gallery, the data in the safe state is the positive sample, and the data in the unsafe state is the negative sample.
TABLE 1 artifact data types and corresponding results
Figure RE-GDA0002733140520000051
TABLE 2 device factor data types and corresponding results
Figure RE-GDA0002733140520000052
TABLE 3 environmental factor data types and corresponding results
Figure RE-GDA0002733140520000061
And S2, preprocessing the data.
The preprocessing of the data comprises missing value processing and abnormal data processing.
1. The missing value processing is to fill in with the mean and mode of the feature string containing the missing value, the missing numerical feature is to fill in with the mean, and the missing category feature is to fill in with the mode. The numerical characteristics include temperature value, humidity value, water level value, oxygen concentration value, hydrogen sulfide concentration value, methane concentration value, carbon monoxide concentration value and the like, and after the average value is obtained, the average value is adopted for filling missing data. The category characteristic data comprises operation normative, health conditions, dangerous object processing conditions, pipeline and equipment damage degrees, combination part reliability, part safety factors, design rationality, equipment environment adaptability and the like, a numerical value (namely a mode) with the largest occurrence frequency in the data is determined, and the missing data is filled by the data. The processing of missing value processing is not limited to using mean or mode filling.
2. And (3) processing abnormal data: and (4) identifying abnormal values by using an isolated forest abnormal value detection algorithm and deleting the abnormal values.
a) An isolated Forest (Isolation Forest) is an anomaly detection method proposed by Liufei et al in 2008, and is an unsupervised anomaly detection method widely applied to continuous data. It does not need to rely on any measurement of distance or population density, thus greatly eliminating the complexity and computational cost of the distance and population density measurement method. In solitary forest, anomalies are clearly defined because of "outliers that are easily isolated", i.e., those isolated outliers that are sparsely distributed and are far from the population where the population density is high. The regions where the independent outliers with sparsely distributed populations are located can be generally and definitely considered as the regions with relatively low probability of abnormality occurring in the outlier regions, and thus the data occurring in the regions can be considered as abnormal. Isolated forest anomalies have constant nonlinear time complexity, constant training time and space complexity when in use, have strong expansion capability, can quickly process a large amount of data and multidimensional data problems, and have wide application in the industry.
b) The isolated forest algorithm is an unsupervised learning algorithm and is generally mainly used for detecting abnormal data and values. Before understanding an isolated forest (hereinafter, referred to as iForest), an isolated tree (hereinafter, referred to as iTree) needs to be understood, wherein the iTree isolated tree is a random binary tree, and each node has a left subtree and a right subtree, or has two sub-attributes and a node, or has no sub-node. When designing an iTree for a multidimensional data set D, firstly, a data attribute value and a value v of which the attribute value of each data is smaller than a and a are selected, then the attribute value of each data in each data set is classified according to the value of the data attribute a, the attribute values of which the word number is smaller than a and v are respectively placed on the left data sub-tree, and the attribute values of which the word number is larger than v and the word number are respectively placed on the word number in the right sub-tree. And finally, constructing the subtrees continuously and recursively according to the rule, and stopping construction until the height of the tree reaches a threshold value or the data set only has one sample or a plurality of samples of the same type.
c) After the iTree structure is completed, data can be predicted, and the data is searched from the iTree root node until the data falls on the leaf nodes. Assuming that the path length from the root node to the leaf node is h (x), h (x) is very short because the outliers are sparse and sparse and are quickly distributed to the leaf nodes in the iTree. Therefore, whether certain data is an abnormal value or not can be judged by using h (x), and the path length h (x) is normalized before judgment, so that all sample indexes are in the same order of magnitude, and comprehensive evaluation and comparison are facilitated. A data set comprising n samples, the average path length c (n) of the tree being:
Figure RE-GDA0002733140520000081
d) where H (i) is a harmonic number, which can be estimated as ln (i) +0.5712156 (Euler Marsdorzony constant). c (n) is the average path length after a given number of samples, and the path length E (h (x)) used to normalize the samples is the average height value of the sample x over all iTrees. The anomaly score for sample x is defined as:
Figure RE-GDA0002733140520000082
e) as can be seen from equation (2), when E (h (x)) approaches c (n), the anomaly score approaches 0.5, and it cannot be determined whether the sample is abnormal. When E (h (x)) approaches 0, the outlier score approaches 1, and the sample is determined to be an outlier. When E (h (x)) approaches to n-1, the abnormal score value approaches to 0, the sample is judged to be normal, and the generated tree is a random selection attribute, so that the finally obtained random tree is unstable and has great randomness. But it is more accurate and stable if combining multiple itrees to form Forest results. When the iForest is constructed, a part of data sets are randomly sampled to construct a new iTree, and the difference of different iTrees is guaranteed. In addition, if the data set is too large, the area with more abnormal values and dense is judged as a normal value, so that the size of the random sampling data set is limited when each tree is constructed. And (3) after the construction of the iForest is completed, calculating the abnormal score of the sample by using an equation (2), and deleting the abnormal data corresponding to the abnormal score.
And S3, constructing the characteristics.
The features are classified into category features and numerical features according to the types of the feature values.
Class characteristics: for example, the characteristic of the damage degree of the pipeline and the equipment includes no damage, slight damage, moderate damage, serious damage and the like. This non-numerical data cannot be directly put into the model, and needs to be converted into numerical data to be recognized by the computer. The number of feature values of such features is generally limited and can be enumerated one by one, so that such features can also be called discrete features. If the set of feature values of a feature is [1,2,3,4], such feature can be considered to be derived from the non-numeric feature after encoding, and thus classified as a class feature.
Numerical characteristics: in the present embodiment, the numerical characteristics include temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration, carbon monoxide concentration, and the like, for example, the temperature characteristic is represented by a range, and any value in the range may be taken, so that the range is infinite, and such a characteristic may also be called a continuous characteristic.
The characteristic construction comprises the following steps:
the method comprises the following steps: performing numerical conversion on the category characteristic data;
common category feature processing modes include One-hot coding, mean coding, LabeleEncoding and the like. The method of the invention adopts one-hot coding. One-hot encoding is the representation of categorical variables as binary vectors, first mapping categorical values to integer values, then each integer value is represented as a binary vector, except for the index of the integer, which is a zero value, which is marked 1. For example; binary codes corresponding to [ "no damage", "slight damage", "moderate damage", "slight damage" ] are [ [1,0,0], [0,1,0], [0,0,1], [0,1,0] ], respectively, and for other class characteristics in the collected data, for example: the operation normative, the health condition, the processing condition of dangerous objects, the reliability of the combined part, the safety coefficient of parts, the design rationality, the maintenance convenience and the like are also subjected to numerical value conversion processing, and preparation is made for subsequent feature selection.
Step two: feature combinations are performed on the continuous type features (numerical type features).
The method comprises the following specific steps:
1. eigenvalue expansion is performed using statistical methods. The extension here means that in addition to the acquired data itself, a new descriptive feature value is added to the sample, and the new descriptive feature value is calculated based on the acquired data, for example, the acquired numerical feature value is calculated by counting, maximizing, calculating variance, averaging, skewing, summing, etc., so as to generate a new descriptive feature value, and the data sample is richer. The process of characteristic combination and box separation operation of numerical characteristics is described in detail below by taking temperature data as an example.
For example, as shown in table 4, the power compartment has three temperature sensors and three humidity sensors, and the values of the three temperature sensors of the power compartment can be averaged, so that a new characteristic value of the temperature of the power compartment, an average temperature value, and the number of samples of the temperature value of the power compartment are obtained. Similarly, both the gas and thermal compartments may do so. The average temperature and humidity of the three compartments can be obtained as shown in table 5.
TABLE 4 temperature and humidity values detected by sensors in electric power compartment, gas compartment and heating power compartment
Figure RE-GDA0002733140520000101
TABLE 5 average temperature and average humidity of electric, gas and thermal compartments
Temperature of Humidity
Electric power cabin 19 47
Gas cabin 20 46
Heating power cabin 21 45
2. The combined characteristics can also be generated by performing logical operation on the characteristics of the same cabin
The utility tunnel contains a variety of cabins: the method comprises the steps that the same characteristics of the same cabin can be subjected to logic operation to generate new characteristics, the generated new characteristics reflect the change trend or state of the cabin, for example, the temperature characteristics can be grouped according to the cabins, and then the difference between the maximum temperature and the minimum temperature is calculated to generate new combined characteristics.
The generated new descriptive characteristics and the generated combined characteristics increase data dimensionality, and the generated new combined characteristic data can reflect hidden information of the underground comprehensive pipe gallery.
The data sample can be subjected to model training or input into a model for actual measurement after the numerical conversion is carried out on the category characteristic data and the characteristic combination is carried out on the numerical characteristic. However, the data samples are increased, and although the accuracy of analysis can be improved, the amount of data is increased, which affects the processing efficiency, and therefore, the binning operation is performed. The binning operation comprises binning the category feature data and binning the numerical features.
And S4, performing box separation operation on the continuous features.
Performing box separation operation on the continuous features, namely acquiring all data of a certain type in the data sample; finding a value range interval (which can be understood as a range interval consisting of a minimum value and a maximum value), averaging the value range interval according to the number of the sub-boxes, equally dividing the value range interval into N sub-boxes according to the average value, correspondingly, obtaining the value range of each sub-box, and determining which sub-box the data belongs to according to the value range of the sub-box for the data of the type in the data sample, so as to assign the data as the characteristic value of the sub-box. For example, the following steps are carried out:
for example, the temperature range of the same cabin is [20-35], the temperature range needs to be divided into three sub-boxes, so that the average value is (35-20)/3 is 5, the value range of the first sub-box after the sub-box is [20-25 ], the value range of the second sub-box is [25-30 ], the value range of the third sub-box is [30-35], wherein the characteristic value of the temperature after the sub-box of [20-25) is set to 1, [25-30) is set to 2, and [30-35] is set to 3, if a certain temperature value is 26.5482, the temperature value falls into the value range of the second sub-box, so that the temperature value is converted into 2, and the sub-box operation is performed to make the value approach to the value, thereby performing the classification processing. The binning operation is not limited to this characteristic of temperature. The number of the sub-boxes is related to the classification density, and when the number of the sub-boxes is less, the classification types are less, the data is subjected to approximate processing in a larger range, and the calculation amount is reduced. When the number of the sub-boxes is increased, corresponding classification is increased, data is subjected to approximate processing in a smaller range, and although the calculation amount is reduced, the reduction degree is lower. According to the requirement of data analysis, the design of box separation operation can be carried out, and the information quantity and the calculated quantity of data are considered.
And S5, feature selection.
The purpose of feature selection is to reduce the number of features and dimension reduction, so that the generalization capability of the model is stronger, overfitting is reduced, the time of model training is reduced, the complexity of the model is reduced, and the effect of the model is improved. The invention uses Filter to select the characteristics.
And (3) selecting features by using a Filter: one of the Filter filtering operations is: each dimension feature is "scored", i.e., the feature of each dimension is given a weight, such weight representing the importance of the feature, and then sorted by weight. Feature selection is performed first and then the learner is trained, so the process of feature selection is independent of the learner. This is equivalent to performing a filtering operation on the features and then training the classifier with the feature subset. One specific implementation method is to use variance filtering to screen the class of features through the variance of the features themselves, for example, if the variance of a feature itself is small, it means that there is substantially no difference in the feature for a sample, most values in the feature may be the same, even the value of the whole feature is the same, and the feature has no effect on sample differentiation, so that no matter what the feature engineering needs to do next, the feature with the variance of 0 needs to be eliminated preferentially. If the variance threshold is thresholded, all features with variance less than thresholded are discarded, that is, all features with same record are deleted.
As can be seen from Table 7, the original feature data dimension of the training set is (6700, 48), after the constructed feature is expanded, the data dimension of the training set is increased to (6700, 256), the data dimension is increased, and after the Filter feature selection filtering is adopted, the data dimension of the training set is decreased to (6700, 106). The effect of the constructed features on the model effect is shown in table 6. Although the trained algorithms are different, after the characteristics are constructed (characteristic construction), the consumed time is increased, the occupied memory is increased, and the evaluation indexes of AUC and F1-Score are both obviously improved.
TABLE 6 Classification Effect before and after feature construction
Figure RE-GDA0002733140520000131
TABLE 7 feature sizes before and after feature engineering
Figure RE-GDA0002733140520000132
TABLE 8 Classification Effect before and after feature selection Using Filter
Figure RE-GDA0002733140520000133
As can be seen from comparison of classification effects before and after feature selection by using the Filter, the consumed time and the occupied memory are greatly reduced, and two evaluation indexes of AUC and AUC of F1-Score and F1-Score are obviously improved. Therefore, through feature construction, binning and feature selection, feature engineering processing is carried out on the data, the data processing efficiency is improved, and the model effect is improved to a certain extent.
And S6, training the model.
First, data set partitioning is performed.
Dividing the data samples after the feature selection processing to obtain a training set and a test set;
if the original data set is a table of 10000 samples, there are several features, each sample has a label, which indicates whether the sample is a positive or negative example. Before data training, data needs to be cut (training set: test set: 7:3), the data is divided into 6700 pieces of data of the training set and 3300 pieces of data of the test set, the training set comprises samples and marks thereof for model training, and the test set comprises samples and marks thereof for model verification.
Next, model training is performed.
And taking the feature variable after feature selection as an input variable, constructing an XGboost model by adopting an XGboost algorithm, and training the model by utilizing the XGboost algorithm.
The XGboost is an English abbreviation of Extreme Gradient Boosting and is one of Boosting algorithms, the basic design and idea of the algorithm are that a strong learning classifier is formed by integrating a plurality of weak learning classifiers and algorithms, all learning using the classifier algorithm can be trained for many times, the type and data set of each classifier algorithm training are randomly and automatically extracted from the type and data set of the original classifier algorithm training, n classifier training samples are generated, a function for predicting classification problems is obtained after the training data set completes prediction, finally, each prediction function classifies the original classification problems by using a voting mode, and a simple average method is used for predicting the new training sample type of each regression classification problem.
The XGboost is a lifting tree model, a plurality of base models are fused to form a classifier with the best effect, the base model of the XGboost algorithm can be customized, and a CART tree is generally used.
The XGboost objective function is defined as:
Figure RE-GDA0002733140520000141
Figure RE-GDA0002733140520000151
the model objective function is composed of two parts, the first main part is the loss function of the leaf node, and the other part is considered as the regularization term of the leaf node. k represents the number of leaf nodes, the regular term can control the score of the leaf nodes not to be too large, so as to prevent overfitting, and f (x) is one regression tree.
For the XGBoost model, the newly generated tree model needs to be able to accurately fit the model residual predicted by the last generation, i.e. when generating t trees, the prediction score can be written as:
Figure RE-GDA0002733140520000152
the objective function at this time is:
Figure RE-GDA0002733140520000153
XGboost performs Taylor second-order expansion on the loss function as follows:
Figure RE-GDA0002733140520000154
wherein, giIs the first derivative, hiAs second derivative:
Figure RE-GDA0002733140520000155
in the formula, the values of the node loss functions of each leaf sample are added together, and because the final loss of each leaf sample falls into the node of one leaf, all the same samples of the same node as the leaf can be recombined,
Figure RE-GDA0002733140520000156
by rewriting the formula (9), the objective function can be written as a unitary quadratic function with respect to the leaf node score ω, and in this case, the optimum ω and objective function value can be solved simply. Thus, the optimal ω and objective function formula is:
Figure RE-GDA0002733140520000161
Figure RE-GDA0002733140520000162
the main parameters that the XGBoost needs to adjust are shown in table 11 below:
TABLE 11 XGboost the main parameters to be adjusted
Figure RE-GDA0002733140520000163
As a specific embodiment of the invention, 10000 data samples form a data set (each sample comprises 56 characteristic values), the data set is divided into 6700 training sets and 3300 verification sets, the characteristic values of the 6700 data and a manually preset label are input into a model together, and the model is trained to obtain a trained model; and inputting 3300 pieces of verification set data into the trained model, outputting a label value, scoring the output label value by comparing the label value with the original verification set data, wherein the score is high, the better the model training is, setting a scoring threshold, if the score is higher than the scoring threshold during verification, finishing the model training, otherwise, continuing the model training.
During the actual measurement, each piece of state data (including 56 characteristic values) acquired in real time is input into the trained model, the label value corresponding to the piece of data is output, if the result is 0, the underground pipe gallery is safe, and if the result is 1, the underground pipe gallery is considered to be unsafe.
And S5, optimizing the model.
In order to make the model performance show better performance and effect, a group of optimal hyper-parameters needs to be quickly found, searching and adjusting of the hyper-parameters are complicated and important, generally, by using a manual hyper-parameter adjusting method, the rough performance and effect of the hyper-parameter model are firstly determined, then, the method uses two methods of grid search and random search to quickly search, obtains a better model as a hyper-parameter, however, grid search usually requires a lot of space to run, wastes a lot of effort and time to evaluate the search space, cannot accurately and quickly find the type and area that can reach the optimal point, therefore, the minimum value of the model is optimized by using the Bayesian algorithm, and the Bayesian optimization algorithm is an optimization method based on the Bayesian model and used for accurately searching the minimum value of the model function. The Bayesian optimization algorithm mainly comprises three steps:
(1) a prior function is selected to express the assumption about the function to be optimized.
(2) The result of the optimization by maximization of the marginal likelihood parameter distribution by maximum likelihood parameter estimation is the hyperparameter.
(3) And obtaining a specific acquisition function according to the obtained optimized hyper-parameter. The acquisition function is then selected to construct a utility function from the posterior model to determine the next sample point. The acquisition function may balance the samples at points with a low modeling objective function and search for areas that have not been sampled.
After the XGBoost algorithm uses the bayesian algorithm to adjust parameters, the obtained model has the best effect, and the finally obtained optimal hyper-parameters of the XGBoost algorithm are shown in table 12.
Table 12 optimum parameters of XGboost model selected by Bayesian algorithm
Figure RE-GDA0002733140520000181
After the model is constructed by using DT, LR, NC, SVM, GBDT and XGboost algorithms, classification results of the model before and after optimization by using the Bayesian algorithm are shown in Table 13.
TABLE 13 Classification Effect before and after optimization Using Bayesian Algorithm
Figure RE-GDA0002733140520000182
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (10)

1. An underground comprehensive pipe gallery safety condition assessment method based on XGboost is characterized by comprising the following steps:
s1, acquiring a data original sample of the underground pipe gallery, wherein the data original sample comprises human factor data, equipment factor data and environment factor data;
s2, preprocessing the data original sample, wherein the preprocessing comprises missing value processing and abnormal data processing;
s3, performing feature construction on the preprocessed data to obtain a feature-constructed data set;
s4, performing feature selection on the feature-constructed data set, wherein the feature data after feature selection form a data sample set;
and S5, inputting the data sample set into a pre-trained XGboost algorithm model, and determining the safety state of the underground pipe gallery according to the output result.
2. The XGboost-based underground utility tunnel safety assessment method according to claim 1, wherein the artifact data includes operational norms, health conditions, hazardous material handling conditions, pipe and equipment damage levels.
3. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 1, wherein the equipment factor data comprises a combination part reliability degree, a part safety coefficient, design rationality and equipment environment adaptability, equipment safety devices, an equipment failure rate, a periodic inspection rate and a maintenance rate.
4. The XGboost-based underground utility tunnel safety condition assessment method according to claim 1, wherein the environmental factor data types comprise temperature, humidity, water level, oxygen concentration, hydrogen sulfide concentration, methane concentration and carbon monoxide concentration.
5. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to any one of claims 1 to 4, wherein the data original samples are divided into category characteristic data and numerical characteristic data, and the missing value processing is that: the missing numerical features are filled using a mean and the missing class features are filled using a mode.
6. The XGboost-based underground utility tunnel safety condition assessment method according to claim 5, wherein the abnormal data processing is to identify abnormal values in the original sample of data by using an isolated forest abnormal value detection algorithm and delete the abnormal values.
7. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 6, wherein the feature construction comprises the following steps:
A. performing numerical value conversion on the category characteristic data;
B. and performing box separation operation on the numerical characteristic data, and performing characteristic combination.
8. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 7, wherein in step S4, a feature value data set is subjected to feature selection by using a Filter algorithm to obtain a data sample set.
9. The XGboost-based underground comprehensive pipe gallery safety condition assessment method according to claim 8, further comprising the steps of:
and acquiring the hyperparameter of the XGboost algorithm model by adopting a Bayesian optimization algorithm, wherein the hyperparameter is used for optimizing the XGboost algorithm model.
10. An XGboost-based underground comprehensive pipe gallery safety condition evaluation system is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
CN202010604912.5A 2020-06-29 2020-06-29 XGboost-based underground comprehensive pipe gallery safety condition assessment method Pending CN111950585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010604912.5A CN111950585A (en) 2020-06-29 2020-06-29 XGboost-based underground comprehensive pipe gallery safety condition assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010604912.5A CN111950585A (en) 2020-06-29 2020-06-29 XGboost-based underground comprehensive pipe gallery safety condition assessment method

Publications (1)

Publication Number Publication Date
CN111950585A true CN111950585A (en) 2020-11-17

Family

ID=73337580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010604912.5A Pending CN111950585A (en) 2020-06-29 2020-06-29 XGboost-based underground comprehensive pipe gallery safety condition assessment method

Country Status (1)

Country Link
CN (1) CN111950585A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232525A (en) * 2020-12-15 2021-01-15 鹏城实验室 Driving mode characteristic construction and screening method and device and storage medium
CN112861681A (en) * 2021-01-29 2021-05-28 长兴云尚科技有限公司 Pipe gallery video intelligent analysis method and system based on cloud processing
CN113255717A (en) * 2021-03-25 2021-08-13 中冶赛迪重庆信息技术有限公司 Piping lane fire detection method and system
CN113266952A (en) * 2021-05-24 2021-08-17 佛山市顺德区美的洗涤电器制造有限公司 Temperature control method and system for wall-mounted boiler and server
CN113298438A (en) * 2021-06-22 2021-08-24 中国平安财产保险股份有限公司 Regional risk level assessment method and device, computer equipment and storage medium
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN113762805A (en) * 2021-09-23 2021-12-07 国网湖南省电力有限公司 Mountain forest fire early warning method applied to power transmission line
CN114611616A (en) * 2022-03-16 2022-06-10 吕少岚 Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest
CN115081945A (en) * 2022-07-25 2022-09-20 天津市地质研究和海洋地质中心 Damage monitoring and evaluating method and system for underground water environment monitoring well
CN115081741A (en) * 2022-07-21 2022-09-20 西南石油大学 Natural gas metrological verification intelligent prediction method based on neural network
CN115828757A (en) * 2022-12-12 2023-03-21 福建中锐汉鼎数字科技有限公司 Flood discharge hysteresis characteristic construction and selection method for basin water level prediction

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701970A (en) * 2016-04-07 2016-06-22 深圳市桑达无线通讯技术有限公司 One-man operation dangerous condition detecting method and one-man operation automatic alarm method
CN109543986A (en) * 2018-11-16 2019-03-29 湖南数定智能科技有限公司 The pre- methods of risk assessment of prison convict three and system based on user's portrait
CN109767071A (en) * 2018-12-14 2019-05-17 深圳壹账通智能科技有限公司 User credit ranking method, device, computer equipment and storage medium
CN110213222A (en) * 2019-03-08 2019-09-06 东华大学 Network inbreak detection method based on machine learning
CN110472649A (en) * 2019-06-21 2019-11-19 中国地质大学(武汉) Brain electricity sensibility classification method and system based on multiscale analysis and integrated tree-model
CN110837866A (en) * 2019-11-08 2020-02-25 国网新疆电力有限公司电力科学研究院 XGboost-based electric power secondary equipment defect degree evaluation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701970A (en) * 2016-04-07 2016-06-22 深圳市桑达无线通讯技术有限公司 One-man operation dangerous condition detecting method and one-man operation automatic alarm method
CN109543986A (en) * 2018-11-16 2019-03-29 湖南数定智能科技有限公司 The pre- methods of risk assessment of prison convict three and system based on user's portrait
CN109767071A (en) * 2018-12-14 2019-05-17 深圳壹账通智能科技有限公司 User credit ranking method, device, computer equipment and storage medium
CN110213222A (en) * 2019-03-08 2019-09-06 东华大学 Network inbreak detection method based on machine learning
CN110472649A (en) * 2019-06-21 2019-11-19 中国地质大学(武汉) Brain electricity sensibility classification method and system based on multiscale analysis and integrated tree-model
CN110837866A (en) * 2019-11-08 2020-02-25 国网新疆电力有限公司电力科学研究院 XGboost-based electric power secondary equipment defect degree evaluation method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
史月美,宗春梅: "《关联规则挖掘研究》", 31 May 2016 *
叶志宇等: "基于深度LightGBM集成学习模型的谷歌商店顾客购买力预测", 《计算机应用》 *
赵婷: "综合管廊环境安全性监测多源信息融合应用研究", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232525A (en) * 2020-12-15 2021-01-15 鹏城实验室 Driving mode characteristic construction and screening method and device and storage medium
CN112861681B (en) * 2021-01-29 2023-04-07 长兴云尚科技有限公司 Pipe gallery video intelligent analysis method and system based on cloud processing
CN112861681A (en) * 2021-01-29 2021-05-28 长兴云尚科技有限公司 Pipe gallery video intelligent analysis method and system based on cloud processing
CN113255717A (en) * 2021-03-25 2021-08-13 中冶赛迪重庆信息技术有限公司 Piping lane fire detection method and system
CN113266952A (en) * 2021-05-24 2021-08-17 佛山市顺德区美的洗涤电器制造有限公司 Temperature control method and system for wall-mounted boiler and server
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN113298438A (en) * 2021-06-22 2021-08-24 中国平安财产保险股份有限公司 Regional risk level assessment method and device, computer equipment and storage medium
CN113762805A (en) * 2021-09-23 2021-12-07 国网湖南省电力有限公司 Mountain forest fire early warning method applied to power transmission line
CN114611616B (en) * 2022-03-16 2023-02-07 吕少岚 Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest
CN114611616A (en) * 2022-03-16 2022-06-10 吕少岚 Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest
CN115081741A (en) * 2022-07-21 2022-09-20 西南石油大学 Natural gas metrological verification intelligent prediction method based on neural network
CN115081945A (en) * 2022-07-25 2022-09-20 天津市地质研究和海洋地质中心 Damage monitoring and evaluating method and system for underground water environment monitoring well
CN115081945B (en) * 2022-07-25 2022-11-22 天津市地质研究和海洋地质中心 Damage monitoring and evaluating method and system for underground water environment monitoring well
CN115828757A (en) * 2022-12-12 2023-03-21 福建中锐汉鼎数字科技有限公司 Flood discharge hysteresis characteristic construction and selection method for basin water level prediction
CN115828757B (en) * 2022-12-12 2024-02-23 福建中锐汉鼎数字科技有限公司 Flood discharge hysteresis characteristic structure and selection method for drainage basin water level prediction

Similar Documents

Publication Publication Date Title
CN111950585A (en) XGboost-based underground comprehensive pipe gallery safety condition assessment method
CN107578104B (en) A kind of Chinese Traditional Medicine knowledge system
WO2021184630A1 (en) Method for locating pollutant discharge object on basis of knowledge graph, and related device
CN110750524A (en) Method and system for determining fault characteristics of active power distribution network
CN112905580A (en) Multi-source heterogeneous data fusion system and method based on industrial big data
CN112819107A (en) Artificial intelligence-based fault prediction method for gas pressure regulating equipment
CN111178585A (en) Fault reporting amount prediction method based on multi-algorithm model fusion
CN116316599A (en) Intelligent electricity load prediction method
CN110555058A (en) Power communication equipment state prediction method based on improved decision tree
CN111860624A (en) Power grid fault information classification method based on decision tree
CN116245406B (en) Software operation and maintenance quality evaluation method and system based on operation and maintenance quality management database
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN111476274B (en) Big data predictive analysis method, system, device and storage medium
CN116610816A (en) Personnel portrait knowledge graph analysis method and system based on graph convolution neural network
CN113590396A (en) Method and system for diagnosing defect of primary device, electronic device and storage medium
CN116432123A (en) Electric energy meter fault early warning method based on CART decision tree algorithm
CN116933010A (en) Load rate analysis and evaluation method and system based on multi-source data fusion and deep learning
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN115409120A (en) Data-driven-based auxiliary user electricity stealing behavior detection method
CN115358481A (en) Early warning and identification method, system and device for enterprise ex-situ migration
CN110781206A (en) Method for predicting whether electric energy meter in operation fails or not by learning meter-dismantling and returning failure characteristic rule
CN111027841A (en) Low-voltage transformer area line loss calculation method based on gradient lifting decision tree
CN116365519B (en) Power load prediction method, system, storage medium and equipment
CN116091206B (en) Credit evaluation method, credit evaluation device, electronic equipment and storage medium
CN116664098A (en) Abnormality detection method and system for photovoltaic power station

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201117

RJ01 Rejection of invention patent application after publication