CN116502154A

CN116502154A - Seismic classification method and system based on multidimensional feature extraction and XGBoost

Info

Publication number: CN116502154A
Application number: CN202310454650.2A
Authority: CN
Inventors: 王婷婷; 边银菊; 任梦依
Original assignee: INSTITUTE OF GEOPHYSICS CHINA EARTHQUAKE ADMINISTRATION
Current assignee: INSTITUTE OF GEOPHYSICS CHINA EARTHQUAKE ADMINISTRATION
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-07-28

Abstract

The invention provides a seismic classification method and a system based on multi-dimensional feature extraction and XGBoost, wherein the method comprises a seismic property rechecking step, a multi-dimensional feature extraction step, a feature analysis and feature selection step and an XGBoost algorithm training classification step, and a standardized feature data set is established by combining seismic property rechecking, multi-dimensional feature value extraction, feature analysis, feature selection and XGBoost algorithm training, so that an XGBoost classification model is established, and the generalization capability test of the XGBoost classification model is carried out to determine an optimal XGBoost classification model to realize seismic intelligent classification, so that the record difference of different types of earthquakes is effectively highlighted, the classification method has high training speed, and the generalization capability of the classification model is strong, so that the method has wide popularization value.

Description

Seismic classification method and system based on multidimensional feature extraction and XGBoost

Technical Field

The invention relates to the technical field of seismic data processing, in particular to a seismic classification method and system based on multidimensional feature extraction and XGBoost.

Background

In recent years, a plurality of non-natural earthquake events with large influence occur, with the development of scientific technology, industrial activities of human beings such as mineral exploitation, natural gas exploitation, oilfield wastewater reinjection, mountain and stone deep-frying and the like can cause earthquakes, and the induced earthquakes can be perceived by surrounding residents, so that potential damage and harm are caused, and intense public discussion topics are caused.

With the establishment of dense tables of the world and the countries, non-natural seismic events such as artificial explosion, mine earthquake, landslide and the like can be recorded, and the records are confused with natural seismic records, if the records cannot be removed in time, seismic catalogues can be confused, and the subsequent seismic risk analysis is influenced. Therefore, the problem of seismic classification is an important task in seismic monitoring in the future in regional and local seismic records, and has important practical significance: (1) the non-natural seismic event is removed from the seismic catalogue, and the establishment of a complete natural seismic catalog plays a significant role in the study of seismology such as active fault division, seismic risk evaluation and the like; (2) the method is characterized in that an integral industrial explosion catalog of a mountain explosion, a quarry and the like is established, so that the supervision of the explosion mode of an exploitation department is facilitated, and the assessment of sudden explosion disasters is also facilitated; (3) a mine earthquake catalog is created to help assess the risk of earthquakes associated with human mining activities.

Due to factors such as the complexity of the shallow media of the crust and the diversity of the unnatural vibration types, identification of local and regional seismic records is more difficult. The recorded waveform of the unnatural seismic event is greatly influenced by the attenuation characteristics of the seismic source, the propagation path, the medium and the like, and has strong regional characteristics. Natural earthquakes generally have a dual-couple source mechanism, deeper source depth and four-quadrant amplitude distribution characteristics; industrial explosions can be classified into single explosions, multi-hole instantaneous explosions, and delayed differential explosions, and although not considered point sources in the strict sense, there is no dual couple source mechanism feature. The seismic source mechanism of the mining area for inducing the earthquake is more complex, the rock destruction process possibly has a shearing component, and the seismic source mechanism can be implosion, dual couples or a combination of the two.

In theory, the source mechanism of natural earthquake and artificial explosion has great difference, but the distance between stations of a fixed station network is larger, and solving the source parameters of low-earthquake-level earthquake (M < 3.0) and the high-frequency attenuation model of the area by utilizing local and regional earthquake records is still a difficult problem, and the high-frequency earthquake waves are greatly influenced by the propagation paths. Therefore, the common method for classifying natural earthquakes and non-natural earthquakes is to extract identification features from the earthquake records, the defect of single features is that the stability and universality of the classification capability are poor, and the multi-criterion comprehensive identification method can improve the event classification capability to different degrees, so that the artificial intelligent technologies such as pattern identification, machine learning and the like are widely applied to the identification of the earthquakes and the non-natural earthquakes. In the classification of local and regional distance seismic signals, most studies are two-classification problems of earthquakes and explosions, with fewer studies concerning the classification of induced earthquakes or three types of seismic events. Tang et al (2019) uses SVM to triad, explode and evoked earthquakes, using P-wave and S-wave amplitude spectra as input features, and found that generalization in classification models is weak due to differences in attenuation structures and seismic source depths in different regions. Therefore, the classification identification and feature universality verification of multiple types of seismic events need to be further explored.

Disclosure of Invention

Aiming at the instability of single characteristics and the multi-classification problem of earthquake events, the invention provides the earthquake classification method based on multi-dimensional characteristic extraction and XGBoost. The invention also relates to a seismic classification system based on multidimensional feature extraction and XGBoost.

The technical scheme of the invention is as follows:

the earthquake classification method based on multidimensional feature extraction and XGBoost is characterized by comprising the following steps of:

a seismic property rechecking step, namely collecting seismic data, preprocessing the seismic data to establish a standard data set, carrying out waveform analysis on an original seismic catalog to finish seismic property rechecking so as to determine seismic, explosion and mine earthquake catalogues, and carrying out effective wave band automatic segmentation based on the standard data set and the seismic, explosion and mine earthquake catalogues;

a multidimensional feature extraction step, namely extracting multidimensional feature values of a P/S amplitude ratio, a high-low frequency energy ratio, a corner frequency, a waveform duration and a wavelet packet energy ratio in an effective wave band, and establishing a standardized feature data set;

feature analysis and feature selection, namely calculating feature importance scores through an XGBoost algorithm and a random forest algorithm, performing feature selection according to the feature importance scores, and screening out an optimal feature data set;

and a XGBoost algorithm training and classifying step, namely splitting a training set and a verification set of the screened optimal characteristic data set, constructing an XGBoost classification model on the basis of the XGBoost algorithm, performing model training and verification by adopting a 5-fold cross verification method, determining parameters of the XGBoost classification model, and performing XGBoost classification model generalization capability test by utilizing test data sets in different regions to determine the optimal XGBoost classification model to realize seismic classification.

Preferably, in the seismic property rechecking step, format conversion and instrument response correction processing are performed after the seismic data are collected to establish a standard data set, and the seismic property rechecking is completed by performing waveform analysis and seismic position confirmation on the original seismic catalogue so as to determine the seismic, explosion and mine seismic catalogue.

Preferably, in the multi-dimensional feature extraction step, in an effective band, a normalized feature data set is established by multi-dimensionally extracting corner frequencies, waveform duration, high-low frequency energy ratios, waveform complexity, zero-crossing rate, instantaneous frequency complexity, cepstrum complexity, P/S amplitude ratio, autocorrelation coefficients, excellent frequencies, frequency third moments, wavelet packet energy ratios, and average frequency, mean value and main peak eigenvalue after hilbert yellow transformation.

Preferably, in the multi-dimensional feature extraction step, the P/S amplitude ratio is extracted by selecting a seismophase window including P-wave and S-wave signals by using a specific speed window, and calculating root mean square P/S amplitude ratios of different frequency bands after fourier transformation, wherein the root mean square P/S amplitude ratio includes a time domain maximum amplitude ratio and amplitude ratios of a plurality of different frequency bands; the corner frequency is the intersection point of the high-frequency progressive trend and the low-frequency progressive trend of the amplitude spectrum, the spectrum analysis method is adopted to obtain the source spectrum parameter, and the corner frequencies of the P wave and the S wave are calculated; the extraction of the wavelet packet energy ratio is to utilize sym5 wavelet packet function to carry out four-layer decomposition on P wave and S wave, and calculate the energy ratio of decomposed signal energy to total signal; the cepstrum complexity is extracted based on the vibration characteristics of an amplitude spectrum of an earthquake and the single property of an explosion focus, and cepstrum criterion values of P waves and S waves of the earthquake and the explosion are respectively extracted; the extraction of the instantaneous frequency complexity is that Hilbert transformation is firstly carried out on a frequency signal to obtain an analysis signal, then the instantaneous frequency is obtained by Wigner distribution and empirical mode decomposition, and finally the instantaneous frequency complexity is calculated.

Preferably, in the step of feature analysis and feature selection, the base importance of each feature is calculated through an XGBoost algorithm and a random forest algorithm, and normalized to obtain a feature importance score, and the number of features for seismic classification is selected based on the feature importance score so as to screen out an optimal feature data set.

Preferably, in the XGBoost algorithm training and classifying step, the method further comprises XGBoost classification model parameter optimization and model evaluation, wherein a gridsetchcv function is adopted to optimize XGBoost classification model parameters, and the XGBoost classification model parameters comprise iteration times, learning rate, minimum sample weight, minimum loss function drop value and random sampling proportion; and uses the confusion matrix, ROC characteristic curve and AUC value to evaluate XGBoost classification model performance.

Preferably, in the XGBoost algorithm training and classifying step, an XGBoost classification model constructed on the basis of the XGBoost algorithm on the training set comprises a two-classification model and a three-classification model, and the generalization capability of the two-classification model and the three-classification model is evaluated by adopting indexes of accuracy, true class rate and true negative class rate.

A seismic classification system based on multidimensional feature extraction and XGBoost is characterized by comprising a seismic property rechecking module, a multidimensional feature extraction module, a feature analysis and feature selection module and an XGBoost algorithm training classification module which are connected in sequence,

the seismic property rechecking module is used for collecting seismic data, preprocessing the seismic data to establish a standard data set, carrying out waveform analysis on an original seismic catalog to finish seismic property rechecking so as to determine seismic, explosion and mine seismic catalogues, and carrying out effective wave band automatic segmentation based on the standard data set and the seismic, explosion and mine seismic catalogues;

the multi-dimensional characteristic extraction module is used for extracting multi-dimensional characteristic values of a P/S amplitude ratio, a high-low frequency energy ratio, corner frequency, waveform duration and wavelet packet energy ratio in an effective wave band and establishing a standardized characteristic data set;

the feature analysis and feature selection module calculates feature importance scores through an XGBoost algorithm and a random forest algorithm, performs feature selection according to the feature importance scores, and screens out an optimal feature data set;

the XGBoost algorithm training classification module is used for splitting a training set and a verification set of the screened optimal characteristic data set, constructing an XGBoost classification model for the training set based on the XGBoost algorithm, performing model training and verification by adopting a 5-fold cross verification method, determining parameters of the XGBoost classification model, and performing XGBoost classification model generalization capability test by utilizing test data sets of different regions so as to determine the optimal XGBoost classification model to realize seismic classification.

Preferably, the seismic property rechecking module acquires the seismic data, performs format conversion and instrument response correction processing to establish a standard data set, and performs waveform analysis and seismic position confirmation on the original seismic catalogue to complete seismic property rechecking so as to determine the seismic, explosion and mine seismic catalogue.

Preferably, in the multi-dimensional feature extraction module, corner frequency, waveform duration, high-low frequency energy ratio, waveform complexity, zero crossing rate, instantaneous frequency complexity, cepstrum complexity, P/S amplitude ratio, autocorrelation coefficient, excellent frequency, frequency third moment, wavelet packet energy ratio, and average frequency, mean value and main peak feature value after hilbert yellow transformation are extracted in a multi-dimensional manner, and a standardized feature data set is established;

and/or the XGBoost algorithm training and classifying module also performs XGBoost classification model parameter optimization and model evaluation, optimizes XGBoost classification model parameters by adopting a GridSearchCV function, wherein the XGBoost classification model parameters comprise iteration times, learning rate, minimum sample weight, minimum loss function drop value and random sampling proportion; and uses the confusion matrix, ROC characteristic curve and AUC value to evaluate XGBoost classification model performance.

The beneficial effects of the invention are as follows:

the invention provides a seismic classification method based on multidimensional feature extraction and XGBoost, which is characterized in that a seismic property rechecking step establishes a standard data set and completes seismic property rechecking so as to determine earthquake, explosion and mine earthquake catalogues; based on a standard data set and an earthquake, explosion and mine earthquake catalogue, carrying out effective wave band automatic segmentation, extracting a plurality of characteristic values of P/S amplitude ratio, high-low frequency energy ratio, corner frequency, waveform complexity and cepstrum complexity from the multi-dimension in a multi-dimension characteristic extraction step, effectively highlighting the record difference of different types of earthquakes, combining an earthquake theory and a digital signal processing method in characteristic extraction, respectively extracting identification characteristics from a time domain and a frequency domain, and establishing a standardized characteristic data set; feature analysis and feature selection, namely calculating a feature importance score through an XGBoost algorithm and a Random Forest (RF) algorithm, performing feature selection according to the feature importance, and screening out an optimal feature data set; the XGBoost algorithm training and classifying step is used for splitting a training set and a verification set of the standardized characteristic data set, constructing XGBoost classification models for the training set based on the XGBoost algorithm (Extreme Gradient Boosting, extreme gradient lifting algorithm), respectively constructing two classification models and three classification models, stabilizing classification capability, determining parameters of the XGBoost classification models when the classification accuracy rate of earthquakes, explosions and mineral earthquakes is higher than 90%, and performing model generalization capability verification in a larger range to determine the optimal XGBoost classification model to realize intelligent classification of earthquakes. The earthquake classification method has the advantages of high training speed, strong generalization capability of the classification model, simplicity, easiness in use and lower realization cost, is suitable for the business requirements of earthquake classification and microseism classification, and has wide popularization value.

The invention also relates to a seismic classification system based on multi-dimensional feature extraction and XGBoost, which corresponds to the seismic classification method based on multi-dimensional feature extraction and XGBoost, and can be understood as a system for realizing the seismic classification method based on multi-dimensional feature extraction and XGBoost.

Drawings

FIG. 1 is a flow chart of a seismic classification method based on multi-dimensional feature extraction and XGBoost of the present invention.

FIG. 2 is a technical schematic diagram of a seismic classification method based on multi-dimensional feature extraction and XGBoost of the present invention.

Fig. 3 is a multi-dimensional profile of an earthquake, explosion, and mine earthquake.

Fig. 4a-4b are feature importance score profiles.

Fig. 5a-5b are feature selection and recognition accuracy graphs.

Fig. 6 is a schematic diagram of a 5-fold cross-validation process.

FIGS. 7a-7d are graphs comparing recognition accuracy of XGBoost classification model to RF.

FIG. 8 is a workflow diagram of a multi-dimensional feature extraction and XGBoost seismic classification system.

Detailed Description

The present invention will be described below with reference to the accompanying drawings.

The invention relates to a seismic classification method based on multidimensional feature extraction and XGBoost, which is used for rechecking seismic properties; in the multi-dimensional feature extraction, combining with a seismic theory and a digital signal processing method, respectively extracting identification features from a time domain and a frequency domain, and establishing a data training set; two-class and three-class models are respectively constructed by using an extreme gradient lifting algorithm (XGBoost, extreme Gradient Boosting), model stability is evaluated by using different indexes, and a larger-range model generalization capability verification is performed, as shown in a flow chart of FIG. 1, and the method comprises the following steps:

1. and a seismic property rechecking step, wherein the seismic data are collected and preprocessed to establish a standard data set, the seismic property rechecking is completed by carrying out waveform analysis on an original seismic catalog, so that seismic, explosion and mine earthquake catalogues are determined, and effective wave band automatic segmentation is carried out based on the standard data set and the seismic, explosion and mine earthquake catalogues.

Specifically, as shown in the technical schematic diagram of fig. 2, which is also a preferred flowchart, format conversion and instrument response correction processing are performed after seismic data are collected to establish a standard data set, seismic property review is completed by performing waveform analysis and seismic position confirmation on an original seismic catalog to determine seismic, explosion and mine earthquake catalogues, and effective wave band automatic segmentation is performed based on the standard data set and the seismic, explosion and mine earthquake catalogues.

2. And in the multi-dimensional feature extraction step, feature values such as P/S amplitude ratio, high-low frequency energy ratio, corner frequency, waveform duration, wavelet packet energy ratio and the like are extracted in a multi-dimensional manner in an effective wave band, and a standardized feature data set is established. Further, the multi-dimensional feature extraction also involves other dimensional feature values, such as waveform complexity, cepstrum complexity, zero crossing rate, autocorrelation coefficients, dominant frequency, frequency third moment, instantaneous frequency complexity, and average frequency, mean value, and main peak feature value after hilbert yellow transformation, to establish a standardized feature data set.

(1) P/S amplitude ratio

In theory, blasting mainly produces P-waves. However, the shear force S wave may be generated due to the influence of the blasting method, the complicated propagation path, and the like, and thus the blasting has a strong P wave group and the S wave is relatively weak. In the earthquake process, the rock is subjected to shearing dislocation, so that stronger S can be generated in most earthquakes. The P/S amplitude ratio can reflect the different P, S development characteristics of the earthquake and the explosion, and can reduce the influence of the magnitude, the amplification factor of the seismometer and the frequency characteristic. Firstly, selecting a proper vibration phase window by utilizing a fixed speed window to ensure that a main P, S vibration phase is contained, and secondly, calculating root mean square P/S amplitude ratio criteria of different frequency bands after Fourier transformation, wherein the criteria comprise the maximum amplitude ratio of 1 time domain and the amplitude ratio of 20 different frequency bands of 0-20 Hz.

(2) High-low frequency energy ratio

The time spectrum of the earthquake shows that the frequency bands of P wave and S wave are wider, and the frequency bands of explosion and mine earthquake are narrower. The high-frequency and low-frequency energy ratios of Pg and Sg are respectively extracted as quantitative characteristics, wherein the high frequency band is 5-18Hz, and the low frequency band is 0.05-5Hz.

(3) Corner frequency

Corner frequencies are the intersections of the high frequency progression and the low frequency progression of the amplitude spectrum, and are also physical quantities reflecting the magnitude of the source scale. In the frequency domain, the invention adopts a spectrum analysis method to obtain the spectrum parameter of the seismic source, and calculates the corner frequencies fc of P wave and S wave. The seismic spectrum is a description of the frequency domain of the seismic waves radiated by a seismic source, and has a close relationship with the mechanical parameters of the source, and the theoretical seismic spectra radiated by different source modes are different. For small and medium earthquakes, the source spectral symbol Brune disk model (Brune, 1970), the source displacement spectrum Ω (f) with negligible inelastic attenuation, can be expressed as:

Ω(f)＝Ω ₀ /(1+(f/f _c ) ² )

the source spectrum model Ω (f) of Jimenez et al (2005) can be represented:

in omega ₀ Is the zero frequency limit value of the displacement spectrum, f is the frequency, f _c Is the corner frequency. When omega ₀ And f _c When the determination is carried out, the source displacement spectrum can be obtained. Selecting f in horizontal low frequency band of seismic source displacement spectrum ₁ 、f ₂ Selecting f in the high-frequency attenuation section ₃ The integral of the square of the displacement spectrum is calculated as:

the distance between the calculated theoretical spectrum and the actual observed spectrum is:

R _Ω as a function of frequency, when R _Ω At minimum, i.e. the deviation between theoretical and actual values is said to be minimal, R will therefore be _Ω The f-value at the minimum value is taken as the corner frequency f _c . P, S wave corner frequencies are calculated from the Brune model (1970) and Jimenez model (2005), respectively.

(4) Waveform duration

The larger the magnitude of a general event, the longer the wake duration. For the same magnitude, a longer wake duration indicates a slower decay of the seismic wave. In theory, the artificial explosion and ore vibration seismic waves propagate near the surface, the high-frequency energy loss is large, the surface wave development, the low-frequency coupling vibration signal development and the like are characterized in that the attenuation is slower, and the duration of the wake wave is longer than that of the natural earthquake with the same earthquake magnitude. The waveform duration is defined as follows:

t＝(t _coda -t _p )/Δ

where t is the waveform duration, t _p Is the arrival time of P wave, t _coda Is the time the waveform decays to noise level and delta is the epicenter distance.

(5) Wavelet packet energy ratio

Wavelet analysis is considered as a breakthrough progress of fourier analysis methods, and as a technical tool and method, discrete wavelet transformation is well applied in various fields such as signal information processing, image processing, linguistic analysis, and the like. Mallat binary wavelet algorithm is commonly used in wavelet analysis, and in binary wavelet analysis, signals are decomposed in a frequency domain according to a binary direction and a low frequency direction. The wavelet packet method is an extension of wavelet transformation, not only decomposes the low-frequency part of the signal, but also decomposes the high-frequency part in a binary manner, so that the frequency band of the signal can be divided into finer frequency bands, the resolving power of the signal in the whole frequency band is improved, and more detailed information contained in the signal can be known.

Defining subspace U _j ⁿ Is a function u _n The closure space of (t), U _j ²ⁿ Is a function u _2n The closure space of (t), and let u _n (t) satisfies the following two-scale equation:

g (k) = (-1) ^k h (1-k), i.e. the two coefficients have an orthogonal relationship. When n=0, the above equation yields

Wherein phi (t) and phi (t) are the scale function and wavelet basis function in the multi-resolution analysis respectively, we call { u } _n (t) } (where n ε Z ₊ ) To be composed of the basis function u ₀ (t) =phi (t). If the signal is subjected to m-layer wavelet packet decomposition, 2 can be obtained ^m The signals are decomposed. In order to obtain fine frequency components, the signals are subjected to 4-layer wavelet packet decomposition by sym5 wavelet basis functions to obtain 2 ⁴ The method comprises the steps of namely decomposing signals of 16 nodes, calculating the ratio of the energy of the decomposed signals of each node to the total energy of the signals to be used as the energy ratio of wavelet packets, and respectively obtaining the energy ratio of the 16 wavelet packets of P waves and S waves to be used as a quantization criterion.

In the above embodiment, 88 features are quantized in total in the multi-dimensional feature extraction, including 4 corner frequencies, 1 duration, 2 high-low frequency energy ratios, 2 waveform complexity, 2 zero crossing rates, 2 instantaneous frequency complexity, 2 cepstrum complexity, 21P/S amplitude ratios, 2 autocorrelation coefficients, 2 autocorrelation coefficient means, 2 autocorrelation coefficient variances, 2 dominant frequencies, 2 frequency third moments, 32 wavelet packet energy ratios, 8 hilbert yellow transformed average frequencies, 1 hilbert yellow transformed average value, and 1 hilbert yellow transformed main peak. The normalized feature dataset consisted of these 88 feature values, such as the multidimensional feature profiles of seismic, explosive and mine seismological shown in fig. 3. It should be noted that 88 features are only exemplary schemes, and not limited to only, and that the quantized N features may be extracted from several feature combinations in multiple dimensions.

3. And a feature analysis and feature selection step, wherein the feature importance calculation and feature selection are respectively carried out on the feature data set by utilizing an XGBoost algorithm and a random forest (RF, randomForest) algorithm, the correlation between the features and the categories is considered while the single feature importance is ensured, and the optimal feature data set is screened out.

Preferably, the importance of the features is analyzed by XGBoost and Random Forest (RF) algorithms. The base importance, which is a measure of variable importance based on the base impurity index, is normalized for 88 feature importance scores (the sum of all importance scores is 1), fig. 4a is the XGBoost calculated feature importance score, and fig. 4b is the RF calculated feature importance score. P/S amplitude ratios have long played an important role in seismic/explosive classification, with XGBoost and RF calculated importance scores for 88 features slightly different, but high frequency P/S amplitude ratios are ranked higher. The number of features used for classification is selected by the feature importance score, fig. 5a for XGBoost for different feature recognition accuracy and fig. 5b for RF for different feature recognition accuracy. The features which do not play a role in 88 features can be obtained, and a higher recognition rate can be obtained by selecting 60 to 88 features, in the embodiment, 88 features all contribute to the model to a certain extent, and we do not perform feature discarding, namely the screened optimal feature data set is selected and all the features are input. It should be noted that this embodiment is only an exemplary illustration, and is not a unique solution, and feature selection is performed according to the feature importance scores, and 60, 66, 70, … …, etc. may be selected as the best feature data set to be screened.

4. And a XGBoost algorithm training and classifying step, namely splitting a training set and a verification set of the screened optimal characteristic data set, constructing an XGBoost classification model on the basis of the XGBoost algorithm, performing model training and verification by adopting a 5-fold cross verification method, determining parameters of the XGBoost classification model, and performing XGBoost classification model generalization capability test by utilizing test data sets in different regions to determine the optimal XGBoost classification model to realize seismic classification.

(1) Training set preprocessing

Preferably, the training feature data set is further normalized or normalized, which is understood to be the optimal feature data set preprocessing selected, and is also understood to be the optimal feature data set created after the third step feature analysis and feature selection data set preprocessing. Because the value range of some features is larger in the feature data set, the value range of some features is smaller, so that the value range of the features is possibly larger and can be considered as an important feature by the neural network algorithm, and the value change range of the features is larger and can cause the w weight to be poorly trained. This time, the standardized technique is mainly used. The normalization formula is as follows:

wherein X represents a feature, X _min Representing the smallest value, X, in the feature _max Representing the maximum value of this feature.

(2) Training set and validation set splitting

In order to verify the XGBoost classification model performance, as shown in FIG. 6, a principle of 5-fold (5-fold) cross verification is adopted to randomly divide a Training data set into five sub-samples (Itation 1, itation 2, itation 3, itation 4, itation 5), model Training is performed on the four sub-sets, and verification is performed on the remaining sub-sets (white filling is a Training set, black filling is a verification set Validation set), ACC1, ACC2, ACC3, ACC4, ACC5 are respectively obtained, and the verification result is that the ACC is obtained by taking an average value. The training and validation process was repeated five times using different subsets as validation sets. To test the stability of the five artificial intelligence methods, 10 5 cross-validations were performed for each classifier, randomly resampling training data at each run, and a total of 50 validation results.

(3) Design XGBoost classification model

The gradient Boosting decision tree algorithm is a Boosting algorithm proposed by Friedman (2001), the algorithm makes the loss function obtained by each iteration descend along the gradient direction to construct a weak classifier function, and then a plurality of weak classifier results are combined with a certain weight to form a strong classifier as a final prediction output. Extreme Gradient Boosting (XGBoost) is an improvement of the gradient lifting algorithm, 2-order Taylor expansion is performed on the loss function, and 2-order derivatives are added while the first-order derivative model is reserved, so that the model can be quickly converged, and a regularization term is added in the loss function to prevent the model from being overfitted.

The regularized objective function (Chen and Guestrin, 2016) of XGboost is:

wherein, the liquid crystal display device comprises a liquid crystal display device,is a model predictive value, y _i For a real value, l is a loss function that measures the difference between the predicted value and the real value; k is the number of the classification regression trees; omega is a regular penalty term function; f (f) _k Is the model of the kth tree. Regularization helps to smooth the learning weights, avoiding overfitting. Intuitively, regularizeThe target tends to select a simple predictive function model, and returns to the conventional gradient tree lifting model when the regularization parameter is set to 0. Gamma and lambda are penalty coefficients in the expression of the regular penalty term, so that the decision tree model is prevented from being too complex; t is the number of leaf nodes, and ω is the sum of the weights of the leaf child nodes. Unlike decision trees, each regression tree has a weight value on each leaf. For the t-th iteration, the objective function of the model is:

an objective function representing the t-th iteration, +.>The predicted value of the previous t-1 iterations; f (f) _t A regression tree model for the t-th classification; omega (f) _t ) The model regularization term representing the t-th iteration helps reduce the effect of the overfitting. The objective function can be optimized quickly by using second-order expansion:

g _i and h _i The first and second derivatives of the loss function, respectively, the constant term being removed, I being defined _j ＝{i|q(x _i ) As an example set of leaf nodes j, =j }, the objective function after the t-th iteration is:

wherein w is _j For the weight of leaf node j, for a fixed decision tree structure, the weight of the leaf node is:

will beSubstituting the objective function to obtain:

the above formula may serve as a scoring function that scales the tree structure, but it is generally not possible to enumerate all possible tree structures, so XGBoost uses a greedy algorithm, starting with a single leaf, iteratively adding branches to the tree. Suppose I _L And I _R Is the set of left and right nodes after splitting, i.e. i=i _L ∪I _R The loss function after segmentation is:

(4) XGBoost classification model parameter optimization process

The XGBoost classification model mainly has the following parameters:

(1) the iteration times t are the number t of the generated classification regression trees, the influence on the model performance is large, and if t is too large, the model is easy to be fitted excessively, so that the generalization capability of the model is reduced;

(2) the learning rate eta is used for controlling the step length of each iteration, improving the stability of the model, and if the learning rate is too high, the accuracy rate of model identification is reduced, and if the learning rate is too low, the operation speed of the model is influenced;

(3) the maximum tree depth Dmax and the minimum sample weight Wmin in the child nodes, if the node sample weight sum of a leaf is smaller than the set Wmin, the leaf node splitting process is ended. The parameters are used for controlling the complexity of the model, the accuracy of the model is reduced when the tree is too shallow, the model is over-fitted when the tree is too deep, and the generalization capability of the model is reduced;

(4) the larger the drop value r of the minimum loss function required by the leaf node is, the more conservative the algorithm is, and the longer the calculation time is;

(5) the random sampling proportion S is used for increasing the parameter of the model randomness, if the S is set to 0.8, 80% of samples are extracted randomly to build a tree model, and the stability of the model can be improved and the final correct recognition rate can be increased by modulating the size of the S.

The present invention uses the GridSearchCV function in Scikit learning to optimize the main parameters of the XGBoost classification model. The model parameters are as follows: the basic learning type adopts gbtree, the number t=500 of decision trees, the maximum depth dmax=6 of the trees, the minimum sample weight wmin=1, the learning rate eta=0.2, the minimum loss function drop value r=0.1, and the random sampling proportion S=0.8 of each tree.

(5) Model Performance evaluation

The XGBoost model performance was evaluated using several model performance parameters, confusion matrix, subject operating characteristic curve (ROC), area under ROC curve (AUC). Accuracy (Accuracy), true class rate (TPR), true negative class rate (TNR) as the index for evaluating the performance of the model, specifically defined as follows:

wherein, the positive class is identified as TP correctly, the positive class is identified as FN, the negative class is identified as FP, and the negative class is identified as TN. 7a-7d, and FIG. 7a is a comparison of the accuracy of the seismic and explosive classification model with RF; FIG. 7b is a comparison of the accuracy of the seismic and mineral seismic classification model versus RF; FIG. 7c is a comparison of the accuracy of the explosion and mine vibration classification model versus RF; FIG. 7d is a comparison of the accuracy of the seismic, explosive and mineral seismological classification model with RF.

(6) Model generalization capability verification

The generalization capability of the model is verified by using test data sets of different areas of China, wherein the first test data set is the data of the earthquakes, explosions and mineral earthquakes which are newly generated in the last two years and are positioned in the research area, and the second test data set is the data of the earthquakes, explosions and mineral earthquakes which are positioned in other areas of China. The generalization capability of the two-classification and three-classification models is evaluated through the Accuracy Accuracy, the true class rate TPR and the true negative class rate TNR indexes.

FIG. 8 is a flow chart of the XGBoost algorithm creating an XGBoost classification model. The seismic record of the earthquake, explosion and mine earthquake event is preprocessed firstly by the seismic classification method based on multidimensional feature extraction and XGBoost, and waveforms with abnormal records and lower signal to noise ratio are removed; then, an 88-dimensional feature training set is established by utilizing a feature quantization method, and missing values, abnormal values, standardization and the like in the feature training set are processed; respectively constructing a two-classification model and a three-classification model by using an XGBoost algorithm, optimizing the XGBoost super-parameters by using a GridSearchCV function in scikit-learn, and determining the parameters of the XGBoost classification model; performing classification model training and verification by a sampling 5-fold cross verification method, determining XGBoost model parameters, and evaluating model performance by using Accuracy, ROC, AUC, TPR, TNR and other indexes; finally, the generalization capability test of the model is carried out by utilizing test data sets of different regions, and the model is preferably classified.

The invention also relates to a seismic classification system based on multi-dimensional feature extraction and XGBoost, which corresponds to the seismic classification method based on multi-dimensional feature extraction and XGBoost, and can be understood as a system for realizing the seismic classification method based on multi-dimensional feature extraction and XGBoost, and comprises a seismic property rechecking module, a multi-dimensional feature extraction module, a feature analysis and feature selection module and an XGBoost algorithm training classification module which are sequentially connected, wherein the workflow is shown in figure 8, the seismic property rechecking module acquires seismic data to perform preprocessing to establish a standard data set, performs waveform analysis on an original seismic catalog to complete seismic property rechecking so as to determine an earthquake, an explosion and a mine catalog, and performs effective wave band automatic segmentation based on the standard data set and the earthquake, the explosion and the mine catalog; the multi-dimensional feature extraction module is used for multi-dimensionally extracting multi-dimensional feature values of the P/S amplitude ratio, the high-low frequency energy ratio, the corner frequency, the waveform duration time and the wavelet packet energy ratio in an effective wave band to establish a standardized feature data set; the feature analysis and feature selection module calculates feature importance scores through an XGBoost algorithm and an RF algorithm, and performs feature selection according to the feature importance scores so as to screen out an optimal feature data set; the XGBoost algorithm training classification module is used for splitting a training set and a testing set of the screened optimal characteristic data set, constructing an XGBoost classification model on the basis of the XGBoost algorithm, processing missing values, abnormal values, standardization and the like in the characteristic training set, performing model training and verification by adopting a 5-fold cross verification method, determining XGBoost classification model parameters, and performing XGBoost classification model generalization capability test by utilizing the testing data sets of different regions to determine the optimal XGBoost classification model to realize seismic classification.

Preferably, the multi-dimensional feature extraction module is used for multi-dimensionally extracting corner frequencies, waveform duration, high-low frequency energy ratio, waveform complexity, zero crossing rate, instantaneous frequency complexity, cepstrum complexity, P/S amplitude ratio, autocorrelation coefficient, excellent frequencies, frequency third moment, wavelet packet energy ratio, and average frequency, mean value and main peak feature value after Hilbert yellow transformation in an effective wave band, and establishing a standardized feature data set.

Preferably, the feature analysis and feature selection module calculates and calculates the base importance of each feature through an XGBoost algorithm and an RF algorithm, normalizes the base importance to obtain a feature importance score, and selects the number of features for seismic classification based on the feature importance score to screen out an optimal feature data set.

Preferably, the XGBoost algorithm training and classifying module further performs XGBoost classification model parameter optimization and model evaluation, optimizes XGBoost classification model parameters by adopting a GridSearchCV function, and the XGBoost classification model parameters comprise iteration times, learning rate, minimum sample weight, minimum loss function drop value and random sampling proportion; and uses the confusion matrix, ROC characteristic curve and AUC value to evaluate XGBoost classification model performance.

According to the seismic classification method and system based on multi-dimensional feature extraction and XGBoost, seismic records of earthquakes, explosions and mine earthquake events are preprocessed through combination of seismic property rechecking, multi-dimensional feature value extraction and XGBoost algorithm training, and waveforms with abnormal records and low signal to noise ratio are removed; 88 characteristics such as corner frequency, duration, high-low frequency energy ratio, P/S amplitude ratio, wavelet packet energy ratio and the like can be extracted, the recording difference of different types of earthquakes is effectively highlighted, missing values, abnormal values, standardization and the like in the characteristic training set are processed, and a standardized characteristic data set is established; performing feature importance score calculation and feature selection by using an XGBoost algorithm and an RF algorithm, and screening out an optimal feature data set; the XGBoost super-parameters are optimized by utilizing a two-classification and three-classification seismic classification model established by utilizing an XGBoost algorithm and feature extraction, and the GridSearchCV function in scikit-learn is utilized to determine classification model parameters, so that the classification capability is stable, and the classification accuracy rate of the earthquake, explosion and ore earthquake is higher than 90%; the sampling 5-fold cross validation method is used for training and validating the XGBoost classification model, determining parameters of the XGBoost classification model, evaluating the performance of the model by using Accuracy, ROC, AUC, TPR, TNR and other indexes, testing the generalization capability of the model by using test data sets in different areas, and optimizing the classification model; is simple and easy to use, and is suitable for the service requirements of earthquake classification and microseism classification. Low realization cost and wide popularization value.

It should be noted that the above-described embodiments will enable those skilled in the art to more fully understand the invention, but do not limit it in any way. Therefore, although the present invention has been described in detail with reference to the drawings and examples, it will be understood by those skilled in the art that the present invention may be modified or equivalent, and in all cases, all technical solutions and modifications which do not depart from the spirit and scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The earthquake classification method based on multidimensional feature extraction and XGBoost is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in the seismic property review step, format conversion and instrument response correction processing are performed after the seismic data are collected to create a standard data set, and the seismic property review is completed by performing waveform analysis and seismic position confirmation on the original seismic catalogue to determine the seismic, explosion and mine seismic catalogue.

3. The seismic classification method of claim 1, wherein in the multi-dimensional feature extraction step, a normalized feature data set is created from multi-dimensional extracted corner frequencies, waveform durations, high-low frequency energy ratios, waveform complexity, zero-crossing rate, instantaneous frequency complexity, cepstral complexity, P/S amplitude ratios, autocorrelation coefficients, excellent frequencies, frequency third moments, wavelet packet energy ratios, and hilbert-transformed average frequency, mean and main peak eigenvalues within an effective band.

4. The seismic classification method according to claim 3, wherein in the multi-dimensional feature extraction step, the P/S amplitude ratio is extracted by selecting a seismic phase window including P-wave and S-wave signals by using a specific speed window, and calculating root mean square P/S amplitude ratios of different frequency bands after fourier transformation, wherein the root mean square P/S amplitude ratios include a time domain maximum amplitude ratio and amplitude ratios of a plurality of different frequency bands; the corner frequency is the intersection point of the high-frequency progressive trend and the low-frequency progressive trend of the amplitude spectrum, the spectrum analysis method is adopted to obtain the source spectrum parameter, and the corner frequencies of the P wave and the S wave are calculated; the extraction of the wavelet packet energy ratio is to utilize sym5 wavelet packet function to carry out four-layer decomposition on P wave and S wave, and calculate the energy ratio of decomposed signal energy to total signal; the cepstrum complexity is extracted based on the vibration characteristics of an amplitude spectrum of an earthquake and the single property of an explosion focus, and cepstrum criterion values of P waves and S waves of the earthquake and the explosion are respectively extracted; the extraction of the instantaneous frequency complexity is that Hilbert transformation is firstly carried out on a frequency signal to obtain an analysis signal, then the instantaneous frequency is obtained by Wigner distribution and empirical mode decomposition, and finally the instantaneous frequency complexity is calculated.

5. The method according to any one of claims 1 to 4, wherein in the feature analysis and feature selection steps, the base importance of each feature is calculated by XGBoost algorithm and random forest algorithm and normalized to obtain feature importance scores, and the number of features for seismic classification is selected based on the feature importance scores to screen out an optimal feature data set.

6. The seismic classification method according to one of claims 1 to 4, wherein in the XGBoost algorithm training classification step, XGBoost classification model parameter optimization and model evaluation are further included, and a GridSearchCV function is adopted to optimize XGBoost classification model parameters, wherein the XGBoost classification model parameters include iteration times, learning rate, minimum sample weight, minimum loss function drop value and random sampling proportion; and uses the confusion matrix, ROC characteristic curve and AUC value to evaluate XGBoost classification model performance.

7. The seismic classification method according to claim 5, wherein in the XGBoost algorithm training classification step, XGBoost classification models constructed on the basis of XGBoost algorithm on the training set include two classification models and three classification models, and accuracy, true class rate and true negative class rate indexes are used to evaluate generalization ability of the two classification models and the three classification models.

8. A seismic classification system based on multidimensional feature extraction and XGBoost is characterized by comprising a seismic property rechecking module, a multidimensional feature extraction module, a feature analysis and feature selection module and an XGBoost algorithm training classification module which are connected in sequence,

9. The seismic classification system of claim 8, wherein the seismic property review module performs format conversion and instrument response correction processing after acquiring the seismic data to create a standard dataset, and wherein the seismic property review is performed by performing waveform analysis and seismic location validation on the original seismic catalog to determine the seismic, explosive and mineral seismic catalog.

10. The seismic classification system of claim 8 or 9, wherein the multi-dimensional feature extraction module is configured to multi-dimensionally extract corner frequencies, waveform durations, high-low frequency energy ratios, waveform complexity, zero-crossing rates, instantaneous frequency complexity, cepstral complexity, P/S amplitude ratios, autocorrelation coefficients, excellent frequencies, frequency third moments, wavelet packet energy ratios, and hilbert yellow transformed average frequency, average value, and main peak eigenvalues to create a normalized feature data set;