CN117497095A

CN117497095A - Prediction method of bond dissociation energy of energetic material based on feature fusion and data enhancement

Info

Publication number: CN117497095A
Application number: CN202311534351.6A
Authority: CN
Inventors: 蒲雪梅; 苟巧林; 刘静; 郭延芝; 徐司雨
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-02
Anticipated expiration: 2043-11-17

Abstract

The invention discloses a prediction method of energy-containing material bond dissociation energy based on feature fusion and data enhancement, which constructs an energy-containing material bond dissociation energy BDE data set; constructing a fusion descriptor for each molecule in the dataset; dividing the data set into an initial training set and an independent testing set, repeating the dividing for 20 times, expanding the initial training set by adopting a data enhancement method after dividing the data each time, optimizing an XGBoost model by using grid search, and predicting the independent testing set by adopting the trained XGBoost model to evaluate the performance of the model; the average value of the 20 test results is taken as the final performance of the model; the obtained bond dissociation energy prediction model performance is obviously superior to the reported bond dissociation energy prediction model. The method adopts the fusion descriptor and combines the data enhancement strategy to solve the limitation of the small sample of the energetic data on the prediction performance of the model so as to improve the prediction precision of the bond dissociation energy of the energetic material.

Description

Prediction method of bond dissociation energy of energetic material based on feature fusion and data enhancement

Technical Field

The invention relates to the technical field of energetic material bond dissociation energy, in particular to a prediction method of energetic material bond dissociation energy based on feature fusion and data enhancement.

Background

The energetic material is a substance capable of instantaneously releasing a large amount of energy through chemical reaction under the stimulation of a certain external condition, is widely applied to fireworks, explosives, propellants and the like, is of great importance to energetic compounds, and has the disastrous impact to human beings because of accidental explosion accidents caused by unintentional impact, accidental ignition or fire disaster in the transportation, storage and use processes for many years. At present, people are continuously striving to develop novel energy-containing materials with good stability to meet the future demands of the fields of national defense, military and the like while pursuing high detonation performance. The sensitivity of an energetic material is an indicator of its stability, the greater the sensitivity the more susceptible the molecule to reactions and explosions when subjected to external stimuli (e.g., impact, static electricity, friction, flame, etc.). However, the sensitivity measurement tends to be less reproducible, resulting in difficulty in analysis using experimental values. Furthermore, researchers have linked the impact sensitivity to the Bond Dissociation Energy (BDE) of the weakest bond of the energetic molecule, which is typically found to be X-NO ₂ (x= C, N, O) bond whose bond dissociation energy has a good correlation with the impact sensitivity of the explosive, it is more convenient and accurate to characterize the stability of the energetic molecule using the bond dissociation energy available by quantitative calculation than by experimental determination of sensitivity. Generally, the greater the BDE of the weakest bond (i.e., the pyrolysis initiation bond) of the energetic molecule, the better its stability and the lower its sensitivity, so the bond dissociation energy of the pyrolysis initiation bond of the energetic material is of great importance in the study of the energetic material.

Although BDEs can be experimentally determined by various means, experimental determination is a complex and time-consuming task, the number of known BDEs measured experimentally is only one ten million of the number of currently registered molecules, and only a few molecules with a heavy atomic number of less than 10 have available data. BDE calculation methods based on quantum chemistry theory can achieve almost the accuracy comparable to experiments, and have become the main means for obtaining bond dissociation energy at present, however, the search for energetic molecules with high stability in a large-scale unknown space requires a lot of time and high calculation cost, which is impractical for the design of novel energetic molecules. Therefore, there is an urgent need for an efficient and accurate method that can rapidly screen energetic materials with excellent stability in a wide search space. In recent years, data-driven machine learning has made significant progress in the fields of material science and chemistry. Although some maximum likelihood estimation (ML) methods have been used to rapidly predict the relevant properties of energetic materials, the development of predictive models for energetic molecular BDEs is still lacking, and the accuracy of machine-learning predictive models for only energetic molecular BDEs is still relatively low. In addition, the energetic material is in a state of data scarcity for a long time due to long experimental development period and high risk, and the development of the energetic material is further limited by the lack of high-quality energetic data.

Although some studies have been made to predict bond dissociation energy using machine learning methods, the following problems still remain:

an energetic material bond dissociation energy prediction model constructed based on a non-authentic energetic material (such as a nitro compound, a mono-benzene ring derivative) can affect the extrapolation capability of the model in practical applications. Because of the lack of structural features of classical energetic backbones and other energetic substituents in addition to nitro groups in the data, predicting molecules in the search space that are made up of the actual energetic backbones combined with substituents may exhibit poor performance. The bond dissociation energy prediction model of other systems has high accuracy in the respective systems, but the accuracy of the prediction of the energetic molecules is low because of the lack of energetic material molecules in the data sets and the lack of unique structural characteristics of the energetic compounds in the characteristic characterization of the sample, so that the accuracy of the model in common small molecular organic compounds is high. In general, existing predictive models of bond dissociation energy lack truly reliable energetic molecular data, and the descriptors used lack feature descriptions that are capable of fully reflecting energetic features and dissociative bonds.

Disclosure of Invention

The invention aims to provide a prediction method for the dissociation energy of an energetic material bond based on feature fusion and data enhancement, which is used for solving the problem of low accuracy of predicting an energetic molecule pyrolysis initiation bond BDE in the prior art.

The invention solves the problems by the following technical proposal:

a method of predicting bond dissociation energy of an energetic material based on feature fusion and data enhancement, comprising:

step S100, constructing an energy-containing material bond dissociation energy BDE data set;

step 200, constructing a bond dissociation energy fusion descriptor of each molecule in the BDE data set of the energetic material;

step S300, dividing the BDE data set of the energetic material into an initial training set and an independent testing set according to a set proportion, repeatedly dividing the data M times, and respectively executing the following steps for the initial training set and the independent testing set which are divided each time:

expanding an initial training set by adopting a data enhancement method; dividing the expanded initial training set into a training set and a verification set again according to a set proportion, training an XGBoost model by adopting the training set, searching the optimal super-parameters of the model on the verification set by adopting a network searching method, and predicting the bond dissociation energy of molecules of an independent test set after determining the optimal model to be used as a result of independent test of the XGBoost model;

Step S400, taking an average value of the results of M model evaluations as a final performance result of the model, wherein the optimal model is taken as a final XGBoost model;

and S500, predicting the dissociation energy of the energetic material bonds by adopting a trained XGBoost model.

Further, the step S100 specifically includes:

step S110, collecting synthesized energetic compounds composed of C, H, O, N elements, and constructing an initial energetic material BDE data set;

step S120, optimizing the structure of each molecule in the data set, extracting the optimized structure file of each molecule, and obtaining the total energy of the molecule at 0K;

step S130, calculating the Wiberg bond level of each bond in the molecule, wherein the smaller the bond level is, the higher the possibility that the chemical bond becomes a pyrolysis initiation bond is, so as to determine the initiation bond;

step S140, determining a molecular breaking position according to an initiation bond, and uniformly breaking each molecule into two free radicals;

step S150, respectively optimizing the structure of two free radicals of each molecule to obtain the total energy of the two free radicals at 0K;

step S160, calculating the difference value between the sum of the energy of two free radicals generated after homolytic cleavage of each molecule and the energy of the original molecule, and obtaining the bond dissociation energy value of each molecule;

Step S170, counting SMILES of the energetic molecules and bond dissociation energy values thereof as a final energetic material BDE data set.

Further, the step S200 specifically includes:

step S210, converting SMILES of each molecule into mol, then carrying out hydrogenation to obtain a 3D configuration, and finally carrying out MMFF94 force field optimization to generate a corresponding sdf file;

step S220, generating a corresponding txt file by using chemical bond descriptor generating software for the sdf file of each molecule, wherein the txt file of each molecule comprises chemical bond descriptors of all non-cyclic bonds in the molecule, and each row in the txt file is expressed as a descriptor of a root bond, and the total number of the descriptors is 100, wherein the first two dimensions are indexes of atoms at two ends of the bond, and the last 98 dimensions are chemical bond descriptors of the bond;

step S230, determining indexes of atoms at two ends of each molecular bond according to the position of the bond fracture of each molecular bond determined in step S140, and extracting 98-dimensional descriptors in corresponding rows in the molecular txt file;

step S240, calculating energetic characteristic descriptors of energetic molecules, wherein the total number of the energetic characteristic descriptors is 50;

and step S250, splicing the chemical environment descriptor and the energetic feature descriptor to finally obtain the bond dissociation energy fusion descriptor of 148 dimensions of each molecule.

Further, the specific calculation method of the chemical environment descriptor comprises the following steps:

a. naming each element according to the element type and substitution number of molecules in the BDE dataset of the energetic material, wherein the element type is limited to C, H, O and N;

b. selecting one chemical bond in the molecule as a calculated target bond, and encoding the distance between the target bond and other atoms in the molecule: dividing the sphere according to the number of chemical bonds of the distance between the adjacent atoms and the target bond; the selected chemical bond is limited to a non-cyclic bond in the molecule;

c. after defining the respective sphere for each non-circular key, the following 3 classes of descriptors are calculated: a point descriptor recording the number of each atom type in each sphere; a pair descriptor recording the number of atom pair types separated by a specified distance in a specified sphere; segment point descriptors, calculating aromatic atoms in each segment after target bond is broken and atoms of a conjugated pi system; the machine-recognizable numbers described above are used as descriptors for each non-circular key.

Further, the energetic feature descriptor is a key sum sob+electric topology state fingerprint E-state+custom descriptor set CDS, wherein:

the key sum SOB enumerates the types of all keys in the dataset, and calculates the number of times each key appears in each molecule as a SOB descriptor of the molecule;

Adding the inherent state of each atom and the disturbance action of other non-hydrogen atoms on the atom to obtain the E-state index of the atom, and adding the indexes of the same kind of atoms according to the atom type;

the custom descriptor set CDS consists of N element type, O element type, number of N, C and H in the molecule and carbon-nitrogen ratio, oxygen balance in the dataset.

Further, the step S300 specifically includes:

the method comprises the steps of randomly arranging samples in a dissociation energy data set of the whole energetic material bond, dividing the initial training set and an independent testing set according to a set proportion by using a Shuffle Split method, repeating the dividing process for M times (for example, M=20), carrying out data enhancement on the samples in the initial training set after dividing the data each time, dividing the samples in the initial training set after expansion into a training set and a verification set again by using nested five-fold cross verification, searching the optimal super-parameters of a model on the verification set by adopting a network searching method, predicting by using molecules of the independent testing set after determining the optimal XGBoost model, combining the data of the independent testing set and the data in the training set one by one to obtain a prediction result which is regarded as distribution, adding the average value of the distribution and the average value of the training set, and finally obtaining the prediction result of an unknown sample as the result of the independent testing set.

Further, the data enhancement on the samples in the initial training set specifically includes: and combining samples of the initial training set in pairs, calculating differences between one-dimensional descriptors of the two samples combined in pairs to obtain difference descriptors, and then splicing the difference descriptors with original descriptors of the two samples combined in pairs to obtain spliced descriptors, wherein the spliced descriptors form a feature matrix.

Further, the step S400 specifically includes: the average absolute error MAE, the root mean square error RMSE and the determination coefficient R of the XGBoost model are calculated according to the results of M XGBoost model independent tests ² To evaluate the performance of XGBoost model, M average absolute errors MAE, root mean square error RMSE and decision coefficient R obtained by M times of calculation ² The average values are respectively calculated as the final performance results of the XGBoost model.

Compared with the prior art, the invention has the following advantages:

(1) The invention starts from two key technologies of machine learning data set and characteristic characterization, constructs a representative high-quality energetic material bond dissociation energy data set, provides a fusion characteristic characterization combining a chemical bond descriptor and a global descriptor of an energetic substance based on the nature of bond dissociation energy and the energetic characteristic of the energetic substance, and introduces a data enhancement strategy of a pairwise difference regression method capable of reducing systematic errors The method solves the limitation of the small energetic data sample on the prediction performance of the model, so as to improve the prediction precision of the model. Based on the proposed fusion characteristic descriptor and data enhancement mode and combining with XGBoost algorithm, a high-accuracy energetic material bond dissociation energy prediction model is obtained, and R is achieved on an independent test set ² ＝0.98，MAE＝8.8kJ·mol ^-1 The model is used for predicting the bond dissociation energy of the energetic material, so that the prediction accuracy of the energetic material is improved.

(2) The invention provides a rapid and accurate bond dissociation energy prediction tool for the research and development of novel high-efficiency energetic molecules, and is favorable for the development of high-efficiency energetic materials. In addition, the method for fusing the feature descriptors and the data enhancement provides a guide on the method for the application of machine learning in other small sample fields.

Drawings

FIG. 1 is a flow chart of the calculation of bond dissociation energy descriptors for energetic materials according to the present invention;

FIG. 2 is a schematic diagram of a PADRE data enhancement strategy according to the present invention;

FIG. 3 is a diagram of BDE predictions and calculations without data enhancement and with data enhancement methods;

FIG. 4 is a schematic diagram showing the results of predicting bond dissociation energy of energetic materials from different combinations of fingerprint and Elastic Net models and models according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.

SMILES, collectively Simplified Molecular Input Line Entry System, is a linear symbol for inputting and representing molecular reactions, an ASCII code;

MMFF：Merck Molecular Force Filed。

examples:

data scarcity is an important factor limiting the development of energetic materials in the field, and requires a sufficiently high quality data to construct a predictive model suitable for predicting the bond dissociation energies of a broad range of candidate energetic molecules, so that in order to obtain a reliable dataset, 778 total energetic compounds composed of C, H, O, N elements, including many typical explosives (e.g., TNT, CL-20, HMX, RDX, etc.), have been collected from published literature. The complete list of compounds in the dataset includes SMILES formulas, names, etc. To obtain a label for the bond dissociation energy of this dataset, the stability of the energetic compound was evaluated, the 778 molecular structure was optimized at the calculated level of B3LYP/6-31G of Density Functional Theory (DFT), since the bond dissociation energy of the energetic compound is typically the bond strength of the weakest bond of the energetic molecule, and cleavage of this bond is considered a critical factor in the decomposition process, R-NO ₂ The bond (r=c, N or O) is typically the weakest bond in the energetic molecule, and cleavage of the bond is the first step in the decomposition process, so that the stability and impact sensitivity of the energetic material is comparable to that of R-NO ₂ The greater the BDE value of the energetic molecule, the better the stability and the lower the sensitivity, so the bond dissociation energy of the weakest bond is calculated as an index for evaluating the stability.

The method specifically comprises the following steps:

in the case of small data samples, feature vector extraction is more critical, bond dissociation energy fusion descriptors are effective characterizations of molecular structures related to bond dissociation energy of energetic materials, and in order to highlight the advantages of this feature characterization method, the effects of some classical general descriptors are also compared, and a total of 7 different types of descriptors are used:

(1) Key Sum (Sum Over Bonds, SOB)

In the process of generating the feature vector of 'key sum', the types of all keys in the data set are enumerated first, and then the number of times each key appears in each molecule is calculated. In total there are 22 different types of bonds in the energetic material dataset, the types of bonds comprising n=n, n=o, C: N, C: O, C: C, C-O, C-N, C-H, C-C, C/C, N: O, N: N, C/O, C/N, N/N, O-O, c=o, c=n, H-N, N-O, c=c, wherein ' - ' represents a single bond, ' = ' represents a double bond, '/' represents a directional bond, ': represents an aromatic bond, the number of each type of bond is recorded as an SOB descriptor of the molecule, which contains count information of each type of bond in each molecule.

(2) Electric topology state fingerprint (Electrotopological state Fingerprints, E-state)

Molecular fingerprint is a representation way for converting a molecular structure into a digital matrix, and has the characteristics of easy calculation and substructural search. The fingerprint used in the invention is an electrical topology state fingerprint: and adding the inherent state of each atom and the disturbance action of other non-hydrogen atoms on the atom to obtain the E-state index of the atom, and adding the indexes of the same kind of atoms according to the atom type. Taking it as a descriptor, the model can be made to learn the magnitude of the effect of a particular composition in the molecule on the target property. The descriptor defines 79 atom types in total, and the constructed data set only relates to 13 kinds of the 79 atom types, and in order to avoid excessive meaningless zero values of E-state calculation, only 13 kinds of the data set are calculated in actual use.

(3) Custom descriptor set (Custom Descriptor Set, CDS)

Energetic compounds have a high energy density, and one of the important factors affecting energy density is the explosive groups (typically containing N, O, etc.) in the molecular structure, which, upon explosion, generate a significant amount of energy during the cleavage process. For this structural feature we have custom a set of descriptors relating to the element type of N, O and the element composition of the molecule, with N, O elements being classified according to the way they are typed into the molecule. We calculated all compounds in the dataset, for a total of 7N element types and 3O element types: C-NO ₂ ，N–NO ₂ ，O–N＝O，O–NO ₂ ，C–N＝N，C＝N–O，C–NH ₂ N-O-C, n=o and c=o. In addition, the descriptor set includes the number of N, C and H in the molecule, the carbon-nitrogen ratio, and the oxygen balance, which is 15-dimensional in total.

(4) Coulomb matrices (Coulomb matrices, CM)

Coulomb matrix is a mathematical representation method used to describe molecular structures, particularly widely used in computational chemistry and materials science, and is an important tool to describe molecules, helping to understand interactions and properties between molecules. Based on the atomic coordinates of the molecule and the number of nuclear charges, a square matrix is generated to represent the coulomb interactions between each pair of atoms in the molecule. Diagonal elements in the coulomb matrix correspond to polynomial fits of isolated atomic potential energy, while non-diagonal elements correspond to coulomb repulsion energy between different atomic pairs in the molecule. The coulomb matrix is invariant under translation and rotation of the molecule. However, the random arrangement of atomic indices is not constant. To avoid this problem, eigenvalues of the coulomb matrix (CMs eigs) may be used, as the eigenvalues of the matrix are unchanged in the arrangement of columns or rows. In this approach, the coulomb matrix is replaced by eigenvectors of eigenvalues, ordered in descending order, but using eigenvalues means that the information in the complete matrix is lost. Thus, we compared the coulomb matrix eigenvalues (CMs eigs) with the original Coulomb Matrix (CMs).

(5) Key Bag (Bag of bonds, BOB)

BOB is mainly used to capture information about different bonds in a molecule, in particular the type, length and strength of chemical bonds. First, the types of various bonds (e.g., single bond, double bond, triple bond, etc.) that may be present in a molecule are defined. The atoms in the molecule are then analyzed to determine which bonds exist between the atoms by measuring the distance between the atoms and their relative positions. For each type of key, its associated characteristics, such as the length of the key, the angle of the key, the strength of the key, etc., are recorded. Using the information described above, a vector or matrix is constructed in which each element represents the number or characteristics of a type of key. BOB is typically a high-dimensional vector, each dimension corresponding to a key type or key property. The Summed BOB is an improved "keybag" representation method for representing information of different bonds in a molecule and generating a compact representation by summing the different types of bond characteristics. The advantage of the Summed BOB is that it integrates information of different types of bonds in the molecule into one feature vector, thus reducing the dimensions making it more suitable for machine learning and model training. Therefore, we also compare the keybag sum (sum BOB) with the original keybag (BOB).

(6) Atomic center symmetry function (Atom-centered Symmetry Functions, ACSF)

ACSF is a mathematical representation method used to characterize molecular and crystal structures. ACSF is generally used to describe the local structure of an atomic environment in order to better capture the chemical and physical properties of a molecular or crystal structure. These features may be used for input of machine learning models to perform tasks of property prediction, classification, clustering, etc. In constructing an ACSF descriptor, a central atom, typically a particular atom in a molecular or crystal structure, is first selected, and the atomic environment associated with the central atom, including those adjacent atoms that interact with the central atom, is determined. For a selected atomic environment, features such as distance, angle, bond length, bond angle, etc. between a series of atomic pairs are calculated. These features can be used to describe the geometry and chemistry of the atomic environment. Using the computed features, the atomic environment is represented as a mathematical vector or matrix. Different atomic centers may be selected and a corresponding feature vector calculated for each center. These feature vectors may be summarized or combined into a global feature representation to describe the entire molecular or crystal structure.

(7) Chemical environment descriptor (Chemical Environment Descriptor, CED)

We introduce a chemical environment descriptor for the energetic material system based on the chemical bond descriptor, thereby constructing a chemical environment descriptor suitable for characterizing the thermal initiation bonds of the energetic material. We calculated a bond dissociation energy descriptor for each sample in the dataset for its pyrolysis initiation bonds, where each chemical bond has a characteristic dimension of 100, the first 2 dimensions record the actual sequence numbers of two atoms of the chemical bond, and the last 98 dimensions characterize information about the chemical environment. The descriptor characterizes the chemical environment around each bond in the molecule in terms of "spheres" such that predictions of bond dissociation energy are specific to each chemical bond, rather than obscuring predictions of this property into an assessment of the molecule as a whole. The specific calculation steps of the descriptor are as follows:

a. first, each element is named according to the element type and substitution number, for example, C4 means that there are 4 substituents on 1 carbon atom. We restrict the element types to C, H, O and N based on the characteristics of the previously constructed search space.

b. When the descriptor is calculated, one chemical bond is selected as a calculated target bond, and the distances between the target bond and other atoms in the molecule are coded: the spheres are divided according to the number of chemical bonds of the distance between the adjacent atom and the target bond, for example, the target bond itself is Sphere 0 (Sphere 0), and the range of one chemical bond from the target bond is regarded as Sphere 1 (Sphere 1). Since the energy-containing material releases energy by explosion reaction, the ring opening is rarely involved mainly by breaking substituents, we limit the chemical bonds selected to non-cyclic bonds in the molecule.

c. After defining the respective sphere for each non-circular key, the following 3 classes of descriptors are calculated: a point descriptor recording the number of each atom type in each sphere; a pair descriptor recording the number of atom pair types separated by a specified distance in a specified sphere; and (3) calculating the aromatic atoms and the atoms of the conjugated pi system in each fragment after the target bond is broken. The machine-recognizable numbers described above are used as each non-circular key specific descriptor.

Three common descriptors that are currently excellent in the field of energetic material property prediction are considered for 778 finite energetic data sets constructed: bond-Sum (SOB), electrical topology state-fingerprint (E-state), and Custom Descriptor Set (CDS), where SOB characterizes different bond types, E-state is an atomic type count vector, CDS contains the nature of an explosive group. Considering that these several descriptors characterize the energetic molecule based on different structural angles, respectively, we combine the three descriptors to get a combined descriptor sob+e-state+cds (called "SEC") in order to get a more comprehensive sample characterization. Such combined descriptors have achieved good results in the prediction of some properties of the energetic material, such as density, heat of formation, heat of detonation, detonation velocity, and detonation pressure. Since these properties are all closely related to the overall structure of the molecule, the use of such global descriptors, which are equally focused on the overall molecular structure, allows for a more accurate characterization of the energetic molecule. However, these properties differ from bond dissociation energies, and since one typically uses bond strength of the weakest bond in an energetic molecule to represent its bond dissociation energy, characterization of the chemical environment surrounding the weakest bond of the energetic molecule is important. Thus, characterization with such global descriptors alone is not comprehensive, and there remains a need for a bond descriptor that focuses on the local environment of the broken bond, given the structural characteristics associated with the dissociation energy of the energetic material bond.

Based on this deficiency, we introduce chemical bond descriptors into the energetic material system, thereby constructing a chemical environment descriptor suitable for characterizing the thermally induced bonds of the energetic material. The descriptor is based on the concept of a sphere, and describes a structure centered on a designated bond in a molecule to characterize the structural characteristics of a sample, and the characterization encodes the target bond according to the atom type and the atom type pair existing in the neighborhood of the target bond, wherein the difference information between the descriptor pair before and after the target bond fracture is also included, so that the structural characteristics of bond dissociation energy should be more reflected.

To more intuitively represent the method of constructing descriptors, fig. 1 gives an example of defining spheres with pyrolysis-induced bonds in one energetic molecule as target bonds and encoding the bond descriptors. As shown in fig. 1, fig. 1 (a) is an example of defining an atom type and a sphere. The spheres numbered 2, 5, 7, 10 in the figure represent a C atom; spheres numbered 1, 3, 4, 8, 9, 12 represent N atoms; spheres numbered 6, 11 represent O atoms; spheres numbered 13, 14, 15, 16 represent H atoms, "×" represents the location of target bond cleavage; then naming each element according to the element type and substitution number, and dividing a sphere according to the number of chemical bonds between atoms and the target bonds; fig. 1 (b) is an example of a point descriptor and calculation of the descriptor. In constructing the chemical environment descriptor of energetic materials, we first number all atoms in the molecule, then define for each atom, according to its element type and the number of its linking atoms, the type of this atom, like atom 5 in the figure is carbon element, and three atoms 4, 6, 7 are linked, thus defining C3, and in our study we refer to 10 atom types, respectively: c2, C3, C4, H1, N2, N3, N4, O1, O2. After the atom types are defined, the spheres are divided according to the number of chemical bonds of the distance between the atoms and the target bonds. For example, the target bond itself is Sphere 0 (Sphere 0), while the range of one chemical bond from the target bond is considered Sphere 1 (Sphere 1), and so on. In order to achieve a good balance of descriptor length and model accuracy, a maximum of 4 spheres per molecule are considered. After defining the atom types and spheres, the following 3 classes of descriptors are calculated:

Point descriptor: recording the number of each atom type in each sphere;

for descriptors: recording the number of atom pair types separated by a specified distance in a specified sphere;

fragment point descriptor: the atoms of the aromatic atoms and conjugated pi system in each segment after target bond cleavage are calculated.

The specific calculation method of the chemical environment descriptor is as follows:

For each sample in the energetic dataset, a bond dissociation energy descriptor of the pyrolysis initiation bond is calculated, and the characteristic dimension of the chemical bond is 100, wherein the actual serial numbers of two atoms of the chemical bond are recorded in the first two dimensions, and the information related to the chemical environment is represented in the last 98 dimensions. Considering that chemical bond descriptors are focused on structural features of the local environment of broken bonds, characterization is not yet sufficient on the overall structure of energetic material molecules and information related to the energetic features. For some complex molecules, defining only 4 spheres does not cover all bond types of the whole molecule, whereas SOB considers different types of all bonds in the whole molecule, thus compensating for this. The E-state contains information about the electronic state of atoms, and is characterized by different angles with the atom types and the number of connecting atoms defined above, which enriches the atomic information in the molecule. CDS additionally characterizes explosive groups in the molecule (typically containing N, O etc. elements) and describes energetic characteristics such as oxygen balance, carbon to nitrogen ratio etc. which are absent from bond descriptors. The fusion of such chemical bond descriptors with the energetic global descriptor SOB+E-state+CDS enables a more comprehensive characterization of the structural features of the bond dissociation energy of the energetic molecule.

To explore the specific effects that this fusion of local and global descriptors presents in predicting bond dissociation energy, we tested the effects of these two types of features and their fusion characterization on XGBoost model, one was a separate Chemical Environment Descriptor (CED) containing only structural information with dissociated bonds as the core, and 98 dimensions in total, one was a separate energetic global descriptor (SEC) containing only descriptors for the energetic molecule overall structure, 50 dimensions in total, and finally we proposed a fusion descriptor (cde+sec) combining chemical environment descriptors with descriptions of the energetic molecule overall features, and 148 dimensions in total. The first 3 rows of table 1 list the precision of these two descriptors on the training set and the test set. It can be seen that the performance of the model based on the chemical bond descriptor CBD on the test set is significantly better than that of the energetic global descriptor SEC, R ² Up to 0.87 and MAE and RMSE respectively as low as 15.6 kJ.mol ^-1 And 30.3 kJ.mol ^-1 It has been shown that for the prediction of bond dissociation energy, there is still a need to focus on the characteristics of the bonds in the molecule. But when we adopt fusion characteristic CDE+SEC, the prediction accuracy of the model is further greatly improved, and the model is in R of the test set ² From 0.87 to 0.92, the MAE and RMSE were again reduced to 14.7 kJ.mol ^-1 And 24.3 kJ.mol ^-1 At the same time, the prediction bias is also reduced. This demonstrates that our proposed fusion feature more fully characterizes the structural features of bond dissociation energy than either feature alone.

TABLE 1 bond dissociation energy prediction accuracy of different molecular descriptors in combination with XGBoost model

Wherein MAEs and RMSEs are expressed in kJ.mol ^-1 In units of.

To further demonstrate that our fusion descriptors more accurately express structural features related to bond dissociation energy of energetic materials than the rest of the generic descriptors, we used three descriptors, SOB, E-state, CDS, and combinations thereof two by two to predict bond dissociation energy, and also compared with some classical generic molecular descriptors, such as coulomb matrices (including CMs vec and CMs eigs), bond bags (BOB, including BOB and sumed BOB), and Atomic Center Symmetry Functions (ACSF). The coulomb matrix characterizes the charge number, distance and coulomb interaction between atoms, the bond pocket reflects the bond type and quantity relation in the molecule, and the atom central symmetry function describes the surrounding structural information, such as the distance and dihedral angle between atoms, by taking each atom as a core. The last 11 rows of table 1 summarize the performance results of these several descriptors in combination with the XGBoost model to predict the bond dissociation energy of energetic materials.

As shown in Table 1, the 11 different descriptor-bound XGBoost models all exhibit higher accuracy, R, on the training set ² Between 0.83 and 0.96, wherein the BOB descriptor is as low as 11.1 kJ. Mol in MAE and RMSE, respectively, of the training set ^-1 And 15.9 kJ.mol ^-1 . But compared with the training set, the effect of all models in the independent test set is obviously reduced, R ² Only 0.32 to 0.79, MAE and RMSE being at least 23.9 kJ.mol ^-1 And 40.1 kJ.mol ^-1 Part of the reason for this model's large difference in results between training and testing sets should be the imperfections in the overfitting and characterization. From the results of Table 1 we can conclude that the fusion characteristics of the local environmental descriptor of chemical bonds and the global descriptor of energetic materials we propose have achieved far higher results in the prediction of bond dissociation energies of energetic materials than other descriptors, further illustrating the advantage of fusion descriptors that the addition of characteristics with respect to the overall structure of energetic molecules can significantly promote the model when predicting bond dissociation energies of thermally induced bonds of energetic materialsAccuracy. Therefore, we choose to do subsequent work on the basis of the fusion descriptor.

The specific calculation method of the fusion descriptor is as follows:

dividing the BDE data set of the energetic material into an initial training set and an initial testing set according to a set proportion, repeating the data dividing process M (20 times are taken here), and expanding the initial training set by adopting a data enhancement method after dividing the data each time;

increasing data size using pairwise difference regression (PADRE) in which the original n training points are converted into n according to pairwise information ² A point. Specifically, feature vector x _1,2…n Paired combinations forming paired features (x _i ,x _j ) It is a concatenation of two features and the difference between the two features, (x) _i ,x _j ) Defined as equation (1), and accordingly, the PADRE feature matrix X may be represented by equation (2), and the target Y of the pairwise difference regression may be represented by equation (3).

(x _i ,x _j )＝x _i ⊕x _j ⊕(x _i -x _j ) (1)

Where ∈ is the splice operation, y _1,2…n Is the feature vector x in the training set _1,2…n Is set to the target reference value of (2).

The model is then fitted with X as the feature vector and Y as the target value. For an unknown data point u of an unknown sample, combining it with each point in the training set and using the trained model to give a set of predictions that can be considered as distributions, i.e., differences between the unknown sample and the training pointsWherein the mean value is regarded as the final predictor, i.e. the final predictor of the unknown sample +.>Can be represented by formula (4): />

Since the dataset contained only 778 unique energetic materials, it was still a small-scale dataset. In order to increase data diversity, improve model performance and mitigate overfitting risk, a data enhancement strategy is introduced: paired difference regression (PADRE) to improve model performance. In addition, PADRE can generate an uncertainty quantization index when predicting unknown samples, where uncertainty reflects the error of the model, which can be used to select candidates in active learning or bayesian attribute optimization for molecules and materials. Using the PADRE method, the number of samples can also be increased sharply from n training points to n ² And the method can be used as a data enhancement strategy and can also eliminate systematic errors caused by calculation.

The PADRE was introduced to expand 778 energetic data, and the specific operation of data enhancement is shown in fig. 2, where first, each data point in the training set is a fusion feature of a pyrolysis initiation bond in the molecule, a vector with a length of 148, and then the following steps:

(1) In the training phase, the samples in the training set are combined in pairs, the difference between the one-dimensional descriptors of the two samples is calculated, and then the difference descriptors are spliced with the original descriptors of the two samples. Splice descriptors as input and differences in their labels as output. As shown in fig. 2 (a), a modelAccording to the paired eigenvectors (x _i ,x _j ) Training is performed to predict differences in target values (y _i -y _j )。

(2) For an unknown sample u in the test set, the unknown sample u is paired with each sample in the training set, where the difference is calculated by subtracting the known sample (sample in the training set) from the unknown sample u. They are then predicted with a trained model, resulting in n predictions that can be considered as a distribution. As shown in fig. 2 (b), an unknown feature vector (x _μ ) And all feature vectors (x _u ) Pairing process, modelGiving a set of predictions +.>

(3) And adding the average value of the distribution with the average value of the training set to finally obtain a prediction result of the unknown sample u. As shown in fig. 2 (c), for the data points (x _μ ) Making predictionsTo a known amount (y _i ) And a difference prediction setThe average (μ) of the distributions is the target property of the sample μ.

After using the PADRE method, the descriptor dimension is increased 3 times (444 dimensions) from 622 to 622 training data points ² The number of the two-by-two free combinations between the samples is equivalent, the increasing amplitude is far more than the change of the descriptor dimension, and the problem of data deficiency is relieved to a certain extent.

In the model training stage, 622 data in the training set are subjected to data enhancement and then divided into a training set and a verification set, the verification set is used for searching the optimal super parameters of the model to obtain an optimal training model, and in the prediction stage, the rest 156 molecules are predicted by being combined with 622 molecules in the training set one by one to serve as the result of an independent test set. We next performed ablative experiments on the data enhancements to assess whether they are beneficial to improve model performance. FIG. 3 shows the prediction accuracy without data enhancement (referred to as "Xgb") and with a PADRE data enhancement strategy (referred to as "Xgb +PADRE"). Wherein (a) in fig. 3 is the prediction effect of XGBoost model on the validation set without using the data enhancement method; FIG. 3 (b) shows the predicted effect of XGBoost model on the validation set using the data enhancement method; FIG. 3 (c) shows the predicted effect of XGBoost model on an independent test set without data enhancement; FIG. 3 (d) shows the predicted effect of XGBoost model on the independent test set using the data enhancement method; it is apparent that this data enhancement technique does improve the predictive performance of the training set and the independent test set, R of the test set when enhanced with PADRE data ² The value increased from 0.92 to 0.98."Xgb +PADRE" is not only R ² The best effect is obtained on MAE and RMSE, and the MAE of the test set is lower than that of the training set, which shows that after the data enhancement is used, the generalization capability of the model is obviously improved, and in addition, the prediction deviation of the model on the test set is obviously reduced, which shows that the stability of the model is also improved. The process of considering differences in features in descriptors captures structural information related to bond dissociation energy well, and this difference approach can offset some computational errors to some extent. The practical significance of paired difference regression on improving the model performance is fully proved by the results of the ablation experiments, the method not only relieves the sample size problem faced by small sample learning through data enhancement, but also enriches the information contained in the original feature descriptors, and improves the robustness of model predictive key dissociation energy.

Prior to model training, the model is first determined: since the constructed dataset is limited to 778 energetic compounds, the traditional machine learning model is more suitable than deep learning. Six traditional machine learning algorithms were tested that exhibited good performance in learning structure-attribute relationships for small-scale datasets: minimum absolute shrinkage and selection operator regression model (LASSO), kernel-ridge regression model (KRR), support vector regression model (SVR), gaussian process regression model (GPR), random forest regression model (RF), and extreme gradient lifting regression model (XGBoost). For each machine learning model, the data is divided by using a Shuffle Split method, unlike the traditional K-Fold division, the Shuffle Split randomly samples the whole data set during each iteration, so that the data selected as a test set in one iteration may be selected again in the subsequent iteration, through multiple selections and evaluations, not only can the diversity of the test set be obtained which is higher than that of the K-Fold division method, but also the robustness of the model can be reflected by calculating the prediction deviation after the data points are divided multiple times and predicted.

The specific method is that initial samples of the data set are arranged randomly, the data set is divided into a training set and an independent test set according to the proportion of 8:2, and the data dividing process is repeated for 20 times. After each division of the data set, optimizing the model super-parameters by using a grid search method, and carrying out nested five-fold cross validation to search parameters so as to find the optimal super-parameters of each model. After twenty iterations, the model performance was evaluated using the average evaluation index. Calculating Mean Absolute Error (MAE), root Mean Square Error (RMSE) and determining coefficient (R ² ) To evaluate the performance of these machine learning models:

where N is the number of samples, y ^True Is the true value, y ^Pred Is the predicted value of the sample and,is the average of the true values of the values,is the average of the predicted values. R is R ² The values of (2) range from minus infinity to 1, with values approaching 1 indicating a higher goodness of fit of the model on the sample.

The effect of combining the 6 machine learning models (LASSO, KRR, SVR, GPR, RF, XGBoost) and the fusion descriptor is shown in table 2.

Table 2 6 machine learning models comparison of bond dissociation energy prediction accuracies with fusion descriptors, respectively

Wherein MAEs and RMSEs are expressed in kJ.mol ^-1 In units of.

Table 2 details the accuracy of all combinations on the training set and the test set, with six models showing higher accuracy on the training set, especially the two integrated models, RF and XGBoost, R ² Up to 0.99, and the two models also achieved the best results on independent test sets, R ² All reached 0.92, and RMSE and MAE exhibited the lowest values in all models of 14.7 kJ. Mol, respectively ^-1 And 24.3 kJ.mol ^-1 It is explained that the mapping relationship between the input features and the properties of the molecules can be well established by using a conventional machine learning model under the condition of proper descriptors and models. Although the XGBoost model and the RF model have equivalent prediction effects, XGBoost shows a training speed which is twenty times higher than RF in the whole training process, so in order to efficiently predict the bond dissociation energy, we select XGBoost as the optimal model.

Since the current minimum prediction error is still a relatively high level for the bond dissociation energy, in order to further improve the model performance, we combine the data enhancement method on the basis of combining the bond dissociation energy fusion descriptor with the effect optimal model XGBoost, which is helpful for improving the stability and generalization ability of the model.

and S500, predicting the bond dissociation energy of the energetic material by adopting a trained XGBoost model, comparing the predicted bond dissociation energy with other bond dissociation energy prediction models, and verifying the advantages and the necessity of the model.

To verify the advantages of our constructed model of energetic material dissociation energy prediction, the present invention was compared to some competitive models. Graphic neural network model BonDNet and I amThe task of predicting the bond dissociation energies of the thermally induced bonds of energetic materials is different from that of bondnaet, which maps the difference between molecular diagram representations of reactants and products to the bond dissociation energies of the corresponding reactions, it is possible to predict each bond dissociation energy, excluding non-cyclic bonds, including neutral and charged molecules. Applying the bondnat model to our independent test set showed poor results, R ² MAE and RMSE were 0.35, 142.87 kJ.mol, respectively ^-1 And 273.56 kJ.mol ^-1 Table 3 further shows the BDE predictions versus the quantitative calculations for ten representative molecules extracted from the independent test sets on the BonDNet model and the model of the invention, respectively.

Table 3 comparison of BDE predictions versus calculated for 10 energized molecules in independent test sets on bondnat model and XGBoost model.

Wherein the predicted value and the calculated value are kJ.mol ^-1 In units of.

The data enhancement strategy of the fusion descriptor, the XGBoost model and the PADRE of the invention achieves a significantly better effect on the prediction of the bond dissociation energy of the 10 energetic molecules, but the prediction error of the BonDNet model on most molecules of the 10 energetic molecules shows an abnormally high level, especially molecules with the weight number of more than 15; the rest of molecules 4, 6 and 7 have smaller number of heavy atoms, and the BonDNet model gives predictions closer to the true value, but the deviation is still significantly higher than our predicted result. The reason for this is that the molecules in the database, although they are all composed of C, H, O, N atoms, are small molecules and contain only 10 or less heavy atoms, and since most energetic molecules are more than 10 heavy atoms, bondnaet can only predict some of the energetic molecules with fewer heavy atoms more accurately, but the prediction accuracy is poor for molecules with more than 10 heavy atoms. This also suggests that the bondnat model is not suitable for the prediction of bond dissociation energy of energetic materials, which is also a necessity to develop a predictive model suitable for the bond dissociation energy of a wide range of energetic materials.

Four different types of molecular fingerprints are generated in the prior art by using SMILES strings, and BDEs of I-X bonds in 716 high-valence iodine compounds are predicted by five machine learning methods, wherein an Elastic Net model shows the highest precision of predicting BDEs of the I-X bonds: r is R ² =0.96 and mae=6.60 kj·mol ^-1 . The invention calculates 778 energy-containing molecules of 4 fingerprint descriptors including Morgan, RDK, MACCS and Avalon fingerprint by constructing an Elastic Net architecture. The constructed EN architecture retrains and predicts on the dataset of the present invention and uses grid search to search for their best hyper-parameters to ensure fairness of the comparison. Fig. 4 shows the prediction accuracy of EN in combination with various fingerprints on independent test sets, with the dashed line of the corresponding color representing our Xgb +padre results. As is evident from the figure, our XGBoost combined with PADRE enhancement shows more excellent results than EN on various fingerprints, demonstrating the superiority of our fusion descriptor, XGBoost model combined with PADER enhancement method on energetic molecular bond dissociation energy prediction.

Based on the development requirement of the energetic material and the limitation of the existing model, the invention starts from three elements (data, characteristic characterization and model framework) of machine learning, and develops a machine learning model of the energetic material pyrolysis initiation bond BDE which can accurately and rapidly predict. Specifically, firstly starting from a data set, aiming at the problem of lack of the current energetic molecule data set, 778 synthesized energetic molecules composed of C, H, O, N elements are manually collected from the literature, and labels of bond dissociation energy are obtained through high-precision quantitative calculation, so that a representative and high-quality energetic material bond dissociation energy data set is constructed, and the data set can also provide data resources for research on other properties of energetic materials. Secondly, for the finite nature of the energetic material data set, the characteristic characterization of the sample is particularly important to the effect of the machine learning prediction model, so that a descriptor fused with the local chemical environment of the dissociated bond and the global characteristic of the energetic structure is provided based on the local characteristic related to the bond dissociation energy and the global characteristic of the energetic material. On the basis, we try and The method compares various machine learning models, further introduces paired difference regression for data enhancement, and can partially eliminate systematic errors, thereby being beneficial to the improvement of prediction precision and being capable of evaluating the uncertainty of a prediction result. Finally, a high-accuracy prediction model of the bond dissociation energy of the energetic material is obtained through the proposed fusion characteristics, data enhancement strategy and XGBoost algorithm model, and the accuracy of the independent test set can reach R ² ＝0.98，MAE＝8.8kJ·mol ^-1 The accuracy of the method is far higher than that of the existing model on the dissociation energy of the energetic molecular bond, a reliable performance prediction tool is provided for the research and development of novel and efficient energetic molecules, and the enhancement strategy of the data set in the work, the feature characterization combining local area and global area and the comparison of the machine learning prediction model provide guidance and reference on the method for machine learning modeling in other small sample fields.

Although the invention has been described herein with reference to the above-described illustrative embodiments thereof, the above-described embodiments are merely preferred embodiments of the present invention, and the embodiments of the present invention are not limited by the above-described embodiments, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the scope and spirit of the principles of this disclosure.

Claims

1. A method for predicting bond dissociation energy of an energetic material based on feature fusion and data enhancement, comprising:

2. The method for predicting bond dissociation energy of energetic material based on feature fusion and data enhancement according to claim 1, wherein the step S100 specifically comprises:

3. The method for predicting bond dissociation energy of energetic material based on feature fusion and data enhancement according to claim 2, wherein the step S200 specifically comprises:

4. The method for predicting energy of dissociation of energetic material bonds based on feature fusion and data enhancement as claimed in claim 3, wherein the specific calculation method of chemical environment descriptor comprises:

5. The method of claim 4, wherein the energetic feature descriptor is a key sum sob+ electrical topology state fingerprint E-state+ custom descriptor set CDS, wherein:

6. The method for predicting energy of dissociation of energetic material bonds based on feature fusion and data enhancement as claimed in claim 3, wherein said step S300 comprises:

the method comprises the steps of randomly arranging samples in a dissociation energy data set of the whole energetic material bond, dividing the samples into an initial training set and an independent testing set according to a set proportion by using a Shuffle Split method, repeating the dividing process for M times, carrying out data enhancement on the samples in the initial training set after dividing data each time, dividing the samples in the initial training set after expansion into a training set and a verification set again by using nested five-fold cross verification, searching optimal super parameters of a model on the verification set by adopting a network searching method, predicting by using molecules of the independent testing set after determining an optimal XGBoost model, combining data of the independent testing set and data in the training set one by one to obtain a prediction result regarded as distribution, and adding the average value of the distribution and the average value of the training set to obtain a prediction result of an unknown sample finally as a result of the independent testing set.

7. The method for predicting energy of dissociation of energetic material bonds based on feature fusion and data enhancement as claimed in claim 6, wherein the data enhancement of the samples in the initial training set comprises: and combining samples of the initial training set in pairs, calculating differences between one-dimensional descriptors of the two samples combined in pairs to obtain difference descriptors, and then splicing the difference descriptors with original descriptors of the two samples combined in pairs to obtain spliced descriptors, wherein the spliced descriptors form a feature matrix.

8. The method for predicting energy of dissociation of energetic material bonds based on feature fusion and data enhancement as claimed in claim 7, wherein said step S400 comprises: the average absolute error MAE, the root mean square error RMSE and the determination coefficient R of the XGBoost model are calculated according to the results of M XGBoost model independent tests ² To evaluate the performance of XGBoost model, M average absolute errors MAE, root mean square error RMSE and decision coefficient R obtained by M times of calculation ² The average values are respectively calculated as the final performance results of the XGBoost model.