CN116312854A

CN116312854A - Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances

Info

Publication number: CN116312854A
Application number: CN202310201931.7A
Authority: CN
Inventors: 宋敏; 刘羽晨
Original assignee: Hangzhou Yile Standard Technology Co ltd
Current assignee: Hangzhou Yile Standard Technology Co ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-06-23

Abstract

The invention relates to the technical field of ecological risk evaluation, solves the technical problem of poor prediction capability, and particularly relates to a method for predicting a sulfamethylisoxazole substance n-octanol water distribution coefficient, which comprises the following steps: screening out compounds similar to the structure of a target substance to be detected from experimental measurement data of published documents to generate a sample data set; randomly dividing a sample data set into a training set and a verification set according to a preset proportion; training the constructed n-octanol water distribution coefficient prediction model according to the training set; verifying the external prediction capacity of the n-octanol water distribution coefficient prediction model according to the verification set; and predicting the n-octanol water distribution coefficient of the target substance to be detected by adopting a n-octanol water distribution coefficient prediction model. According to the invention, the descriptor with significance of the compound similar to the structure of the target substance to be detected is adopted to construct the n-octanol water distribution coefficient prediction model, so that the prediction capability of the sulfamethoxazole substance is improved.

Description

Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances

Technical Field

The invention relates to the technical field of ecological risk evaluation, in particular to a method for predicting a water distribution coefficient of n-octanol of a sulfamethoxazole substance.

Background

Sulfomethylisoxazoles are widely used as an antibacterial drug in human bodies, but most of the drugs are directly discharged into the environment without being metabolized, and in addition, the drugs can enter water environment in the processes of producing the drugs and treating expired and unused drugs, so that the average concentration level of the drugs in urban sewage is higher. The research shows that the n-octanol water distribution coefficient of the sulfamethoxazole substances has important significance for measuring the concentration of the sulfamethoxazole in the sewage.

The n-octanol water partition coefficient refers to the concentration ratio of a certain compound in the n-octanol and the aqueous phase in the equilibrium state, and reflects the migration ability of the compound between the aqueous phase and the organic phase. In theory, the direct measurement of the n-octanol water distribution coefficient of the sulfamethoxazole substances in a laboratory is the most effective method, but the experimental measurement process is complicated and is complex to operate, and standard samples are needed, and data of different laboratories have systematic errors, so many researchers propose to establish a prediction model for predicting the n-octanol water distribution coefficient of the compound according to the molecular structure information of the compound, and although the conventional prediction model simplifies the operation process compared with the experimental measurement method, the conventional n-octanol water distribution coefficient prediction model on the market has poorer prediction capability for the sulfamethoxazole substances, so that the prediction requirement of people on the n-octanol water distribution coefficient of the sulfamethoxazole substances and derivatives thereof cannot be met.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for predicting the n-octanol water distribution coefficient of a sulfamethoxazole substance, which solves the technical problem of poor prediction capability of the sulfamethoxazole substance, and achieves the aim of improving the prediction capability of the model by constructing a n-octanol water distribution coefficient prediction model by adopting a descriptor with significance of a compound similar to the structure of a target substance to be detected, and comprehensively verifying the model from fitting degree, application domain and prediction capability.

In order to solve the technical problems, the invention provides the following technical scheme: a method for predicting the n-octanol water partition coefficient of a sulfamethoxazole substance comprises the following steps:

s1, screening out a compound similar to a structure of a target substance to be detected from experimental measurement data of published documents, and generating a sample data set;

s2, randomly dividing a sample data set into a training set and a verification set according to the modeling requirement of the OECD;

s3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set;

s4, adopting the determination coefficient adjusted by the degree of freedom according to the verification set

And the square of the root mean square error RMSE and the correlation coefficient +.>

Verifying the prediction capability of the n-octanol water distribution coefficient prediction model;

s5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance;

s6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds;

s7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.

Further, in step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting the multiple linear regression stepwise analysis method includes the following steps:

s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;

s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors;

s33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.

Further, in step S33, the degree-of-freedom adjusted decision coefficient

The calculation formula of (2) is as follows:

wherein y is _i And

experimental and predictive values for the ith compound, respectively,/->

The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.

Further, descriptors used for constructing the n-octanol water distribution coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to an auto-correlation parameter weighted by the Mulbersen electronegativity and is used for describing one parameter of out-of-core electron distribution of each atom in a molecule.

Further, log k for all compounds in the sample dataset _OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:

logk _OW ＝1.504×nX-4.907×AATS1e+39.845

wherein nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.

By means of the technical scheme, the invention provides a method for predicting the water distribution coefficient of the n-octanol of the sulfamethoxazole substance, which has at least the following beneficial effects:

1. according to the invention, compounds with the structure similar to that of a target substance to be detected are screened from experimental measurement data of published documents to be used as a sample data set, a semi-empirical molecular orbit method is adopted to optimize the structure of the compounds in the sample data set, the optimized structure is guided into PaDEL-Descriptor software to be calculated to obtain a plurality of molecular structure descriptors, two descriptors nX and AATS1e for constructing a positive octanol water distribution coefficient prediction model are screened by adopting a multiple linear regression stepwise analysis method MLR, the prediction model is comprehensively evaluated from fitting degree, application domain and mechanism, and the accuracy of a prediction result is improved.

2. According to the invention, the screened sample data set similar to the structure of the target substance to be detected is randomly divided into the training set and the verification set according to the preset proportion of 3:1 according to the modeling requirement in the OECD guide rule, and the n-octanol water distribution coefficient prediction model is constructed and verified, so that the stability of the n-octanol water distribution coefficient prediction model is enhanced.

3. The invention uses the adjusted decision coefficient

The RMSE characterizes the model fitness by the determining coefficient between the experimental value and the predicted value of the compound of the validation set ∈>

And square of correlation coefficient->

Representing the external verification result, and characterizing the application domain of the prediction model by using Euclidean distance to ensure that the target substances to be detected are all in the application range of the constructed prediction modelAnd further, the credibility of the prediction result is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a method for predicting the water partition coefficient of n-octanol of a sulfamethylisoxazole substance;

FIG. 2 is a flow chart of the construction of a water distribution coefficient prediction model of n-octanol according to the present invention;

FIG. 3 is a graph of the model descriptor versus Euclidean distance of the present invention;

FIG. 4 shows the training set log k of the present invention _OW Schematic fit of experimental and predicted values.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. Therefore, the implementation process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

Summary of the application

To date, many researchers have successfully established a predictive model of the n-octanol water partition coefficient value of organic compounds using quantitative structure-activity relationship (QSAR) techniques, such as using molecular fragmentation in the United states environmental protection agency's EPISUITE software to predict compound log k _OW Although this approach is better for simpler compounds, the prediction error may be larger for more complex compounds; some patents also provide prediction schemes, such as a method for predicting n-octanol/water distribution coefficient of ionic liquid with a patent name of 201210181904.X, which proposes that the n-octanol/water distribution coefficient is calculated according to van der Waals volumes of atoms in a molecular structural formula of the ionic liquid to be predicted, but the method is complex in operation and poor in prediction capability. In addition, the prior art relates to the prediction of the water content of n-octanol of the sulfamethylisoxazolesThe research on the matching coefficient is less, and related patents are blank. Therefore, the application provides a method for constructing the n-octanol water distribution coefficient prediction model according to the descriptor with significance of the compound similar to the structure of the object to be detected, so that the prediction capability and stability of the model are improved, the application domain of the model is characterized, and the credibility of the prediction result of the model is improved.

Examples

Referring to fig. 1-4, a specific implementation manner of this embodiment is shown, in this embodiment, the prediction capability of the n-octanol water distribution coefficient prediction model of the sulfamethoxazole substance is improved by constructing the n-octanol water distribution coefficient prediction model according to the screened descriptors nX and AATS1e, and performing internal and external verification on the model by adopting the square pair model of the determination coefficient, root mean square error and correlation coefficient.

As shown in FIG. 1, the method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance comprises the following steps:

s1, screening out compounds similar to the structure of the target substance to be detected from experimental measurement data of published documents, and generating a sample data set.

To ensure data quality, all n-octanol water partition coefficient values were derived from experimental measurements in published scientific literature. In addition, the target substance to be detected belongs to a sulfamethoxazole compound, and has an isoxazole ring, a substituted amino group and a substituted sulfonyl (amine) group in the structure, and the three groups contain two elements of N or O, and the group with N, O element can form a hydrogen bond with water molecules easily under the normal condition, so that the water solubility of the whole molecule is influenced; secondly, because the substituted amino, halogen, benzene ring and isoxazole ring belong to groups with higher electron cloud density, the electron cloud distribution and electronegativity of the whole molecule can be changed, and the polarity of the molecule is influenced; finally, all target substances to be detected contain benzene rings or isoxazole rings with larger volumes, so that interaction between molecules and water molecules is blocked to a certain extent, and the water solubility is also influenced to a certain extent.

Therefore, when experimental measurement data in published scientific and technical literature are screened, substituted amino, sulfonamide, isoxazole ring, benzene ring and halogen are used as standards, so that substances used for modeling are more structurally consistent with target substances to be tested, and accuracy of a prediction model is further ensured.

In this embodiment, the specific process of generating the sample data set includes: firstly judging whether the structure of the substance takes a structural unit represented by a formula (1) as a basic structure, if the structure of the substance takes the structural unit represented by the formula (1) as the basic structure, then continuously judging whether the substituent of the substance is an amino group or an isoxazole ring substance, if the substituent of the substance is an amino group or an isoxazole ring substance, putting the substance into a sample data set, otherwise, judging that the substance does not meet the requirement; wherein the formula (1) is as follows:

in this example, 76 sulfamethoxazole compounds were collected and sorted according to the above screening procedure, and the n-octanol water distribution coefficient log k was calculated _OW The range of (2) is-0.62 to 4.39.

By adopting the compound with the structure similar to that of the target substance to be detected as a sample data set, a powerful data support can be provided for the subsequent construction of a prediction model for predicting the n-octanol water distribution coefficient of the target substance, and the prediction precision of the prediction model is further improved.

S2, randomly dividing the sample data set into a training set and a verification set according to the modeling requirement of the OECD.

In order to meet modeling requirements in OECD guidelines, the present embodiment screens all log k in the 76 sample data sets selected in step S1 _OW The data were randomly divided into training and validation sets at a preset ratio of 3:1. Wherein the training set contains 57 data for the establishment of the model; the validation set contains 19 data for external validation of the model, enhancing stability, thus enabling the model to have strong predictive power and robustness.

S3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set. The process of establishing the predictive model of the water partition coefficient of n-octanol will be further described with reference to FIG. 2:

in this embodiment, the molecular structure of the compound in the training set is optimized by adopting the PM7 algorithm in the MOPAC 2016 software package, and the keyword charge=0ef color=0.0100 shift=80 is added to obtain the optimized lowest energy structure;

s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors.

In the present embodiment, the calculation process according to the above-described step S31 and step S32 obtains 1444 descriptors of 63 classes in total.

It should be noted that, the descriptor is a symbol for describing structural information or experimental description of chemical molecules, and can be divided into three categories: 1D, 2D, and 3D descriptors representing chemical composition, topology, 3D shape, and function, respectively; the descriptor may be a simple feature, such as molecular volume, or complex, such as 3D-MoRSE, containing various physicochemical and structural characteristics of the compound; the descriptors can be used to build quantitative structure-activity relationship (QSAR) models to predict the biological activity of novel compounds.

In this embodiment, a multiple linear regression stepwise analysis (stepwise MLR analysis) in SPSS 19.0 software is adopted to sequentially select and test 1444 descriptors, calculate the significance of each descriptor for the constructed model, consider that the descriptor is significant for the model construction when the test result shows that the significance is less than 0.05, determine whether the variable expansion factor (VIF) of the descriptor is less than a preset threshold value 10, and if the variable expansion factor (VIF) of the descriptor is less than the preset threshold value 10, the significant descriptor is reserved and participates in the model construction.

Finally, two descriptors nX and AATS1e for constructing a water distribution coefficient prediction model of n-octanol are obtained by screening 1444 descriptors, and log k of all compounds in a sample data set is used for _OW Stepwise regression analysis and verification are carried out on the data to obtain a linear relation of the n-octanol water distribution coefficient prediction model as follows:

logk _OW ＝1.504×nX-4.907×AATS1e+39.845

in the above formula, nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.

It should be noted that, descriptor nX (Number of halogen atoms (F, cl, br, I, at, uus)) represents the number of halogen atoms in the molecular structure, and the introduction of halogen, whether occurring on an amino group or a substituted aromatic ring, will affect the electron cloud distribution and polarity of the whole molecule due to its greater electronegativity, and thus affect the water solubility and Kow value;

another descriptor, AATS1e (Average brown-Moreau autocorrelation-lag 1/weighted by Sanderson electronegativities), refers to an autocorrelation parameter weighted by sandsen electronegativity, which is a parameter describing the extra-nuclear electron distribution of each atom in a molecule, whereas the extra-nuclear electron molecular situation directly affects the overall molecular polarity size, so that in halogen substituted sulfamethylisoxazoles, the electronegativity is believed to have a greater effect on the Kow of the material, according to similar compatibility principles.

And verifying the prediction capability of the n-octanol water distribution coefficient prediction model.

This practice isIn an embodiment, by employing a degree-of-freedom adjusted decision coefficient

Root mean square error RMSE and square of correlation coefficient +.>

And verifying the internal and external prediction capacities of the n-octanol water distribution coefficient prediction model.

Wherein the coefficient is determined

The calculation formula of (2) is as follows:

in the above, y _i And

experimental and predictive values for the ith compound, respectively,/->

Square of root mean square error RMSE and correlation coefficient

The calculation formula of (2) is as follows:

in the above, y _i And

experimental and predictive values for the ith compound, respectively,/->

To verify the average value of all data point experimental values in the set, n is the number of data points in the training set, n _ext To verify the number of set data points.

Specifically, the present embodiment is obtained by calculation according to the above formula:

training set n _train Determination coefficient of =57

Root mean square error RMSE _tra =0.009, square of correlation coefficient +.>

Verification set n _text Determination coefficient of =19

Root mean square error RMSE _ext =0.187, square of correlation coefficient +.>

As can be seen from the above data,

and->

The values of (2) are all more than 0.85, which indicates that the predictive model has good fitting degree and strong stability, < >>

The value of (2) is greater than 0.8, indicating that the predictive model has good predictive power,/-for>

And

the difference is much less than 0.3, indicating that the predictive model has not been overfitted, as shown in FIG. 4, log k _OW Fitting of experimental and predicted values.

S5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance.

As shown in fig. 3, the abscissa is two descriptors nX and AATS1e of the n-octanol water partition coefficient prediction model, respectively, the background represents the euclidean distance of the substance, and the euclidean distance is the application domain range of the n-octanol water partition coefficient prediction model within 0-1.2 according to the left graphical information. The substances in all training sets and verification sets are in the application domain range, and the substances to be predicted are in the application range of the n-octanol water distribution coefficient prediction model, so that the prediction result is proved to be more reliable. Wherein, the calculation formula of Euclidean distance is as follows:

in the above, x _i Is a variable of the molecular structure descriptor of the i-th compound,

is the average of the molecular structure descriptors.

According to the embodiment, the Euclidean distance is adopted to characterize the application domain of the model, so that the target substances to be detected are ensured to be in the application range of the constructed prediction model, and the credibility of the prediction result of the n-octanol water distribution coefficient is further improved.

S6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds.

In this embodiment, there are six target substances to be tested, and each target substance to be tested belongs to the sulfamethoxazole compound.

By substituting descriptor parameters of the target substance to be detected into linear relation of the n-octanol water distribution coefficient prediction model, corresponding n-octanol water distribution coefficient prediction values can be obtained, the whole process is simple to operate and easy to explain, so that the prediction efficiency and accuracy of the n-octanol water distribution coefficients are improved, and the reliability of model prediction results is also improved.

In this embodiment, the prediction of the n-octanol water distribution coefficient is performed on given six target substances by using the above-constructed n-octanol water distribution coefficient prediction model, that is:

and substituting the descriptor parameters of the six target substances into the linear relation of the constructed n-octanol water distribution coefficient prediction model respectively to obtain corresponding n-octanol water distribution coefficient prediction values, wherein the detailed results are shown in the following table:

by calculating Euclidean distances of six target substances to be detected according to a formula, all results are less than 1.2, and the fact that the target compound is in the application range of the model is indicated, and the prediction result is more reliable.

Through the embodiment, firstly, a compound similar to the structure of the target substance to be detected is screened from experimental measurement data of the published literature to serve as a sample data set, so that a powerful data support is provided for the subsequent construction of a water distribution coefficient prediction model for predicting the n-octanol water content of the target substance; the sample data set is randomly divided into a training set and a verification set according to the modeling requirement in the OECD guide rule, so that the stability of the prediction model is enhanced; then, optimizing the structure of the compound in the sample data set by adopting a semi-empirical molecular orbital method, introducing the optimized structure into PaDEL-Descriptor software to calculate a plurality of molecular structure descriptors, screening out two significant descriptors nX and AATS1e by adopting a multiple linear regression stepwise analysis method MLR, and constructing a positive octanol water distribution coefficient prediction model, so that the model is simpler and has stronger robustness, and the prediction precision and efficiency of the model are improved; and finally, carrying out internal and external verification on the prediction model constructed by adopting the square pair of the decision coefficient, the root mean square error and the correlation coefficient, characterizing the application domain of the model by adopting the Euclidean distance, and comprehensively evaluating the prediction model from the fitting degree, the application domain and the mechanism to ensure that the target substance to be detected is in the application range of the constructed prediction model, thereby improving the reliability of the prediction result.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance is characterized by comprising the following steps of:

s4, adopting a determination coefficient R adjusted by the degree of freedom according to the verification set ² Square of root mean square error RMSE and correlation coefficient

2. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substances according to claim 1, wherein in the step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting a multiple linear regression stepwise analysis method comprises the following steps:

3. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 2, wherein in step S33, the degree of freedom-adjusted determination coefficient

The calculation formula of (2) is as follows:

wherein y is _i And

experimental and predictive values for the ith compound, respectively,/->

4. The method for predicting the n-octanol water partition coefficient of a sulfamethylisoxazole substance according to claim 2, wherein descriptors used for constructing a n-octanol water partition coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to a samadesen electronegativity weighted autocorrelation parameter for describing one parameter of an extranuclear electron distribution of each atom in a molecule.

5. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 4, wherein log k of all compounds in the sample data set _OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:

logk _OW ＝1.504×nX-4.907×AATS1e+39.845