CN116312854A - Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances - Google Patents

Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances Download PDF

Info

Publication number
CN116312854A
CN116312854A CN202310201931.7A CN202310201931A CN116312854A CN 116312854 A CN116312854 A CN 116312854A CN 202310201931 A CN202310201931 A CN 202310201931A CN 116312854 A CN116312854 A CN 116312854A
Authority
CN
China
Prior art keywords
water distribution
octanol water
distribution coefficient
prediction model
octanol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310201931.7A
Other languages
Chinese (zh)
Inventor
宋敏
刘羽晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yile Standard Technology Co ltd
Original Assignee
Hangzhou Yile Standard Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yile Standard Technology Co ltd filed Critical Hangzhou Yile Standard Technology Co ltd
Priority to CN202310201931.7A priority Critical patent/CN116312854A/en
Publication of CN116312854A publication Critical patent/CN116312854A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Mathematical Physics (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Pure & Applied Mathematics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Primary Health Care (AREA)
  • Probability & Statistics with Applications (AREA)
  • Spectroscopy & Molecular Physics (AREA)

Abstract

The invention relates to the technical field of ecological risk evaluation, solves the technical problem of poor prediction capability, and particularly relates to a method for predicting a sulfamethylisoxazole substance n-octanol water distribution coefficient, which comprises the following steps: screening out compounds similar to the structure of a target substance to be detected from experimental measurement data of published documents to generate a sample data set; randomly dividing a sample data set into a training set and a verification set according to a preset proportion; training the constructed n-octanol water distribution coefficient prediction model according to the training set; verifying the external prediction capacity of the n-octanol water distribution coefficient prediction model according to the verification set; and predicting the n-octanol water distribution coefficient of the target substance to be detected by adopting a n-octanol water distribution coefficient prediction model. According to the invention, the descriptor with significance of the compound similar to the structure of the target substance to be detected is adopted to construct the n-octanol water distribution coefficient prediction model, so that the prediction capability of the sulfamethoxazole substance is improved.

Description

Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances
Technical Field
The invention relates to the technical field of ecological risk evaluation, in particular to a method for predicting a water distribution coefficient of n-octanol of a sulfamethoxazole substance.
Background
Sulfomethylisoxazoles are widely used as an antibacterial drug in human bodies, but most of the drugs are directly discharged into the environment without being metabolized, and in addition, the drugs can enter water environment in the processes of producing the drugs and treating expired and unused drugs, so that the average concentration level of the drugs in urban sewage is higher. The research shows that the n-octanol water distribution coefficient of the sulfamethoxazole substances has important significance for measuring the concentration of the sulfamethoxazole in the sewage.
The n-octanol water partition coefficient refers to the concentration ratio of a certain compound in the n-octanol and the aqueous phase in the equilibrium state, and reflects the migration ability of the compound between the aqueous phase and the organic phase. In theory, the direct measurement of the n-octanol water distribution coefficient of the sulfamethoxazole substances in a laboratory is the most effective method, but the experimental measurement process is complicated and is complex to operate, and standard samples are needed, and data of different laboratories have systematic errors, so many researchers propose to establish a prediction model for predicting the n-octanol water distribution coefficient of the compound according to the molecular structure information of the compound, and although the conventional prediction model simplifies the operation process compared with the experimental measurement method, the conventional n-octanol water distribution coefficient prediction model on the market has poorer prediction capability for the sulfamethoxazole substances, so that the prediction requirement of people on the n-octanol water distribution coefficient of the sulfamethoxazole substances and derivatives thereof cannot be met.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for predicting the n-octanol water distribution coefficient of a sulfamethoxazole substance, which solves the technical problem of poor prediction capability of the sulfamethoxazole substance, and achieves the aim of improving the prediction capability of the model by constructing a n-octanol water distribution coefficient prediction model by adopting a descriptor with significance of a compound similar to the structure of a target substance to be detected, and comprehensively verifying the model from fitting degree, application domain and prediction capability.
In order to solve the technical problems, the invention provides the following technical scheme: a method for predicting the n-octanol water partition coefficient of a sulfamethoxazole substance comprises the following steps:
s1, screening out a compound similar to a structure of a target substance to be detected from experimental measurement data of published documents, and generating a sample data set;
s2, randomly dividing a sample data set into a training set and a verification set according to the modeling requirement of the OECD;
s3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set;
s4, adopting the determination coefficient adjusted by the degree of freedom according to the verification set
Figure BDA0004109301430000021
And the square of the root mean square error RMSE and the correlation coefficient +.>
Figure BDA0004109301430000022
Verifying the prediction capability of the n-octanol water distribution coefficient prediction model;
s5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance;
s6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds;
s7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
Further, in step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting the multiple linear regression stepwise analysis method includes the following steps:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors;
s33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
Further, in step S33, the degree-of-freedom adjusted decision coefficient
Figure BDA0004109301430000023
The calculation formula of (2) is as follows:
Figure BDA0004109301430000031
wherein y is i And
Figure BDA0004109301430000032
experimental and predictive values for the ith compound, respectively,/->
Figure BDA0004109301430000033
The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
Further, descriptors used for constructing the n-octanol water distribution coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to an auto-correlation parameter weighted by the Mulbersen electronegativity and is used for describing one parameter of out-of-core electron distribution of each atom in a molecule.
Further, log k for all compounds in the sample dataset OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
wherein nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
By means of the technical scheme, the invention provides a method for predicting the water distribution coefficient of the n-octanol of the sulfamethoxazole substance, which has at least the following beneficial effects:
1. according to the invention, compounds with the structure similar to that of a target substance to be detected are screened from experimental measurement data of published documents to be used as a sample data set, a semi-empirical molecular orbit method is adopted to optimize the structure of the compounds in the sample data set, the optimized structure is guided into PaDEL-Descriptor software to be calculated to obtain a plurality of molecular structure descriptors, two descriptors nX and AATS1e for constructing a positive octanol water distribution coefficient prediction model are screened by adopting a multiple linear regression stepwise analysis method MLR, the prediction model is comprehensively evaluated from fitting degree, application domain and mechanism, and the accuracy of a prediction result is improved.
2. According to the invention, the screened sample data set similar to the structure of the target substance to be detected is randomly divided into the training set and the verification set according to the preset proportion of 3:1 according to the modeling requirement in the OECD guide rule, and the n-octanol water distribution coefficient prediction model is constructed and verified, so that the stability of the n-octanol water distribution coefficient prediction model is enhanced.
3. The invention uses the adjusted decision coefficient
Figure BDA0004109301430000041
The RMSE characterizes the model fitness by the determining coefficient between the experimental value and the predicted value of the compound of the validation set ∈>
Figure BDA0004109301430000042
And square of correlation coefficient->
Figure BDA0004109301430000043
Representing the external verification result, and characterizing the application domain of the prediction model by using Euclidean distance to ensure that the target substances to be detected are all in the application range of the constructed prediction modelAnd further, the credibility of the prediction result is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method for predicting the water partition coefficient of n-octanol of a sulfamethylisoxazole substance;
FIG. 2 is a flow chart of the construction of a water distribution coefficient prediction model of n-octanol according to the present invention;
FIG. 3 is a graph of the model descriptor versus Euclidean distance of the present invention;
FIG. 4 shows the training set log k of the present invention OW Schematic fit of experimental and predicted values.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. Therefore, the implementation process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
Summary of the application
To date, many researchers have successfully established a predictive model of the n-octanol water partition coefficient value of organic compounds using quantitative structure-activity relationship (QSAR) techniques, such as using molecular fragmentation in the United states environmental protection agency's EPISUITE software to predict compound log k OW Although this approach is better for simpler compounds, the prediction error may be larger for more complex compounds; some patents also provide prediction schemes, such as a method for predicting n-octanol/water distribution coefficient of ionic liquid with a patent name of 201210181904.X, which proposes that the n-octanol/water distribution coefficient is calculated according to van der Waals volumes of atoms in a molecular structural formula of the ionic liquid to be predicted, but the method is complex in operation and poor in prediction capability. In addition, the prior art relates to the prediction of the water content of n-octanol of the sulfamethylisoxazolesThe research on the matching coefficient is less, and related patents are blank. Therefore, the application provides a method for constructing the n-octanol water distribution coefficient prediction model according to the descriptor with significance of the compound similar to the structure of the object to be detected, so that the prediction capability and stability of the model are improved, the application domain of the model is characterized, and the credibility of the prediction result of the model is improved.
Examples
Referring to fig. 1-4, a specific implementation manner of this embodiment is shown, in this embodiment, the prediction capability of the n-octanol water distribution coefficient prediction model of the sulfamethoxazole substance is improved by constructing the n-octanol water distribution coefficient prediction model according to the screened descriptors nX and AATS1e, and performing internal and external verification on the model by adopting the square pair model of the determination coefficient, root mean square error and correlation coefficient.
As shown in FIG. 1, the method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance comprises the following steps:
s1, screening out compounds similar to the structure of the target substance to be detected from experimental measurement data of published documents, and generating a sample data set.
To ensure data quality, all n-octanol water partition coefficient values were derived from experimental measurements in published scientific literature. In addition, the target substance to be detected belongs to a sulfamethoxazole compound, and has an isoxazole ring, a substituted amino group and a substituted sulfonyl (amine) group in the structure, and the three groups contain two elements of N or O, and the group with N, O element can form a hydrogen bond with water molecules easily under the normal condition, so that the water solubility of the whole molecule is influenced; secondly, because the substituted amino, halogen, benzene ring and isoxazole ring belong to groups with higher electron cloud density, the electron cloud distribution and electronegativity of the whole molecule can be changed, and the polarity of the molecule is influenced; finally, all target substances to be detected contain benzene rings or isoxazole rings with larger volumes, so that interaction between molecules and water molecules is blocked to a certain extent, and the water solubility is also influenced to a certain extent.
Therefore, when experimental measurement data in published scientific and technical literature are screened, substituted amino, sulfonamide, isoxazole ring, benzene ring and halogen are used as standards, so that substances used for modeling are more structurally consistent with target substances to be tested, and accuracy of a prediction model is further ensured.
In this embodiment, the specific process of generating the sample data set includes: firstly judging whether the structure of the substance takes a structural unit represented by a formula (1) as a basic structure, if the structure of the substance takes the structural unit represented by the formula (1) as the basic structure, then continuously judging whether the substituent of the substance is an amino group or an isoxazole ring substance, if the substituent of the substance is an amino group or an isoxazole ring substance, putting the substance into a sample data set, otherwise, judging that the substance does not meet the requirement; wherein the formula (1) is as follows:
Figure BDA0004109301430000061
in this example, 76 sulfamethoxazole compounds were collected and sorted according to the above screening procedure, and the n-octanol water distribution coefficient log k was calculated OW The range of (2) is-0.62 to 4.39.
By adopting the compound with the structure similar to that of the target substance to be detected as a sample data set, a powerful data support can be provided for the subsequent construction of a prediction model for predicting the n-octanol water distribution coefficient of the target substance, and the prediction precision of the prediction model is further improved.
S2, randomly dividing the sample data set into a training set and a verification set according to the modeling requirement of the OECD.
In order to meet modeling requirements in OECD guidelines, the present embodiment screens all log k in the 76 sample data sets selected in step S1 OW The data were randomly divided into training and validation sets at a preset ratio of 3:1. Wherein the training set contains 57 data for the establishment of the model; the validation set contains 19 data for external validation of the model, enhancing stability, thus enabling the model to have strong predictive power and robustness.
S3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set. The process of establishing the predictive model of the water partition coefficient of n-octanol will be further described with reference to FIG. 2:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
in this embodiment, the molecular structure of the compound in the training set is optimized by adopting the PM7 algorithm in the MOPAC 2016 software package, and the keyword charge=0ef color=0.0100 shift=80 is added to obtain the optimized lowest energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors.
In the present embodiment, the calculation process according to the above-described step S31 and step S32 obtains 1444 descriptors of 63 classes in total.
It should be noted that, the descriptor is a symbol for describing structural information or experimental description of chemical molecules, and can be divided into three categories: 1D, 2D, and 3D descriptors representing chemical composition, topology, 3D shape, and function, respectively; the descriptor may be a simple feature, such as molecular volume, or complex, such as 3D-MoRSE, containing various physicochemical and structural characteristics of the compound; the descriptors can be used to build quantitative structure-activity relationship (QSAR) models to predict the biological activity of novel compounds.
S33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
In this embodiment, a multiple linear regression stepwise analysis (stepwise MLR analysis) in SPSS 19.0 software is adopted to sequentially select and test 1444 descriptors, calculate the significance of each descriptor for the constructed model, consider that the descriptor is significant for the model construction when the test result shows that the significance is less than 0.05, determine whether the variable expansion factor (VIF) of the descriptor is less than a preset threshold value 10, and if the variable expansion factor (VIF) of the descriptor is less than the preset threshold value 10, the significant descriptor is reserved and participates in the model construction.
Finally, two descriptors nX and AATS1e for constructing a water distribution coefficient prediction model of n-octanol are obtained by screening 1444 descriptors, and log k of all compounds in a sample data set is used for OW Stepwise regression analysis and verification are carried out on the data to obtain a linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
in the above formula, nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
It should be noted that, descriptor nX (Number of halogen atoms (F, cl, br, I, at, uus)) represents the number of halogen atoms in the molecular structure, and the introduction of halogen, whether occurring on an amino group or a substituted aromatic ring, will affect the electron cloud distribution and polarity of the whole molecule due to its greater electronegativity, and thus affect the water solubility and Kow value;
another descriptor, AATS1e (Average brown-Moreau autocorrelation-lag 1/weighted by Sanderson electronegativities), refers to an autocorrelation parameter weighted by sandsen electronegativity, which is a parameter describing the extra-nuclear electron distribution of each atom in a molecule, whereas the extra-nuclear electron molecular situation directly affects the overall molecular polarity size, so that in halogen substituted sulfamethylisoxazoles, the electronegativity is believed to have a greater effect on the Kow of the material, according to similar compatibility principles.
S4, adopting the determination coefficient adjusted by the degree of freedom according to the verification set
Figure BDA0004109301430000081
And the square of the root mean square error RMSE and the correlation coefficient +.>
Figure BDA0004109301430000082
And verifying the prediction capability of the n-octanol water distribution coefficient prediction model.
This practice isIn an embodiment, by employing a degree-of-freedom adjusted decision coefficient
Figure BDA0004109301430000083
Root mean square error RMSE and square of correlation coefficient +.>
Figure BDA0004109301430000084
And verifying the internal and external prediction capacities of the n-octanol water distribution coefficient prediction model.
Wherein the coefficient is determined
Figure BDA0004109301430000085
The calculation formula of (2) is as follows:
Figure BDA0004109301430000091
in the above, y i And
Figure BDA0004109301430000092
experimental and predictive values for the ith compound, respectively,/->
Figure BDA0004109301430000093
The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
Square of root mean square error RMSE and correlation coefficient
Figure BDA0004109301430000094
The calculation formula of (2) is as follows:
Figure BDA0004109301430000095
Figure BDA0004109301430000096
in the above, y i And
Figure BDA0004109301430000097
experimental and predictive values for the ith compound, respectively,/->
Figure BDA0004109301430000098
To verify the average value of all data point experimental values in the set, n is the number of data points in the training set, n ext To verify the number of set data points.
Specifically, the present embodiment is obtained by calculation according to the above formula:
training set n train Determination coefficient of =57
Figure BDA0004109301430000099
Root mean square error RMSE tra =0.009, square of correlation coefficient +.>
Figure BDA00041093014300000910
Verification set n text Determination coefficient of =19
Figure BDA00041093014300000911
Root mean square error RMSE ext =0.187, square of correlation coefficient +.>
Figure BDA00041093014300000912
As can be seen from the above data,
Figure BDA00041093014300000913
and->
Figure BDA00041093014300000914
The values of (2) are all more than 0.85, which indicates that the predictive model has good fitting degree and strong stability, < >>
Figure BDA00041093014300000915
The value of (2) is greater than 0.8, indicating that the predictive model has good predictive power,/-for>
Figure BDA00041093014300000916
And
Figure BDA00041093014300000917
the difference is much less than 0.3, indicating that the predictive model has not been overfitted, as shown in FIG. 4, log k OW Fitting of experimental and predicted values.
S5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance.
As shown in fig. 3, the abscissa is two descriptors nX and AATS1e of the n-octanol water partition coefficient prediction model, respectively, the background represents the euclidean distance of the substance, and the euclidean distance is the application domain range of the n-octanol water partition coefficient prediction model within 0-1.2 according to the left graphical information. The substances in all training sets and verification sets are in the application domain range, and the substances to be predicted are in the application range of the n-octanol water distribution coefficient prediction model, so that the prediction result is proved to be more reliable. Wherein, the calculation formula of Euclidean distance is as follows:
Figure BDA0004109301430000101
in the above, x i Is a variable of the molecular structure descriptor of the i-th compound,
Figure BDA0004109301430000102
is the average of the molecular structure descriptors.
According to the embodiment, the Euclidean distance is adopted to characterize the application domain of the model, so that the target substances to be detected are ensured to be in the application range of the constructed prediction model, and the credibility of the prediction result of the n-octanol water distribution coefficient is further improved.
S6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds.
In this embodiment, there are six target substances to be tested, and each target substance to be tested belongs to the sulfamethoxazole compound.
S7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
By substituting descriptor parameters of the target substance to be detected into linear relation of the n-octanol water distribution coefficient prediction model, corresponding n-octanol water distribution coefficient prediction values can be obtained, the whole process is simple to operate and easy to explain, so that the prediction efficiency and accuracy of the n-octanol water distribution coefficients are improved, and the reliability of model prediction results is also improved.
In this embodiment, the prediction of the n-octanol water distribution coefficient is performed on given six target substances by using the above-constructed n-octanol water distribution coefficient prediction model, that is:
and substituting the descriptor parameters of the six target substances into the linear relation of the constructed n-octanol water distribution coefficient prediction model respectively to obtain corresponding n-octanol water distribution coefficient prediction values, wherein the detailed results are shown in the following table:
Figure BDA0004109301430000103
Figure BDA0004109301430000111
by calculating Euclidean distances of six target substances to be detected according to a formula, all results are less than 1.2, and the fact that the target compound is in the application range of the model is indicated, and the prediction result is more reliable.
Through the embodiment, firstly, a compound similar to the structure of the target substance to be detected is screened from experimental measurement data of the published literature to serve as a sample data set, so that a powerful data support is provided for the subsequent construction of a water distribution coefficient prediction model for predicting the n-octanol water content of the target substance; the sample data set is randomly divided into a training set and a verification set according to the modeling requirement in the OECD guide rule, so that the stability of the prediction model is enhanced; then, optimizing the structure of the compound in the sample data set by adopting a semi-empirical molecular orbital method, introducing the optimized structure into PaDEL-Descriptor software to calculate a plurality of molecular structure descriptors, screening out two significant descriptors nX and AATS1e by adopting a multiple linear regression stepwise analysis method MLR, and constructing a positive octanol water distribution coefficient prediction model, so that the model is simpler and has stronger robustness, and the prediction precision and efficiency of the model are improved; and finally, carrying out internal and external verification on the prediction model constructed by adopting the square pair of the decision coefficient, the root mean square error and the correlation coefficient, characterizing the application domain of the model by adopting the Euclidean distance, and comprehensively evaluating the prediction model from the fitting degree, the application domain and the mechanism to ensure that the target substance to be detected is in the application range of the constructed prediction model, thereby improving the reliability of the prediction result.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (5)

1. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance is characterized by comprising the following steps of:
s1, screening out a compound similar to a structure of a target substance to be detected from experimental measurement data of published documents, and generating a sample data set;
s2, randomly dividing a sample data set into a training set and a verification set according to the modeling requirement of the OECD;
s3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set;
s4, adopting a determination coefficient R adjusted by the degree of freedom according to the verification set 2 Square of root mean square error RMSE and correlation coefficient
Figure FDA0004109301420000011
Verifying the prediction capability of the n-octanol water distribution coefficient prediction model;
s5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance;
s6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds;
s7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
2. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substances according to claim 1, wherein in the step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting a multiple linear regression stepwise analysis method comprises the following steps:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors;
s33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
3. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 2, wherein in step S33, the degree of freedom-adjusted determination coefficient
Figure FDA0004109301420000021
The calculation formula of (2) is as follows:
Figure FDA0004109301420000022
wherein y is i And
Figure FDA0004109301420000023
experimental and predictive values for the ith compound, respectively,/->
Figure FDA0004109301420000024
The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
4. The method for predicting the n-octanol water partition coefficient of a sulfamethylisoxazole substance according to claim 2, wherein descriptors used for constructing a n-octanol water partition coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to a samadesen electronegativity weighted autocorrelation parameter for describing one parameter of an extranuclear electron distribution of each atom in a molecule.
5. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 4, wherein log k of all compounds in the sample data set OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
wherein nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
CN202310201931.7A 2023-03-06 2023-03-06 Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances Pending CN116312854A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310201931.7A CN116312854A (en) 2023-03-06 2023-03-06 Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310201931.7A CN116312854A (en) 2023-03-06 2023-03-06 Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances

Publications (1)

Publication Number Publication Date
CN116312854A true CN116312854A (en) 2023-06-23

Family

ID=86800725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310201931.7A Pending CN116312854A (en) 2023-03-06 2023-03-06 Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances

Country Status (1)

Country Link
CN (1) CN116312854A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003126689A (en) * 2001-10-26 2003-05-07 Tosoh Corp Adsorbent for hydrocarbon and removal method by adsorbing hydrocarbon
KR20120085144A (en) * 2011-10-05 2012-07-31 주식회사 켐에쎈 Multiple linear regression-artificial neural network hybrid model predicting water solubility of pure organic compound
WO2012177108A2 (en) * 2011-10-04 2012-12-27 주식회사 켐에쎈 Model, method and system for predicting, processing and servicing online physicochemical and thermodynamic properties of pure compound
CN102999705A (en) * 2012-11-30 2013-03-27 大连理工大学 Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model
CN104573863A (en) * 2015-01-07 2015-04-29 大连理工大学 Method for predicting organic compound and hydroxyl radical reaction rate constant in water phase
CN113722988A (en) * 2021-08-18 2021-11-30 扬州大学 Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003126689A (en) * 2001-10-26 2003-05-07 Tosoh Corp Adsorbent for hydrocarbon and removal method by adsorbing hydrocarbon
WO2012177108A2 (en) * 2011-10-04 2012-12-27 주식회사 켐에쎈 Model, method and system for predicting, processing and servicing online physicochemical and thermodynamic properties of pure compound
KR20120085144A (en) * 2011-10-05 2012-07-31 주식회사 켐에쎈 Multiple linear regression-artificial neural network hybrid model predicting water solubility of pure organic compound
CN102999705A (en) * 2012-11-30 2013-03-27 大连理工大学 Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model
CN104573863A (en) * 2015-01-07 2015-04-29 大连理工大学 Method for predicting organic compound and hydroxyl radical reaction rate constant in water phase
CN113722988A (en) * 2021-08-18 2021-11-30 扬州大学 Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUANYUAN ZHANG 等: "Toxicity of disinfection byproducts formed during the chlorination of sulfamethoxazole, norfloxacin, and 17β-estradiol in the presence of bromide", ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH, pages 50718 *

Similar Documents

Publication Publication Date Title
Leydesdorff The mutual information of university-industry-government relations: An indicator of the Triple Helix dynamics
Seffernick et al. Hybrid methods for combined experimental and computational determination of protein structure
Thapa et al. DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction
An et al. Characterizing and mining the citation graph of the computer science literature
Ashton et al. Depletion potentials in highly size-asymmetric binary hard-sphere mixtures: Comparison of simulation results with theory
He et al. Motif-All: discovering all phosphorylation motifs
Alvioli et al. Universality of nucleon-nucleon short-range correlations: the factorization property of the nuclear wave function, the relative and center-of-mass momentum distributions, and the nuclear contacts
Costas et al. Scaling rules in the science system: Influence of field‐specific citation characteristics on the impact of individual researchers
Guo et al. Addressing big data challenges in mass spectrometry-based metabolomics
Afantitis et al. A novel QSAR model for predicting the inhibition of CXCR3 receptor by 4-N-aryl-[1, 4] diazepane ureas
Cao et al. Computational exploration of the network of sequence flow between protein structures
Wang et al. QSPR model for Caco-2 cell permeability prediction using a combination of HQPSO and dual-RBF neural network
Shen et al. A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers
Stork et al. Computational prediction of frequent hitters in target-based and cell-based assays
CN116312854A (en) Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances
Zhao et al. Identification of metal ion-binding sites in RNA structures using deep learning method
CN110084423A (en) A kind of link prediction method based on local similarity
Naveja et al. Consistent Cell‐selective Analog Series as Constellation Luminaries in Chemical Space
D Bolboaca et al. The effect of leverage and/or influential on structure-activity relationships
Ghanbarpour et al. On-the-fly prediction of protein hydration densities and free energies using deep learning
WO2011041247A1 (en) A system for the determination of selective absorbent molecules through predictive correlations
Betancourt Another look at the conditions for the extraction of protein knowledge‐based potentials
Brimicombe Constructing and evaluating contextual indices using GIS: a case of primary school performance tables
Buric et al. Parallel Factor Analysis Enables Quantification and Identification of Highly Convolved Data-Independent-Acquired Protein Spectra
Cao et al. Machine Learning Prediction of On/Off Target-driven Clinical Adverse Events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230623

RJ01 Rejection of invention patent application after publication