CN116312854A - Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances - Google Patents
Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances Download PDFInfo
- Publication number
- CN116312854A CN116312854A CN202310201931.7A CN202310201931A CN116312854A CN 116312854 A CN116312854 A CN 116312854A CN 202310201931 A CN202310201931 A CN 202310201931A CN 116312854 A CN116312854 A CN 116312854A
- Authority
- CN
- China
- Prior art keywords
- water distribution
- octanol water
- distribution coefficient
- prediction model
- octanol
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- HGASFNYMVGEKTF-UHFFFAOYSA-N octan-1-ol;hydrate Chemical compound O.CCCCCCCCO HGASFNYMVGEKTF-UHFFFAOYSA-N 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000000126 substance Substances 0.000 title claims abstract description 33
- JLKIGFTWXXRPMT-UHFFFAOYSA-N sulphamethoxazole Chemical compound O1C(C)=CC(NS(=O)(=O)C=2C=CC(N)=CC=2)=N1 JLKIGFTWXXRPMT-UHFFFAOYSA-N 0.000 title claims abstract description 26
- 229960005404 sulfamethoxazole Drugs 0.000 title claims abstract description 21
- 150000001875 compounds Chemical class 0.000 claims abstract description 35
- 239000013076 target substance Substances 0.000 claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000012795 verification Methods 0.000 claims abstract description 18
- 238000005259 measurement Methods 0.000 claims abstract description 10
- 238000012216 screening Methods 0.000 claims abstract description 7
- 238000012417 linear regression Methods 0.000 claims description 17
- 238000005192 partition Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 9
- 125000005843 halogen group Chemical group 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 125000004429 atom Chemical group 0.000 claims description 4
- 238000004219 molecular orbital method Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000000611 regression analysis Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 abstract description 2
- KBPLFHHGFOOTCA-UHFFFAOYSA-N caprylic alcohol Natural products CCCCCCCCO KBPLFHHGFOOTCA-UHFFFAOYSA-N 0.000 description 20
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 16
- TVMXDCGIABBOFY-UHFFFAOYSA-N n-Octanol Natural products CCCCCCCC TVMXDCGIABBOFY-UHFFFAOYSA-N 0.000 description 10
- 125000000842 isoxazolyl group Chemical group 0.000 description 6
- 238000010276 construction Methods 0.000 description 5
- 229940079593 drug Drugs 0.000 description 5
- 239000003814 drug Substances 0.000 description 5
- 238000004617 QSAR study Methods 0.000 description 4
- 125000003277 amino group Chemical group 0.000 description 4
- 229910052736 halogen Inorganic materials 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 150000002367 halogens Chemical class 0.000 description 3
- 125000001997 phenyl group Chemical group [H]C1=C([H])C([H])=C(*)C([H])=C1[H] 0.000 description 3
- 239000008346 aqueous phase Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000002608 ionic liquid Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000010865 sewage Substances 0.000 description 2
- 125000001424 substituent group Chemical group 0.000 description 2
- -1 sulfamethoxazole compound Chemical class 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- YKJOVEVYXKEPGB-UHFFFAOYSA-N 1,2-oxazol-3-ylmethanesulfonic acid Chemical class OS(=O)(=O)CC=1C=CON=1 YKJOVEVYXKEPGB-UHFFFAOYSA-N 0.000 description 1
- QECVIPBZOPUTRD-UHFFFAOYSA-N N=S(=O)=O Chemical group N=S(=O)=O QECVIPBZOPUTRD-UHFFFAOYSA-N 0.000 description 1
- 229940124350 antibacterial drug Drugs 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 125000003118 aryl group Chemical group 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 150000002894 organic compounds Chemical class 0.000 description 1
- 239000012074 organic phase Substances 0.000 description 1
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 1
- NVBFHJWHLNUMCV-UHFFFAOYSA-N sulfamide Chemical class NS(N)(=O)=O NVBFHJWHLNUMCV-UHFFFAOYSA-N 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C10/00—Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- General Health & Medical Sciences (AREA)
- Economics (AREA)
- Tourism & Hospitality (AREA)
- Mathematical Physics (AREA)
- Educational Administration (AREA)
- Health & Medical Sciences (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pure & Applied Mathematics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Biology (AREA)
- Primary Health Care (AREA)
- Probability & Statistics with Applications (AREA)
- Spectroscopy & Molecular Physics (AREA)
Abstract
The invention relates to the technical field of ecological risk evaluation, solves the technical problem of poor prediction capability, and particularly relates to a method for predicting a sulfamethylisoxazole substance n-octanol water distribution coefficient, which comprises the following steps: screening out compounds similar to the structure of a target substance to be detected from experimental measurement data of published documents to generate a sample data set; randomly dividing a sample data set into a training set and a verification set according to a preset proportion; training the constructed n-octanol water distribution coefficient prediction model according to the training set; verifying the external prediction capacity of the n-octanol water distribution coefficient prediction model according to the verification set; and predicting the n-octanol water distribution coefficient of the target substance to be detected by adopting a n-octanol water distribution coefficient prediction model. According to the invention, the descriptor with significance of the compound similar to the structure of the target substance to be detected is adopted to construct the n-octanol water distribution coefficient prediction model, so that the prediction capability of the sulfamethoxazole substance is improved.
Description
Technical Field
The invention relates to the technical field of ecological risk evaluation, in particular to a method for predicting a water distribution coefficient of n-octanol of a sulfamethoxazole substance.
Background
Sulfomethylisoxazoles are widely used as an antibacterial drug in human bodies, but most of the drugs are directly discharged into the environment without being metabolized, and in addition, the drugs can enter water environment in the processes of producing the drugs and treating expired and unused drugs, so that the average concentration level of the drugs in urban sewage is higher. The research shows that the n-octanol water distribution coefficient of the sulfamethoxazole substances has important significance for measuring the concentration of the sulfamethoxazole in the sewage.
The n-octanol water partition coefficient refers to the concentration ratio of a certain compound in the n-octanol and the aqueous phase in the equilibrium state, and reflects the migration ability of the compound between the aqueous phase and the organic phase. In theory, the direct measurement of the n-octanol water distribution coefficient of the sulfamethoxazole substances in a laboratory is the most effective method, but the experimental measurement process is complicated and is complex to operate, and standard samples are needed, and data of different laboratories have systematic errors, so many researchers propose to establish a prediction model for predicting the n-octanol water distribution coefficient of the compound according to the molecular structure information of the compound, and although the conventional prediction model simplifies the operation process compared with the experimental measurement method, the conventional n-octanol water distribution coefficient prediction model on the market has poorer prediction capability for the sulfamethoxazole substances, so that the prediction requirement of people on the n-octanol water distribution coefficient of the sulfamethoxazole substances and derivatives thereof cannot be met.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for predicting the n-octanol water distribution coefficient of a sulfamethoxazole substance, which solves the technical problem of poor prediction capability of the sulfamethoxazole substance, and achieves the aim of improving the prediction capability of the model by constructing a n-octanol water distribution coefficient prediction model by adopting a descriptor with significance of a compound similar to the structure of a target substance to be detected, and comprehensively verifying the model from fitting degree, application domain and prediction capability.
In order to solve the technical problems, the invention provides the following technical scheme: a method for predicting the n-octanol water partition coefficient of a sulfamethoxazole substance comprises the following steps:
s1, screening out a compound similar to a structure of a target substance to be detected from experimental measurement data of published documents, and generating a sample data set;
s2, randomly dividing a sample data set into a training set and a verification set according to the modeling requirement of the OECD;
s3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set;
s4, adopting the determination coefficient adjusted by the degree of freedom according to the verification setAnd the square of the root mean square error RMSE and the correlation coefficient +.>Verifying the prediction capability of the n-octanol water distribution coefficient prediction model;
s5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance;
s6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds;
s7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
Further, in step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting the multiple linear regression stepwise analysis method includes the following steps:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors;
s33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
Further, in step S33, the degree-of-freedom adjusted decision coefficientThe calculation formula of (2) is as follows:
wherein y is i Andexperimental and predictive values for the ith compound, respectively,/->The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
Further, descriptors used for constructing the n-octanol water distribution coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to an auto-correlation parameter weighted by the Mulbersen electronegativity and is used for describing one parameter of out-of-core electron distribution of each atom in a molecule.
Further, log k for all compounds in the sample dataset OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
wherein nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
By means of the technical scheme, the invention provides a method for predicting the water distribution coefficient of the n-octanol of the sulfamethoxazole substance, which has at least the following beneficial effects:
1. according to the invention, compounds with the structure similar to that of a target substance to be detected are screened from experimental measurement data of published documents to be used as a sample data set, a semi-empirical molecular orbit method is adopted to optimize the structure of the compounds in the sample data set, the optimized structure is guided into PaDEL-Descriptor software to be calculated to obtain a plurality of molecular structure descriptors, two descriptors nX and AATS1e for constructing a positive octanol water distribution coefficient prediction model are screened by adopting a multiple linear regression stepwise analysis method MLR, the prediction model is comprehensively evaluated from fitting degree, application domain and mechanism, and the accuracy of a prediction result is improved.
2. According to the invention, the screened sample data set similar to the structure of the target substance to be detected is randomly divided into the training set and the verification set according to the preset proportion of 3:1 according to the modeling requirement in the OECD guide rule, and the n-octanol water distribution coefficient prediction model is constructed and verified, so that the stability of the n-octanol water distribution coefficient prediction model is enhanced.
3. The invention uses the adjusted decision coefficientThe RMSE characterizes the model fitness by the determining coefficient between the experimental value and the predicted value of the compound of the validation set ∈>And square of correlation coefficient->Representing the external verification result, and characterizing the application domain of the prediction model by using Euclidean distance to ensure that the target substances to be detected are all in the application range of the constructed prediction modelAnd further, the credibility of the prediction result is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method for predicting the water partition coefficient of n-octanol of a sulfamethylisoxazole substance;
FIG. 2 is a flow chart of the construction of a water distribution coefficient prediction model of n-octanol according to the present invention;
FIG. 3 is a graph of the model descriptor versus Euclidean distance of the present invention;
FIG. 4 shows the training set log k of the present invention OW Schematic fit of experimental and predicted values.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. Therefore, the implementation process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
Summary of the application
To date, many researchers have successfully established a predictive model of the n-octanol water partition coefficient value of organic compounds using quantitative structure-activity relationship (QSAR) techniques, such as using molecular fragmentation in the United states environmental protection agency's EPISUITE software to predict compound log k OW Although this approach is better for simpler compounds, the prediction error may be larger for more complex compounds; some patents also provide prediction schemes, such as a method for predicting n-octanol/water distribution coefficient of ionic liquid with a patent name of 201210181904.X, which proposes that the n-octanol/water distribution coefficient is calculated according to van der Waals volumes of atoms in a molecular structural formula of the ionic liquid to be predicted, but the method is complex in operation and poor in prediction capability. In addition, the prior art relates to the prediction of the water content of n-octanol of the sulfamethylisoxazolesThe research on the matching coefficient is less, and related patents are blank. Therefore, the application provides a method for constructing the n-octanol water distribution coefficient prediction model according to the descriptor with significance of the compound similar to the structure of the object to be detected, so that the prediction capability and stability of the model are improved, the application domain of the model is characterized, and the credibility of the prediction result of the model is improved.
Examples
Referring to fig. 1-4, a specific implementation manner of this embodiment is shown, in this embodiment, the prediction capability of the n-octanol water distribution coefficient prediction model of the sulfamethoxazole substance is improved by constructing the n-octanol water distribution coefficient prediction model according to the screened descriptors nX and AATS1e, and performing internal and external verification on the model by adopting the square pair model of the determination coefficient, root mean square error and correlation coefficient.
As shown in FIG. 1, the method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance comprises the following steps:
s1, screening out compounds similar to the structure of the target substance to be detected from experimental measurement data of published documents, and generating a sample data set.
To ensure data quality, all n-octanol water partition coefficient values were derived from experimental measurements in published scientific literature. In addition, the target substance to be detected belongs to a sulfamethoxazole compound, and has an isoxazole ring, a substituted amino group and a substituted sulfonyl (amine) group in the structure, and the three groups contain two elements of N or O, and the group with N, O element can form a hydrogen bond with water molecules easily under the normal condition, so that the water solubility of the whole molecule is influenced; secondly, because the substituted amino, halogen, benzene ring and isoxazole ring belong to groups with higher electron cloud density, the electron cloud distribution and electronegativity of the whole molecule can be changed, and the polarity of the molecule is influenced; finally, all target substances to be detected contain benzene rings or isoxazole rings with larger volumes, so that interaction between molecules and water molecules is blocked to a certain extent, and the water solubility is also influenced to a certain extent.
Therefore, when experimental measurement data in published scientific and technical literature are screened, substituted amino, sulfonamide, isoxazole ring, benzene ring and halogen are used as standards, so that substances used for modeling are more structurally consistent with target substances to be tested, and accuracy of a prediction model is further ensured.
In this embodiment, the specific process of generating the sample data set includes: firstly judging whether the structure of the substance takes a structural unit represented by a formula (1) as a basic structure, if the structure of the substance takes the structural unit represented by the formula (1) as the basic structure, then continuously judging whether the substituent of the substance is an amino group or an isoxazole ring substance, if the substituent of the substance is an amino group or an isoxazole ring substance, putting the substance into a sample data set, otherwise, judging that the substance does not meet the requirement; wherein the formula (1) is as follows:
in this example, 76 sulfamethoxazole compounds were collected and sorted according to the above screening procedure, and the n-octanol water distribution coefficient log k was calculated OW The range of (2) is-0.62 to 4.39.
By adopting the compound with the structure similar to that of the target substance to be detected as a sample data set, a powerful data support can be provided for the subsequent construction of a prediction model for predicting the n-octanol water distribution coefficient of the target substance, and the prediction precision of the prediction model is further improved.
S2, randomly dividing the sample data set into a training set and a verification set according to the modeling requirement of the OECD.
In order to meet modeling requirements in OECD guidelines, the present embodiment screens all log k in the 76 sample data sets selected in step S1 OW The data were randomly divided into training and validation sets at a preset ratio of 3:1. Wherein the training set contains 57 data for the establishment of the model; the validation set contains 19 data for external validation of the model, enhancing stability, thus enabling the model to have strong predictive power and robustness.
S3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set. The process of establishing the predictive model of the water partition coefficient of n-octanol will be further described with reference to FIG. 2:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
in this embodiment, the molecular structure of the compound in the training set is optimized by adopting the PM7 algorithm in the MOPAC 2016 software package, and the keyword charge=0ef color=0.0100 shift=80 is added to obtain the optimized lowest energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors.
In the present embodiment, the calculation process according to the above-described step S31 and step S32 obtains 1444 descriptors of 63 classes in total.
It should be noted that, the descriptor is a symbol for describing structural information or experimental description of chemical molecules, and can be divided into three categories: 1D, 2D, and 3D descriptors representing chemical composition, topology, 3D shape, and function, respectively; the descriptor may be a simple feature, such as molecular volume, or complex, such as 3D-MoRSE, containing various physicochemical and structural characteristics of the compound; the descriptors can be used to build quantitative structure-activity relationship (QSAR) models to predict the biological activity of novel compounds.
S33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
In this embodiment, a multiple linear regression stepwise analysis (stepwise MLR analysis) in SPSS 19.0 software is adopted to sequentially select and test 1444 descriptors, calculate the significance of each descriptor for the constructed model, consider that the descriptor is significant for the model construction when the test result shows that the significance is less than 0.05, determine whether the variable expansion factor (VIF) of the descriptor is less than a preset threshold value 10, and if the variable expansion factor (VIF) of the descriptor is less than the preset threshold value 10, the significant descriptor is reserved and participates in the model construction.
Finally, two descriptors nX and AATS1e for constructing a water distribution coefficient prediction model of n-octanol are obtained by screening 1444 descriptors, and log k of all compounds in a sample data set is used for OW Stepwise regression analysis and verification are carried out on the data to obtain a linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
in the above formula, nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
It should be noted that, descriptor nX (Number of halogen atoms (F, cl, br, I, at, uus)) represents the number of halogen atoms in the molecular structure, and the introduction of halogen, whether occurring on an amino group or a substituted aromatic ring, will affect the electron cloud distribution and polarity of the whole molecule due to its greater electronegativity, and thus affect the water solubility and Kow value;
another descriptor, AATS1e (Average brown-Moreau autocorrelation-lag 1/weighted by Sanderson electronegativities), refers to an autocorrelation parameter weighted by sandsen electronegativity, which is a parameter describing the extra-nuclear electron distribution of each atom in a molecule, whereas the extra-nuclear electron molecular situation directly affects the overall molecular polarity size, so that in halogen substituted sulfamethylisoxazoles, the electronegativity is believed to have a greater effect on the Kow of the material, according to similar compatibility principles.
S4, adopting the determination coefficient adjusted by the degree of freedom according to the verification setAnd the square of the root mean square error RMSE and the correlation coefficient +.>And verifying the prediction capability of the n-octanol water distribution coefficient prediction model.
This practice isIn an embodiment, by employing a degree-of-freedom adjusted decision coefficientRoot mean square error RMSE and square of correlation coefficient +.>And verifying the internal and external prediction capacities of the n-octanol water distribution coefficient prediction model.
in the above, y i Andexperimental and predictive values for the ith compound, respectively,/->The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
Square of root mean square error RMSE and correlation coefficientThe calculation formula of (2) is as follows:
in the above, y i Andexperimental and predictive values for the ith compound, respectively,/->To verify the average value of all data point experimental values in the set, n is the number of data points in the training set, n ext To verify the number of set data points.
Specifically, the present embodiment is obtained by calculation according to the above formula:
training set n train Determination coefficient of =57Root mean square error RMSE tra =0.009, square of correlation coefficient +.>
Verification set n text Determination coefficient of =19Root mean square error RMSE ext =0.187, square of correlation coefficient +.>
As can be seen from the above data,and->The values of (2) are all more than 0.85, which indicates that the predictive model has good fitting degree and strong stability, < >>The value of (2) is greater than 0.8, indicating that the predictive model has good predictive power,/-for>Andthe difference is much less than 0.3, indicating that the predictive model has not been overfitted, as shown in FIG. 4, log k OW Fitting of experimental and predicted values.
S5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance.
As shown in fig. 3, the abscissa is two descriptors nX and AATS1e of the n-octanol water partition coefficient prediction model, respectively, the background represents the euclidean distance of the substance, and the euclidean distance is the application domain range of the n-octanol water partition coefficient prediction model within 0-1.2 according to the left graphical information. The substances in all training sets and verification sets are in the application domain range, and the substances to be predicted are in the application range of the n-octanol water distribution coefficient prediction model, so that the prediction result is proved to be more reliable. Wherein, the calculation formula of Euclidean distance is as follows:
in the above, x i Is a variable of the molecular structure descriptor of the i-th compound,is the average of the molecular structure descriptors.
According to the embodiment, the Euclidean distance is adopted to characterize the application domain of the model, so that the target substances to be detected are ensured to be in the application range of the constructed prediction model, and the credibility of the prediction result of the n-octanol water distribution coefficient is further improved.
S6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds.
In this embodiment, there are six target substances to be tested, and each target substance to be tested belongs to the sulfamethoxazole compound.
S7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
By substituting descriptor parameters of the target substance to be detected into linear relation of the n-octanol water distribution coefficient prediction model, corresponding n-octanol water distribution coefficient prediction values can be obtained, the whole process is simple to operate and easy to explain, so that the prediction efficiency and accuracy of the n-octanol water distribution coefficients are improved, and the reliability of model prediction results is also improved.
In this embodiment, the prediction of the n-octanol water distribution coefficient is performed on given six target substances by using the above-constructed n-octanol water distribution coefficient prediction model, that is:
and substituting the descriptor parameters of the six target substances into the linear relation of the constructed n-octanol water distribution coefficient prediction model respectively to obtain corresponding n-octanol water distribution coefficient prediction values, wherein the detailed results are shown in the following table:
by calculating Euclidean distances of six target substances to be detected according to a formula, all results are less than 1.2, and the fact that the target compound is in the application range of the model is indicated, and the prediction result is more reliable.
Through the embodiment, firstly, a compound similar to the structure of the target substance to be detected is screened from experimental measurement data of the published literature to serve as a sample data set, so that a powerful data support is provided for the subsequent construction of a water distribution coefficient prediction model for predicting the n-octanol water content of the target substance; the sample data set is randomly divided into a training set and a verification set according to the modeling requirement in the OECD guide rule, so that the stability of the prediction model is enhanced; then, optimizing the structure of the compound in the sample data set by adopting a semi-empirical molecular orbital method, introducing the optimized structure into PaDEL-Descriptor software to calculate a plurality of molecular structure descriptors, screening out two significant descriptors nX and AATS1e by adopting a multiple linear regression stepwise analysis method MLR, and constructing a positive octanol water distribution coefficient prediction model, so that the model is simpler and has stronger robustness, and the prediction precision and efficiency of the model are improved; and finally, carrying out internal and external verification on the prediction model constructed by adopting the square pair of the decision coefficient, the root mean square error and the correlation coefficient, characterizing the application domain of the model by adopting the Euclidean distance, and comprehensively evaluating the prediction model from the fitting degree, the application domain and the mechanism to ensure that the target substance to be detected is in the application range of the constructed prediction model, thereby improving the reliability of the prediction result.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Claims (5)
1. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substance is characterized by comprising the following steps of:
s1, screening out a compound similar to a structure of a target substance to be detected from experimental measurement data of published documents, and generating a sample data set;
s2, randomly dividing a sample data set into a training set and a verification set according to the modeling requirement of the OECD;
s3, constructing a n-octanol water distribution coefficient prediction model by adopting a multiple linear regression stepwise analysis method according to the training set;
s4, adopting a determination coefficient R adjusted by the degree of freedom according to the verification set 2 Square of root mean square error RMSE and correlation coefficientVerifying the prediction capability of the n-octanol water distribution coefficient prediction model;
s5, carrying out application domain characterization on the n-octanol water distribution coefficient prediction model according to Euclidean distance;
s6, obtaining a plurality of target substances to be detected, which belong to the sulfamethoxazole compounds;
s7, adopting a n-octanol water distribution coefficient prediction model to respectively and independently predict the n-octanol water distribution coefficient of the target substance to be detected, and obtaining a corresponding n-octanol water distribution coefficient prediction value.
2. The method for predicting the n-octanol water partition coefficient of the sulfamethoxazole substances according to claim 1, wherein in the step S3, the specific process of constructing the n-octanol water partition coefficient prediction model by adopting a multiple linear regression stepwise analysis method comprises the following steps:
s31, optimizing the molecular structure of the compound in the training set by adopting a semi-empirical molecular orbital method to obtain an optimized minimum energy structure;
s32, importing the optimized lowest energy structure into PaDEL-Descriptor software, and calculating to obtain a plurality of molecular structure descriptors;
s33, a multiple linear regression model with the variable expansion factor smaller than a preset threshold and the maximum determination coefficient adjusted by the degree of freedom is selected from a plurality of molecular structure descriptors by adopting a multiple linear regression method, and the multiple linear regression model is an n-octanol water distribution coefficient prediction model.
3. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 2, wherein in step S33, the degree of freedom-adjusted determination coefficientThe calculation formula of (2) is as follows:
wherein y is i Andexperimental and predictive values for the ith compound, respectively,/->The average value of the experimental values of all data points of the training set is represented by n, the number of the data points of the training set is represented by p, and the number of descriptors is represented by p.
4. The method for predicting the n-octanol water partition coefficient of a sulfamethylisoxazole substance according to claim 2, wherein descriptors used for constructing a n-octanol water partition coefficient prediction model are nX and AATS1e, wherein the descriptor nX represents the number of halogen atoms in a molecular structure, and the descriptor AATS1e refers to a samadesen electronegativity weighted autocorrelation parameter for describing one parameter of an extranuclear electron distribution of each atom in a molecule.
5. The method for predicting n-octanol water partition coefficient of sulfamethoxazole according to claim 4, wherein log k of all compounds in the sample data set OW Stepwise regression analysis and verification are carried out on the values to obtain the linear relation of the n-octanol water distribution coefficient prediction model as follows:
logk OW =1.504×nX-4.907×AATS1e+39.845
wherein nX represents the number of halogen atoms in the molecular structure, and AATS1e represents the sandsen electronegativity weighted autocorrelation parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310201931.7A CN116312854A (en) | 2023-03-06 | 2023-03-06 | Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310201931.7A CN116312854A (en) | 2023-03-06 | 2023-03-06 | Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116312854A true CN116312854A (en) | 2023-06-23 |
Family
ID=86800725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310201931.7A Pending CN116312854A (en) | 2023-03-06 | 2023-03-06 | Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116312854A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003126689A (en) * | 2001-10-26 | 2003-05-07 | Tosoh Corp | Adsorbent for hydrocarbon and removal method by adsorbing hydrocarbon |
KR20120085144A (en) * | 2011-10-05 | 2012-07-31 | 주식회사 켐에쎈 | Multiple linear regression-artificial neural network hybrid model predicting water solubility of pure organic compound |
WO2012177108A2 (en) * | 2011-10-04 | 2012-12-27 | 주식회사 켐에쎈 | Model, method and system for predicting, processing and servicing online physicochemical and thermodynamic properties of pure compound |
CN102999705A (en) * | 2012-11-30 | 2013-03-27 | 大连理工大学 | Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model |
CN104573863A (en) * | 2015-01-07 | 2015-04-29 | 大连理工大学 | Method for predicting organic compound and hydroxyl radical reaction rate constant in water phase |
CN113722988A (en) * | 2021-08-18 | 2021-11-30 | 扬州大学 | Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model |
-
2023
- 2023-03-06 CN CN202310201931.7A patent/CN116312854A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003126689A (en) * | 2001-10-26 | 2003-05-07 | Tosoh Corp | Adsorbent for hydrocarbon and removal method by adsorbing hydrocarbon |
WO2012177108A2 (en) * | 2011-10-04 | 2012-12-27 | 주식회사 켐에쎈 | Model, method and system for predicting, processing and servicing online physicochemical and thermodynamic properties of pure compound |
KR20120085144A (en) * | 2011-10-05 | 2012-07-31 | 주식회사 켐에쎈 | Multiple linear regression-artificial neural network hybrid model predicting water solubility of pure organic compound |
CN102999705A (en) * | 2012-11-30 | 2013-03-27 | 大连理工大学 | Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model |
CN104573863A (en) * | 2015-01-07 | 2015-04-29 | 大连理工大学 | Method for predicting organic compound and hydroxyl radical reaction rate constant in water phase |
CN113722988A (en) * | 2021-08-18 | 2021-11-30 | 扬州大学 | Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model |
Non-Patent Citations (1)
Title |
---|
YUANYUAN ZHANG 等: "Toxicity of disinfection byproducts formed during the chlorination of sulfamethoxazole, norfloxacin, and 17β-estradiol in the presence of bromide", ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH, pages 50718 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Leydesdorff | The mutual information of university-industry-government relations: An indicator of the Triple Helix dynamics | |
Seffernick et al. | Hybrid methods for combined experimental and computational determination of protein structure | |
Thapa et al. | DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction | |
An et al. | Characterizing and mining the citation graph of the computer science literature | |
Ashton et al. | Depletion potentials in highly size-asymmetric binary hard-sphere mixtures: Comparison of simulation results with theory | |
He et al. | Motif-All: discovering all phosphorylation motifs | |
Alvioli et al. | Universality of nucleon-nucleon short-range correlations: the factorization property of the nuclear wave function, the relative and center-of-mass momentum distributions, and the nuclear contacts | |
Costas et al. | Scaling rules in the science system: Influence of field‐specific citation characteristics on the impact of individual researchers | |
Guo et al. | Addressing big data challenges in mass spectrometry-based metabolomics | |
Afantitis et al. | A novel QSAR model for predicting the inhibition of CXCR3 receptor by 4-N-aryl-[1, 4] diazepane ureas | |
Cao et al. | Computational exploration of the network of sequence flow between protein structures | |
Wang et al. | QSPR model for Caco-2 cell permeability prediction using a combination of HQPSO and dual-RBF neural network | |
Shen et al. | A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers | |
Stork et al. | Computational prediction of frequent hitters in target-based and cell-based assays | |
CN116312854A (en) | Method for predicting n-octanol water distribution coefficient of sulfamethoxazole substances | |
Zhao et al. | Identification of metal ion-binding sites in RNA structures using deep learning method | |
CN110084423A (en) | A kind of link prediction method based on local similarity | |
Naveja et al. | Consistent Cell‐selective Analog Series as Constellation Luminaries in Chemical Space | |
D Bolboaca et al. | The effect of leverage and/or influential on structure-activity relationships | |
Ghanbarpour et al. | On-the-fly prediction of protein hydration densities and free energies using deep learning | |
WO2011041247A1 (en) | A system for the determination of selective absorbent molecules through predictive correlations | |
Betancourt | Another look at the conditions for the extraction of protein knowledge‐based potentials | |
Brimicombe | Constructing and evaluating contextual indices using GIS: a case of primary school performance tables | |
Buric et al. | Parallel Factor Analysis Enables Quantification and Identification of Highly Convolved Data-Independent-Acquired Protein Spectra | |
Cao et al. | Machine Learning Prediction of On/Off Target-driven Clinical Adverse Events |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230623 |
|
RJ01 | Rejection of invention patent application after publication |