CN112086141A - Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation - Google Patents
Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation Download PDFInfo
- Publication number
- CN112086141A CN112086141A CN202010939877.2A CN202010939877A CN112086141A CN 112086141 A CN112086141 A CN 112086141A CN 202010939877 A CN202010939877 A CN 202010939877A CN 112086141 A CN112086141 A CN 112086141A
- Authority
- CN
- China
- Prior art keywords
- model
- compound
- coefficient
- predicting
- descriptor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 title claims abstract description 16
- 239000002957 persistent organic pollutant Substances 0.000 title claims abstract description 15
- 150000001875 compounds Chemical class 0.000 claims abstract description 51
- 150000002894 organic compounds Chemical class 0.000 claims abstract description 10
- 238000012360 testing method Methods 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims abstract description 5
- 238000012417 linear regression Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000000324 molecular mechanic Methods 0.000 claims description 15
- 238000012795 verification Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 13
- 238000005192 partition Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000002790 cross-validation Methods 0.000 claims description 7
- 239000005416 organic matter Substances 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- HGASFNYMVGEKTF-UHFFFAOYSA-N octan-1-ol;hydrate Chemical compound O.CCCCCCCCO HGASFNYMVGEKTF-UHFFFAOYSA-N 0.000 claims description 5
- 239000004215 Carbon black (E152) Substances 0.000 claims description 3
- PFRUBEOIWWEFOL-UHFFFAOYSA-N [N].[S] Chemical compound [N].[S] PFRUBEOIWWEFOL-UHFFFAOYSA-N 0.000 claims description 3
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 3
- 150000008282 halocarbons Chemical class 0.000 claims description 3
- 229930195733 hydrocarbon Natural products 0.000 claims description 3
- 150000002430 hydrocarbons Chemical class 0.000 claims description 3
- OLKGTKIYKWXMOZ-UHFFFAOYSA-N hydroxymethyl 2,2-dimethylpropanoate Chemical compound CC(C)(C)C(=O)OCO OLKGTKIYKWXMOZ-UHFFFAOYSA-N 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 229910052760 oxygen Inorganic materials 0.000 claims description 3
- 239000001301 oxygen Substances 0.000 claims description 3
- 239000000575 pesticide Substances 0.000 claims description 3
- 238000012552 review Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 2
- 238000012935 Averaging Methods 0.000 claims 1
- 230000007613 environmental effect Effects 0.000 abstract description 8
- 239000000463 material Substances 0.000 abstract description 2
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 239000000126 substance Substances 0.000 description 15
- LQNUZADURLCDLV-UHFFFAOYSA-N nitrobenzene Chemical compound [O-][N+](=O)C1=CC=CC=C1 LQNUZADURLCDLV-UHFFFAOYSA-N 0.000 description 8
- 238000004618 QSPR study Methods 0.000 description 7
- 230000006399 behavior Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- MWPLVEDNUUSJAV-UHFFFAOYSA-N anthracene Chemical compound C1=CC=CC2=CC3=CC=CC=C3C=C21 MWPLVEDNUUSJAV-UHFFFAOYSA-N 0.000 description 2
- HHNHBFLGXIUXCM-GFCCVEGCSA-N cyclohexylbenzene Chemical compound [CH]1CCCC[C@@H]1C1=CC=CC=C1 HHNHBFLGXIUXCM-GFCCVEGCSA-N 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- OEBRKCOSUFCWJD-UHFFFAOYSA-N dichlorvos Chemical compound COP(=O)(OC)OC=C(Cl)Cl OEBRKCOSUFCWJD-UHFFFAOYSA-N 0.000 description 2
- 229950001327 dichlorvos Drugs 0.000 description 2
- 239000012528 membrane Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 231100000048 toxicity data Toxicity 0.000 description 2
- HTWJIGPQBFITDI-UHFFFAOYSA-N 1-iodooctane Chemical compound ICCCCCCCC.ICCCCCCCC HTWJIGPQBFITDI-UHFFFAOYSA-N 0.000 description 1
- 206010072082 Environmental exposure Diseases 0.000 description 1
- OAYONPVQZPIHBU-UHFFFAOYSA-N anthracene Chemical compound C1=CC=CC2=CC3=CC=CC=C3C=C21.C1=CC=CC2=CC3=CC=CC=C3C=C21 OAYONPVQZPIHBU-UHFFFAOYSA-N 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000003344 environmental pollutant Substances 0.000 description 1
- 239000007789 gas Substances 0.000 description 1
- 125000005842 heteroatom Chemical group 0.000 description 1
- 230000002209 hydrophobic effect Effects 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- BEJBETAAAVFGOR-UHFFFAOYSA-N nitrobenzene Chemical compound [O-][N+](=O)C1=CC=CC=C1.[O-][N+](=O)C1=CC=CC=C1 BEJBETAAAVFGOR-UHFFFAOYSA-N 0.000 description 1
- 229920006112 polar polymer Polymers 0.000 description 1
- 231100000719 pollutant Toxicity 0.000 description 1
- 229920000058 polyacrylate Polymers 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000002110 toxicologic effect Effects 0.000 description 1
- 231100000027 toxicology Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C10/00—Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)
Abstract
The invention discloses a method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation, which calculates molecular descriptors through the molecular structure of the existing compound, constructs a quantitative structure-property relation model by adopting stepwise multiple linear regression analysis, and can quickly and efficiently predict KPA-w value of the organic compound; the method is simple and rapid, has low cost, can save manpower, material resources and financial resources required by experimental tests, can effectively predict the PA membrane-water distribution coefficient of the organic compound in the application domain, fills the blank of data of other compounds, provides necessary basic data for monitoring environmental compounds and applying passive samplers, and has great significance.
Description
Technical Field
The invention relates to a PA-water distribution coefficient prediction method, in particular to a method for predicting PA-water distribution coefficients of organic pollutants based on quantitative structure property relation.
Background
The membrane passive sampling technology is widely applied to measuring the free dissolved concentration of organic compounds and evaluating the environmental exposure risk of the organic matters at present. PA (polyacrylate) is a polar polymer containing heteroatoms, is more suitable for extracting hydrophobic organic matters, and is widely applied to the technical field of passive sampling. The partition coefficient (K) of organic substances between PA membrane and water is generallyPA-w) The method is an important parameter for evaluating the environmental behavior of the compound, and is also an important index for measuring the performance of the passive sampler and optimizing the passive sampler. The conventional experimental measurement method is time-consuming and labor-consuming, the error of the measurement result of the substance with unstable property is large, and the environmental monitoring requirement of the organic pollutants which are huge in quantity and are increased day by day is difficult to meet, so that the development of a simpler, more effective and faster method for predicting the distribution coefficient is urgently needed.
The quantitative structure-property relationship (QSPR) is a computer modeling method capable of representing the relation between the molecular structure of an organic matter and the physicochemical property, environmental behavior and toxicological parameters thereof, can make up the deficiency of the environmental behavior and ecological toxicological data of the organic matter, greatly reduces the experimental cost, and is beneficial to reducing or replacing related experiments. OECD in 2004 proposed the criteria for QSPR model construction and use, indicating that QSPR models meeting the following requirements can be applied to risk assessment and management of chemicals: (1) has well-defined environmental indicators; (2) the method has clear and transparent algorithm, and is beneficial to mechanism explanation; (3) defining an application domain of the model; (4) the model has appropriate fitness, stability and predictive ability. This criterion points the direction to the development of the QSPR model.
At present, a lot of reports have been made on organic KPA-wThe simulation prediction of the value is mostly focused on foreign research, and relatively few domestic research. The octanol water partition coefficient (K) was established as described in the literature "Toxicol. Mech. method.,2005,4(15), 307-ow) And log KPA-wThe relationship model of (a) is,has very high correlation coefficient (R)21), but the relation is only suitable for 3 chemical substances, the research substances are single, and the application range is limited. The literature "environ. Sci. technol.,2017,5(51),3001-PA-wDistribution coefficient L of gas stationary phase1、L2The relation between the two has a high correlation coefficient (R)20.94), but this relationship is applicable to only 14 chemical substances, the study substance is single, and the application range is limited.
Because most of the compounds of the models established in the existing research are single in type, the application field is narrow, and necessary model characterization parameters are lacked. Therefore, with the increase of emerging pollutants, it is necessary to develop a simple, rapid and efficient method for predicting organic KPA-wQSPR model (1).
Disclosure of Invention
The invention aims to provide a method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation, which can rapidly and effectively predict K of organic pollutant according to molecular structure descriptor of organic compoundPA-wThe value is obtained.
The purpose of the invention is realized as follows: a method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation comprises the following steps:
1) data collection, review of literature collections to obtain the log K of 198 organic compoundsPA-wObtaining a data set;
2) descriptor calculation, before calculating a molecular structure descriptor, firstly generating a molecular structure of an initial organic matter through software, secondly optimizing the molecular structure by using an MM2 molecular mechanics method, then calculating the molecular structure descriptor of each compound in the software, and performing descriptor pretreatment;
3) and (3) model construction, namely, establishing a data set according to the ratio of 4: 1 proportion is divided into a training set and a test set, step-by-step multiple linear regression analysis is carried out through SPSS software, and the decision coefficient is adjusted according to the small number of the molecular descriptorsAnd external testCoefficient of evidenceOptimal model obtained by higher principle:
logKPA-w=0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v +1.823;
wherein CrippenlogP is the Crippen octanol-water partition coefficient; RNCG is a relatively negative charge; VE2_ Dzv is the average coefficient sum of the last eigenvector of the Barysz matrix weighted by volume; ATSC4v is a 2D autocorrelation descriptor weighted by volume;
4) and (3) model verification: verifying the model, and entering the step 5) after the model is qualified;
5) application domain characterization: characterizing the model application domain by a Williams diagram;
6) application of the model: the model was used to predict POM-water partition coefficients for unknown compounds.
As a further limitation of the present invention, the organic compound in step 1) includes hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound, and pesticide.
As a further limitation of the invention, the preprocessing in step 2) includes removing descriptors with constants, near constants, deletions and correlations greater than 0.95.
As a further limitation of the present invention, the cross-validation of the coefficient Q by the bootstrap method during the validation in step 3)2 BOOTSum-and-one method cross validation coefficient Q2 LOOVerifying the robustness of the model; external verification uses fitting coefficients between prediction and actual measurementAnd training set root mean square error RMSEextRepresenting the model external prediction capability.
As a further limitation of the present invention, step 1) for the same compound, data significantly deviating from the overall value are removed, and averaged to create a data set, step 3) the data in the training set is used for model creation and internal verification, and the data in the testing set is used for external verification and performance evaluation of the model.
As a further limitation of the present invention, step 5) specifically comprises: using a standard residual error based leverage value hiThe Williams diagram of (1) characterizes the application domain of the model, with absolute values greater than 3.0, the compound being an outlier, with a lever value of hiGreater than a warning value h*When the compound is used, the structure of the compound is obviously different from the structures of other compounds; h isiAnd h*Calculated by the following formula:
hi=xi T(XTX)-1xi
h*=3(p+1)/n
wherein xiIs the descriptor matrix for the ith compound; x is the number ofi TIs xiThe transposed matrix of (2); x is a descriptor matrix for all compounds; xTIs the transpose of X; (X)TX)-1Is a matrix XTThe inverse of X; p is the number of variables in the model; n is the number of training set samples.
Compared with the prior art, the invention has the beneficial effects that: the invention adopts a simple and transparent step-by-step multiple linear regression algorithm to construct a QSPR prediction model, the model covers organic compounds with various structures, has good goodness-of-fit, robustness and prediction capability, and is used for predicting the log K of the organic compounds in an application domainPA-wThe values provide an efficient tool. The method is low in cost, simple and rapid, and can save a large amount of manpower, material resources and financial resources required by experimental tests. K according to the inventionPA-wEstablishment and verification of prediction method strictly follow OECD specified QSPR model development and use guide rule, therefore, K of organic matter is predicted by using model established by the inventionPA-wThe method has high reliability, provides important basic data for chemical supervision work, and has important guiding significance for ecological risk evaluation; simultaneously still possess following characteristics:
1. according to the guide rule of OECD about the construction and use of the QSRR model, the QSRR model with a transparent algorithm is established, and the mechanism explanation is easy;
2. the model has proper fitting degree, stability and prediction capability;
3. the application range of the model is wide, organic compounds with various structures are covered, and the model can be used for predicting the K of different compoundsPA-wThe value provides basic data for global environmental behavior analysis and ecological risk evaluation of organic compounds;
4. the model completely adopts a calculation mode, so that the loss of organic matter environmental behaviors and ecological toxicological data can be compensated, the experiment cost is greatly reduced, and the reduction or the substitution of related experiments is facilitated; more efficient access to chemical KPA-wThe value is obtained.
Drawings
FIG. 1 shows log K in the present inventionPA-wFitting graph of measured value and predicted value
FIG. 2 is a Williams diagram of the domain of application of the characterization model in the present invention.
Detailed Description
A method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation comprises the following steps:
1) data collection, review of literature collections to obtain the log K of 198 organic compoundsPA-wAnd (3) removing data obviously deviating from the whole numerical value for the same substance, taking the average value of the data to carry out model construction research, wherein organic compounds comprise hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound, pesticide and other compounds, and the data set is divided into 4: 1, splitting into a training set and a test set in proportion;
2) descriptor calculation, before calculating the Descriptor of the molecular structure, firstly generating the molecular structure of the initial organic matter by using ChemBio3D Ultra 12.0 software, secondly optimizing the molecular structure by using an MM2 (molecular mechanics) method, then calculating the Descriptor of the molecular structure of each compound in the PadEL-Descriptor software, and removing the descriptors with constant, approximate constant, deletion and relevance more than 0.95;
3) model construction, step-by-step multiple linear regression analysis is carried out through SPSS software, and the decision coefficient is adjusted according to the small number of the molecular descriptorsAnd external verification coefficientOptimal model obtained by higher principle:
logKPA-w=0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v +1.823; (1)
wherein CrippenlogP is a Crippen octanol-water partition coefficient (Crippen octanol-water partition coefficient); RNCG is the relative negative charge (the charge of the last negative electrode derivative by the total negative charge); VE2_ Dzv is the sum of the average coefficients of the last eigenvector of the Barysz matrix weighted by volume (the average coefficient sum of the last eigenvector from the Barysz matrix/weighted by van der Waals volume); ATSC4v is a volume-weighted 2D autocorrelation descriptor (centered Broto-Moreau autocorrelation-lag 4/weighted by van der Waals volume);
4) and (3) model verification: and (5) verifying the model, and entering the step 5) after the model is verified to be qualified, wherein the specific parameters are as follows:
ntra=158,R2 adj=0.898,Q2 LOO=0.858,Q2 BOOT=0.793,RMSEtra=0.162,p <0.001;
next=40,R2 ext=0.797,Q2 ext=0.741,RMSEext=0.586;
wherein n istraAnd nextThe number of compounds in the training set and test set, respectively; r2 adjIn order to determine the coefficients, the coefficients are,is a one-out cross validation coefficient; q2 BOOTIs a bootstrap cross-validation coefficient; RMSEtraAnd RMSEextThe root mean square error of the training set and the test set respectively;is the decision coefficient in the test set;is the external verification coefficient;
determining coefficientsTraining set root mean square error RMSEtra0.162, the model has good fitting ability, and the one-off cross validation coefficient of the modelBootstrap cross validation coefficientsThe robustness of the description model is good, and the external verification coefficient Verification set Root Mean Square Error (RMSE)extWhen the value is 0.586, the model has good external prediction capability, and the fitting degree and the verification result of the model are shown in fig. 1;
5) application domain characterization: the model application domain was characterized by the Williams diagram (fig. 2).
The standard residual calculation formula is as follows:
where, is the standard residual, yiAnd the experimental value and the predicted value of the ith compound respectively, n is the number of the compounds in the data set, and A is the number of the descriptors;
lever value (h) and lever alarm value (h)*) Calculated by the following formula:
h*=3(p+1)/n (4)
wherein x isiIs the descriptor matrix for the ith compound; x is the number ofi TIs xiThe transposed matrix of (2); x is a descriptor matrix for all compounds; xTIs the transpose of X; (X)TX)-1Is a matrix XTThe inverse of X; p is the number of variables in the model; n is the number of training set samples.
Calculating and drawing the sum h of the model; when the absolute value of a compound is greater than 3.0, the compound is considered a model outlier. When h of the compound is larger than h, the structure of the compound is obviously different from the structures of other compounds, wherein h is 0.090, and the model is suitable for hiLess than 0.090 of compounds log KPA-wPrediction of the value of (c).
6) Application of the model: the model is used to predict the POM-water partition coefficient of unknown compounds, and the effect of the model is further illustrated below with reference to the examples.
Example 1
Given a compound anthracene (anthracene) the log K of it is predictedPA-wThe value is obtained. The molecular structure of anthracene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are 3.993, 0.098, 0.000 and-481.522 respectively. Obtaining h of the substance according to the calculation formula (2)iValue of 0.020<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log KPA-wThe predicted value is 4.46, the experimental value is 4.52, and the predicted value is very similar to the experimental value.
Example 2
Given a compound nitrobenzene (nitrobenzene) the log K is predictedPA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 1.454, 0.255, 0.039 and-12.664. Obtaining h of the substance according to the calculation formula (2)iValue of 0.060<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log KPA-wThe predicted value is 1.92, the experimental value is 1.98, and the predicted value is very similar to the experimental value.
Example 3
Given a compound, dichlorvos (dichlorvos), predicted log KPA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 1.63, 0.228, 0.007 and 186.337. Obtaining h of the substance according to the calculation formula (2)iValue of 0.008<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log KPA-wThe predicted value is 2.47, the experimental value is 2.48, and the predicted value is very similar to the experimental value.
Example 4
Given a compound 1-iodooctane (1-iodooctane), the log K of the compound is predictedPA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 3.782, 0.183, 0.004 and 28.912. Obtaining h of the substance according to the calculation formula (2)iThe value was 0.014<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log KPA-wThe predicted value is 3.92, the experimental value is 3.90, and the predicted value is very similar to the experimental value.
Example 5
Given a compound, Cyclohexylbenzene (Cyclohexylbenzene), its log K is predictedPA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are 3.734, 0.000, 0.028 and-174.589 respectively. Obtaining h of the substance according to the calculation formula (2)iA value of 0.028<0.090, so the compound is within the model application domain. Carry the value of the above descriptor intoModeling to obtain log KPA-wThe predicted value is 4.08, the experimental value is 4.15, and the predicted value is very similar to the experimental value.
The present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.
Claims (6)
1. A method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation is characterized by comprising the following steps:
1) data collection, review of literature collections to obtain the log K of 198 organic compoundsPA-wObtaining a data set;
2) descriptor calculation, before calculating a molecular structure descriptor, firstly generating a molecular structure of an initial organic matter through software, secondly optimizing the molecular structure by using an MM2 molecular mechanics method, then calculating the molecular structure descriptor of each compound in the software, and performing descriptor pretreatment;
3) and (3) model construction, namely, establishing a data set according to the ratio of 4: 1 proportion is divided into a training set and a test set, step-by-step multiple linear regression analysis is carried out through SPSS software, and the decision coefficient is adjusted according to the small number of the molecular descriptorsAnd external verification coefficientOptimal model obtained by higher principle:
logKPA-w=0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v+1.823;
wherein CrippenlogP is the Crippen octanol-water partition coefficient; RNCG is a relatively negative charge; VE2_ Dzv is the average coefficient sum of the last eigenvector of the Barysz matrix weighted by volume; ATSC4v is a 2D autocorrelation descriptor weighted by volume;
4) and (3) model verification: verifying the model, and entering the step 5) after the model is qualified;
5) application domain characterization: characterizing the model application domain by a Williams diagram;
6) application of the model: the model was used to predict POM-water partition coefficients for unknown compounds.
2. The method for predicting PA-water partition coefficient of organic pollutant according to claim 1, wherein the organic compound in step 1) comprises hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound and pesticide.
3. The method for predicting the PA-water partition coefficient of organic pollutants according to claim 1, wherein the pretreatment in the step 2) comprises removing descriptors with constant, near constant, missing and correlation larger than 0.95.
4. The method for predicting PA-water distribution coefficient of organic pollutant according to claim 1, wherein the coefficient Q is cross-validated by bootstrap method in the validation of step 3)2 BOOTSum-and-one method cross validation coefficient Q2 LOOVerifying the robustness of the model; external verification uses fitting coefficients between prediction and actual measurement And training set root mean square error RMSEextRepresenting the model external prediction capability.
5. The method for predicting PA-water distribution coefficients of organic pollutants according to claim 1, wherein the step 1) comprises the steps of removing data obviously deviating from overall values from the same compound, averaging the data to establish a data set, and the step 3) comprises the steps of establishing a model and internally verifying the data in the training set and externally verifying and evaluating the performance of the model by using the data in the testing set.
6. The method for predicting the PA-water partition coefficient of the organic pollutant according to claim 1, wherein the step 5) specifically comprises the following steps: using a standard residual error based leverage value hiThe Williams diagram of (1) characterizes the application domain of the model, with absolute values greater than 3.0, the compound being an outlier, with a lever value of hiGreater than a warning value h*When the compound is used, the structure of the compound is obviously different from the structures of other compounds; h isiAnd h*Calculated by the following formula:
hi=xi T(XTX)-1xi
h*=3(p+1)/n
wherein xiIs the descriptor matrix for the ith compound; x is the number ofi TIs xiThe transposed matrix of (2); x is a descriptor matrix for all compounds; xTIs the transpose of X; (X)TX)-1Is a matrix XTThe inverse of X; p is the number of variables in the model; n is the number of training set samples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010939877.2A CN112086141A (en) | 2020-09-09 | 2020-09-09 | Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010939877.2A CN112086141A (en) | 2020-09-09 | 2020-09-09 | Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112086141A true CN112086141A (en) | 2020-12-15 |
Family
ID=73732950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010939877.2A Pending CN112086141A (en) | 2020-09-09 | 2020-09-09 | Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112086141A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160902A (en) * | 2021-04-09 | 2021-07-23 | 大连理工大学 | Method for predicting enantioselectivity of chemical reaction product |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110534163A (en) * | 2019-08-22 | 2019-12-03 | 大连理工大学 | Using the method for the Octanol/water Partition Coefficients of multi-parameter linear free energy relationship model prediction organic compound |
-
2020
- 2020-09-09 CN CN202010939877.2A patent/CN112086141A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110534163A (en) * | 2019-08-22 | 2019-12-03 | 大连理工大学 | Using the method for the Octanol/water Partition Coefficients of multi-parameter linear free energy relationship model prediction organic compound |
Non-Patent Citations (2)
Title |
---|
TENGYI ZHU, YUANYUAN GU, HAOMIAO CHENG, MING CHEN.: "Versatile modelling of polyoxymethylene-water partition coefficients for hydrophobic organic contaminants using linear and nonlinear approaches", SCIENCE OF THE TOTAL ENVIRONMENT, vol. 728 * |
朱腾义,姜越: "基于理论线性溶解能关系预测有机污染物在PDMS与水中的分配系数", 东南大学学报(自然科学版), vol. 50, no. 1 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160902A (en) * | 2021-04-09 | 2021-07-23 | 大连理工大学 | Method for predicting enantioselectivity of chemical reaction product |
CN113160902B (en) * | 2021-04-09 | 2024-05-10 | 大连理工大学 | Method for predicting enantioselectivity of chemical reaction product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Willmott et al. | Assessment of three dimensionless measures of model performance | |
CN110534163B (en) | Method for predicting octanol/water distribution coefficient of organic compound by adopting multi-parameter linear free energy relation model | |
WO2016179864A1 (en) | Fresh water acute standard prediction method based on metal quantitative structure-activity relationship | |
Song et al. | An efficient global sensitivity analysis approach for distributed hydrological model | |
Balekelayi et al. | External corrosion pitting depth prediction using Bayesian spectral analysis on bare oil and gas pipelines | |
CN109060702B (en) | Infrared spectrum nonlinear quantitative analysis method | |
CN111768813A (en) | Method for predicting organic PDMS membrane-water distribution coefficient based on SW-SVM algorithm quantitative structure-activity relationship model | |
CN112086141A (en) | Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation | |
CN103345544B (en) | Adopt logistic regression method prediction organic chemicals biological degradability | |
Gu et al. | Nonmetric multidimensional scaling and probabilistic ecological risk assessment of trace metals in surface sediments of Daya Bay (China) using diffusive gradients in thin films | |
CN111768815A (en) | Method for predicting distribution coefficient of POPs (Point-of-sale) in PUF (physical unclonable function) membrane-air based on theoretical linear solvation energy relation model | |
CN111554358A (en) | Prediction method of heavy metal toxicity end point and ocean water quality reference threshold | |
CN112750507B (en) | Method for simultaneously detecting nitrate and nitrite contents in water based on hybrid machine learning model | |
CN110910970B (en) | Method for predicting toxicity of chemicals by taking zebra fish embryos as receptors through building QSAR model | |
CN111768814A (en) | Method for predicting POM-water distribution coefficient of organic pollutant based on quantitative structure-activity relationship | |
CN115660455A (en) | Three-water-level water quality evaluation system model construction system, equipment and terminal | |
Ren et al. | Parameter screening and optimized gaussian process for water dew point prediction of natural gas dehydration unit | |
Fatemi et al. | Quantitative structure property relationship study of the electrophoretic mobilities of some benzoic acids derivatives in different carrier electrolyte compositions | |
Murphy | A coherent method of stratification within a general framework for forecast verification | |
CN110838339B (en) | Method, equipment and medium for predicting toxicity effect of phthalate on zebra fish | |
CN113722988A (en) | Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model | |
Antunes et al. | Resolution of voltammetric peaks using chemometric multivariate calibration methods | |
CN114420219A (en) | Construction method, prediction method and device of relative retention time prediction model | |
CN111768812A (en) | Method for predicting organic PDMS film-water distribution coefficient | |
Olenius et al. | Role of gas–molecular cluster–aerosol dynamics in atmospheric new-particle formation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |