CN111768813A - Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm - Google Patents
Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm Download PDFInfo
- Publication number
- CN111768813A CN111768813A CN202010645135.9A CN202010645135A CN111768813A CN 111768813 A CN111768813 A CN 111768813A CN 202010645135 A CN202010645135 A CN 202010645135A CN 111768813 A CN111768813 A CN 111768813A
- Authority
- CN
- China
- Prior art keywords
- model
- pdms
- value
- organic
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000004205 dimethyl polysiloxane Substances 0.000 title claims abstract description 17
- 229920000435 poly(dimethylsiloxane) Polymers 0.000 title claims abstract description 17
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 title claims abstract description 17
- 235000013870 dimethyl polysiloxane Nutrition 0.000 title claims abstract 5
- CXQXSVUQTKDNFP-UHFFFAOYSA-N octamethyltrisiloxane Chemical compound C[Si](C)(C)O[Si](C)(C)O[Si](C)(C)C CXQXSVUQTKDNFP-UHFFFAOYSA-N 0.000 title claims abstract 5
- 238000004987 plasma desorption mass spectroscopy Methods 0.000 title claims abstract 5
- 238000004422 calculation algorithm Methods 0.000 title claims description 20
- 238000004617 QSAR study Methods 0.000 title claims description 6
- 238000005192 partition Methods 0.000 title description 13
- 150000002894 organic compounds Chemical class 0.000 claims abstract description 30
- 150000001875 compounds Chemical class 0.000 claims abstract description 29
- 238000012360 testing method Methods 0.000 claims abstract description 10
- 238000012706 support-vector machine Methods 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 26
- 238000012795 verification Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000002790 cross-validation Methods 0.000 claims description 11
- 239000000126 substance Substances 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- 238000000324 molecular mechanic Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000012417 linear regression Methods 0.000 claims description 4
- 238000011160 research Methods 0.000 claims description 4
- PFRUBEOIWWEFOL-UHFFFAOYSA-N [N].[S] Chemical compound [N].[S] PFRUBEOIWWEFOL-UHFFFAOYSA-N 0.000 claims description 3
- 230000009471 action Effects 0.000 claims description 3
- 125000001931 aliphatic group Chemical group 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 150000002148 esters Chemical class 0.000 claims description 3
- 229930195733 hydrocarbon Natural products 0.000 claims description 3
- 150000002430 hydrocarbons Chemical class 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 239000000575 pesticide Substances 0.000 claims description 3
- 150000003071 polychlorinated biphenyls Chemical group 0.000 claims description 3
- 125000005575 polycyclic aromatic hydrocarbon group Chemical group 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims description 2
- UHOVQNZJYSORNB-UHFFFAOYSA-N Benzene Chemical compound C1=CC=CC=C1 UHOVQNZJYSORNB-UHFFFAOYSA-N 0.000 claims 3
- RTZKZFJDLAIYFH-UHFFFAOYSA-N Diethyl ether Chemical compound CCOCC RTZKZFJDLAIYFH-UHFFFAOYSA-N 0.000 claims 2
- KVGZZAHHUNAVKZ-UHFFFAOYSA-N 1,4-Dioxin Chemical compound O1C=COC=C1 KVGZZAHHUNAVKZ-UHFFFAOYSA-N 0.000 claims 1
- 239000004215 Carbon black (E152) Substances 0.000 claims 1
- 238000012552 review Methods 0.000 claims 1
- 230000007613 environmental effect Effects 0.000 abstract description 7
- 239000000463 material Substances 0.000 abstract description 3
- 238000005070 sampling Methods 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 238000004618 QSPR study Methods 0.000 description 10
- 239000005416 organic matter Substances 0.000 description 9
- 239000012528 membrane Substances 0.000 description 7
- 238000010200 validation analysis Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- IZUPBVBPLAPZRR-UHFFFAOYSA-N pentachlorophenol Chemical compound OC1=C(Cl)C(Cl)=C(Cl)C(Cl)=C1Cl IZUPBVBPLAPZRR-UHFFFAOYSA-N 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012502 risk assessment Methods 0.000 description 3
- HORNXRXVQWOLPJ-UHFFFAOYSA-N 3-chlorophenol Chemical compound OC1=CC=CC(Cl)=C1 HORNXRXVQWOLPJ-UHFFFAOYSA-N 0.000 description 2
- 150000001555 benzenes Chemical class 0.000 description 2
- 150000002013 dioxins Chemical class 0.000 description 2
- 150000002170 ethers Chemical class 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 206010072082 Environmental exposure Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000003344 environmental pollutant Substances 0.000 description 1
- 230000002209 hydrophobic effect Effects 0.000 description 1
- 239000002957 persistent organic pollutant Substances 0.000 description 1
- 231100000719 pollutant Toxicity 0.000 description 1
- -1 polydimethylsiloxane Polymers 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002110 toxicologic effect Effects 0.000 description 1
- 231100000027 toxicology Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C10/00—Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种有机物PDMS膜/水分配系数的方法,特别涉及一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法。The invention relates to a method for organic matter PDMS membrane/water partition coefficient, in particular to a method for predicting organic matter PDMS membrane-water partition coefficient based on a quantitative structure-activity relationship model based on SW-SVM algorithm.
背景技术Background technique
膜被动采样技术是测量有机化合物自由溶解浓度和评估其环境暴露风险的一种广泛接受的方法。PDMS(聚二甲基硅氧烷)膜因其具有良好的热稳定性和对疏水化合物的高度亲和力成为应用广泛的被动采样材料之一。通常有机物在PDMS膜与水之间的分配系数(KPDMS-w)是评价化合物环境行为的重要参数,也是被动采样器成功运用的关键。常规实验的测量费时费力,难以满足数量庞大且日益增长的有机污染物环境监测和管理的需求,因此发展简便而准确的理论预测方法用于估算有机物的KPDMS-w显得尤为重要。Membrane passive sampling techniques are a widely accepted method for measuring freely dissolved concentrations of organic compounds and assessing their environmental exposure risk. PDMS (polydimethylsiloxane) membranes have become one of the most widely used passive sampling materials due to their good thermal stability and high affinity for hydrophobic compounds. Usually, the partition coefficient of organic matter between PDMS membrane and water (K PDMS-w ) is an important parameter for evaluating the environmental behavior of compounds, and it is also the key to the successful application of passive samplers. The measurement of conventional experiments is time-consuming and labor-intensive, and it is difficult to meet the large and growing needs of environmental monitoring and management of organic pollutants. Therefore, it is particularly important to develop a simple and accurate theoretical prediction method for estimating the K PDMS-w of organic matter.
定量结构-性质关系(QSPR)是指一种关联有机物的分子结构与其理化性质、环境行为和毒理学参数的计算机建模方法,能够减少或替代相关实验,弥补实验数据的缺失、降低实验费用。目前有关预测有机物的KPDMS-w研究方法较少,特别是关于非线性方法的研究,且现有模型中研究的物质较为单一数量少,模型的预测精度也需进一步提高。考虑到各类有机物的环境分配行为是一个复杂的过程,分配系数可能涉及到一些非线性关系,因此,有必要构建一个涵盖多种化合物、具有明确算法、便于应用推广且不依赖实验数据的KPDMS-w非线性预测模型,并依照OECD导则对模型进行验证和表征。Quantitative structure-property relationship (QSPR) refers to a computer modeling method that correlates the molecular structure of organic matter with its physicochemical properties, environmental behavior, and toxicological parameters, which can reduce or replace related experiments, make up for the lack of experimental data, and reduce experimental costs. At present, there are few research methods on K PDMS-w for predicting organic matter, especially the research on nonlinear methods, and the number of substances studied in the existing model is relatively small, and the prediction accuracy of the model needs to be further improved. Considering that the environmental allocation behavior of various organic compounds is a complex process, the partition coefficient may involve some nonlinear relationships. Therefore, it is necessary to construct a K that covers a variety of compounds, has a clear algorithm, is easy to apply, and does not depend on experimental data. PDMS-w nonlinear prediction model, and the model is validated and characterized according to OECD guidelines.
考虑到特征描述符的高维性,如何从原始变量中选择最有用的子集特征用于建模变得越来越重要。为选择更合理的分子描述符用于构建QSPR模型,采用了逐步线性回归(Stepwise linear regression)的筛选方法以达到变量降维的目的。为建立可靠的QSPR模型,采用的非线性算法为支持向量机(SVM)回归算法,该方法不但简单容易实施,而且具有较好的“鲁棒”性和优秀的泛化能力。通过调用R语言程序包e1071设置种子数还可以使模型具有可重现性。Considering the high dimensionality of feature descriptors, how to select the most useful subset of features from the original variables for modeling becomes increasingly important. In order to select more reasonable molecular descriptors for constructing the QSPR model, a stepwise linear regression screening method was adopted to achieve the purpose of variable dimensionality reduction. In order to establish a reliable QSPR model, the nonlinear algorithm used is the support vector machine (SVM) regression algorithm, which is not only simple and easy to implement, but also has good "robustness" and excellent generalization ability. Models can also be made reproducible by calling the R language package e1071 to set the number of seeds.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法,可直接根据有机化合物的分子结构描述符快速有效的预测其KPDMS-w值,有利于污染物的风险评价,有利于决策者和管理者制定化学品排放相关标准,也有益于为环境治理提供新思路。The purpose of the present invention is to provide a method for predicting the organic PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm, which can quickly and effectively predict the K PDMS-w value of the organic compound according to the molecular structure descriptor of the organic compound. , which is conducive to the risk assessment of pollutants, to decision makers and managers to formulate relevant standards for chemical emissions, and to providing new ideas for environmental governance.
本发明的目的是这样实现的:一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法,其特征在于,包括以下步骤:The object of the present invention is achieved in this way: a method for predicting the organic matter PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm, is characterized in that, comprises the following steps:
步骤1)数据收集:查阅文献收集得到包含若干种有机化合物的log KPDMS-w值,将得到的数据集按其log KPDMS-w值的大小抽取其中1/5作为验证集数据,其余作为训练集数据;Step 1) Data collection: The log K PDMS-w value containing several organic compounds was collected by referring to the literature, and 1/5 of the obtained data set was extracted according to the size of its log K PDMS-w value as the validation set data, and the rest were used as the validation set data. training set data;
步骤2)描述符计算:使用MM2分子力学的方法优化有机化合物的初始分子结构,利用alvaDesc 1.0.0获取有机化合物的分子结构描述符,预处理后经逐步线性回归筛选出最终描述符;Step 2) Descriptor calculation: use MM2 molecular mechanics to optimize the initial molecular structure of organic compounds, use alvaDesc 1.0.0 to obtain molecular structure descriptors of organic compounds, and screen out the final descriptors through stepwise linear regression after preprocessing;
步骤3)模型构建:以最终描述符作为自变量,有机物PDMS-水分配系数的对数logKPDMS-w为因变量,对训练集采用支持向量机回归算法建立QSPR预测模型,通过k折交叉验证算法选取最优化参数,构建基于最优SW-SVM算法的QSPR模型;Step 3) Model construction: take the final descriptor as the independent variable, the log K PDMS-w of the organic matter PDMS-water partition coefficient as the dependent variable, use the support vector machine regression algorithm to establish the QSPR prediction model on the training set, and pass the k-fold cross-validation. The algorithm selects the optimal parameters to construct the QSPR model based on the optimal SW-SVM algorithm;
步骤4)模型验证:对模型进行验证,其分为两步:a)模型的拟合优度和稳健性评价;b)对模型进行应用域表征和性能评价;验证合格后进入步骤5);Step 4) Model verification: the model is verified, which is divided into two steps: a) goodness of fit and robustness evaluation of the model; b) application domain characterization and performance evaluation of the model; after passing the verification, go to step 5);
步骤5)应用域表征:通过Williams图对模型应用域进行表征;Step 5) Application domain characterization: the model application domain is characterized by the Williams diagram;
步骤6)模型应用:利用所述模型预测未知化合物的PDMS膜/水分配系数。Step 6) Model application: use the model to predict the PDMS membrane/water partition coefficient of the unknown compound.
作为本发明的进一步限定,在所述步骤1)中,有机化合物包括多环芳烃、多氯联苯、苯类、农药、醚类、二恶英、酯类、脂肪族、碳氢化合物、氮硫化合物。As a further limitation of the present invention, in the step 1), the organic compounds include polycyclic aromatic hydrocarbons, polychlorinated biphenyls, benzenes, pesticides, ethers, dioxins, esters, aliphatics, hydrocarbons, nitrogen Sulfur compounds.
作为本发明的进一步限定,步骤1)中,对于收集到的有机化合物中同一物质,剔除明显偏离整体数值的数据,取其平均值进行模型的构建研究。As a further limitation of the present invention, in step 1), for the same substance in the collected organic compounds, the data that obviously deviates from the overall value are excluded, and the average value is taken to conduct model construction research.
作为本发明的进一步限定,步骤1)中所述训练集中的有机化合物用于构建模型,进行内部验证,验证集中的有机化合物用于模型外部验证As a further limitation of the present invention, the organic compounds in the training set described in step 1) are used for constructing the model for internal verification, and the organic compounds in the verification set are used for external verification of the model
作为本发明的进一步限定,在步骤2)中,预处理过程包括去除常数、接近常数、缺失和相关性大于0.95的描述符。As a further limitation of the present invention, in step 2), the preprocessing process includes removing constants, near constants, missing descriptors and descriptors with a correlation greater than 0.95.
作为本发明的进一步限定,在步骤3)中,运用R语言程序包构建基于最优SW-SVM算法的QSPR模型,具体包括以下过程:As a further limitation of the present invention, in step 3), use the R language program package to build the QSPR model based on the optimal SW-SVM algorithm, specifically including the following process:
步骤3-1,首先将整个数据集分为k个集合,每个集合都会轮流作为测试集,剩余集合则作为训练集,如此重复进行k次训练与测试,保证每个集合都作为测试集将被验证过一次;Step 3-1, first divide the entire data set into k sets, each set will be used as the test set in turn, and the remaining sets will be used as the training set. Repeat the training and testing k times to ensure that each set is used as the test set. verified once;
步骤3-2,计算并比较k次训练的平均交叉验证正确率,选取交叉验证正确率最高的一组参数,此组参数(cost,gamma)将作为k折交叉验证的最优值应用到支持向量机回归预测中,其中惩罚因子cost控制了模型结构风险与经验风险的相对比重,决定了模型的优越性,gamma参数决定了数据映射到新的特征空间后的分布,预测模型选取gamma为径向基核函数,如式g=1/2σ2,其中σ为函数的宽度参数,控制了函数的径向作用范围;Step 3-2, calculate and compare the average cross-validation accuracy of k-fold training, and select a set of parameters with the highest cross-validation accuracy. This set of parameters (cost, gamma) will be used as the optimal value of k-fold cross-validation to apply In vector machine regression prediction, the penalty factor cost controls the relative proportion of model structural risk and empirical risk, and determines the superiority of the model. The gamma parameter determines the distribution of the data after mapping to the new feature space. The prediction model selects gamma as the diameter. To the basis kernel function, such as the formula g=1/2σ 2 , where σ is the width parameter of the function, which controls the radial action range of the function;
步骤3-3,将参数应用到模型中,构建最优化模型。Steps 3-3, apply the parameters to the model to construct the optimal model.
作为本发明的进一步限定,在所述步骤4中,模型验证时,其拟合优度和稳健性评价指标为:自由度校正的决定系数训练集均方根误差RMSEtra以及训练集平均绝对误差MAEtra。As a further limitation of the present invention, in the
作为本发明的进一步限定,步骤5)具体包括:采用基于标准残差δ对杠杆值hi的Williams图对模型的应用域进行表征,δ的绝对值大于3.0时,该化合物为离群点,当杠杆值hi大于警戒值h*时,说明该化合物结构与其他化合物结构有显著性差异;hi和h*由如下公式计算:As a further limitation of the present invention, step 5) specifically includes: using the Williams diagram of the leverage value hi based on the standard residual δ to characterize the application domain of the model, when the absolute value of δ is greater than 3.0, the compound is an outlier, When the leverage value hi is greater than the warning value h*, it indicates that the structure of this compound is significantly different from other compounds; hi and h* are calculated by the following formulas:
hi=xi T(XTX)-1xi hi = x i T (X T X) -1 x i
h*=3(p+1)/nh*=3(p+1)/n
其中xi是第i个化合物的描述符矩阵;xi T是xi的转置矩阵;X是所有化合物的描述符矩阵;XT是X的转置矩阵;(XTX)-1是矩阵XTX的逆;p是模型中变量的个数,n为数据集中数据点的个数。where x i is the descriptor matrix of the ith compound; x i T is the transpose matrix of x i ; X is the descriptor matrix of all compounds; X T is the transpose matrix of X; (X T X) -1 is Inverse of matrix X T X; p is the number of variables in the model and n is the number of data points in the dataset.
与现有技术相比,本发明的有益效果在于:Compared with the prior art, the beneficial effects of the present invention are:
1.依据OECD关于QSRR模型构建和使用的导则,建立的QSRR模型具有良好的拟合优度,稳健性和预测能力;1. According to the OECD guidelines on the construction and use of QSRR models, the established QSRR models have good fit, robustness and predictive ability;
2.模型的应用域较广,涵盖多种结构的有机化合物,可用于预测不同化合物的KPDMS-w值,为有机化合物全球性环境行为分析和生态风险评价提供基础数据;2. The model has a wide range of applications, covering organic compounds of various structures, and can be used to predict the K PDMS-w value of different compounds, providing basic data for global environmental behavior analysis and ecological risk assessment of organic compounds;
3.模型完全采用计算的方式,与前人依靠实验值的方法不同,能够大量减少实验成本,更高效的获取化学品KPDMS-w值;3. The model completely adopts the calculation method, which is different from the previous method relying on the experimental value, which can greatly reduce the experimental cost and obtain the K PDMS-w value of the chemical more efficiently;
4.本发明可以快速有效地预测多种有机化合物的PDMS膜/水分配系数。该方法成本低廉、简便而快速,可节省大量的人力、物力和财力。该发明涉及的QSRR模型的建立和验证严格依照OECD规定的QSPR模型构建和使用的导则,准确可靠,可以有效获取物质的KPDMS-w值,为化学品监管工作提供重要的基础数据,并对生态风险评价具有重要的指导意义。4. The present invention can quickly and effectively predict the PDMS membrane/water partition coefficients of various organic compounds. The method is low-cost, simple and fast, and can save a lot of manpower, material resources and financial resources. The establishment and verification of the QSRR model involved in the invention is strictly in accordance with the guidelines for the construction and use of the QSPR model stipulated by the OECD, which is accurate and reliable, can effectively obtain the K PDMS-w value of the substance, and provides important basic data for chemical supervision work. It has important guiding significance for ecological risk assessment.
附图说明Description of drawings
图1为本发明的预测方法流程图。FIG. 1 is a flow chart of the prediction method of the present invention.
图2为本发明的数据集log KPDMS-w的实验值和预测值的拟合图。FIG. 2 is a fitting diagram of the experimental value and the predicted value of the data set log K PDMS-w of the present invention.
图3为本发明中Williams图。Figure 3 is a Williams diagram in the present invention.
具体实施方式Detailed ways
如图1所示的一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法,包括以下步骤。As shown in Figure 1, a method for predicting the organic PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm includes the following steps.
步骤1)查阅文献收集得到347种有机化合物的log KPDMS-w值,对于同一物质,剔除明显偏离整体数值的数据,取其平均值进行模型的构建研究,有机化合物包括多环芳烃、多氯联苯、苯类、农药、醚类、二恶英、酯类、脂肪族、碳氢化合物、氮硫化合物;将得到的数据集按其log KPDMS-w值的大小排序,取前4/5个有机化合物作为训练集数据,其余物质作为验证集数据,训练集数据包括277个,验证集数据包括70个,训练集中的有机化合物用于构建模型,进行内部验证,验证集中的有机化合物用于模型外部验证。Step 1) The log K PDMS-w value of 347 kinds of organic compounds was obtained by referring to the literature. For the same substance, the data that deviates significantly from the overall value were excluded, and the average value was taken to construct the model. The organic compounds include polycyclic aromatic hydrocarbons, polychlorinated Biphenyls, benzenes, pesticides, ethers, dioxins, esters, aliphatics, hydrocarbons, nitrogen-sulfur compounds; sort the obtained data sets according to their log K PDMS-w values, and take the first 4/ 5 organic compounds are used as training set data, and the rest are used as validation set data. The training set data includes 277 and the validation set data includes 70. The organic compounds in the training set are used to build the model for internal verification, and the organic compounds in the validation set are used for Validation outside the model.
步骤2)使用MM2分子力学的方法优化有机化合物的初始分子结构,利用alvaDesc获取有机化合物的分子结构描述符,去除去除常数、接近常数、缺失和相关性大于0.95的描述符后经逐步线性回归筛选出最终描述符。Step 2) Use the MM2 molecular mechanics method to optimize the initial molecular structure of organic compounds, use alvaDesc to obtain the molecular structure descriptors of organic compounds, remove constants, close to constants, deletions and descriptors with a correlation greater than 0.95, and then screen by stepwise linear regression out the final descriptor.
步骤3)以筛选出的最终描述符作为自变量,有机物PDMS/水分配系数的对数logKPDMS-w为因变量,调用R语言e1071程序包的支持向量机回归算法建立QSPR预测模型,通过k折交叉验证算法选取最优化参数(cost,gamma),运用R语言程序包e1071构建基于最优SW-SVM算法的QSPR模型的具体过程如下:Step 3) Take the final descriptor screened out as the independent variable, the log K PDMS-w of the organic matter PDMS/water partition coefficient as the dependent variable, and call the support vector machine regression algorithm of the R language e1071 package to establish a QSPR prediction model, and pass k The folding cross-validation algorithm selects the optimal parameters (cost, gamma), and uses the R language package e1071 to construct the QSPR model based on the optimal SW-SVM algorithm. The specific process is as follows:
步骤3-1,首先将整个数据集分为k个集合,每个集合都会轮流作为测试集,剩余集合则作为训练集,这样重复进行k次训练与测试,保证每个集合都作为测试集将被验证过一次;Step 3-1, first divide the entire data set into k sets, each set will be used as a test set in turn, and the remaining sets will be used as a training set, so that the training and testing are repeated k times to ensure that each set is used as a test set. verified once;
步骤3-2,计算并比较k次训练的平均交叉验证正确率,选取交叉验证正确率最高的一组参数,这个参数(cost,gamma)将作为k折交叉验证的最优值应用到支持向量机回归预测中,其中惩罚因子cost控制了模型结构风险与经验风险的相对比重,决定了模型的优越性,gamma参数决定了数据映射到新的特征空间后的分布,预测模型选取gamma为径向基核函数,如式g=1/2σ2,其中σ为函数的宽度参数,控制了函数的径向作用范围;Step 3-2, calculate and compare the average cross-validation accuracy of k-fold training, and select a set of parameters with the highest cross-validation accuracy. In machine regression prediction, the penalty factor cost controls the relative proportion of model structural risk and empirical risk, and determines the superiority of the model. The gamma parameter determines the distribution of the data after mapping to the new feature space. The prediction model selects gamma as the radial The basis kernel function, such as the formula g=1/2σ 2 , where σ is the width parameter of the function, which controls the radial action range of the function;
步骤3-3,将参数应用到模型中,构建最优化模型。Steps 3-3, apply the parameters to the model to construct the optimal model.
步骤4,对模型进行验证和表征分为两步:1)模型的拟合优度和稳健性评价;2)对模型进行应用域表征,性能评价;
模型的拟合能力由自由度校正的决定系数训练集均方根误差RMSEtra和训练集平均绝对误差MAEtra表征,在本实施例中自由度校正的决定系数训练集均方根误差RMSEtra=0.457,训练集平均绝对误差MAEtra=0.329,误差值越小说明拟合程度越高,说明模型具有较好的拟合优度和稳健性;外部验证采用预测与实测之间的拟合系数和一致性相关系数CCC表示模型外部预测能力。判定依据:R2>0.7,Q2>0.6,R2-Q2<0.3,CCC>0.85;在本实施例中,最终的模型为 CCC=0.930,表明模型具有很好的外部预测能力,图2给出模型的拟合程度及验证结果;The fit of the model is determined by the coefficient of determination corrected for the degrees of freedom The training set root mean square error RMSE tra and the training set mean absolute error MAE tra characterize the coefficient of determination of the degree of freedom correction in this example The training set root mean square error RMSE tra = 0.457, the training set mean absolute error MAE tra = 0.329, the smaller the error value, the higher the fitting degree, It shows that the model has good goodness of fit and robustness; the external verification adopts the fitting coefficient between the prediction and the actual measurement and the consistency correlation coefficient CCC represents the model's external predictive ability. Judgment basis: R 2 >0.7, Q 2 >0.6, R 2 -Q 2 <0.3, CCC>0.85; in this embodiment, the final model is CCC=0.930, indicating that the model has good external prediction ability. Figure 2 shows the fitting degree and verification results of the model;
步骤5:采用基于标准残差δ对杠杆值hi的Williams图对模型的应用域进行表征,具体为:一般认为δ的绝对值大于3.0时,该化合物为离群点,当杠杆值hi大于警戒值h*时,说明该化合物结构与其他化合物结构有显著性差异;hi和h*由如下公式计算:Step 5: The application domain of the model is characterized by the Williams diagram of the leverage value hi based on the standard residual δ, specifically: when the absolute value of δ is generally considered to be greater than 3.0, the compound is an outlier, and when the leverage value hi When it is greater than the warning value h*, it indicates that the structure of this compound is significantly different from other compounds; h i and h* are calculated by the following formulas:
hi=xi T(XTX)-1xi hi = x i T (X T X) -1 x i
h*=3(p+1)/nh*=3(p+1)/n
其中xi是第i个化合物的描述符矩阵;xi T是xi的转置矩阵;X是所有化合物的描述符矩阵;XT是X的转置矩阵;(XTX)-1是矩阵XTX的逆;p是模型中变量的个数,n为数据集中数据点的个数。如图3所示,模型的h*为0.054,得到该模型适用于对hi小于0.054的化合物logKPDMS-w的值的预测。where xi is the descriptor matrix of the ith compound; x i T is the transpose matrix of x i ; X is the descriptor matrix of all compounds; X T is the transpose matrix of X; (X T X) -1 is the matrix The inverse of X T X; p is the number of variables in the model and n is the number of data points in the dataset. As shown in Figure 3, the h* of the model is 0.054, and it is found that the model is suitable for predicting the value of logK PDMS-w for compounds with hi less than 0.054.
步骤6模型应用:利用所述模型预测未知化合物的PDMS膜/水分配系数。
例1:给定一个化合物3-氯酚预测其log KPDMS-w值。首先根据MM2分子力学的方法优化3-氯酚的分子结构,其次基于优化后的分子结构利用alvaDesc 1.0.0软件计算其4个分子描述符BLTD48、Hy、SpMaxA_B(s)和SpMaxA_AEA(dm)的值,分别为-3.341、-0.039、0.789和0.374。根据计算公式得到该物质的hi值为0.0353<0.054,所以该化合物在模型应用域内。将上述描述符的值带入所建模型,得到log KPDMS-w预测值为0.22,实验值为0.31,预测值与实验值相近。Example 1: Given a compound 3-chlorophenol, predict its log K PDMS-w value. Firstly, the molecular structure of 3-chlorophenol was optimized according to the method of MM2 molecular mechanics. Secondly, based on the optimized molecular structure, the 4 molecular descriptors BLTD48, Hy, SpMaxA_B(s) and SpMaxA_AEA(dm) were calculated by alvaDesc 1.0.0 software. values, -3.341, -0.039, 0.789, and 0.374, respectively. According to the calculation formula, the hi value of the substance is 0.0353<0.054, so the compound is in the model application domain. Bringing the values of the above descriptors into the built model, the predicted value of log K PDMS-w is 0.22, the experimental value is 0.31, and the predicted value is similar to the experimental value.
例2:给定一个化合物五氯苯酚预测其log KPDMS-w值。首先根据MM2分子力学的方法优化五氯苯酚的分子结构,其次基于优化后的分子结构利用alvaDesc 1.0.0软件计算其4个分子描述符BLTD48、Hy、SpMaxA_B(s)和SpMaxA_AEA(dm)的值,分别为-5.033、0.078、0.526和0.313。根据计算公式得到该物质的hi值为0.0178<0.054,所以该化合物在模型应用域内。将上述描述符的值带入所建模型,得到log KPDMS-w预测值为2.68,实验值为2.65,预测值与实验值相近。Example 2: Given a compound pentachlorophenol, predict its log K PDMS-w value. Firstly, the molecular structure of pentachlorophenol was optimized according to the method of MM2 molecular mechanics, and then the values of its four molecular descriptors BLTD48, Hy, SpMaxA_B(s) and SpMaxA_AEA(dm) were calculated based on the optimized molecular structure using alvaDesc 1.0.0 software. , were -5.033, 0.078, 0.526 and 0.313, respectively. According to the calculation formula, the hi value of the substance is 0.0178<0.054, so the compound is in the model application domain. Bringing the values of the above descriptors into the established model, the predicted value of log K PDMS-w is 2.68, the experimental value is 2.65, and the predicted value is similar to the experimental value.
本发明并不局限于上述实施例,在本发明公开的技术方案的基础上,本领域的技术人员根据所公开的技术内容,不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形,这些替换和变形均在本发明的保护范围内。The present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some of the technical features according to the disclosed technical contents without creative work. Modifications, replacements and modifications are all within the protection scope of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645135.9A CN111768813A (en) | 2020-07-07 | 2020-07-07 | Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645135.9A CN111768813A (en) | 2020-07-07 | 2020-07-07 | Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111768813A true CN111768813A (en) | 2020-10-13 |
Family
ID=72724655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010645135.9A Pending CN111768813A (en) | 2020-07-07 | 2020-07-07 | Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111768813A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722988A (en) * | 2021-08-18 | 2021-11-30 | 扬州大学 | Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model |
CN115470702A (en) * | 2022-09-14 | 2022-12-13 | 中山大学 | Sewage treatment water quality prediction method and system based on machine learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868540A (en) * | 2016-03-25 | 2016-08-17 | 哈尔滨理工大学 | A polycyclic aromatic hydrocarbon property/toxicity prediction method using an intelligent support vector machine |
CN109212096A (en) * | 2018-11-02 | 2019-01-15 | 扬州大学 | Hydrophobic organic compound LDPE film/water partition coefficient rapid assay methods based on surfactant strengthening extraction |
-
2020
- 2020-07-07 CN CN202010645135.9A patent/CN111768813A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868540A (en) * | 2016-03-25 | 2016-08-17 | 哈尔滨理工大学 | A polycyclic aromatic hydrocarbon property/toxicity prediction method using an intelligent support vector machine |
CN109212096A (en) * | 2018-11-02 | 2019-01-15 | 扬州大学 | Hydrophobic organic compound LDPE film/water partition coefficient rapid assay methods based on surfactant strengthening extraction |
Non-Patent Citations (7)
Title |
---|
ANDREA MAURI: "alvaDesc:A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints", 《ECOTOXICOLOGICAL QSARS》 * |
朱腾义 等: "基于理论线性溶解能关系预测有机污染物在PDMS与水中的分配系数", 《东南大学学报(自然科学版)》 * |
李美萍: "QSAR/QSPR方法在环境、药物和材料化学中的应用", 《中国博士学位论文全文数据库 工程科技Ⅰ辑》 * |
李言伟: "QSPR研究在材料化学和环境化学中的应用", 《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》 * |
聂长明 等: "《计算化学》", 31 January 2010, 北京理工大学出版社 * |
胡桂香 等: "化合物膜水分配系数的QSPR研究和分子三维参数表征", 《浙江大学学报(理学版)》 * |
董霁红 等著: "《矿区复垦土壤重金属光谱解析与迁移特征研》", 31 May 2018, 中国矿业大学出版社 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722988A (en) * | 2021-08-18 | 2021-11-30 | 扬州大学 | Method for predicting organic PDMS membrane-air distribution coefficient by quantitative structure-activity relationship model |
CN113722988B (en) * | 2021-08-18 | 2024-01-26 | 扬州大学 | Method for predicting organic PDMS film-air distribution coefficient by quantitative structure-activity relationship model |
CN115470702A (en) * | 2022-09-14 | 2022-12-13 | 中山大学 | Sewage treatment water quality prediction method and system based on machine learning |
CN115470702B (en) * | 2022-09-14 | 2024-06-11 | 中山大学 | Sewage treatment water quality prediction method and system based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104820873B (en) | A kind of acute reference prediction method of fresh water based on metal quantitative structure activity relationship | |
Carrascal et al. | Partial least squares regression as an alternative to current regression methods used in ecology | |
CN110534163B (en) | Method for predicting octanol/water distribution coefficient of organic compound by adopting multi-parameter linear free energy relation model | |
Shokrollahi et al. | On accurate determination of PVT properties in crude oil systems: Committee machine intelligent system modeling approach | |
Akinpelu et al. | A support vector regression model for the prediction of total polyaromatic hydrocarbons in soil: an artificial intelligent system for mapping environmental pollution | |
CN111768813A (en) | Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm | |
CN103488901B (en) | Adopt the soil of Quantitative structure-activity relationship model prediction organic compound or the method for sediment sorption coefficients | |
Song et al. | An efficient global sensitivity analysis approach for distributed hydrological model | |
Laoun et al. | Global sensitivity analysis of proton exchange membrane fuel cell model | |
CN107844870B (en) | Soil heavy metal content prediction method based on Elman neural network model | |
Qiao et al. | Development of pedotransfer functions for soil hydraulic properties in the critical zone on the Loess Plateau, China | |
CN1655082A (en) | Nonlinear Fault Diagnosis Method Based on Kernel Principal Component Analysis | |
Wang et al. | Novel adaptive sample space expansion approach of NIR model for in-situ measurement of gasoline octane number in online gasoline blending processes | |
CN113722988B (en) | Method for predicting organic PDMS film-air distribution coefficient by quantitative structure-activity relationship model | |
CN111768812A (en) | A method for predicting organic PDMS membrane-water partition coefficient | |
CN110853701A (en) | Method for predicting fish biological enrichment factor of organic compound by adopting multi-parameter linear free energy relation model | |
Liu et al. | Predicting the rate constants of volatile organic compounds (VOCs) with ozone reaction at different temperatures | |
CN104376221B (en) | A kind of method for the skin permeability coefficient for predicting organic chemicals | |
CN112086141A (en) | Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation | |
CN111768814A (en) | A method for predicting the POM-water partition coefficient of organic pollutants based on quantitative structure-activity relationship | |
Ren et al. | Parameter screening and optimized gaussian process for water dew point prediction of natural gas dehydration unit | |
Hong et al. | Spatiotemporal sensitivity analysis of vertical transport of pesticides in soil | |
Yuan et al. | Combining national and state data improves predictions of microcystin concentration | |
Wang et al. | Spatial variation of catchment-oriented extreme rainfall in England and Wales | |
CN108536992B (en) | Method for predicting reduction rate constant of nitro aromatic compound |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |