CN111768813A

CN111768813A - Prediction method of organic PDMS membrane-water partition coefficient based on quantitative structure-activity relationship model based on SW-SVM algorithm

Info

Publication number: CN111768813A
Application number: CN202010645135.9A
Authority: CN
Inventors: 朱腾义; 陈文瑄; 程浩淼; 李懿; 王坤; 吴晶
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-13

Abstract

The invention discloses a method for organic PDMS membrane-water distribution coefficient, which is characterized in that a molecular descriptor is calculated through the molecular structure of the existing compound, a stepwise linear regression-support vector machine (SW-SVM) analysis combination method is adopted, a quantitative structure-property relation model is constructed, and the organic compound can be rapidly and efficiently predictedK _PDMS‑wA value; the method is simple and quick, has low cost, can save manpower, material resources and financial resources required by experimental tests, develops a nonlinear model with better generalization capability by using R language, and has good goodness of fit, robustness and prediction capability; the invention can effectively predict the PDMS film/water distribution coefficient of the organic compound in the application domain, fills the blank of data of other compounds, and is used for monitoring and passive sampling of environmental compoundsThe application of the device provides necessary basic data and has great significance.

Description

Quantitative structure-activity relationship model based on SW-SVM algorithm to predict organic matter PDMS membrane-water method of distribution coefficient

技术领域technical field

本发明涉及一种有机物PDMS膜/水分配系数的方法，特别涉及一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法。The invention relates to a method for organic matter PDMS membrane/water partition coefficient, in particular to a method for predicting organic matter PDMS membrane-water partition coefficient based on a quantitative structure-activity relationship model based on SW-SVM algorithm.

背景技术Background technique

膜被动采样技术是测量有机化合物自由溶解浓度和评估其环境暴露风险的一种广泛接受的方法。PDMS(聚二甲基硅氧烷)膜因其具有良好的热稳定性和对疏水化合物的高度亲和力成为应用广泛的被动采样材料之一。通常有机物在PDMS膜与水之间的分配系数(K_PDMS-w)是评价化合物环境行为的重要参数，也是被动采样器成功运用的关键。常规实验的测量费时费力，难以满足数量庞大且日益增长的有机污染物环境监测和管理的需求，因此发展简便而准确的理论预测方法用于估算有机物的K_PDMS-w显得尤为重要。Membrane passive sampling techniques are a widely accepted method for measuring freely dissolved concentrations of organic compounds and assessing their environmental exposure risk. PDMS (polydimethylsiloxane) membranes have become one of the most widely used passive sampling materials due to their good thermal stability and high affinity for hydrophobic compounds. Usually, the partition coefficient of organic matter between PDMS membrane and water (K _PDMS-w ) is an important parameter for evaluating the environmental behavior of compounds, and it is also the key to the successful application of passive samplers. The measurement of conventional experiments is time-consuming and labor-intensive, and it is difficult to meet the large and growing needs of environmental monitoring and management of organic pollutants. Therefore, it is particularly important to develop a simple and accurate theoretical prediction method for estimating the K _PDMS-w of organic matter.

定量结构-性质关系(QSPR)是指一种关联有机物的分子结构与其理化性质、环境行为和毒理学参数的计算机建模方法，能够减少或替代相关实验，弥补实验数据的缺失、降低实验费用。目前有关预测有机物的K_PDMS-w研究方法较少，特别是关于非线性方法的研究，且现有模型中研究的物质较为单一数量少，模型的预测精度也需进一步提高。考虑到各类有机物的环境分配行为是一个复杂的过程，分配系数可能涉及到一些非线性关系，因此，有必要构建一个涵盖多种化合物、具有明确算法、便于应用推广且不依赖实验数据的K_PDMS-w非线性预测模型，并依照OECD导则对模型进行验证和表征。Quantitative structure-property relationship (QSPR) refers to a computer modeling method that correlates the molecular structure of organic matter with its physicochemical properties, environmental behavior, and toxicological parameters, which can reduce or replace related experiments, make up for the lack of experimental data, and reduce experimental costs. At present, there are few research methods on K _PDMS-w for predicting organic matter, especially the research on nonlinear methods, and the number of substances studied in the existing model is relatively small, and the prediction accuracy of the model needs to be further improved. Considering that the environmental allocation behavior of various organic compounds is a complex process, the partition coefficient may involve some nonlinear relationships. Therefore, it is necessary to construct a K that covers a variety of compounds, has a clear algorithm, is easy to apply, and does not depend on experimental data. _PDMS-w nonlinear prediction model, and the model is validated and characterized according to OECD guidelines.

考虑到特征描述符的高维性，如何从原始变量中选择最有用的子集特征用于建模变得越来越重要。为选择更合理的分子描述符用于构建QSPR模型，采用了逐步线性回归(Stepwise linear regression)的筛选方法以达到变量降维的目的。为建立可靠的QSPR模型，采用的非线性算法为支持向量机(SVM)回归算法，该方法不但简单容易实施，而且具有较好的“鲁棒”性和优秀的泛化能力。通过调用R语言程序包e1071设置种子数还可以使模型具有可重现性。Considering the high dimensionality of feature descriptors, how to select the most useful subset of features from the original variables for modeling becomes increasingly important. In order to select more reasonable molecular descriptors for constructing the QSPR model, a stepwise linear regression screening method was adopted to achieve the purpose of variable dimensionality reduction. In order to establish a reliable QSPR model, the nonlinear algorithm used is the support vector machine (SVM) regression algorithm, which is not only simple and easy to implement, but also has good "robustness" and excellent generalization ability. Models can also be made reproducible by calling the R language package e1071 to set the number of seeds.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法，可直接根据有机化合物的分子结构描述符快速有效的预测其K_PDMS-w值，有利于污染物的风险评价，有利于决策者和管理者制定化学品排放相关标准，也有益于为环境治理提供新思路。The purpose of the present invention is to provide a method for predicting the organic PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm, which can quickly and effectively predict the K _PDMS-w value of the organic compound according to the molecular structure descriptor of the organic compound. , which is conducive to the risk assessment of pollutants, to decision makers and managers to formulate relevant standards for chemical emissions, and to providing new ideas for environmental governance.

本发明的目的是这样实现的：一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法，其特征在于，包括以下步骤：The object of the present invention is achieved in this way: a method for predicting the organic matter PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm, is characterized in that, comprises the following steps:

步骤1)数据收集：查阅文献收集得到包含若干种有机化合物的log K_PDMS-w值，将得到的数据集按其log K_PDMS-w值的大小抽取其中1/5作为验证集数据，其余作为训练集数据；Step 1) Data collection: The log K _PDMS-w value containing several organic compounds was collected by referring to the literature, and 1/5 of the obtained data set was extracted according to the size of its log K _PDMS-w value as the validation set data, and the rest were used as the validation set data. training set data;

步骤2)描述符计算：使用MM2分子力学的方法优化有机化合物的初始分子结构，利用alvaDesc 1.0.0获取有机化合物的分子结构描述符，预处理后经逐步线性回归筛选出最终描述符；Step 2) Descriptor calculation: use MM2 molecular mechanics to optimize the initial molecular structure of organic compounds, use alvaDesc 1.0.0 to obtain molecular structure descriptors of organic compounds, and screen out the final descriptors through stepwise linear regression after preprocessing;

步骤3)模型构建：以最终描述符作为自变量，有机物PDMS-水分配系数的对数logK_PDMS-w为因变量，对训练集采用支持向量机回归算法建立QSPR预测模型，通过k折交叉验证算法选取最优化参数，构建基于最优SW-SVM算法的QSPR模型；Step 3) Model construction: take the final descriptor as the independent variable, the log K _PDMS-w of the organic matter PDMS-water partition coefficient as the dependent variable, use the support vector machine regression algorithm to establish the QSPR prediction model on the training set, and pass the k-fold cross-validation. The algorithm selects the optimal parameters to construct the QSPR model based on the optimal SW-SVM algorithm;

步骤4)模型验证：对模型进行验证，其分为两步：a)模型的拟合优度和稳健性评价；b)对模型进行应用域表征和性能评价；验证合格后进入步骤5)；Step 4) Model verification: the model is verified, which is divided into two steps: a) goodness of fit and robustness evaluation of the model; b) application domain characterization and performance evaluation of the model; after passing the verification, go to step 5);

步骤5)应用域表征：通过Williams图对模型应用域进行表征；Step 5) Application domain characterization: the model application domain is characterized by the Williams diagram;

步骤6)模型应用：利用所述模型预测未知化合物的PDMS膜/水分配系数。Step 6) Model application: use the model to predict the PDMS membrane/water partition coefficient of the unknown compound.

作为本发明的进一步限定，在所述步骤1)中，有机化合物包括多环芳烃、多氯联苯、苯类、农药、醚类、二恶英、酯类、脂肪族、碳氢化合物、氮硫化合物。As a further limitation of the present invention, in the step 1), the organic compounds include polycyclic aromatic hydrocarbons, polychlorinated biphenyls, benzenes, pesticides, ethers, dioxins, esters, aliphatics, hydrocarbons, nitrogen Sulfur compounds.

作为本发明的进一步限定，步骤1)中，对于收集到的有机化合物中同一物质，剔除明显偏离整体数值的数据，取其平均值进行模型的构建研究。As a further limitation of the present invention, in step 1), for the same substance in the collected organic compounds, the data that obviously deviates from the overall value are excluded, and the average value is taken to conduct model construction research.

作为本发明的进一步限定，步骤1)中所述训练集中的有机化合物用于构建模型，进行内部验证，验证集中的有机化合物用于模型外部验证As a further limitation of the present invention, the organic compounds in the training set described in step 1) are used for constructing the model for internal verification, and the organic compounds in the verification set are used for external verification of the model

作为本发明的进一步限定，在步骤2)中，预处理过程包括去除常数、接近常数、缺失和相关性大于0.95的描述符。As a further limitation of the present invention, in step 2), the preprocessing process includes removing constants, near constants, missing descriptors and descriptors with a correlation greater than 0.95.

作为本发明的进一步限定，在步骤3)中，运用R语言程序包构建基于最优SW-SVM算法的QSPR模型，具体包括以下过程：As a further limitation of the present invention, in step 3), use the R language program package to build the QSPR model based on the optimal SW-SVM algorithm, specifically including the following process:

步骤3-1，首先将整个数据集分为k个集合，每个集合都会轮流作为测试集，剩余集合则作为训练集，如此重复进行k次训练与测试，保证每个集合都作为测试集将被验证过一次；Step 3-1, first divide the entire data set into k sets, each set will be used as the test set in turn, and the remaining sets will be used as the training set. Repeat the training and testing k times to ensure that each set is used as the test set. verified once;

步骤3-2，计算并比较k次训练的平均交叉验证正确率，选取交叉验证正确率最高的一组参数，此组参数(cost,gamma)将作为k折交叉验证的最优值应用到支持向量机回归预测中，其中惩罚因子cost控制了模型结构风险与经验风险的相对比重，决定了模型的优越性，gamma参数决定了数据映射到新的特征空间后的分布，预测模型选取gamma为径向基核函数，如式g＝1/2σ²，其中σ为函数的宽度参数，控制了函数的径向作用范围；Step 3-2, calculate and compare the average cross-validation accuracy of k-fold training, and select a set of parameters with the highest cross-validation accuracy. This set of parameters (cost, gamma) will be used as the optimal value of k-fold cross-validation to apply In vector machine regression prediction, the penalty factor cost controls the relative proportion of model structural risk and empirical risk, and determines the superiority of the model. The gamma parameter determines the distribution of the data after mapping to the new feature space. The prediction model selects gamma as the diameter. To the basis kernel function, such as the formula g=1/2σ ² , where σ is the width parameter of the function, which controls the radial action range of the function;

步骤3-3，将参数应用到模型中，构建最优化模型。Steps 3-3, apply the parameters to the model to construct the optimal model.

作为本发明的进一步限定，在所述步骤4中，模型验证时，其拟合优度和稳健性评价指标为：自由度校正的决定系数

训练集均方根误差RMSE_tra以及训练集平均绝对误差MAE_tra。As a further limitation of the present invention, in the step 4, when the model is verified, its goodness of fit and robustness evaluation indicators are: the coefficient of determination of the degree of freedom correction

The training set root mean square error RMSE _tra and the training set mean absolute error MAE _tra .

作为本发明的进一步限定，步骤5)具体包括：采用基于标准残差δ对杠杆值h_i的Williams图对模型的应用域进行表征，δ的绝对值大于3.0时，该化合物为离群点，当杠杆值h_i大于警戒值h*时，说明该化合物结构与其他化合物结构有显著性差异；h_i和h*由如下公式计算：As a further limitation of the present invention, step 5) specifically includes: using the Williams diagram of the leverage value _hi based on the standard residual δ to characterize the application domain of the model, when the absolute value of δ is greater than 3.0, the compound is an outlier, When the leverage value _hi is greater than the warning value h*, it indicates that the structure of this compound is significantly different from other compounds; _hi and h* are calculated by the following formulas:

hi＝x_i ^T(X^TX)^-1x_i hi = x _i ^T (X ^T X) ^-1 x _i

h*＝3(p+1)/nh*=3(p+1)/n

其中x_i是第i个化合物的描述符矩阵；x_i ^T是x_i的转置矩阵；X是所有化合物的描述符矩阵；X^T是X的转置矩阵；(X^TX)^-1是矩阵X^TX的逆；p是模型中变量的个数，n为数据集中数据点的个数。where x _i is the descriptor matrix of the ith compound; x _i ^T is the transpose matrix of x _i ; X is the descriptor matrix of all compounds; X ^T is the transpose matrix of X; (X ^T X) ^-1 is Inverse of matrix X ^T X; p is the number of variables in the model and n is the number of data points in the dataset.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the beneficial effects of the present invention are:

1.依据OECD关于QSRR模型构建和使用的导则，建立的QSRR模型具有良好的拟合优度，稳健性和预测能力；1. According to the OECD guidelines on the construction and use of QSRR models, the established QSRR models have good fit, robustness and predictive ability;

2.模型的应用域较广，涵盖多种结构的有机化合物，可用于预测不同化合物的K_PDMS-w值，为有机化合物全球性环境行为分析和生态风险评价提供基础数据；2. The model has a wide range of applications, covering organic compounds of various structures, and can be used to predict the K _PDMS-w value of different compounds, providing basic data for global environmental behavior analysis and ecological risk assessment of organic compounds;

3.模型完全采用计算的方式，与前人依靠实验值的方法不同，能够大量减少实验成本，更高效的获取化学品K_PDMS-w值；3. The model completely adopts the calculation method, which is different from the previous method relying on the experimental value, which can greatly reduce the experimental cost and obtain the K _PDMS-w value of the chemical more efficiently;

4.本发明可以快速有效地预测多种有机化合物的PDMS膜/水分配系数。该方法成本低廉、简便而快速，可节省大量的人力、物力和财力。该发明涉及的QSRR模型的建立和验证严格依照OECD规定的QSPR模型构建和使用的导则，准确可靠，可以有效获取物质的K_PDMS-w值，为化学品监管工作提供重要的基础数据，并对生态风险评价具有重要的指导意义。4. The present invention can quickly and effectively predict the PDMS membrane/water partition coefficients of various organic compounds. The method is low-cost, simple and fast, and can save a lot of manpower, material resources and financial resources. The establishment and verification of the QSRR model involved in the invention is strictly in accordance with the guidelines for the construction and use of the QSPR model stipulated by the OECD, which is accurate and reliable, can effectively obtain the K _PDMS-w value of the substance, and provides important basic data for chemical supervision work. It has important guiding significance for ecological risk assessment.

附图说明Description of drawings

图1为本发明的预测方法流程图。FIG. 1 is a flow chart of the prediction method of the present invention.

图2为本发明的数据集log K_PDMS-w的实验值和预测值的拟合图。FIG. 2 is a fitting diagram of the experimental value and the predicted value of the data set log K _PDMS-w of the present invention.

图3为本发明中Williams图。Figure 3 is a Williams diagram in the present invention.

具体实施方式Detailed ways

如图1所示的一种基于SW-SVM算法的定量构效关系模型预测有机物PDMS膜-水分配系数的方法，包括以下步骤。As shown in Figure 1, a method for predicting the organic PDMS membrane-water partition coefficient based on the quantitative structure-activity relationship model of the SW-SVM algorithm includes the following steps.

步骤1)查阅文献收集得到347种有机化合物的log K_PDMS-w值，对于同一物质，剔除明显偏离整体数值的数据，取其平均值进行模型的构建研究，有机化合物包括多环芳烃、多氯联苯、苯类、农药、醚类、二恶英、酯类、脂肪族、碳氢化合物、氮硫化合物；将得到的数据集按其log K_PDMS-w值的大小排序，取前4/5个有机化合物作为训练集数据，其余物质作为验证集数据，训练集数据包括277个，验证集数据包括70个，训练集中的有机化合物用于构建模型，进行内部验证，验证集中的有机化合物用于模型外部验证。Step 1) The log K _PDMS-w value of 347 kinds of organic compounds was obtained by referring to the literature. For the same substance, the data that deviates significantly from the overall value were excluded, and the average value was taken to construct the model. The organic compounds include polycyclic aromatic hydrocarbons, polychlorinated Biphenyls, benzenes, pesticides, ethers, dioxins, esters, aliphatics, hydrocarbons, nitrogen-sulfur compounds; sort the obtained data sets according to their log K _PDMS-w values, and take the first 4/ 5 organic compounds are used as training set data, and the rest are used as validation set data. The training set data includes 277 and the validation set data includes 70. The organic compounds in the training set are used to build the model for internal verification, and the organic compounds in the validation set are used for Validation outside the model.

步骤2)使用MM2分子力学的方法优化有机化合物的初始分子结构，利用alvaDesc获取有机化合物的分子结构描述符，去除去除常数、接近常数、缺失和相关性大于0.95的描述符后经逐步线性回归筛选出最终描述符。Step 2) Use the MM2 molecular mechanics method to optimize the initial molecular structure of organic compounds, use alvaDesc to obtain the molecular structure descriptors of organic compounds, remove constants, close to constants, deletions and descriptors with a correlation greater than 0.95, and then screen by stepwise linear regression out the final descriptor.

步骤3)以筛选出的最终描述符作为自变量，有机物PDMS/水分配系数的对数logK_PDMS-w为因变量，调用R语言e1071程序包的支持向量机回归算法建立QSPR预测模型，通过k折交叉验证算法选取最优化参数(cost，gamma)，运用R语言程序包e1071构建基于最优SW-SVM算法的QSPR模型的具体过程如下：Step 3) Take the final descriptor screened out as the independent variable, the log K _PDMS-w of the organic matter PDMS/water partition coefficient as the dependent variable, and call the support vector machine regression algorithm of the R language e1071 package to establish a QSPR prediction model, and pass k The folding cross-validation algorithm selects the optimal parameters (cost, gamma), and uses the R language package e1071 to construct the QSPR model based on the optimal SW-SVM algorithm. The specific process is as follows:

步骤3-1，首先将整个数据集分为k个集合，每个集合都会轮流作为测试集，剩余集合则作为训练集，这样重复进行k次训练与测试，保证每个集合都作为测试集将被验证过一次；Step 3-1, first divide the entire data set into k sets, each set will be used as a test set in turn, and the remaining sets will be used as a training set, so that the training and testing are repeated k times to ensure that each set is used as a test set. verified once;

步骤3-2，计算并比较k次训练的平均交叉验证正确率，选取交叉验证正确率最高的一组参数，这个参数(cost,gamma)将作为k折交叉验证的最优值应用到支持向量机回归预测中，其中惩罚因子cost控制了模型结构风险与经验风险的相对比重，决定了模型的优越性，gamma参数决定了数据映射到新的特征空间后的分布，预测模型选取gamma为径向基核函数，如式g＝1/2σ²，其中σ为函数的宽度参数，控制了函数的径向作用范围；Step 3-2, calculate and compare the average cross-validation accuracy of k-fold training, and select a set of parameters with the highest cross-validation accuracy. In machine regression prediction, the penalty factor cost controls the relative proportion of model structural risk and empirical risk, and determines the superiority of the model. The gamma parameter determines the distribution of the data after mapping to the new feature space. The prediction model selects gamma as the radial The basis kernel function, such as the formula g=1/2σ ² , where σ is the width parameter of the function, which controls the radial action range of the function;

步骤4，对模型进行验证和表征分为两步：1)模型的拟合优度和稳健性评价；2)对模型进行应用域表征，性能评价；Step 4, the verification and characterization of the model is divided into two steps: 1) goodness of fit and robustness evaluation of the model; 2) application domain characterization and performance evaluation of the model;

模型的拟合能力由自由度校正的决定系数

训练集均方根误差RMSE_tra和训练集平均绝对误差MAE_tra表征，在本实施例中自由度校正的决定系数

训练集均方根误差RMSE_tra＝0.457，训练集平均绝对误差MAE_tra＝0.329，误差值越小说明拟合程度越高，

说明模型具有较好的拟合优度和稳健性；外部验证采用预测与实测之间的拟合系数

和一致性相关系数CCC表示模型外部预测能力。判定依据：R²>0.7，Q²>0.6，R²-Q²<0.3，CCC>0.85；在本实施例中，最终的模型为

CCC＝0.930，表明模型具有很好的外部预测能力，图2给出模型的拟合程度及验证结果；The fit of the model is determined by the coefficient of determination corrected for the degrees of freedom

The training set root mean square error RMSE _tra and the training set mean absolute error MAE _tra characterize the coefficient of determination of the degree of freedom correction in this example

The training set root mean square error RMSE _tra = 0.457, the training set mean absolute error MAE _tra = 0.329, the smaller the error value, the higher the fitting degree,

It shows that the model has good goodness of fit and robustness; the external verification adopts the fitting coefficient between the prediction and the actual measurement

and the consistency correlation coefficient CCC represents the model's external predictive ability. Judgment basis: R ² >0.7, Q ² >0.6, R ² -Q ² <0.3, CCC>0.85; in this embodiment, the final model is

CCC=0.930, indicating that the model has good external prediction ability. Figure 2 shows the fitting degree and verification results of the model;

步骤5：采用基于标准残差δ对杠杆值h_i的Williams图对模型的应用域进行表征，具体为：一般认为δ的绝对值大于3.0时，该化合物为离群点，当杠杆值h_i大于警戒值h*时，说明该化合物结构与其他化合物结构有显著性差异；h_i和h*由如下公式计算：Step 5: The application domain of the model is characterized by the Williams diagram of the leverage value _hi based on the standard residual δ, specifically: when the absolute value of δ is generally considered to be greater than 3.0, the compound is an outlier, and when the leverage value _hi When it is greater than the warning value h*, it indicates that the structure of this compound is significantly different from other compounds; h _i and h* are calculated by the following formulas:

hi＝x_i ^T(X^TX)^-1x_i hi = x _i ^T (X ^T X) ^-1 x _i

h*＝3(p+1)/nh*=3(p+1)/n

其中xi是第i个化合物的描述符矩阵；x_i ^T是x_i的转置矩阵；X是所有化合物的描述符矩阵；X^T是X的转置矩阵；(X^TX)^-1是矩阵X^TX的逆；p是模型中变量的个数，n为数据集中数据点的个数。如图3所示，模型的h*为0.054，得到该模型适用于对h_i小于0.054的化合物logK_PDMS-w的值的预测。where xi is the descriptor matrix of the ith compound; x _i ^T is the transpose matrix of x _i ; X is the descriptor matrix of all compounds; X ^T is the transpose matrix of X; (X ^T X) ^-1 is the matrix The inverse of X ^T X; p is the number of variables in the model and n is the number of data points in the dataset. As shown in Figure 3, the h* of the model is 0.054, and it is found that the model is suitable for predicting the value of logK _PDMS-w _for compounds with hi less than 0.054.

步骤6模型应用：利用所述模型预测未知化合物的PDMS膜/水分配系数。Step 6 Model application: Use the model to predict the PDMS membrane/water partition coefficient of the unknown compound.

例1：给定一个化合物3-氯酚预测其log K_PDMS-w值。首先根据MM2分子力学的方法优化3-氯酚的分子结构，其次基于优化后的分子结构利用alvaDesc 1.0.0软件计算其4个分子描述符BLTD48、Hy、SpMaxA_B(s)和SpMaxA_AEA(dm)的值，分别为-3.341、-0.039、0.789和0.374。根据计算公式得到该物质的hi值为0.0353<0.054，所以该化合物在模型应用域内。将上述描述符的值带入所建模型，得到log K_PDMS-w预测值为0.22，实验值为0.31，预测值与实验值相近。Example 1: Given a compound 3-chlorophenol, predict its log K _PDMS-w value. Firstly, the molecular structure of 3-chlorophenol was optimized according to the method of MM2 molecular mechanics. Secondly, based on the optimized molecular structure, the 4 molecular descriptors BLTD48, Hy, SpMaxA_B(s) and SpMaxA_AEA(dm) were calculated by alvaDesc 1.0.0 software. values, -3.341, -0.039, 0.789, and 0.374, respectively. According to the calculation formula, the hi value of the substance is 0.0353<0.054, so the compound is in the model application domain. Bringing the values of the above descriptors into the built model, the predicted value of log K _PDMS-w is 0.22, the experimental value is 0.31, and the predicted value is similar to the experimental value.

例2：给定一个化合物五氯苯酚预测其log K_PDMS-w值。首先根据MM2分子力学的方法优化五氯苯酚的分子结构，其次基于优化后的分子结构利用alvaDesc 1.0.0软件计算其4个分子描述符BLTD48、Hy、SpMaxA_B(s)和SpMaxA_AEA(dm)的值，分别为-5.033、0.078、0.526和0.313。根据计算公式得到该物质的hi值为0.0178<0.054，所以该化合物在模型应用域内。将上述描述符的值带入所建模型，得到log K_PDMS-w预测值为2.68，实验值为2.65，预测值与实验值相近。Example 2: Given a compound pentachlorophenol, predict its log K _PDMS-w value. Firstly, the molecular structure of pentachlorophenol was optimized according to the method of MM2 molecular mechanics, and then the values of its four molecular descriptors BLTD48, Hy, SpMaxA_B(s) and SpMaxA_AEA(dm) were calculated based on the optimized molecular structure using alvaDesc 1.0.0 software. , were -5.033, 0.078, 0.526 and 0.313, respectively. According to the calculation formula, the hi value of the substance is 0.0178<0.054, so the compound is in the model application domain. Bringing the values of the above descriptors into the established model, the predicted value of log K _PDMS-w is 2.68, the experimental value is 2.65, and the predicted value is similar to the experimental value.

本发明并不局限于上述实施例，在本发明公开的技术方案的基础上，本领域的技术人员根据所公开的技术内容，不需要创造性的劳动就可以对其中的一些技术特征作出一些替换和变形，这些替换和变形均在本发明的保护范围内。The present invention is not limited to the above-mentioned embodiments. On the basis of the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some of the technical features according to the disclosed technical contents without creative work. Modifications, replacements and modifications are all within the protection scope of the present invention.

Claims

1. A method for predicting the membrane-water distribution coefficient of organic PDMS based on a quantitative structure-activity relationship model of SW-SVM algorithm is characterized by comprising the following steps:

step 1) data collection: the log K containing several organic compounds was collected from a review of the literature_PDMS-wValue, the resulting data set is log K_PDMS-wThe size of the value is extracted 1/5 as verification set data, and the rest is training set data;

step 2) descriptor computation: optimizing an initial molecular structure of the organic compound by using an MM2 molecular mechanics method, acquiring a molecular structure descriptor of the organic compound by using alvaDesc 1.0.0, and screening out a final descriptor through stepwise linear regression after pretreatment;

step 3), model construction: logarithm of organic PDMS-water distribution coefficient logK with final descriptor as independent variable_PDMS-wAdopting a support vector machine regression algorithm to establish a QSPR prediction model for the training set as a dependent variable, selecting optimized parameters through a k-fold cross validation algorithm, and establishing the QSPR model based on the optimal SW-SVM algorithm;

step 4), model verification: the model is verified, and the method comprises the following two steps: a) evaluating the goodness of fit and the robustness of the model; b) carrying out application domain representation and performance evaluation on the model; entering step 5) after the verification is qualified;

step 5) application domain characterization: characterizing the model application domain by a Williams diagram;

step 6) model application: the model was used to predict the PDMS film/water distribution coefficient of unknown compounds.

2. The prediction method according to claim 1, wherein in the step 1), the organic compound comprises polycyclic aromatic hydrocarbon, polychlorinated biphenyl, benzene, pesticide, ether, dioxin, ester, aliphatic, hydrocarbon, nitrogen sulfur compound.

3. The prediction method according to claim 1, wherein in step 1), data significantly deviating from the overall value of the collected same substance in the organic compound are removed, and the average value is used for model construction research.

4. The prediction method according to claim 1, wherein the organic compounds in the training set in step 1) are used for constructing a model, performing internal verification, and the organic compounds in the verification set are used for external verification of the model.

5. The prediction method according to claim 1, wherein in step 2) the preprocessing procedure comprises removing descriptors with constants, near constants, missing and correlation greater than 0.95.

6. The prediction method according to claim 1, wherein in the step 3), the QSPR model based on the optimal SW-SVM algorithm is constructed by using an R language package, and the method specifically comprises the following processes:

step 3-1, firstly dividing the whole data set into k sets, taking each set as a test set in turn, taking the rest sets as training sets, and repeating the training and testing for k times to ensure that each set is verified once as a test set;

step 3-2, calculating and comparing the average cross validation accuracy of k times of training, selecting a group of parameters with the highest cross validation accuracy, applying the group of parameters (cost, gamma) as the optimal value of k-fold cross validation to regression prediction of a support vector machine, wherein a penalty factor cost controls the relative proportion of model structure risk and experience risk, determines the superiority of the model, the gamma parameter determines the distribution of data after mapping to a new feature space, and the prediction model selects gamma as a radial basis kernel functionFormula g is 1/2 sigma²Wherein sigma is a width parameter of the function, and controls the radial action range of the function;

and 3-3, applying the parameters to the model to construct an optimized model.

7. The prediction method according to claim 1, wherein in the step 4, the goodness-of-fit and robustness evaluation indexes during model verification are as follows: coefficient of determination of degree of freedom correction

Training set root mean square error RMSE_traAnd training set mean absolute error MAE_tra。

8. The prediction method according to claim 1, wherein the step 5) specifically comprises: using a standard residual error based leverage value h_iThe Williams diagram of (1) characterizes the application domain of the model, with absolute values greater than 3.0, the compound being an outlier, with a lever value of h_iWhen the value is more than the alarm value h, the structure of the compound is obviously different from the structures of other compounds; h is_iAnd h is calculated by the following formula:

hi＝x_i ^T(X^TX)^-1x_i

h*＝3(p+1)/n

wherein x_iIs the descriptor matrix for the ith compound; x is the number of_i ^TIs x_iThe transposed matrix of (2); x is a descriptor matrix for all compounds; x^TIs the transpose of X; (X)^TX)^-1Is a matrix X^TThe inverse of X; p is the number of variables in the model and n is the number of data points in the dataset.