CN112086141A

CN112086141A - Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation

Info

Publication number: CN112086141A
Application number: CN202010939877.2A
Authority: CN
Inventors: 朱腾义; 陶翠翠; 严和婷; 王坤; 孙凤
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-12-15

Abstract

The invention discloses a method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation, which calculates molecular descriptors through the molecular structure of the existing compound, constructs a quantitative structure-property relation model by adopting stepwise multiple linear regression analysis, and can quickly and efficiently predict KPA-w value of the organic compound; the method is simple and rapid, has low cost, can save manpower, material resources and financial resources required by experimental tests, can effectively predict the PA membrane-water distribution coefficient of the organic compound in the application domain, fills the blank of data of other compounds, provides necessary basic data for monitoring environmental compounds and applying passive samplers, and has great significance.

Description

Method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation

Technical Field

The invention relates to a PA-water distribution coefficient prediction method, in particular to a method for predicting PA-water distribution coefficients of organic pollutants based on quantitative structure property relation.

Background

The membrane passive sampling technology is widely applied to measuring the free dissolved concentration of organic compounds and evaluating the environmental exposure risk of the organic matters at present. PA (polyacrylate) is a polar polymer containing heteroatoms, is more suitable for extracting hydrophobic organic matters, and is widely applied to the technical field of passive sampling. The partition coefficient (K) of organic substances between PA membrane and water is generally_PA-w) The method is an important parameter for evaluating the environmental behavior of the compound, and is also an important index for measuring the performance of the passive sampler and optimizing the passive sampler. The conventional experimental measurement method is time-consuming and labor-consuming, the error of the measurement result of the substance with unstable property is large, and the environmental monitoring requirement of the organic pollutants which are huge in quantity and are increased day by day is difficult to meet, so that the development of a simpler, more effective and faster method for predicting the distribution coefficient is urgently needed.

The quantitative structure-property relationship (QSPR) is a computer modeling method capable of representing the relation between the molecular structure of an organic matter and the physicochemical property, environmental behavior and toxicological parameters thereof, can make up the deficiency of the environmental behavior and ecological toxicological data of the organic matter, greatly reduces the experimental cost, and is beneficial to reducing or replacing related experiments. OECD in 2004 proposed the criteria for QSPR model construction and use, indicating that QSPR models meeting the following requirements can be applied to risk assessment and management of chemicals: (1) has well-defined environmental indicators; (2) the method has clear and transparent algorithm, and is beneficial to mechanism explanation; (3) defining an application domain of the model; (4) the model has appropriate fitness, stability and predictive ability. This criterion points the direction to the development of the QSPR model.

At present, a lot of reports have been made on organic K_PA-wThe simulation prediction of the value is mostly focused on foreign research, and relatively few domestic research. The octanol water partition coefficient (K) was established as described in the literature "Toxicol. Mech. method.,2005,4(15), 307-_ow) And log K_PA-wThe relationship model of (a) is,has very high correlation coefficient (R)²1), but the relation is only suitable for 3 chemical substances, the research substances are single, and the application range is limited. The literature "environ. Sci. technol.,2017,5(51),3001-_PA-wDistribution coefficient L of gas stationary phase₁、L₂The relation between the two has a high correlation coefficient (R)²0.94), but this relationship is applicable to only 14 chemical substances, the study substance is single, and the application range is limited.

Because most of the compounds of the models established in the existing research are single in type, the application field is narrow, and necessary model characterization parameters are lacked. Therefore, with the increase of emerging pollutants, it is necessary to develop a simple, rapid and efficient method for predicting organic K_PA-wQSPR model (1).

Disclosure of Invention

The invention aims to provide a method for predicting PA-water distribution coefficient of organic pollutant based on quantitative structure property relation, which can rapidly and effectively predict K of organic pollutant according to molecular structure descriptor of organic compound_PA-wThe value is obtained.

The purpose of the invention is realized as follows: a method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation comprises the following steps:

1) data collection, review of literature collections to obtain the log K of 198 organic compounds_PA-wObtaining a data set;

2) descriptor calculation, before calculating a molecular structure descriptor, firstly generating a molecular structure of an initial organic matter through software, secondly optimizing the molecular structure by using an MM2 molecular mechanics method, then calculating the molecular structure descriptor of each compound in the software, and performing descriptor pretreatment;

3) and (3) model construction, namely, establishing a data set according to the ratio of 4: 1 proportion is divided into a training set and a test set, step-by-step multiple linear regression analysis is carried out through SPSS software, and the decision coefficient is adjusted according to the small number of the molecular descriptors

And external testCoefficient of evidence

Optimal model obtained by higher principle:

logK_PA-w＝0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v +1.823；

wherein CrippenlogP is the Crippen octanol-water partition coefficient; RNCG is a relatively negative charge; VE2_ Dzv is the average coefficient sum of the last eigenvector of the Barysz matrix weighted by volume; ATSC4v is a 2D autocorrelation descriptor weighted by volume;

4) and (3) model verification: verifying the model, and entering the step 5) after the model is qualified;

5) application domain characterization: characterizing the model application domain by a Williams diagram;

6) application of the model: the model was used to predict POM-water partition coefficients for unknown compounds.

As a further limitation of the present invention, the organic compound in step 1) includes hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound, and pesticide.

As a further limitation of the invention, the preprocessing in step 2) includes removing descriptors with constants, near constants, deletions and correlations greater than 0.95.

As a further limitation of the present invention, the cross-validation of the coefficient Q by the bootstrap method during the validation in step 3)² _BOOTSum-and-one method cross validation coefficient Q² _LOOVerifying the robustness of the model; external verification uses fitting coefficients between prediction and actual measurement

And training set root mean square error RMSE_extRepresenting the model external prediction capability.

As a further limitation of the present invention, step 1) for the same compound, data significantly deviating from the overall value are removed, and averaged to create a data set, step 3) the data in the training set is used for model creation and internal verification, and the data in the testing set is used for external verification and performance evaluation of the model.

As a further limitation of the present invention, step 5) specifically comprises: using a standard residual error based leverage value h_iThe Williams diagram of (1) characterizes the application domain of the model, with absolute values greater than 3.0, the compound being an outlier, with a lever value of h_iGreater than a warning value h^*When the compound is used, the structure of the compound is obviously different from the structures of other compounds; h is_iAnd h^*Calculated by the following formula:

h_i＝x_i ^T(X^TX)^-1x_i

h^*＝3(p+1)/n

wherein x_iIs the descriptor matrix for the ith compound; x is the number of_i ^TIs x_iThe transposed matrix of (2); x is a descriptor matrix for all compounds; x^TIs the transpose of X; (X)^TX)^-1Is a matrix X^TThe inverse of X; p is the number of variables in the model; n is the number of training set samples.

Compared with the prior art, the invention has the beneficial effects that: the invention adopts a simple and transparent step-by-step multiple linear regression algorithm to construct a QSPR prediction model, the model covers organic compounds with various structures, has good goodness-of-fit, robustness and prediction capability, and is used for predicting the log K of the organic compounds in an application domain_PA-wThe values provide an efficient tool. The method is low in cost, simple and rapid, and can save a large amount of manpower, material resources and financial resources required by experimental tests. K according to the invention_PA-wEstablishment and verification of prediction method strictly follow OECD specified QSPR model development and use guide rule, therefore, K of organic matter is predicted by using model established by the invention_PA-wThe method has high reliability, provides important basic data for chemical supervision work, and has important guiding significance for ecological risk evaluation; simultaneously still possess following characteristics:

1. according to the guide rule of OECD about the construction and use of the QSRR model, the QSRR model with a transparent algorithm is established, and the mechanism explanation is easy;

2. the model has proper fitting degree, stability and prediction capability;

3. the application range of the model is wide, organic compounds with various structures are covered, and the model can be used for predicting the K of different compounds_PA-wThe value provides basic data for global environmental behavior analysis and ecological risk evaluation of organic compounds;

4. the model completely adopts a calculation mode, so that the loss of organic matter environmental behaviors and ecological toxicological data can be compensated, the experiment cost is greatly reduced, and the reduction or the substitution of related experiments is facilitated; more efficient access to chemical K_PA-wThe value is obtained.

Drawings

FIG. 1 shows log K in the present invention_PA-wFitting graph of measured value and predicted value

FIG. 2 is a Williams diagram of the domain of application of the characterization model in the present invention.

Detailed Description

A method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation comprises the following steps:

1) data collection, review of literature collections to obtain the log K of 198 organic compounds_PA-wAnd (3) removing data obviously deviating from the whole numerical value for the same substance, taking the average value of the data to carry out model construction research, wherein organic compounds comprise hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound, pesticide and other compounds, and the data set is divided into 4: 1, splitting into a training set and a test set in proportion;

2) descriptor calculation, before calculating the Descriptor of the molecular structure, firstly generating the molecular structure of the initial organic matter by using ChemBio3D Ultra 12.0 software, secondly optimizing the molecular structure by using an MM2 (molecular mechanics) method, then calculating the Descriptor of the molecular structure of each compound in the PadEL-Descriptor software, and removing the descriptors with constant, approximate constant, deletion and relevance more than 0.95;

3) model construction, step-by-step multiple linear regression analysis is carried out through SPSS software, and the decision coefficient is adjusted according to the small number of the molecular descriptors

And external verification coefficient

Optimal model obtained by higher principle:

logK_PA-w＝0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v +1.823； (1)

wherein CrippenlogP is a Crippen octanol-water partition coefficient (Crippen octanol-water partition coefficient); RNCG is the relative negative charge (the charge of the last negative electrode derivative by the total negative charge); VE2_ Dzv is the sum of the average coefficients of the last eigenvector of the Barysz matrix weighted by volume (the average coefficient sum of the last eigenvector from the Barysz matrix/weighted by van der Waals volume); ATSC4v is a volume-weighted 2D autocorrelation descriptor (centered Broto-Moreau autocorrelation-lag 4/weighted by van der Waals volume);

4) and (3) model verification: and (5) verifying the model, and entering the step 5) after the model is verified to be qualified, wherein the specific parameters are as follows:

n_tra＝158,R² _adj＝0.898,Q² _LOO＝0.858,Q² _BOOT＝0.793,RMSE_tra＝0.162,p <0.001；

n_ext＝40,R² _ext＝0.797,Q² _ext＝0.741,RMSE_ext＝0.586；

wherein n is_traAnd n_extThe number of compounds in the training set and test set, respectively; r² _adjIn order to determine the coefficients, the coefficients are,

is a one-out cross validation coefficient; q² _BOOTIs a bootstrap cross-validation coefficient; RMSE_traAnd RMSE_extThe root mean square error of the training set and the test set respectively;

is the decision coefficient in the test set;

is the external verification coefficient;

determining coefficients

Training set root mean square error RMSE_tra0.162, the model has good fitting ability, and the one-off cross validation coefficient of the model

Bootstrap cross validation coefficients

The robustness of the description model is good, and the external verification coefficient

Verification set Root Mean Square Error (RMSE)_extWhen the value is 0.586, the model has good external prediction capability, and the fitting degree and the verification result of the model are shown in fig. 1;

5) application domain characterization: the model application domain was characterized by the Williams diagram (fig. 2).

The standard residual calculation formula is as follows:

where, is the standard residual, y_iAnd the experimental value and the predicted value of the ith compound respectively, n is the number of the compounds in the data set, and A is the number of the descriptors;

lever value (h) and lever alarm value (h)^*) Calculated by the following formula:

h^*＝3(p+1)/n (4)

wherein x is_iIs the descriptor matrix for the ith compound; x is the number of_i ^TIs x_iThe transposed matrix of (2); x is a descriptor matrix for all compounds; x^TIs the transpose of X; (X)^TX)^-1Is a matrix X^TThe inverse of X; p is the number of variables in the model; n is the number of training set samples.

Calculating and drawing the sum h of the model; when the absolute value of a compound is greater than 3.0, the compound is considered a model outlier. When h of the compound is larger than h, the structure of the compound is obviously different from the structures of other compounds, wherein h is 0.090, and the model is suitable for h_iLess than 0.090 of compounds log K_PA-wPrediction of the value of (c).

6) Application of the model: the model is used to predict the POM-water partition coefficient of unknown compounds, and the effect of the model is further illustrated below with reference to the examples.

Example 1

Given a compound anthracene (anthracene) the log K of it is predicted_PA-wThe value is obtained. The molecular structure of anthracene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are 3.993, 0.098, 0.000 and-481.522 respectively. Obtaining h of the substance according to the calculation formula (2)_iValue of 0.020<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log K_PA-wThe predicted value is 4.46, the experimental value is 4.52, and the predicted value is very similar to the experimental value.

Example 2

Given a compound nitrobenzene (nitrobenzene) the log K is predicted_PA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 1.454, 0.255, 0.039 and-12.664. Obtaining h of the substance according to the calculation formula (2)_iValue of 0.060<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log K_PA-wThe predicted value is 1.92, the experimental value is 1.98, and the predicted value is very similar to the experimental value.

Example 3

Given a compound, dichlorvos (dichlorvos), predicted log K_PA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 1.63, 0.228, 0.007 and 186.337. Obtaining h of the substance according to the calculation formula (2)_iValue of 0.008<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log K_PA-wThe predicted value is 2.47, the experimental value is 2.48, and the predicted value is very similar to the experimental value.

Example 4

Given a compound 1-iodooctane (1-iodooctane), the log K of the compound is predicted_PA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are respectively 3.782, 0.183, 0.004 and 28.912. Obtaining h of the substance according to the calculation formula (2)_iThe value was 0.014<0.090, so the compound is within the model application domain. Substituting the value of the descriptor into the model to obtain log K_PA-wThe predicted value is 3.92, the experimental value is 3.90, and the predicted value is very similar to the experimental value.

Example 5

Given a compound, Cyclohexylbenzene (Cyclohexylbenzene), its log K is predicted_PA-wThe value is obtained. The molecular structure of nitrobenzene is optimized firstly according to a MM2 molecular mechanics method, and then the values of four molecular structure descriptors of CrippenLogP, RNCG, VE2_ Dzv and ATSC4v are calculated in PaDEL-Descriptor software and are 3.734, 0.000, 0.028 and-174.589 respectively. Obtaining h of the substance according to the calculation formula (2)_iA value of 0.028<0.090, so the compound is within the model application domain. Carry the value of the above descriptor intoModeling to obtain log K_PA-wThe predicted value is 4.08, the experimental value is 4.15, and the predicted value is very similar to the experimental value.

The present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.

Claims

1. A method for predicting PA-water distribution coefficient of organic pollutants based on quantitative structural property relation is characterized by comprising the following steps:

And external verification coefficient

Optimal model obtained by higher principle:

logK_PA-w＝0.636CrippenlogP-1.274RNCG-12.849VE2_Dzv-0.000439ATSC4v+1.823；

2. The method for predicting PA-water partition coefficient of organic pollutant according to claim 1, wherein the organic compound in step 1) comprises hydrocarbon, halogenated hydrocarbon, oxygen-containing compound, nitrogen-sulfur compound and pesticide.

3. The method for predicting the PA-water partition coefficient of organic pollutants according to claim 1, wherein the pretreatment in the step 2) comprises removing descriptors with constant, near constant, missing and correlation larger than 0.95.

4. The method for predicting PA-water distribution coefficient of organic pollutant according to claim 1, wherein the coefficient Q is cross-validated by bootstrap method in the validation of step 3)² _BOOTSum-and-one method cross validation coefficient Q² _LOOVerifying the robustness of the model; external verification uses fitting coefficients between prediction and actual measurement

5. The method for predicting PA-water distribution coefficients of organic pollutants according to claim 1, wherein the step 1) comprises the steps of removing data obviously deviating from overall values from the same compound, averaging the data to establish a data set, and the step 3) comprises the steps of establishing a model and internally verifying the data in the training set and externally verifying and evaluating the performance of the model by using the data in the testing set.

6. The method for predicting the PA-water partition coefficient of the organic pollutant according to claim 1, wherein the step 5) specifically comprises the following steps: using a standard residual error based leverage value h_iThe Williams diagram of (1) characterizes the application domain of the model, with absolute values greater than 3.0, the compound being an outlier, with a lever value of h_iGreater than a warning value h^*When the compound is used, the structure of the compound is obviously different from the structures of other compounds; h is_iAnd h^*Calculated by the following formula:

h_i＝x_i ^T(X^TX)^-1x_i

h^*＝3(p+1)/n