CN113591394A - Method for predicting organic compound n-hexadecane/air distribution coefficient - Google Patents

Method for predicting organic compound n-hexadecane/air distribution coefficient Download PDF

Info

Publication number
CN113591394A
CN113591394A CN202110920233.3A CN202110920233A CN113591394A CN 113591394 A CN113591394 A CN 113591394A CN 202110920233 A CN202110920233 A CN 202110920233A CN 113591394 A CN113591394 A CN 113591394A
Authority
CN
China
Prior art keywords
organic compound
group
model
value
scbo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110920233.3A
Other languages
Chinese (zh)
Other versions
CN113591394B (en
Inventor
李俊华
王雅
彭悦
杨雯皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110920233.3A priority Critical patent/CN113591394B/en
Publication of CN113591394A publication Critical patent/CN113591394A/en
Application granted granted Critical
Publication of CN113591394B publication Critical patent/CN113591394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)

Abstract

The invention discloses a method for predicting an n-hexadecane/air distribution coefficient of an organic compound. And (3) calculating descriptor parameters for representing the molecular structure characteristics based on the molecular structure of the organic compound, and quickly and efficiently acquiring the n-hexadecane/air distribution coefficient of the organic substance by using the constructed quantitative structure-activity relationship (QSAR) model. The method can replace an experimental test method, reduce the test cost and is more efficient and faster than the experimental method.

Description

Method for predicting organic compound n-hexadecane/air distribution coefficient
Technical Field
The invention belongs to the field of ecological risk evaluation test strategies, and relates to a method for predicting an n-hexadecane/air distribution coefficient of an organic compound by adopting a quantitative structure-activity relation model.
Background
N-hexadecane/air partition coefficient (K) of organic compoundhexadecane/air) Is a key parameter for characterizing the migration capability between an organic phase and air and is also used for understanding the environmental behavior and tendency of organic compoundsIs used as one of the basic data. Furthermore, Khexadecane/airThe logarithmic value of (a), L, is also an important descriptor for the non-specific interaction of organic compounds in different partitioning processes and plays an essential role in the multi-parameter linear free energy relationship (pp-LFERs). Especially for pp-LFERs environmental models predicting the dispensing behavior of organic compounds, L values are essential.
Therefore, the method for acquiring the L value of the organic compound is particularly important for an environmental prediction model based on pp-LFERs, and has important significance for understanding the environmental behavior of the organic compound and evaluating the ecological risk of the organic compound.
As is well known, L is logKhexadecane/airIn which K ishexadecane/air=Chexadecane/Cair,ChexadecaneIs the equilibrium concentration of solute in the n-hexadecane phase, CairIs the equilibrium concentration of solute in air. In general, L values can be measured by gas/liquid chromatography experiments. However, conventional experimental measurement methods can only be used to measure volatile organic compounds with L values in a narrow range. In recent years, new experimental measurement methods have been developed, more and more compounds can be measured, and the range of L values has been widened. To date, the number of experimentally determined L values has exceeded 1000. Considering over 14 thousands of commercial chemicals, it would be impractical to experimentally determine them one by one. Therefore, it is necessary to develop a prediction method in order to quickly and efficiently acquire the L value thereof.
Some predictive methods, including the commonly used quantitative structure-activity relationship (QSAR), have been used to obtain the L value of organic compounds. Guidelines for the development and use of QSAR models issued by the economic cooperation and development Organization (OECD), i.e., (1) have well-defined indicators; (2) a clear algorithm is provided; (3) having a well-defined application domain; (4) the method has proper goodness-of-fit, robustness and prediction capability; (5) and (4) giving mechanism explanation as much as possible and reevaluating the existing prediction method.
The prediction method of Connectivity Index (CI) as described in cited reference 1 selects 3 connectivity indexes to construct a prediction model based on 387 organic matters with different structuresType, the model was evaluated by calculating only the root mean square error, the determinant coefficient (R) related to goodness of fit, robustness and prediction ability2) Cross validation factor (Q) by one-out method2 LOO) Externally interpretable variance (Q)2) None were calculated. Thus, its evaluation of the prediction method is still incomplete and reliable.
On the other hand, the application domain of the model in the above method has not been characterized. Similarly, the software SPARC, cosmotmx and ABSOLV mentioned herein all suffer from incomprehensive model evaluation.
It is also known that, in cited document 2, for 610 organic matters in a training set, 68 different fragments are selected, a new prediction model is constructed, and goodness-of-fit and prediction capability of the model are evaluated, but the robustness of the model is not evaluated yet, and an application domain is not characterized yet.
Furthermore, for the emerging environmental pollutant organosilicon compounds, which are receiving increasing attention from scholars, the prediction of their L-values is also important. The application domain of the previously constructed L-value prediction model covers a smaller number of types of organosilicon compounds, and it is necessary to widen the application domain to cover a larger number of types of organosilicon compounds.
In conclusion, the existing prediction methods have the defects of incomplete model evaluation, lack of characterization of application domains, unclear mechanism and the like, and contain few types of organic silicon compounds. Therefore, there is a need to construct new predictive models, broaden their application domains, cover a wider variety of organosilicon compounds, and perform comprehensive evaluation and application domain characterization of the models according to OECD model construction and usage guidelines for QSAR.
Cited documents:
cited document 1: "J.Chromatogr.A 2012,1220, 132-142".
Cited document 2: "SAR QSAR environ. Res.2014,25, 51-71"
Disclosure of Invention
Problems to be solved by the invention
In view of the above-described deficiencies of the prior art, it is a primary object of the present invention to provide a method for predicting the value of the n-hexadecane/air distribution coefficient L of an organic compound, which is convenient to use, efficient and widely applicable. The method can be used for predicting the L value of the organic compound directly on the basis of the molecular structure description parameters of the organic compound, is beneficial to understanding the migration capacity of the organic compound between an organic phase and air, understanding the environmental behavior and the fate of the organic compound, and provides data support for the ecological risk evaluation of the organic compound.
Means for solving the problems
The inventors have made extensive studies and found that the above-mentioned problems can be solved by the following means:
[1] the present invention firstly provides a method for predicting the partition coefficient L of an organic compound in n-hexadecane and air, wherein the method comprises calculating the L value of the organic compound using the following formula (1):
L=k0+k1SA-k2Mi+k3SCBO-k4nH+k5nCIC+k6Hy
(formula 1)
Wherein the content of the first and second substances,
k0representing the model constant, k0=15.5~17.1
SARepresents the area of the organic compound obtained according to a conductor-like shielding model (COSMO), and the unit is
Figure BDA0003207188060000031
k1=0.022~0.027;
Mi represents an average first ion potential, k, of the organic compound in terms of carbon atoms2=13.92~15.38;
SCBO represents the sum of all bond levels in the dehydromolecular diagram of the organic compound, k3=0.057~0.071;
nH represents the number of hydrogen atoms of the organic compound, k4=0.03~0.04;
nCIC represents the number of rings of the organic ring of the organic compound, k5=0.505~0.617;
Hy represents the organizationHydrophilic factor, k, of the compound6=0.613~0.749。
[2]According to [1]]The method of (1), wherein k in the formula (1)0~k6Independently of one another, have the following defined numerical ranges:
k0=15.8~16.8;
k1=0.024~0.026;
k2=14.21~15.09;
k3=0.061~0.067;
k4=0.035~0.039;
k5=0.533~0.589;
k6=0.647~0.715。
[3] the method according to [1] or [2], wherein the general formula (1) satisfies the following general formula (1-1):
L=16.309±1%+(0.025±3%)SA–(14.651±1%)Mi+(0.064±3%)SCBO–(0.037±3%)nH+(0.561±3%)nCIC+(0.681±3%)Hy
(formula 1-1).
[4] The method according to any one of [1] to [3], wherein the general formula (1) satisfies the following general formula (1-2) or (1-2 b):
L=16.309±1%+(0.025±1%)SA–(14.651±1%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2a)
L=16.309±0.5%+(0.025±1%)SA–(14.651±0.5%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2 b).
[5]According to [1]]~[4]The method of any one of, wherein SAThe values of (a) are obtained by quantum chemical computation software.
[6] The method according to any one of [1] to [5], wherein the Mi, SCBO, nH, CIC and Hy are obtained by molecular Structure descriptor computing software Dragon.
[7] The method according to any one of [1] to [6], wherein the organic compound has one or more functional groups selected from the group consisting of:
a hydrocarbon group, a halogen group, a hydroxyl group, an ether group, an ester group, an aldehyde group, a ketone group, a carboxyl group, a nitrogen-containing group, a phosphorus-containing group, a silicon-containing group, and a sulfur-containing group.
[8] Further, the present invention also provides an ecological early warning method, wherein the early warning method comprises assessing the ecological risk of a target compound using the method according to any one of the above [1] to [7].
ADVANTAGEOUS EFFECTS OF INVENTION
Through the implementation of the technical scheme, the method has the beneficial effect that the L value can be quickly predicted based on the molecular structure description parameters of the organic compound. The method is convenient and quick, and can save the cost of experimental test. The invention constructs and evaluates the L value prediction model and characterizes the application domain of the model strictly according to the QSAR model development and use guide issued by OECD, therefore, the L value prediction result of the invention patent can provide basic data for ecological risk evaluation of chemicals and has important significance for the management of chemicals.
More specifically, the method provided by the invention has the following characteristics:
1) the method for modeling is multivariate linear regression, the algorithm is transparent, 1 quantization parameter and 5 available molecular structure descriptors based on the existing software are used, and the method is simple, clear in mechanism and convenient to popularize and apply;
2) compared with the existing model, the application domain of the model with the general formula (formula 1) is greatly widened, and the model covers more kinds of new organic pollutants, especially organic silicon compounds, such as: silanes, silazanes, silicon sulfides, siloxanes, and chlorinated siloxanes.
3) The construction of the model of the general formula (formula 1) is strictly in accordance with the QSAR model development and use guide issued by OECD, and the QSAR model is further evaluated and characterized, and the built model has good goodness of fit, robustness and reliable prediction capability.
Drawings
FIG. 1 is a graph of the fit of the experimental values to the predicted values for L values in a training set of 1018 compounds in a particular embodiment of the present invention.
FIG. 2 is a plot of the experimental and predicted values for the validation set L values for 254 compounds in a particular embodiment of the invention.
Fig. 3 is a Williams diagram in a specific embodiment of the present invention, wherein ● represents the training set compounds, o represents the validation set compounds, and the alarm value h is 0.021.
Detailed Description
The present invention will be described in detail below. The technical features described below are explained based on typical embodiments and specific examples of the present invention, but the present invention is not limited to these embodiments and specific examples. It should be noted that:
in the present specification, the numerical range represented by "numerical value a to numerical value B" means a range including the end point numerical value A, B.
In the present specification, the numerical ranges indicated by "above" or "below" mean the numerical ranges including the numbers.
In the present specification, the meaning of "may" includes both the meaning of performing a certain process and the meaning of not performing a certain process.
As used herein, the term "optional" or "optional" is used to indicate that certain substances, components, performance steps, application conditions, and the like are used or not used.
In the present specification, X. + -. y% is used to indicate that the range of the value is a numerical range of X (1-y%) to X (1+ y%).
In the present specification, the distribution coefficient L of the organic compound in n-hexadecane and air means the distribution coefficient L at room temperature and one atmosphere.
In the present specification, unless otherwise specified, the external or environmental conditions relating to the model member in the present specification are all obtained under room temperature conditions, and the "room temperature" used herein means an indoor environmental temperature of "25 ℃.
In describing software in this specification, "software" is employed to refer to the various available versions of software under that name, without limitation, including original, extended or upgraded versions, and preferably the series of "software" may be the most recent version of the present specification when read by the public.
In the present specification, reference to "some particular/preferred embodiments," "other particular/preferred embodiments," "embodiments," and the like, means that a particular element (e.g., feature, structure, property, and/or characteristic) described in connection with the embodiment is included in at least one embodiment described herein, and may or may not be present in other embodiments. In addition, it is to be understood that the described elements may be combined in any suitable manner in the various embodiments.
< first aspect >
In the first aspect of the present invention, a method for predicting the L value of an organic compound is provided, which is convenient, efficient, and has a wide application range.
The prediction method predicts the n-hexadecane/air partition coefficient of an organic compound based on a quantitative structure-activity relationship (QSAR) model. Specifically, descriptor parameters characterizing the molecular structure characteristics are calculated based on the molecular structure of the organic compound, and the constructed QSAR model is used for rapidly and efficiently acquiring the n-hexadecane/air distribution coefficient of the organic compound.
By QSAR method is meant the use of (known) physicochemical or theoretical parameters to predict molecular physical, chemical or other properties, such as pharmacology, biotoxicity, etc.
The basic assumption is that the molecular structure of a compound has a direct or indirect relationship with its physical, chemical properties, etc., and this relationship can be characterized by some functional formula. In particular, the properties of a compound depend on its structure and can be represented as:
p=f(s)
where p represents a measurable physical, chemical, pharmacological, etc. property, and s represents an empirical or non-empirical parameter such as the entire molecular structure or a partial substructure or fragment.
For these properties p mentioned above, there have been literature reports that can generally include: molar volume, melting point, vapor pressure, molecular surface area, partition coefficient (water/organic solvent, water/air), dissociation constant, hydrophobicity constant, quantum chemical parameters, topological parameters derived from chemical graph theory, and the like.
In addition, two essential prerequisites for the QSAR method are: on the one hand, it is necessary to have high-precision and reliable empirical (or highly reliable empirical) data or data sets for known compounds; on the other hand, to have parameters that adequately represent the relationship to the entire molecule or substructure.
Based on the principles or rules of the QSAR method described above, in order to construct the model of the present invention, it is first necessary to determine or select a data set of organic compounds with known values of L, and randomly select a portion of the organic compounds in the data set as a training set L1And another part of the compounds is taken as a verification set L2
The present invention is not particularly limited to the organic compound data set having a known L value, and can be selected from an existing database. These databases may be, for example, databases in any known software. In some preferred embodiments, collations may be gathered from a sub-database "UFZ-selected published values" of the database named "UFZ-LSER database" (v3.2.1).
In some embodiments of the present invention, a database of the known L value organic compound data set can be formed, wherein the L value of at least 500 or more, preferably 800 or more, more preferably 1000 or more or 1500 or more organic compounds is provided. These L values may be experimentally measured values or highly reliable empirical values, as described above. It will be appreciated that the greater the number of organic compounds available for use in the organic compound data set, the greater the experimentally determined L value, and the more advantageous it is for the reliability of the model ultimately established by the present invention.
In addition, in the present invention, the organic compound may be selected from the group consisting of:
a hydrocarbon group, a halogen group, a hydroxyl group, an ether group, an ester group, an aldehyde group, a ketone group, a carboxyl group, a nitrogen-containing group, a phosphorus-containing group, a silicon-containing group, and a sulfur-containing group.
The organic compound having a hydrocarbon group is not particularly limited, and may be selected from various saturated or unsaturated aliphatic hydrocarbons, alicyclic hydrocarbons, aromatic hydrocarbons, and the like. The unsaturated structure is not particularly limited, and may be an alkenyl structure or an alkynyl structure. Further, with respect to the compound having a hydrocarbon group of the present invention, a plurality of the structures described above may be present in the formula at the same time.
The organic compound having a hydroxyl group may be, for example, various alcohol compounds, phenol compounds, and the like, and the compound may have one or more hydroxyl groups.
Also, the organic compound having an ether group, an ester group, an aldehyde group, a ketone group or a carboxyl group structure is not particularly limited, and may be an organic compound having one or more of an oxygen ether group, an acyloxy group, a ketone carbonyl group or a carboxyl group in a molecular structure.
Examples of the organic compound having a nitrogen-containing group include a compound having a nitro group, a compound having an amino group (including primary, secondary or tertiary amines), a compound having an amide group, and a compound having a nitrile group. In addition, these organic compounds may include one or more of the nitrogen-containing groups described above.
As the organic compound having a phosphorus-containing group, an organic phosphine compound and a phosphoric acid (aliphatic, alicyclic or aromatic) ester compound can be exemplified.
Examples of the organic compound having a silicon-containing group include various organosilanes and organosiloxanes.
Examples of the organic compound having a sulfur-containing group include a compound having a thioether bond, a compound having a mercapto group, and a compound having a polysulfide bond.
The compound having a halogen group is not particularly limited, and compounds in which one or more hydrogen atoms are substituted with a halogen atom in the above-mentioned various compounds can be cited. The halogen atoms, which may be present at each occurrence, may be chosen independently of one another from fluorine, chlorine, bromine or iodine atoms.
It should be noted that, as to the various functional groups listed above, there are no limitations, and one or more of such functional groups may be present in the structure of the organic compound of the present invention.
Further, the training set L utilized by the present invention to construct the model1In particular embodiments, one may select from the previously determined organic compound data sets, in principle for the training set L1The amount of the medium organic compound is not particularly limited, and preferably, the training set L1The number of the compounds in the compound is more than 300, more than 400, more than 800 or more than 1000, and the compounds have different structures.
Validation set L utilized for the present invention to build a model2In some embodiments, one may select from the previously identified organic compound data sets, in principle for the validation set L2The amount of the medium organic compound is not particularly limited, and preferably, from the viewpoint of simplifying the calculation process, the verification set L2The number of the compounds is 1000 or less, 800 or less, 600 or less, or 500 or less.
In addition, for training set L1And verification set L2The ratio of the amounts of organic compounds in (A) and (B), in some embodiments of the invention, NL1:NL2The value of (b) may be 10:1 or less, preferably 3 to 6: 1.
In constructing the model of the present invention, a training set L is used1The L value of the medium organic compound and the selected molecular structure descriptor value are used for model construction, and a verification set L2The L value of the medium organic compound and the molecular structure descriptor value are used for external verification of the model, and in some specific embodiments, a one-off method (also called a 'tail-off method') can be selected for internal verification
Based on the training set L1And L2The present invention finds that the following molecular structure descriptors can be used to construct the model of the present invention:
SA: represents the area of the organic compound obtained according to a conductor-like shielding model (COSMO), and the unit is
Figure BDA0003207188060000091
Mi: the average first ion potential of the organic compound in terms of carbon atoms is expressed, namely the energy required for neutral molecules to lose one electron and change into positive univalent ions is represented, and the average first ion potential can be obtained by the ratio of the ion charge number of the compound after losing the electron to the ion radius (nanometer or picometer);
SCBA: represents the sum of all bond levels in the dehydrogenation molecular diagram of the organic compound, and is dimensionless, wherein the bond level of a single bond is 1, the bond level of a double bond is 2, the bond level of a triple bond is 3, and the like. E.g. C6H6Has a total of 6, CH3-CH=CH2The sum of the key levels of (a) is 3;
nH: represents the number of hydrogen atoms of the organic compound;
nCIC: represents the number of rings of the organic ring of the organic compound, wherein the number of rings refers to the total number of rings as a basic ring unit, or the total number of independent non-overlapping rings, for example, the number of rings of naphthalene is 2;
hy: represents the hydrophilic factor of the organic compound, which can be calculated based on the number of hydrophilic groups (e.g., hydroxyl group, carboxyl group, nitrogen-containing group) in the compound, the number of carbon atoms, and all the atoms except hydrogen. For example, the specific calculation method may be as described in "30-modeling and Prediction by WHIM descriptors.part 8. proximity and physical-chemical Properties of Environmental Priority Chemicals by 2D-TI and 3DWHIM Descriptors" (SAR QSAR Environ Res. 1997; 7(1-4):173-93.doi: 10.1080/10629369708039130.).
In principle, the source of the descriptor value is not particularly limited, and may be obtained by performing calculation according to software existing in the field or by using a currently existing calculation method.
In some specific embodiments of the invention, for SAThe numerical value of (c) can be calculated by a conventional quantum chemical calculation method. In some preferred embodiments, for S of the inventionASpecific numerical values can be calculated by optimizing the structure of the compound using, for example, the quantum chemical computing software MOPAC.
Additionally, the values for Mi, SCBO, nH, nCIC and Hy, in some preferred embodiments of the present invention, can be calculated by the Dragon series software.
Further, the model for predicting the partition coefficient L of an organic compound in n-hexadecane and air established by the present invention can be expressed using a function represented by the following formula (1):
L=k0+k1SA-k2Mi+k3SCBO-k4nH+k5nCIC+k6hy (formula 1)
Wherein k is0Representing the model constant, k0=15.5~17.1;k1=0.022~0.037;k2=13.92~15.38;k3=0.057~0.071;k4=0.03~0.04;k5=0.505~0.617;k6=0.613~0.749。
In some preferred embodiments of the present invention, k in the general formula (1)0~k6Independently of one another, have the following defined numerical ranges:
k0=15.8~16.8;k1=0.024~0.026;k2=14.21~15.09;k3=0.061~0.067;k4=0.035~0.039;k5=0.533~0.589;k6=0.647~0.715。
in still other preferred embodiments of the present invention, the general formula (1) satisfies the following general formula (1-1):
L=16.309±1%+(0.025±3%)SA- (14.651 + -1%) Mi + (0.064 + -3%) SCBO- (0.037 + -3%) nH + (0.561 + -3%) nCIC + (0.681 + -3%) Hy (formula 1-1).
Further, more preferably, the general formula (1) satisfies the following general formula (1-2a) or (1-2 b):
L=16.309±1%+(0.025±1%)SA–(14.651±1%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2a)
L=16.309±0.5%+(0.025±1%)SA–(14.651±0.5%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2 b).
For a typical embodiment of the present invention, a preferred mathematical model function of the present invention can be obtained by the following method:
first, the L values of the sorted organic compounds were collected from the sub-database UFZ-preselected published values of the UFZ-LSER database v3.2.1 database, and the inorganic substances and organic substances having molecular structures far from the structural domain were removed to obtain the L values of 1272 organic compounds including various organosilicon compounds such as silane, silazane, silicon sulfide, siloxane and chlorinated siloxane. Randomly splitting the 1272 organic matters into a training set and a verification set according to a ratio of 4:1, wherein the training set comprises 1018 compounds, and the verification set comprises 254 compounds; training the L value and the molecular structure descriptor value of the concentrated organic matters for model construction, and verifying the L value and the molecular structure descriptor value of the concentrated organic matters for external verification of the model; the model is internally verified by using a one-out method.
The descriptor selected by the model comprises a quantization descriptor SAAnd a molecular structure descriptor in Dragon software, and performing descriptor screening and model construction by adopting step-by-step multivariate linear regression analysis.
The model constructed was as follows:
L=16.309+0.025SA–14.651Mi+0.064SCBO–0.037nH+0.561nCIC+0.681Hy
finally, the above 6 molecular structure descriptors are screened out for model construction. Wherein the square of the coefficient of determination (R) of the model2) 0.958, Root Mean Square Error (RMSE)t) 0.620, indicating that the model has good fitting ability。
De-one cross validation coefficient (Q) of model2 LOO) 0.957, which shows that the model has better robustness; external verification coefficient (Q) of model2 V) 0.961, Root Mean Square Error (RMSE) of the validation setV) 0.604, the model has better prediction ability. Compared with the existing L value prediction model, the model carries out comprehensive evaluation and application domain characterization according to OECD (overall evolution and integration) about QSAR model construction and use guide rules. The model covers a wider variety of organosilicon compounds, including silanes, silazanes, silicon sulfides, siloxanes, and chlorinated siloxanes.
In addition, the model uses a small number of molecular structure description parameters, i.e. 1 quantification parameter (S)A) And 5 Dragon molecular structure descriptors, the model is simple, and the further popularization and application are facilitated.
Further, Williams (Williams) graphs are selected to characterize the application domain of the constructed model. Calculating standard residual error (delta) of organic matter and Hat (h) calculated based on molecular structure descriptor valuei) The value is obtained. When h is organiciThe value is greater than the warning value (h)*) When the time is short, the organic matter is not in the model application domain. h isiAnd h*Calculated by the following formula:
hi=xi T(XTX)-1xi (2)
h*=3(k+1)/n (3)
x hereiA descriptor matrix representing the ith compound; x is the number ofi TDenotes xiThe transposed matrix of (2); x refers to the descriptor matrix for all compounds; xTThen is the transpose of X; (X)TX)-1Representation matrix XTThe inverse of X; k represents the number of variables in the model being built. H of the created model*Is 0.021, and thus, in some and preferred embodiments of the present invention, the model is particularly useful for predicting hiAn L value of the organic compound less than 0.021.
< second aspect >
In a second aspect of the invention, a method for performing ecological or environmental pre-warning using the model established by the invention is provided.
Specifically, the n-hexadecane-to-air partition coefficient L of the target compound can be calculated or predicted using the model provided in < first aspect > of the present invention. The target compound may be selected from the organic compounds mentioned in < first aspect >, specifically, the target compound has one or more of functional groups selected from the following group, and may have a value of L which is not known at present:
a hydrocarbon group, a halogen group, a hydroxyl group, an ether group, an ester group, an aldehyde group, a ketone group, a carboxyl group, a nitrogen-containing group, a phosphorus-containing group, a silicon-containing group, and a sulfur-containing group.
In particular, the above-described model functions of the present invention are particularly suitable for the prediction or calculation of the L value for organic compounds having a silicon-containing group (unknown L value).
Examples
The technical solution of the present invention will be described below by specific examples.
Example 1
Given an organic compound, 1-chloro-4-nitrobenzene, the L value is predicted. Firstly, according to the structural information of 1-chloro-4-nitrobenzene, using MOPAC software package to make structure optimization to obtain SAThe value was 165.85, and based on MOPAC optimized structure, Mi, SCBO, nH, nCIC, Hy values were calculated as 1.121, 15, 4, 1 and-0.576, respectively, using Draogon 6.0 software. The value of h calculated according to equation (2) is 0.003: (<0.021), the compound was used in the model application domain, and the values of the above descriptors were substituted into equation (1) to give a predicted value of L of 5.01, and its experimentally determined value of L was 5.22, which is in close agreement with the data for the experimental values.
Example 2
Given an organic compound, methyl phenyl ether, the value of L is predicted. Firstly, according to the structural information of the methyl phenyl ether, the structure of the methyl phenyl ether is optimized by using MOPAC software package to obtain S of the methyl phenyl etherAValue 146.19, and based on MOPAC-optimized structure, Mi, SCBO, nH, nCIC, Hy were calculated using Draogon 6.0 softwareThe values are 1.117, 11, 8, 1 and-0.828 respectively. The value of h calculated according to equation (2) is 0.003: (<0.021) in the model application domain, the values of the above descriptors are substituted into formula (1) to obtain the predicted value of L of 4.00, the experimentally determined value of L is 3.89, and the predicted value and the data of the experimental value are in good agreement.
Example 3
Given an organic compound, ethyl propionate, the value of L is predicted. Firstly, according to the structure information of ethyl propionate, using MOPAC software package to carry out structure optimization to the ethyl propionate to obtain S of the ethyl propionateAThe value was 149.15, and Mi, SCBO, nH, nCIC, Hy were calculated to be 1.147, 7, 10, 0 and-0.668, respectively, based on MOPAC optimized structure using Draogon 6.0 software. The value of h calculated according to equation (2) is 0.003: (<0.021), the compound was used in the model application domain, and the values of the above descriptors were substituted into equation (1) to give a predicted value of L of 2.86, and the experimentally determined value of L was 2.81, which is in close agreement with the data for the experimental values.
Example 4
Given an organic compound simetryn, its L value is to be predicted. Firstly, according to the structure information of simetryn, using MOPAC software package to carry out structure optimization on simetryn so as to obtain S of simetrynAThe value was 253.07, and based on MOPAC-optimized structures, Mi, SCBO, nH, nCIC, Hy values were calculated as 1.155, 17, 15, 1 and 0.686, respectively, using Draogon 6.0 software. The value of h calculated according to equation (2) is 0.008: (<0.021) in the model application domain, substituting the values of the above descriptors into equation (1) yields a predicted value of L of 7.28, and an experimentally determined value of L of 8.23, with the predicted and experimental data being very close.
Example 5
Given an organic Compound indeno [1, 2, 3-cd]Pyrene, the L value of which is to be predicted. First according to indeno [1, 2, 3-cd]Structural information of pyrene, using MOPAC software package to carry out structural optimization to pyrene to obtain S of pyreneAThe value is 284.89, and based on MOPAC optimized structure, the values of Mi, SCBO, nH, nCIC, Hy were calculated to be 1.073, 40.5, 12, 6 and-0.986. The value of h calculated according to equation (2) is 0.04 (h)>0.021), the value of the descriptor is substituted into the formula (1) to obtain the predicted value of L which is 12.55, the experimentally determined value of L is 12.70, the data of the predicted value and the experimental value are still very similar, and the predicted result of the model has certain reference value for the compound outside the application domain.
It should be noted that, although the technical solutions of the present invention are described by specific examples, those skilled in the art can understand that the present invention should not be limited thereto.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Industrial applicability
The method for predicting the distribution coefficient L of the organic compound in the n-hexadecane and the air can be industrially used for predicting or evaluating the L value of a specific target compound, and further provides a reliable basis for evaluating the ecological risk of the compound and controlling high-risk chemicals.

Claims (8)

1. A method of predicting the partition coefficient L of an organic compound in n-hexadecane and air, characterized in that the method comprises calculating the L value of the organic compound using the following formula (1):
L=k0+k1SA-k2Mi+k3SCBO-k4nH+k5nCIC+k6Hy
(formula 1)
Wherein the content of the first and second substances,
k0representing the model constant, k0=15.5~17.1
SARepresents the area of the organic compound obtained according to a conductor-like shielding model (COSMO), and the unit is
Figure FDA0003207188050000011
k1=0.022~0.027;
Mi represents an average first ion potential, k, of the organic compound in terms of carbon atoms2=13.92~15.38;
SCBO represents the sum of all bond levels in the dehydromolecular diagram of the organic compound, k3=0.057~0.071;
nH represents the number of hydrogen atoms of the organic compound, k4=0.03~0.04;
nCIC represents the number of rings of the organic ring of the organic compound, k5=0.505~0.617;
Hy represents the hydrophilic factor, k, of the organic compound6=0.613~0.749。
2. The method according to claim 1, wherein k in the general formula (1)0~k6Independently of one another, have the following defined numerical ranges:
k0=15.8~16.8;
k1=0.024~0.026;
k2=14.21~15.09;
k3=0.061~0.067;
k4=0.035~0.039;
k5=0.533~0.589;
k6=0.647~0.715。
3. the method according to claim 1 or 2, wherein the general formula (1) satisfies the following general formula (1-1):
L=16.309±1%+(0.025±3%)SA–(14.651±1%)Mi+(0.064±3%)SCBO–(0.037±3%)nH+(0.561±3%)nCIC+(0.681±3%)Hy
(formula 1-1).
4. The method according to any one of claims 1 to 3, wherein the general formula (1) satisfies the following general formula (1-2a) or (1-2 b):
L=16.309±1%+(0.025±1%)SA–(14.651±1%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2a)
L=16.309±0.5%+(0.025±1%)SA–(14.651±0.5%)Mi+(0.064±1%)SCBO–(0.037±1%)nH+(0.561±1%)nCIC+(0.681±1%)Hy
(formula 1-2 b).
5. The method according to any one of claims 1 to 4, wherein S isAThe values of (a) are obtained by quantum chemical computation software.
6. The method according to any one of claims 1 to 5, wherein Mi, SCBO, nH, CIC and Hy are obtained from molecular Structure descriptor computing software Dragon.
7. The method according to any one of claims 1 to 6, wherein the organic compound has one or more functional groups selected from the group consisting of:
a hydrocarbon group, a halogen group, a hydroxyl group, an ether group, an ester group, an aldehyde group, a ketone group, a carboxyl group, a nitrogen-containing group, a phosphorus-containing group, a silicon-containing group, and a sulfur-containing group.
8. An ecological early warning method, which is characterized in that the early warning method comprises the step of evaluating the ecological risk of a target compound by using the method based on any one of claims 1 to 7.
CN202110920233.3A 2021-08-11 2021-08-11 Method for predicting n-hexadecane/air distribution coefficient of organic compound Active CN113591394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110920233.3A CN113591394B (en) 2021-08-11 2021-08-11 Method for predicting n-hexadecane/air distribution coefficient of organic compound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110920233.3A CN113591394B (en) 2021-08-11 2021-08-11 Method for predicting n-hexadecane/air distribution coefficient of organic compound

Publications (2)

Publication Number Publication Date
CN113591394A true CN113591394A (en) 2021-11-02
CN113591394B CN113591394B (en) 2024-02-23

Family

ID=78257255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110920233.3A Active CN113591394B (en) 2021-08-11 2021-08-11 Method for predicting n-hexadecane/air distribution coefficient of organic compound

Country Status (1)

Country Link
CN (1) CN113591394B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152038A1 (en) * 2000-12-27 2002-10-17 Steffen Sonnenberg Selection method for aroma substances
CN101673321A (en) * 2009-10-17 2010-03-17 大连理工大学 Method for fast predicting organic pollutant n-caprylic alcohol/air distribution coefficient based on molecular structure
CN102999705A (en) * 2012-11-30 2013-03-27 大连理工大学 Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model
CN107516016A (en) * 2017-08-30 2017-12-26 华南理工大学 A kind of method by building the silicone oil air distribution coefficient of quantitative structure activity relationship model prediction hydrophobic compound
CN110534163A (en) * 2019-08-22 2019-12-03 大连理工大学 Using the method for the Octanol/water Partition Coefficients of multi-parameter linear free energy relationship model prediction organic compound
CN111768815A (en) * 2020-07-07 2020-10-13 扬州大学 Method for predicting distribution coefficient of POPs (Point-of-sale) in PUF (physical unclonable function) membrane-air based on theoretical linear solvation energy relation model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020152038A1 (en) * 2000-12-27 2002-10-17 Steffen Sonnenberg Selection method for aroma substances
CN101673321A (en) * 2009-10-17 2010-03-17 大连理工大学 Method for fast predicting organic pollutant n-caprylic alcohol/air distribution coefficient based on molecular structure
CN102999705A (en) * 2012-11-30 2013-03-27 大连理工大学 Method for predicting n-octyl alcohol air distribution coefficient (KOA) at different temperatures through quantitative structure-activity relationship and solvent model
CN107516016A (en) * 2017-08-30 2017-12-26 华南理工大学 A kind of method by building the silicone oil air distribution coefficient of quantitative structure activity relationship model prediction hydrophobic compound
CN110534163A (en) * 2019-08-22 2019-12-03 大连理工大学 Using the method for the Octanol/water Partition Coefficients of multi-parameter linear free energy relationship model prediction organic compound
CN111768815A (en) * 2020-07-07 2020-10-13 扬州大学 Method for predicting distribution coefficient of POPs (Point-of-sale) in PUF (physical unclonable function) membrane-air based on theoretical linear solvation energy relation model

Also Published As

Publication number Publication date
CN113591394B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
Lu et al. Realization of conceptual density functional theory and information‐theoretic approach in multiwfn program
Zissimos et al. Calculation of Abraham descriptors from solvent–water partition coefficients in four different systems; evaluation of different methods of calculation
Marenich et al. Self-consistent reaction field model for aqueous and nonaqueous solutions based on accurate polarized partial charges
Balaban et al. Steric fit in quantitative structure-activity relations
Dunnivant et al. Quantitative structure-property relationships for aqueous solubilities and Henry's law constants of polychlorinated biphenyls
Thomas et al. Molecular dynamics simulations of the solution− air interface of aqueous sodium nitrate
CN110534163B (en) Method for predicting octanol/water distribution coefficient of organic compound by adopting multi-parameter linear free energy relation model
Gharagheizi Prediction of upper flammability limit percent of pure compounds from their molecular structures
Finkel et al. Evaluating the benefits of uncertainty reduction in environmental health risk management
Stovall et al. Solubility of 9-fluorenone, thianthrene and xanthene in organic solvents
CN103488901B (en) Adopt the soil of Quantitative structure-activity relationship model prediction organic compound or the method for sediment sorption coefficients
Arnold et al. Diagnostic evaluation of numerical air quality models with specialized ambient observations: testing the Community Multiscale Air Quality modeling system (CMAQ) at selected SOS 95 ground sites
CN107563133B (en) Method for predicting chlorine free radical reaction rate constant of organic chemicals by adopting quantitative structure-activity relation model
CN103345544B (en) Adopt logistic regression method prediction organic chemicals biological degradability
CN114564841B (en) City atmospheric emission list inversion method, system, equipment and storage medium
Brown QSPRs for predicting equilibrium partitioning in solvent–air systems from the chemical structures of solutes and solvents
Zou et al. Problems in the fingerprints based polycyclic aromatic hydrocarbons source apportionment analysis and a practical solution
CN103425872B (en) Method by Organic substance in quantitative structure activity relationship model prediction air Yu hydroxyl reaction speed constant
CN104820745A (en) Organic chemical exposure level forecasting method for surface water environment medium
Chen et al. A novel model for predicting lower flammability limits using Quantitative Structure Activity Relationship approach
Huang et al. Application of an emission profile-based method to trace the sources of volatile organic compounds in a chemical industrial park
CN109524063B (en) Method for predicting distribution coefficient between silicon rubber and water of hydrophobic organic matter passive sampling material
CN104573863A (en) Method for predicting organic compound and hydroxyl radical reaction rate constant in water phase
CN113591394A (en) Method for predicting organic compound n-hexadecane/air distribution coefficient
Wang et al. Novel quantitative structure activity relationship models for predicting hexadecane/air partition coefficients of organic compounds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant