CN113948156A - Multitask neural network method for predicting degradation half-life of chemicals in four environment media - Google Patents
Multitask neural network method for predicting degradation half-life of chemicals in four environment media Download PDFInfo
- Publication number
- CN113948156A CN113948156A CN202111388088.5A CN202111388088A CN113948156A CN 113948156 A CN113948156 A CN 113948156A CN 202111388088 A CN202111388088 A CN 202111388088A CN 113948156 A CN113948156 A CN 113948156A
- Authority
- CN
- China
- Prior art keywords
- chemicals
- model
- chemical
- life
- media
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000000126 substance Substances 0.000 title claims abstract description 83
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000006731 degradation reaction Methods 0.000 title claims abstract description 16
- 230000015556 catabolic process Effects 0.000 title claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 25
- 238000005259 measurement Methods 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims abstract description 3
- 230000007613 environmental effect Effects 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 2
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims description 2
- 238000002790 cross-validation Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 abstract description 16
- 238000003062 neural network model Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 abstract description 3
- 238000010276 construction Methods 0.000 abstract 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 13
- 239000002689 soil Substances 0.000 description 12
- 239000013049 sediment Substances 0.000 description 9
- 238000004617 QSAR study Methods 0.000 description 8
- OFBQJSOFQDEBGM-UHFFFAOYSA-N Pentane Chemical compound CCCCC OFBQJSOFQDEBGM-UHFFFAOYSA-N 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000002688 persistence Effects 0.000 description 5
- 238000006065 biodegradation reaction Methods 0.000 description 4
- HPXRVTGHNJAIIH-UHFFFAOYSA-N cyclohexanol Chemical compound OC1CCCCC1 HPXRVTGHNJAIIH-UHFFFAOYSA-N 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 4
- JAYCNKDKIKZTAF-UHFFFAOYSA-N 1-chloro-2-(2-chlorophenyl)benzene Chemical group ClC1=CC=CC=C1C1=CC=CC=C1Cl JAYCNKDKIKZTAF-UHFFFAOYSA-N 0.000 description 3
- PAYRUJLWNCNPSJ-UHFFFAOYSA-N Aniline Chemical compound NC1=CC=CC=C1 PAYRUJLWNCNPSJ-UHFFFAOYSA-N 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- HIXDQWDOVZUNNA-UHFFFAOYSA-N 2-(3,4-dimethoxyphenyl)-5-hydroxy-7-methoxychromen-4-one Chemical compound C=1C(OC)=CC(O)=C(C(C=2)=O)C=1OC=2C1=CC=C(OC)C(OC)=C1 HIXDQWDOVZUNNA-UHFFFAOYSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- ISWSIDIOOBJBQZ-UHFFFAOYSA-N Phenol Chemical compound OC1=CC=CC=C1 ISWSIDIOOBJBQZ-UHFFFAOYSA-N 0.000 description 1
- 150000004945 aromatic hydrocarbons Chemical class 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 231100000693 bioaccumulation Toxicity 0.000 description 1
- 231100000209 biodegradability test Toxicity 0.000 description 1
- 238000002144 chemical decomposition reaction Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000003344 environmental pollutant Substances 0.000 description 1
- RTZKZFJDLAIYFH-UHFFFAOYSA-N ether Substances CCOCC RTZKZFJDLAIYFH-UHFFFAOYSA-N 0.000 description 1
- 150000008282 halocarbons Chemical class 0.000 description 1
- 230000002363 herbicidal effect Effects 0.000 description 1
- 239000004009 herbicide Substances 0.000 description 1
- 150000002391 heterocyclic compounds Chemical class 0.000 description 1
- 238000013537 high throughput screening Methods 0.000 description 1
- 150000002430 hydrocarbons Chemical class 0.000 description 1
- 230000007062 hydrolysis Effects 0.000 description 1
- 238000006460 hydrolysis reaction Methods 0.000 description 1
- 150000002576 ketones Chemical class 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000010525 oxidative degradation reaction Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000006303 photolysis reaction Methods 0.000 description 1
- 230000015843 photosynthesis, light reaction Effects 0.000 description 1
- 231100000719 pollutant Toxicity 0.000 description 1
- 125000005575 polycyclic aromatic hydrocarbon group Chemical group 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 239000002352 surface water Substances 0.000 description 1
- 231100000331 toxic Toxicity 0.000 description 1
- 230000002588 toxic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C10/00—Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Biomedical Technology (AREA)
- Chemical & Material Sciences (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the technical field of high-throughput prediction for chemical risk management, and discloses a multitask neural network method for predicting degradation half-life of chemicals in four environment media. On the basis of the known molecular structure of the chemical, the degradation half-life of the chemical in four media can be obtained by calculating the molecular fingerprint and applying the constructed method. The method is simple, convenient and efficient, has low cost, and can save the resource investment of experimental tests. The construction process of the method is as follows: collecting degradation half-reduction period data; molecular PubChem fingerprint calculation; training a multitask neural network model; evaluating the performance of the model by selecting indexes such as a measurement speculative value-predicted value decision coefficient; characterizing a model application domain by referring to an OECD guide rule; the prediction model established by the invention has good fitting capability, robustness and prediction capability, can effectively predict the degradation half-life of chemicals in an application domain in four environment media, provides a necessary tool for sound management of the chemicals, and has important significance.
Description
Technical Field
The invention belongs to the technical field of high-throughput screening for chemical risk management, and discloses a method for predicting degradation half-life of chemicals in four environment media (atmosphere, water body, soil and sediment) based on a quantitative structure-activity relationship (QSAR) model.
Background
The environmental persistence evaluation of chemicals is one of the core contents of chemical risk management. Environmental persistence refers to the ability of a chemical to degrade and transform with difficulty and to remain unchanged for a long period of time in the environment. The most common indicator for assessing the environmental persistence of a chemical is its degradation half-life (t) in an environmental medium1/2) I.e., the time required for degradation of the chemical from the ambient medium to remove half of the initial amount. t is t1/2Is an important index for determining the environmental fate of chemicals and is also a key parameter for evaluating and controlling the durability, the bioaccumulation and the toxic (PBT) chemicals by global regulations.
T of chemical in ambient medium1/2Is determined by various degradation reactions (such as biodegradation, hydrolysis, photolysis, atmospheric oxidative degradation, etc.). The economic cooperation and development Organization (OECD) evaluates the environmental persistence of chemicals mainly based on biodegradability tests, issuing test guidelines for biodegradability of chemicals in surface water (OECD guideline 309), sediment (OECD guideline 308) and soil (OECD guideline 307). Testing each of the obtained media organismsThe degradation half-life is compared to persistence standards in the relevant regulations (e.g., REACH regulations) to determine whether a chemical is non-persistent (nP), persistent (P), or very persistent (vP).
Experimental testing to obtain t of chemical in environmental medium1/2Low efficiency, long time consumption, high cost, and the need to develop efficient (high throughput, low cost) analog prediction techniques. A quantitative structure-activity relationship (QSAR) -based computational simulation technology can effectively predict t of chemicals in an environmental medium by establishing the correlation between the molecular structure characteristics of the chemicals and the environmental behavior parameters thereof1/2. With the development of a machine learning algorithm, the QSAR based on the machine learning algorithm has strong advantages in the aspect of mining the internal relation between the predicted end point and the molecular characteristics, wherein the multi-task learning technology can learn the related information between different predicted end points through a characteristic and parameter sharing mechanism, so that the QSAR effectively improves the prediction performance of the model on the basis of simultaneously predicting different end points, and is expected to be applied to chemicals t1/2Plays an important role in the aspect of prediction and is helpful for screening out the preferentially controlled persistent chemicals.
At present, some researches have been carried out to construct t of chemicals in environmental media1/2QSAR predictive model of (1). The literature 'Water Res, 2019,157, 181-190' constructs a multiple linear regression model of biodegradation half-reduction period of aromatic hydrocarbon and derivatives thereof in Water; the document J.Chemnformatics, 2018,10 and 10' constructs a K nearest neighbor regression model of biodegradation half-reduction period of hydrocarbon compounds in water; the literature, "ecotox. environ. safe, 2016,129,10-15," constructs a support vector machine model that predicts half-life of herbicide biodegradation in soil. The existing models have a small application domain and are only suitable for predicting a single endpoint, i.e. a chemical in a single medium t1/2The correlation between different end points is neglected, so that the model prediction effect is difficult to further improve and the like; while also lacking t in the deposit1/2The predictive model of (1).
Based on the reasons, t of 250 chemicals in four media of atmosphere, water body, soil and sediment is comprehensively obtained by gathering documents1/2Data, data set covering organic acidsEster, ether, ketone, alcohol, phenol, aniline, polycyclic aromatic hydrocarbon, heterocyclic compound, halogenated hydrocarbon and other chemicals, and a multilayer feedforward neural network combined multitask learning technology is used for constructing a method capable of simultaneously predicting t of the chemicals in four media1/2The multi-task neural network quantitative model is characterized in the application domain of the model, and the application range of the model is determined.
Disclosure of Invention
The invention constructs a simple and efficient method for predicting t of chemicals in four media1/2The method can predict the degradation half-life of the chemicals in four media simultaneously according to SMILES codes of the chemicals, and provides a basic tool for screening the PBT chemicals; in the modeling process, the QSAR model is constructed and used by referring to the OECD, and the robustness and the prediction capability of the model are examined through internal and external verification.
The technical scheme of the invention is as follows:
a multitask neural network method for predicting degradation half-life of chemicals in four environmental media comprises the following steps:
(1) data gathering
250 chemicals were collected from the literature in four media t1/2(ii) a Generating SMILES codes corresponding to chemicals by using an RDkit package in Python 3.8.8 software;
(2) calculating molecular fingerprints of chemicals
Adopting Open Babel 2.3.2.2 software to convert the CSV format file recorded with the chemical SMILES code into an SDF format file; inputting the SDF file into PaDEL-Descriptor 2.21 software, and calculating PubChem molecular fingerprints of 250 chemicals;
(3) model training
Molecular fingerprinting and logt of PubChem of chemicals1/2Merging data; randomly splitting a data set into a training set and a verification set according to a ratio of 4:1, and using chemicals to logt in four media1/2(t1/2Unit: h) taking PubChem fingerprints of chemicals as independent variables, and training a multi-task model by adopting a multi-layer feedforward neural network and combining a multi-task learning technology; to avoid overfitting, a batch process is usedPhysical and Dropout methods; determining the optimal hyper-parameter of the algorithm by a grid search method; constructing a model based on the optimal hyper-parameters and verifying logt of the compounds1/2And (4) predicting data to represent the external prediction performance of the model.
The model optimal hyperparameters are as follows: the neuron comprises a first hidden layer and a second hidden layer, wherein the first hidden layer and the second hidden layer respectively comprise 100 neurons and 10 neurons; a Dropout layer is arranged behind the first hidden layer, and the Dropout rate is set to be 20%; both hidden layers adopt a linear rectification function (ReLU) as an activation function; training 16 chemicals in each batch, wherein the total number of iterations is 300, namely, batchsize is 16, and epoch is 300; the loss function index is Mean Square Error (MSE); the optimizer selects an adaptive moment estimate (Adam); the optimizer step size is set to 0.005, i.e. leaningrate is 0.005; setting the same weight factors for the four tasks during optimization;
(4) model evaluation
Estimation-prediction decision coefficient (R) using training set measurements2) Root Mean Square Error (RMSE), Mean Absolute Error (MAE) characterizes model goodness of fit; r of verification set2RMSE, MAE characterize model prediction capability; ten-fold cross-validation coefficient (Q) using training set2 10) The model robustness is characterized.
The prediction effect of the final model is as follows:
chemicals log in atmosphere1/2(t1/2Unit: h) the predicted effect of (2): r2 train=0.988,RMSEtrain=0.094,MAEtrain=0.070,Q2 10=0.889,R2 test=0.713,RMSEtest=0.348,MAEtest=0.244;
Chemical logt in water1/2(t1/2Unit: h) the predicted effect of (2): r2 train=0.976,RMSEtrain=0.121,MAEtrain=0.087,Q2 10=0.895,R2 test=0.802,RMSEtest=0.305,MAEtest=0.205;
Chemical log in soil1/2(t1/2Unit: h) predicted effect ofAnd (4) fruit: r2 train=0.981,RMSEtrain=0.112,MAEtrain=0.084,Q2 10=0.941,R2 test=0.883,RMSEtest=0.261,MAEtest=0.204;
Chemical logt in deposits1/2(t1/2Unit: h) the predicted effect of (2): r2 train=0.979,RMSEtrain=0.107,MAEtrain=0.079,Q2 10=0.924,R2 test=0.870,RMSEtest=0.261,MAEtest=0.199;
(5) Application domain characterization
Generating MACCS molecular fingerprints of the chemicals by using an RDkit software package, and calculating the similarity (Tanimoto similarity) between the chemical molecules A of the verification set and the chemical molecules B of the training set according to the following calculation formula:
wherein S isABIs the degree of similarity of the valleys of molecules A and B, XjAIs the jth fingerprint feature of molecule A, XjBIs the jth feature of the molecule B and n is the number of feature bits of the fingerprint.
Defining a similarity threshold (S)cutoff) And the minimum number of similar molecules (N)min) Defining an application domain, i.e. if the similarity between the training set and the target molecule is greater than ScutoffChemical number of (2) exceeds NminThen the molecule is determined to be within the application domain. The application domains of the invention are: scutoff=0.6,Nmin=5。
The invention has the advantages that: the established model can simultaneously predict the logt of the chemicals in four media1/2Because the correlation information between different end points is considered, the prediction performance of the model is greatly improved, and the model has a clearly characterized application domain. The method is simple, convenient, efficient and low in cost, is expected to play a role in high-throughput prediction of chemical degradation half-life data, and is used for healthy chemical managementProvides a basic tool, and meets the national important requirements of chemical risk management and control and new pollutant treatment.
Drawings
FIG. 1 shows the overall process flow.
FIG. 2 shows the chemical logt in four media1/2(t1/2The unit is: h) measuring a linear fitting graph of the speculative value and the predicted value, wherein the chemicals in the training set and the verification set are respectively 200 and 50; (a) is in the atmosphere; (b) is in a body of water; (c) is in the soil; (d) is in the deposit.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
Example 1
Given a chemical 2,2' -dichlorobiphenyl (CAS number: 13029-08-8), logt is to be predicted in four media1/2(unit: h). Firstly, according to SMILES code of 2,2' -dichlorobiphenyl, utilizing RDkit software package to calculate MACCS molecular fingerprint, then calculating the similarity of the MACCS molecular fingerprint and grain of chemical molecules in a training set, and calculating to obtain that the similarity of the chemical molecules in the training set and the grain is more than 0.6 (S)cutoff) Has 6 molecules (more than N)min5), 2' -dichlorobiphenyl is within the domain of model applications. Further, the PubChem molecular fingerprint is calculated by using the PaDEL-Descriptor software, and the multi-task neural network model constructed by the method is used for prediction. The results were obtained:
logt1/2 (atmosphere)=2.31,logt1/2 (Water body)=3.79,logt1/2 (soil)=4.15,logt1/2 (sediment)4.25, the corresponding measurement estimate is: logt (r)1/2 (atmosphere)=2.23,logt1/2 (Water body)=3.74,logt1/2 (soil)=4.23,logt1/2 (sediment)The predicted value and the measurement estimate agree very well with 4.23.
Example 2
Given a chemical cyclohexanol (CAS number: 108-93-0), it is predicted to be logt in four media1/2(unit: h). Firstly, according to SMILES code of cyclohexanol, using RDkit software package to calculate MACCS molecular fingerprintThen calculating the similarity of the particles with the chemical molecules in the training set, wherein the similarity of the molecules in the training set with the chemical molecules is calculated to be more than 0.6 (S)cutoff) Has 8 molecules (more than N)min5), cyclohexanol is within the model application domain. Further, the PubChem molecular fingerprint is calculated by using the PaDEL-Descriptor software, and the multi-task neural network model constructed by the method is used for prediction. The results were obtained:
logt1/2 (atmosphere)=1.70,logt1/2 (Water body)=1.77,logt1/2 (soil)=1.60,logt1/2 (sediment)2.25, the corresponding measurement estimate is: logt (r)1/2 (atmosphere)=1.74,logt1/2 (Water body)=1.74,logt1/2 (soil)=1.74,logt1/2 (sediment)The predicted value and the measurement estimate agree very well with 2.23.
Example 3
Given a chemical n-pentane (CAS number: 109-66-0), it is predicted to be logt in four media1/2(unit: h). Firstly, according to SMILES code of n-pentane, utilizing RDkit software package to calculate MACCS molecular fingerprint, then calculating the similarity of the MACCS molecular fingerprint and the grain of chemical molecules in a training set, and calculating to obtain that the similarity of the chemical molecules in the training set and the grain of the chemical molecules in the training set is more than 0.6 (S)cutoff) Has 15 molecules (more than N)min5) so n-pentane is within the model application domain. Further, the PubChem molecular fingerprint is calculated by using the PaDEL-Descriptor software, and the multi-task neural network model constructed by the method is used for prediction. The results were obtained:
logt1/2 (atmosphere)=1.30,logt1/2 (Water body)=2.75,logt1/2 (soil)=3.24,logt1/2 (sediment)3.74, the corresponding measurement estimate is: logt (r)1/2 (atmosphere)=1.23,logt1/2 (Water body)=2.74,logt1/2 (soil)=3.23,logt1/2 (sediment)The predicted value and the measurement estimate agree very well with 3.74.
Claims (2)
1. A multitasking neural network method for predicting degradation half-life of chemicals in four environmental media, comprising the steps of:
(1) data gathering
250 chemicals were collected from the literature for degradation log half-life in four media1/2And generating a SMILES code corresponding to the chemical;
(2) calculating molecular fingerprints of chemicals
Converting the CSV format file recording the SMILES code of the chemical into an SDF format file; calculating PubChem molecular fingerprints of 250 chemicals according to the SDF format file;
(3) model training
Molecular fingerprinting and logt of PubChem of chemicals1/2Merging data; randomly splitting a data set into a training set and a verification set according to a ratio of 4:1, and using the logt of chemicals in four media1/2Taking PubChem molecular fingerprints of chemicals as independent variables, and training a multi-task model by adopting a multi-layer feedforward neural network and combining a multi-task learning technology; determining the optimal hyper-parameter of the algorithm by a grid search method; constructing a model based on the optimal hyper-parameters, predicting degradation half-life data of the chemicals in the verification set, and representing the external prediction performance of the model;
the model optimal hyperparameters are as follows: the neuron comprises a first hidden layer and a second hidden layer, wherein the first hidden layer and the second hidden layer respectively comprise 100 neurons and 10 neurons; a Dropout layer is arranged behind the first hidden layer, and the Dropout rate is 20%; the two hidden layers adopt a linear rectification function as an activation function; training 16 chemicals in each batch, wherein the total number of iterations is 300, namely, batchsize is 16, and epoch is 300; the loss function index is Mean Square Error (MSE); the optimizer selects an adaptive moment estimate; the optimizer step size is set to 0.005, i.e. leaningrate is 0.005; setting the same weight factors for the four tasks during optimization;
(4) model performance assessment
Coefficient R is determined using training set measurement guess-predictor2Root mean square error RMSE, mean absolute error MAE characterize model goodness of fit; r of verification set2RMSE, MAE characterize model prediction capability; ten-fold cross-validation coefficient Q using training set2 10Characterizing model robustness;
(5) application domain characterization
Generating MACCS molecular fingerprints of the chemicals, and calculating the similarity of the grains between the chemical molecules A in the verification set and the chemical molecules B in the training set, wherein the calculation formula is as follows:
wherein S isABIs the degree of similarity of the valleys of molecules A and B, XjAIs the jth fingerprint feature of molecule A, XjBIs the jth feature of the molecule B, and n is the number of feature bits of the fingerprint;
by a self-defined similarity threshold ScutoffAnd the minimum number of similar molecules NminTo define the application domain, i.e. if the similarity of the targeted molecule to the trough in the training set is greater than ScutoffChemical number of (2) exceeds NminThen the molecule is determined to be within the application domain.
2. The method of claim 1, wherein the defined application domains are: scutoff=0.6,Nmin=5。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111217996 | 2021-10-20 | ||
CN2021112179968 | 2021-10-20 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113948156A true CN113948156A (en) | 2022-01-18 |
CN113948156B CN113948156B (en) | 2024-05-07 |
Family
ID=79338398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111388088.5A Active CN113948156B (en) | 2021-10-20 | 2021-11-22 | Multitasking neural network method for predicting degradation half-life of chemicals in four environmental media |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113948156B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101027357A (en) * | 2004-07-27 | 2007-08-29 | 陶氏环球技术公司 | Thermoplastic vulcanizates and process to prepare them |
WO2013079016A1 (en) * | 2011-11-30 | 2013-06-06 | 大连理工大学 | Method for predicting oxidation reaction rate constant between chemicals and ozone based on molecular structure and ambient temperature |
CN107967542A (en) * | 2017-12-21 | 2018-04-27 | 国网浙江省电力公司丽水供电公司 | A kind of electricity sales amount Forecasting Methodology based on shot and long term memory network |
US20180268282A1 (en) * | 2017-03-17 | 2018-09-20 | Wipro Limited. | Method and system for predicting non-linear relationships |
CN112466399A (en) * | 2020-11-19 | 2021-03-09 | 大连理工大学 | Method for predicting mutagenicity of chemicals through machine learning algorithm |
CN112750510A (en) * | 2021-01-18 | 2021-05-04 | 合肥工业大学 | Method for predicting permeability of blood brain barrier of medicine |
-
2021
- 2021-11-22 CN CN202111388088.5A patent/CN113948156B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101027357A (en) * | 2004-07-27 | 2007-08-29 | 陶氏环球技术公司 | Thermoplastic vulcanizates and process to prepare them |
WO2013079016A1 (en) * | 2011-11-30 | 2013-06-06 | 大连理工大学 | Method for predicting oxidation reaction rate constant between chemicals and ozone based on molecular structure and ambient temperature |
US20180268282A1 (en) * | 2017-03-17 | 2018-09-20 | Wipro Limited. | Method and system for predicting non-linear relationships |
CN107967542A (en) * | 2017-12-21 | 2018-04-27 | 国网浙江省电力公司丽水供电公司 | A kind of electricity sales amount Forecasting Methodology based on shot and long term memory network |
CN112466399A (en) * | 2020-11-19 | 2021-03-09 | 大连理工大学 | Method for predicting mutagenicity of chemicals through machine learning algorithm |
CN112750510A (en) * | 2021-01-18 | 2021-05-04 | 合肥工业大学 | Method for predicting permeability of blood brain barrier of medicine |
Non-Patent Citations (2)
Title |
---|
张文灏;陈景文;徐童;王雅: "外源化合物在鱼体内生物半减期的QSAR模型", 生态毒理学报, no. 003, 31 December 2019 (2019-12-31) * |
范德玲;宋波;刘济宁;王蕾;周林军;石利利: "化学品正辛醇空气分配系数定量预测模型研究", 生态与农村环境学报, vol. 31, no. 2, 25 March 2015 (2015-03-25) * |
Also Published As
Publication number | Publication date |
---|---|
CN113948156B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Okazaki et al. | Applicability of machine learning to a crack model in concrete bridges | |
Ma et al. | Advances in corrosion growth modeling for oil and gas pipelines: A review | |
Sarmadi | Investigation of machine learning methods for structural safety assessment under variability in data: Comparative studies and new approaches | |
Hernandez et al. | Use of artificial neural networks for predicting crude oil effect on carbon dioxide corrosion of carbon steels | |
CN114781538A (en) | Air quality prediction method and system of GA-BP neural network coupling decision tree | |
Lu et al. | Quantification of fatigue damage for structural details in slender coastal bridges using machine learning-based methods | |
Moreira de Melo et al. | Artificial neural networks for estimating soil water retention curve using fitted and measured data | |
Pal et al. | Assessment of artificial neural network models based on the simulation of groundwater contaminant transport | |
Sinehbaghizadeh et al. | Evaluation of phase equilibrium conditions of clathrate hydrates using connectionist modeling strategies | |
Li et al. | An improved stochastic configuration network for concentration prediction in wastewater treatment process | |
Zhu et al. | Versatile in silico modelling of microplastics adsorption capacity in aqueous environment based on molecular descriptor and machine learning | |
Sakarkar et al. | Comparative study of ambient air quality prediction system using machine learning to predict air quality in smart city | |
Rezaei et al. | Modeling of gas viscosity at high pressure-high temperature conditions: Integrating radial basis function neural network with evolutionary algorithms | |
CN116307034A (en) | Oil production or CO production based on discrete wavelet transformation and neural network 2 Buried quantity prediction method | |
Sambo et al. | Application of adaptive neuro-fuzzy inference system and optimization algorithms for predicting methane gas viscosity at high pressures and high temperatures conditions | |
Ramani et al. | Impacts of climate change on long-term reliability of reinforced concrete structures due to chloride ingress | |
Alogdianakis et al. | Data-driven recognition and modelling of deterioration patterns in the US National Bridge Inventory: A genetic algorithm-artificial neural network framework | |
Lu et al. | Quality-relevant feature extraction method based on teacher-student uncertainty autoencoder and its application to soft sensors | |
Song et al. | Interpretable machine learning for maximum corrosion depth and influence factor analysis | |
Tipu et al. | Predictive modelling of surface chloride concentration in marine concrete structures: a comparative analysis of machine learning approaches | |
CN113948156A (en) | Multitask neural network method for predicting degradation half-life of chemicals in four environment media | |
Naserzadeh et al. | Development of HGAPSO-SVR corrosion prediction approach for offshore oil and gas pipelines | |
Jiménez-Come et al. | The use of artificial neural networks for modelling pitting corrosion behaviour of EN 1.4404 stainless steel in marine environment: Data analysis and new developments | |
Kpidi et al. | Monitoring and Modeling of Chlorophyll-a Dynamics in a Eutrophic Lake: M'koa Lake (Jacqueville, Ivory Coast) | |
CN112257327A (en) | Submarine pipeline residual life prediction method based on wavelet transform denoising and stacking self-coding feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |