CN112885415B - Quick screening method for estrogen activity based on molecular surface point cloud - Google Patents

Quick screening method for estrogen activity based on molecular surface point cloud Download PDF

Info

Publication number
CN112885415B
CN112885415B CN202110092707.XA CN202110092707A CN112885415B CN 112885415 B CN112885415 B CN 112885415B CN 202110092707 A CN202110092707 A CN 202110092707A CN 112885415 B CN112885415 B CN 112885415B
Authority
CN
China
Prior art keywords
chemical
point cloud
dimensional structure
neural network
activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110092707.XA
Other languages
Chinese (zh)
Other versions
CN112885415A (en
Inventor
刘娴
张爱茜
王理国
薛峤
潘文筱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Research Center for Eco Environmental Sciences of CAS
Original Assignee
Research Center for Eco Environmental Sciences of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Center for Eco Environmental Sciences of CAS filed Critical Research Center for Eco Environmental Sciences of CAS
Priority to CN202110092707.XA priority Critical patent/CN112885415B/en
Publication of CN112885415A publication Critical patent/CN112885415A/en
Application granted granted Critical
Publication of CN112885415B publication Critical patent/CN112885415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

A method of constructing an estrogenic activity prediction model, a method of screening for estrogenic activity, an electronic device, and a computer-readable storage medium, the method of constructing an estrogenic activity prediction model comprising obtaining chemical data known to have estrogenic activity, the chemical data comprising initial three-dimensional structural information of a chemical; optimizing the initial three-dimensional structure information to obtain optimized three-dimensional structure information; obtaining a molecular surface point cloud matrix of the chemical based on the optimized three-dimensional structure information; and training a convolutional neural network model by taking the molecular surface point cloud matrix as input to obtain the estrogen activity prediction model. The depth artificial neural network model constructed by the invention does not need a quantifiable structural parameter defined manually as a molecular descriptor, saves time and calculation resources for calculating the molecular descriptor and selecting the descriptor, and has lower requirement on calculation chemical foundation when in application.

Description

Quick screening method for estrogen activity based on molecular surface point cloud
Technical Field
The invention relates to the technical field of chemical environmental health risk evaluation, in particular to a rapid estrogen activity screening method based on molecular surface point cloud.
Background
A large number of environmental chemicals are gradually found to have estrogenic activity, which can mimic the biological behavior of estrogens in the human body, thereby interfering with the normal function of the endocrine system of the human body and causing adverse health effects on the human body. The endocrine disrupting effects of exogenous compounds, particularly pollutants, have attracted considerable social attention. To protect humans from such potential risks, governments must exercise strict estrogen-like activity assessment and regulatory control of production applications for chemicals that may come into contact with humans. However, compared with thousands of chemicals existing in the environment, only a very small part of the chemicals have in-vitro test experimental results of estrogen-like activity, and the activity evaluation work of a large number of chemicals is needed to be completed. Most of the existing methods for activity evaluation are based on the results of in vivo (in vivo) or in vitro (in vitro) experiments, which often consume a lot of time and experimental resources and are not suitable for activity evaluation of a huge amount of chemicals. Even chemicals that the United States Environmental Protection Agency (USEPA) deems to be at risk of human exposure have exceeded 30000.
Quantitative structure-activity relationships (Quantitative Structure Activity Relationships, QSAR) are therefore an important tool for chemical activity evaluation, which builds qualitative/quantitative activity prediction models based on molecular structure information based on qualitative/quantitative changes between known chemical structure properties. The use of this method greatly improves the efficiency of chemical activity evaluation, and is one of the important tools for chemical management. However, because the structure of a molecule is hard to characterize and calculate, the traditional QSAR prediction model needs to define and calculate a certain number of molecular descriptors in advance to describe molecular structure information, including thousands of descriptors such as molecular constitution, molecular fingerprint, topological index and three-dimensional structure characteristics. The input of a large number of descriptors, independent of the nature of interest or similar in meaning, can lead to model multiple collinearity problems, making the model less robust and increasing computational complexity, limited by the model approach itself. In practice, molecular descriptors often require pre-screening, culling redundant, highly relevant and low-representative descriptor information, which can require a significant amount of effort. In addition, the definition calculation of the molecular descriptors based on priori knowledge or experience often causes missing and missing of important molecular structure information, and limits the application and the prediction performance of the QSAR prediction model to a certain extent. With the further rise of deep learning wave, deep neural network models have achieved excellent results in numerous fields, especially in terms of computer vision and natural language processing, let us see their potential for chemical molecular recognition and thus molecular property prediction. The deep neural network model has a more flexible structure different from the traditional machine learning, so that the deep neural network model can accept more abundant and various input information, is not limited to the description characteristics of artificial definition, reduces the requirement of data preparation in the earlier stage of model use, and greatly improves the model prediction effect. Many studies have therefore attempted to construct deep learning models, such as one-dimensional molecular structure codes and two-dimensional molecular structure plans, using one-dimensional two-dimensional molecular structure characterizations as input information. This approach still has some problems: they cannot describe the steric information of the molecules, such as the orientation of the groups and bond lengths; secondly, the description of atoms in the molecule is too simple, and the influence of the surrounding environment on the properties of the atoms is ignored. The absence of such structural information also limits the predictive capabilities of the model.
In summary, although the quantitative structure-activity relationship mathematical prediction model established based on the traditional machine learning algorithm greatly improves the processes of chemical evaluation and rapid property screening, the quantitative structure-activity relationship mathematical prediction model is difficult to realize enough prediction effect in a complex system due to the limitation of available descriptors; and the calculation and collection of descriptors require a certain time, calculation resources and a certain discipline basis, and also limit the application of the prediction model to a certain extent. Therefore, a deep learning model capable of receiving richer input information is needed, direct mapping from chemical structure to property is realized, the requirement of data preparation in the middle and earlier stages of model use is reduced, and the prediction capability of the model is improved.
Disclosure of Invention
Accordingly, it is a primary object of the present invention to provide a method for constructing an estrogen activity prediction model, a method for screening estrogen activity, an electronic device and a computer readable storage medium, so as to at least partially solve at least one of the above problems.
In order to achieve the above object, as one aspect of the present invention, there is provided a method for constructing an estrogen activity prediction model, comprising:
s1, acquiring chemical data known to have estrogenic activity, wherein the chemical data comprises initial three-dimensional structure information of a chemical;
s2, optimizing the initial three-dimensional structure information to obtain optimized three-dimensional structure information;
s3, obtaining a molecular surface point cloud matrix of the chemical based on the optimized three-dimensional structure information;
and S4, training a convolutional neural network model by taking the molecular surface point cloud matrix as input, and obtaining the estrogen activity prediction model.
As another aspect of the present invention, there is also provided a screening method of estrogenic activity, the estrogenic activity prediction model obtained by the construction method as described above, comprising:
converting the initial three-dimensional structure information of the chemical to be evaluated into a component surface point cloud matrix, and inputting the component surface point cloud matrix into the estrogen activity prediction model to obtain an estrogen activity prediction value;
and if the predicted value is greater than or equal to the preset threshold value, the chemical is considered to have the estrogenic activity, and if the predicted value is less than the preset threshold value, the chemical is considered to have no estrogenic activity.
As still another aspect of the present invention, there is also provided an electronic apparatus including:
one or more processors;
a memory for storing one or more instructions,
wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the screening method as described above.
As yet another aspect of the present invention, there is also provided a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement a screening method as described above.
Based on the above technical solutions, the method for constructing an estrogen activity prediction model, the method for screening estrogen activity, the electronic device and the computer-readable storage medium according to the present invention have at least one or a part of the following advantages over the prior art:
(1) Different from a mathematical prediction model established by a traditional machine learning algorithm, the deep artificial neural network model constructed by the invention does not need a quantifiable structural parameter defined manually as a molecular descriptor, so that the time and calculation resources for calculating the molecular descriptor and selecting the descriptor are saved, and the requirement on a calculation chemical basis is lower when the model is applied;
(2) The method adopts molecular surface point cloud to represent molecular three-dimensional structure information, constructs a deep artificial neural network prediction model, improves the upper limit of the model prediction capability, and is not applied in the field of chemical substance estrogen activity evaluation at present;
(3) Compared with the existing method, the method has high-precision prediction performance, and is suitable for precise and rapid screening of the estrogen activation activity of large-scale chemicals; the method has wide application prospect in the fields of chemical risk evaluation, environmental safety evaluation and the like;
(4) According to the invention, accurate information of the molecular three-dimensional structure surface point cloud and a flexible structure of a convolutional neural network are utilized, so that the information loss of a traditional prediction model by using a molecular descriptor is reduced, the prediction capability of the model is greatly improved, and the high-precision prediction of the estrogen activity of chemicals is realized; in addition, the invention does not depend on a molecular descriptor, directly establishes a connection between a molecular three-dimensional structure and activity, is favorable for guiding the synthesis design of a chemical structure with specific properties, and has wide application prospect in the fields of rapid screening and design of chemicals and the like.
Drawings
FIG. 1 is a flow chart of chemical evaluation using a molecular surface point cloud-based estrogen activity high-precision model prediction method in an embodiment of the invention;
FIG. 2 is a schematic diagram of a deep neural network according to embodiment 1 of the present invention;
fig. 3 is a graphical representation of the surface point cloud of estradiol in example 1 of the invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
According to literature studies on existing chemical estrogen activity prediction models, the related methods or technologies have disadvantages. The quantitative structure-activity relation model established by the traditional machine learning method depends on the artificially defined quantifiable molecular structure parameters as molecular descriptors, so that not only can the loss of molecular structure information be possibly caused, but also the redundant molecular descriptors and multiple collinearity problems greatly influence the model prediction performance. The invention aims to provide a deep artificial neural network model based on molecular surface point cloud, which saves time and calculation resources required by molecular descriptor collection and calculation, and compared with other similar methods, the method has the most excellent prediction capability in the current research.
The basic principle of the invention is that the three-dimensional structure of the molecule is characterized by the three-dimensional coordinates of the lattice points on the surface of the molecule and electrostatic potential parameters; the convolution operation in the convolution network can extract a large amount of molecular structure information contained in the coordinates and electrostatic potential parameters of each lattice point, and the global pooling operation is used for extracting the structural characteristics of the molecular scale; studies have shown that deep neural networks can fit arbitrary mathematical functions, so that the fully connected layers of the model can establish mathematical functional relationships between molecular structure information extracted by convolution operations and molecular specific properties to achieve prediction of specific molecular properties.
The invention provides a high-precision estrogen activity prediction method based on molecular surface point cloud for the first time. Before the completion of the invention, no report of chemical estrogen activity prediction by directly taking molecular surface point clouds as input information and using a deep convolutional neural network model has been found.
The invention discloses a construction method of an estrogen activity prediction model, which comprises the following steps:
s1, acquiring chemical data known to have estrogenic activity, wherein the chemical data comprises initial three-dimensional structure information of a chemical;
s2, optimizing the initial three-dimensional structure information to obtain optimized three-dimensional structure information;
s3, obtaining a molecular surface point cloud matrix of the chemical based on the optimized three-dimensional structure information;
and S4, training a convolutional neural network model by taking the molecular surface point cloud matrix as input, and obtaining the estrogen activity prediction model.
In some embodiments of the present invention, step S3 specifically includes: based on the optimized three-dimensional structure information, calculating electrostatic potential and three-dimensional coordinate parameters of the grid points on the surface of the molecule; m points are randomly sampled from the lattice points on the surface of the molecule to be used as point clouds for representing the three-dimensional structure of the molecule, and the point clouds are expressed as a 4 XM digital matrix.
In some embodiments of the present invention, in step S4, training the convolutional neural network model using the molecular surface point cloud matrix as an input specifically includes:
s4.1, randomly dividing the obtained chemical data into a training set and a verification set according to a certain proportion;
s4.2, training the convolutional neural network model by using a training set, and determining the optimal super parameters of the convolutional neural network model by using a verification set to obtain an optimal convolutional neural network model, namely the estrogen activity prediction model.
In some embodiments of the invention, in step S4.1, the proportion of active molecules in the training set and the validation set is the same;
in some embodiments of the present invention, in step S4.2, the convolutional neural network model includes: n is n cv Convolved layer of layers and nf c And the layers are fully connected.
In some embodiments of the present invention, an ith convolution layer of the convolution layers comprises channel i convolution kernels, the convolution kernels having a channel size i-1 ×k i Wherein the convolution step length is stride i The output is channel i ×L i Wherein, the data of the data set is recorded,
in some embodiments of the invention, the output of the last layer convolution is sized to beThe matrix is converted into length +.>As input to the full connection layer;
in some embodiments of the present invention, each node of the current layer in the fully connected layer is connected to all nodes of the previous layer, and the number of nodes of each layer is respectivelyThe last layer of the fully-connected layer, the output layer, outputs the predicted value s of the estrogen activity of the chemical.
In some embodiments of the present invention, in step S4.2, the method for determining the optimal super parameter of the convolutional neural network model includes:
s4.2.1 presetting a group of model superparameters as { alpha, lambda, batch size }, wherein alpha is a learning rate, lambda is a weight attenuation regularization term parameter, and batch size is a batch size;
s4.2.2 iteratively training a convolutional neural network model based on preset appropriate hyper-parameters and training set data;
s4.2.3 the hyper-parameters { α, λ, batch size } were optimized using compounds of the validation set to obtain the optimal hyper-parameters.
In some embodiments of the present invention, the optimal superparameter in step S4.2.3 is a superparameter when the statistical parameter is optimal;
in some embodiments of the invention, the statistical parameter comprises at least one of true positive, true negative, false positive, false negative, sensitivity, specificity, accuracy, balance accuracy.
The invention also discloses a screening method of the estrogen activity, which adopts the estrogen activity prediction model obtained by the construction method, and comprises the following steps:
converting the initial three-dimensional structure information of the chemical to be evaluated into a component surface point cloud matrix, and inputting the component surface point cloud matrix into the estrogen activity prediction model to obtain an estrogen activity prediction value;
and if the predicted value is greater than or equal to the preset threshold value, the chemical is considered to have the estrogenic activity, and if the predicted value is less than the preset threshold value, the chemical is considered to have no estrogenic activity.
In some embodiments of the present invention, the method for determining the preset threshold includes:
converting the initial three-dimensional structure information of a plurality of known chemicals into a component surface point cloud matrix, and inputting the component surface point cloud matrix into the prediction model to obtain an estrogen activity prediction value set S;
sequencing the S from large to small, and calculating the true positive rate and the false positive rate according to the predicted value of each chemical and the corresponding activity label; the false positive rate is taken as an x axis, the true positive rate is taken as a y axis, and a receiver operation characteristic curve is obtained; wherein the corresponding active tag is obtained according to step S1.
Receiver operation specialCalculating a predicted value s corresponding to the maximum point t of the true positive rate relative to the false positive rate in the sexual curve t As a preset threshold for determining estrogenic activity.
The invention also discloses an electronic device, comprising:
one or more processors;
a memory for storing one or more instructions,
wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the screening method as described above.
The invention also discloses a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement a screening method as described above.
In one embodiment of the invention, a method for predicting estrogenic activity of a chemical is disclosed, comprising the steps of: obtaining the molecular structure of a chemical and the estrogen activity category of the chemical; optimizing the molecular three-dimensional structure of the chemical, and calculating electrostatic potential and three-dimensional coordinate parameters of grid points on the surface of the optimized molecular three-dimensional structure; each molecule randomly samples M lattice points, electrostatic potential and three-dimensional coordinate parameters (namely molecular surface point cloud) of the lattice points are used as input information, an estrogen activity value of the lattice points is used as output information, and an estrogen activity convolutional neural network prediction model based on the molecular surface point cloud is established; determining a preset threshold value of the estrogen activity according to the predicted activity value and the predicted performance evaluation of the prediction model; and taking the surface point cloud information of the chemical to be detected as input, and combining the model prediction result and a preset threshold value to judge whether the chemical to be detected has estrogen activity. According to the invention, accurate information of the molecular three-dimensional structure surface point cloud and a flexible structure of the convolutional neural network are utilized, so that the information loss of a traditional prediction model by using a molecular descriptor is reduced, the prediction capability of the model is greatly improved, and the high-precision prediction of the estrogen activity is realized. In addition, the invention does not depend on a molecular descriptor, directly establishes a connection between a molecular three-dimensional structure and activity, is favorable for guiding the synthesis design of chemicals with specific properties, and has wide application prospect in the fields of rapid screening and design of the chemicals and the like.
Specifically, in a preferred embodiment of the invention, a chemical estrogen activity high-precision model prediction method based on deep learning and chemical molecule surface point cloud is disclosed, which comprises the following steps:
step 1: chemical data for known estrogenic activity is obtained from public databases or literature, including binary classes of chemical estrogenic activity and initial three-dimensional structural information for the chemical. And calculating electrostatic potential and three-dimensional coordinate parameters of the molecular surface based on the optimized chemical structure, and randomly sampling M points to be used as point clouds for representing the three-dimensional structure of the molecule.
Specifically, the present step comprises the following sub-steps:
substep 11, obtaining chemical data of known estrogenic activity from a public database or literature, including its initial three-dimensional structure file and chemical binary activity class (number 1 represents active and number 0 represents inactive).
In sub-step 12, the three-dimensional structure of the molecule is optimized by the B3LYP/6-31G (d) base set in Gaussian 09 software.
Substep 13, using Multiwfn software to calculate electrostatic potential and three-dimensional coordinate parameters (x, y, z, esp) of the molecular surface lattice points at intervals of 0.1-0.5 Bohr (Bohr) based on the optimized chemical structure. M points are randomly sampled from the lattice points on the surface of the molecule to be used as a point cloud for representing the three-dimensional structure of the molecule, and the point cloud is expressed as a 4 XM matrix.
Step 2: the obtained chemical data are randomly divided into a training set and a verification set, and a convolutional neural network model which takes a molecular surface point cloud consisting of M lattice points as input is constructed.
Specifically, the present step comprises the following sub-steps:
and step 21, randomly dividing the chemical data into a training set and a verification set according to a certain proportion, and ensuring that the proportion of active molecules in the training set and the verification set is the same. The data of t% are used as a training set for training a convolutional neural network model; v% of the data was used as a validation set for model hyper-parametric search and predictive ability assessment.
Substep 22, the convolutional neural network model constructed in accordance with the present invention comprises n cv Layer convolution layer and nf c Layer full-connection layer: wherein n is cv The value of (2) can be set according to the need, for example, 3-8; nf (nf) c The value of (2) may be set as required, and may be, for example, 3 to 8. A 4×m digital matrix of chemicals in the training set is used as model input.
The first layer of the convolution layer is a one-dimensional convolution layer and comprises channels 1 A convolution kernel of size 4 xk 1 Wherein k is 1 Can be [1,3,5,7,9 ]]The convolution step length is stride 1 The method comprises the steps of carrying out a first treatment on the surface of the Then, batch standardization is carried out so that the input of each layer of neural network in the training process is kept in the same distribution; the linear characteristics in the neural network are then converted to nonlinear characteristics using a linear rectification function (ReLU) as an activation function. The output is channel 1 ×L 1 Wherein:
the second layer of the convolution layer is a one-dimensional convolution layer comprising channels 2 A convolution kernel of size channel 1 ×k 2 The convolution step length is stride 2 Subsequently, batch standardization is carried out so that the input of each layer of neural network in the training process is kept in the same distribution; and converting the linear characteristic in the neural network into the nonlinear characteristic by using the ReLU function as an activation function. The output is channel 2 ×L 2 Is a data of (a) a data of (b).
Third through last convolutional layers n cv A convolution structure similar to the first two layers is used. I.e. the ith layer comprises channels i A convolution kernel of size channel i-1 ×k i Wherein k is i Can be [1,3,5,7,9 ]]The convolution step length is stride i The output size is channel i ×L i Is a data of (a) a data of (b). The output size of the convolution of the last layer isThe matrix is subjected to global maximization pooling and converted into a length +.>As input to the full connection layer.
Each node of the current layer in the full-connection layer is connected with all nodes of the upper layer, and the number of the nodes of each layer is respectivelyConverting linear features in the neural network into nonlinear features by using a ReLU activation function for each layer of output except the last layer;
the last layer of the full connection layer, namely the output layer, has a node number of 1. The output value is in the range of 0-1 by using sigmoid activation function transformation, namely the predicted value s of the estrogen activity of the chemical.
Step 3: and training the model by using the chemical data in the training set, performing prediction verification on the chemical data in the verification set, and searching and determining the optimal super-parameter combination of the convolutional neural network model.
Specifically, the present step comprises the following sub-steps:
in sub-step 31, a set of model superparameters { α, λ, batch size } is preset. Wherein the learning rate α and the batch size are used to control the progress of convergence to a local minimum, and the weight decay L2 regularization term parameter λ is used to reduce model complexity, preventing model overfitting. The batch size is the number of samples selected before each adjustment of the parameters.
In the substep 32, training a model based on preset appropriate hyper-parameters and training set data, performing E generation (epoch) training, and saving model parameters of each generation. The model of each iteration is used to verify the predictions of the set, calculate True Positive (TP), true Negative (TN), false Positive (FP), false Negative (FN), sensitivity (Se), specificity (Sp), accuracy (Acc), balance Accuracy (Balanced Accuracy, BA) statistical parameters, and evaluate the model.
TP: representing the number of samples predicted to be positive in the verification set and actually positive
FP: representing the number of samples predicted to be positive and actually negative in the verification set
FN: representing the number of samples in the verification set that are predicted and negative, and actually positive
TN: representing the number of samples predicted to be negative and actually negative in the verification set
In order to avoid overfitting and improve generalization capability of the model, a model corresponding to the optimal model when the prediction result of the verification set is balanced and accurate BA in the iteration of generation E is selected.
Substep 33, performing E generation (epoch) training based on the set model hyper parameters using the compounds of the training set, and performing model evaluation using the compounds of the verification set, thereby optimizing the search for hyper parameters { α, λ, batch size }. Finally obtaining a set of hyper-parameters { alpha } of the model max ,λ max ,batchsize max And } as the optimal solution.
Step 4: the obtained model calculates a predictive score of estrogen activation activity for the validated chemicals in the set, determining a decision threshold (i.e., a preset threshold) for estrogen activity.
Specifically, the present step comprises the following sub-steps:
and a substep 41, predicting the verification set by using the optimal prediction model obtained through training to obtain the predicted value of chemicals in the verification set.
A set S of predicted values for the estrogenic activity of all chemicals in the validation set is obtained. And (3) sequencing the S from large to small, obtaining a predicted activity label according to the predicted value of each compound and a preset threshold value S, and comparing the actual activity label (1 or 0) obtained in the substep 11, and calculating the true positive rate TPR and the false positive rate FPR. The subject operating characteristics (Receiver Operating Characteristic curve, ROC) were plotted with FPR as the x-axis and TPR as the y-axis. The value s corresponding to the maximum point t of the rate of change of TPR relative to FPR in the ROC curve t As a preset threshold for determining estrogen-activated activity.
s t =arg max TPR″(FPR)
Substep 42, the predicted value s corresponding to the maximum point t of the rate of change of True Positive (TPR) to False Positive (FPR) in the receiver operating characteristic t As a preset threshold for determining estrogen-activated activity.
Step 5: and inputting the surface point cloud of the chemical to be evaluated into the obtained convolutional neural network model to obtain the estrogen activity predicted value. If the predicted value is above a preset threshold, the chemical is judged to have estrogenic activity and vice versa.
Specifically, the present step comprises the following sub-steps:
substep 51, the chemical to be evaluated calculates a 4×m molecular surface point cloud matrix according to the method described in substeps 12 to 13. The digital matrix is used as the input of the convolutional neural network model to calculate the predicted value s of the estrogen activity of the chemical out
Substep 52, if the predicted activity value s out ≥s t Determining the chemical toolIf the estrogen activity exists, the estrogen activity is judged to be not available.
The technical scheme of the invention is further described below by means of specific embodiments and with reference to the accompanying drawings. It should be noted that the following specific examples are given by way of illustration only and the scope of the present invention is not limited thereto.
Example 1
Referring to fig. 1-3, the chemical estrogen activity rapid screening method based on convolutional neural network of the present example comprises the following steps:
(1) Acquisition and pretreatment of chemical data
The three-dimensional structure file of 18 high-throughput test data and chemicals related to estrogen receptor activity in the U.S. Environmental Protection Agency (EPA) toxicology prediction study project ToxCast was downloaded. And converts the high throughput experimental data of the chemical into binary activity categories. The final dataset included 1317 chemicals, 144 of which had estrogen-activating activity and 1173 of which did not.
(2) Conversion of chemical structures into surface point cloud matrices
The three-dimensional structure of the molecule is optimized through the B3LYP/6-31G (d) base group in Gaussian 09 software, and the optimized fchk file is obtained.
The electrostatic potential and three-dimensional coordinate parameters (x, y, z, esp) of the molecular surface lattice were calculated based on the optimized chemical structure using Multiwfn software and at intervals of 0.25Bohr, as shown in fig. 3. 4096 points are randomly sampled from the molecular surface lattice points as a point cloud representing the three-dimensional structure of the molecule, which can be represented as a 4×4096 matrix.
(3) Training of deep neural network model and super parameter search
The chemical data were randomly divided into training and validation sets at a 4:1 ratio, the validation sets being used for model hyper-parametric search and predictive ability assessment (number "1" for active and number "0" for inactive).
The constructed deep neural network model can be divided into 8 layers, and comprises two structures of a convolution layer and a full-connection layer, wherein the first 4 layers are the convolution layers, and the 4 full-connection layers are connected (as shown in fig. 2):
a digital matrix of chemicals 4 x 4096 in the training set was used as model input.
The first layer of the convolution layer is a one-dimensional convolution layer and comprises 64 convolution kernels, the size of each convolution kernel is 4 multiplied by 1, and the convolution step length is 1; then, batch standardization is carried out so that the input of each layer of neural network in the training process is kept in the same distribution; the linear characteristics in the neural network are then converted to nonlinear characteristics using a linear rectification function (ReLU) as an activation function. Output as data of size 64×4096;
the second layer of the convolution layers is a one-dimensional convolution layer and comprises 64 convolution kernels, the size of each convolution kernel is 64 multiplied by 1, the convolution step length is 1, and then batch standardization is carried out so that the input of each layer of neural network in the training process is kept in the same distribution; and converting the linear characteristic in the neural network into the nonlinear characteristic by using the ReLU function as an activation function. The output is data of size 64×4096.
The third and fourth layers of the convolution layer adopt similar convolution structures as the first two layers. That is, the third layer contains 128 convolution kernels, the convolution kernel size is 64×1, the convolution step size is 1, and the output is data with the size of 128×4096; layer 4 contains 1024 convolution kernels with a size of 128 x 1, a convolution step size of 1, and output data with a size of 1024 x 4096. The matrix with the output size of 1024×4096 of the convolution of the last layer is subjected to global maximization pooling and converted into a vector with the length of 1024 to be used as the input of the full connection layer.
Each node of the current layer in the full-connection layer is connected with all nodes of the previous layer, the node number of each layer is 1024, 256 and 64,8,1 respectively, and the output of each layer except the last layer converts the linear characteristics in the neural network into nonlinear characteristics by using a ReLU activation function;
the last layer of the full connection layer, namely the output layer, has a node number of 1. The sigmoid activation function transformation is used to enable the output value to be in the range of 0-1, namely the predicted value s of the estrogen activity of the chemical;
during training, an adaptive moment estimation optimizer (Adam) method is applied to update neural network parameters based on gradients, the learning rate alpha is 0.001, in addition, in order to improve the generalization capability of the model so as to prevent the model from being excessively fitted, an L2 regularization term is added into the model, and the parameters are set to be 0.001 of the L2 regularization term. In order to alleviate the problems caused by data imbalance, the sampling weight of active chemicals is artificially increased to 8 times during training, and the batch data size is set to be 64. Each training is performed for 60 iterations, and model parameters of each iteration are saved. The model of each iteration is used for predicting a verification set, and statistical parameters of True Positive (TP), true Negative (TN), false Positive (FP), false Negative (FN), sensitivity (Se), specificity (Sp), accuracy (Accuracy, acc) and balance Accuracy (BalancedAccuracy, BA) are calculated, so that the generalization capability of the model is improved by avoiding overfitting, and the model corresponding to the optimal verification set prediction result balance Accuracy BA in the E generation iteration is selected.
In order to further avoid the overfitting of the model and improve the generalization capability of the model, the super parameters are searched and optimized, and under a certain range and a certain step length:
selecting a learning rate parameter alpha to obtain balance accuracy BA corresponding to different learning rates, wherein alpha is selected to be 0.001;
selecting an L2 regularization term parameter lambda to obtain balance accuracy BA corresponding to different regularization term parameters, wherein lambda is selected to be 0.001;
selecting batch data size to obtain balance accuracy BA corresponding to different random inactivation ratios, wherein the batch size is selected to be 64;
(4) Determination of an Activity prediction preset threshold
And (3) predicting the verification set by using the optimal model trained by the method in the step (2) to obtain the predicted activity value of chemicals in the verification set, and drawing a receiver operation characteristic curve (ROC curve) by combining the estrogen activity label obtained in the step (1). Predictive value s corresponding to maximum point t of change rate of TPR relative to FPR in ROC curve t As a preset threshold for determining estrogen-activated activity. The average value of the prediction sensitivity Se and the specificity Sp at the maximum point t point is 0.844, and the corresponding activity classification threshold value s t 0.119.
(5) Determination of the estrogenic Activity of a chemical to be evaluated
Beta-Estradiol (CASRN: 50-28-2) has high estrogenic activity as an estrogenic agent, and can be used for treating functional uterine bleeding, primary amenorrhea, menopausal syndrome, and prostatic cancer. As the chemical to be predicted in this example, the 3D file of β -estradiol was obtained by querying β -estradiol through the PubChem molecular database.
Optimizing the three-dimensional structure of beta-estradiol by using a B3LYP/6-31G (d) base group in Gaussian 09 software to obtain an fchk file; the electrostatic potential and three-dimensional coordinate parameters (x, y, z, esp) of the optimized molecular surface lattice were further calculated using Multiwfn software at 0.25Bohr intervals. 4096 points are randomly sampled from the molecular surface lattice points as a point cloud (see fig. 3) characterizing the three-dimensional structure of the molecule, which may be represented as a 4×4096 matrix.
And calculating the digital matrix as input information of a trained deep neural network model to obtain the chemical beta-estradiol with the estrogen activity predicted value of 0.958. The predicted activity value is greater than a preset threshold value of 0.119. Therefore, the beta-estradiol is judged to have estrogenic activity, and the predicted activity value is far higher than a preset threshold value, so that the activity is stronger, and the predicted conclusion is consistent with the fact.
(6) Compared with other existing machine learning-based methods for predicting performance
In order to better embody the high precision and excellent performance of the estrogen activation activity prediction method based on the deep neural network, the method is compared with a similar model in the research of recent years. The results are shown in the following table 1, on the same data set, the method provided by the invention has better generalization capability on the verification set, and the evaluation indexes such as sensitivity, specificity and accuracy are obviously superior to those of other similar methods.
TABLE 1
In conclusion, the estrogen activity of the chemical can be predicted only based on the molecular surface point cloud of the chemical through the established deep artificial neural network estrogen activity prediction model.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims (9)

1. A construction method of an estrogen activity prediction model comprises the following steps:
s1, acquiring chemical data known to have estrogenic activity, wherein the chemical data comprises initial three-dimensional structure information of a chemical;
s2, optimizing the initial three-dimensional structure information to obtain optimized three-dimensional structure information;
s3, obtaining a molecular surface point cloud matrix of the chemical based on the optimized three-dimensional structure information;
s4, training a convolutional neural network model by taking the molecular surface point cloud matrix as input to obtain the estrogen activity prediction model;
based on the optimized three-dimensional structure information, obtaining a molecular surface point cloud matrix of the chemical, wherein the molecular surface point cloud matrix comprises the following components:
based on the optimized three-dimensional structure information, calculating electrostatic potential and three-dimensional coordinate parameters of lattice points on the surface of the molecule at intervals of 0.1-0.5 Bohr;
m points are randomly sampled from the lattice points on the surface of the molecule to be used as point clouds for representing the three-dimensional structure of the molecule, and the point clouds are expressed as a 4 XM digital matrix.
2. The construction method according to claim 1, wherein,
in step S4, training the convolutional neural network model by using the molecular surface point cloud matrix as an input specifically includes:
s4.1, randomly dividing the obtained chemical data into a training set and a verification set according to a certain proportion;
s4.2, training the convolutional neural network model by using a training set, and determining the optimal super parameters of the convolutional neural network model by using a verification set to obtain an optimal convolutional neural network model, namely the estrogen activity prediction model.
3. The construction method according to claim 2, wherein,
in the step S4.1, the proportion of active molecules in the training set and the verification set is the same;
in step S4.2, the convolutional neural network model includes: n is n cv Convolution layer and n of layers fc And the layers are fully connected.
4. The construction method according to claim 3, wherein,
the ith convolution layer of the convolution layers comprises a channel i A convolution kernel of size channel i-1 ×k i Wherein k is i Is [1,3,5,7,9 ]]The convolution step length is stride i The output is channel i ×L i Wherein, the data of the data set is recorded,m is the number of points in the point cloud representing the three-dimensional structure of the molecule;
wherein the output size of the convolution of the last layer isThe matrix is subjected to global maximization pooling and converted into a length +.>As input to the full connection layer;
wherein each node of the current layer in the full-connection layer is connected with all nodes of the upper layer, and the node number of each layer is respectivelyWherein the last layer of the fully-connected layer, the output layer, outputs the predicted value s of the estrogen activity of the chemical.
5. The construction method according to claim 2, wherein,
in step S4.2, the method for determining the optimal super parameter of the convolutional neural network model includes:
s4.2.1 presetting a group of model superparameters as { alpha, lambda, batch size }, wherein alpha is a learning rate, lambda is a weight attenuation regularization term parameter, and batch size is a batch size;
s4.2.2 iteratively training a convolutional neural network model based on preset appropriate hyper-parameters and training set data;
s4.2.3 optimizing the superparameter { alpha, lambda, batch size } by using the compounds of the validation set to obtain an optimal superparameter;
the optimal super-parameter in step S4.2.3 is a super-parameter when the statistical parameter is optimal;
wherein the statistical parameter comprises at least one of true positive, true negative, false positive, false negative, sensitivity, specificity, accuracy and balance accuracy.
6. A method of screening for estrogenic activity, a predictive model of estrogenic activity obtained using the method of construction according to any one of claims 1 to 5, comprising:
converting the initial three-dimensional structure information of the chemical to be evaluated into a component surface point cloud matrix, and inputting the component surface point cloud matrix into the estrogen activity prediction model to obtain an estrogen activity prediction value;
and if the predicted value is greater than or equal to the preset threshold value, the chemical is considered to have the estrogenic activity, and if the predicted value is less than the preset threshold value, the chemical is considered to have no estrogenic activity.
7. The screening method of claim 6, wherein,
the method for determining the preset threshold comprises the following steps:
converting the initial three-dimensional structure information of a plurality of known chemicals into a component surface point cloud matrix, and inputting the component surface point cloud matrix into the prediction model to obtain an estrogen activity prediction value set S;
sequencing the S from large to small, and calculating the true positive rate and the false positive rate according to the predicted value of each chemical and the corresponding activity label; the false positive rate is taken as an x axis, the true positive rate is taken as a y axis, and a receiver operation characteristic curve is obtained;
calculating a predicted value s corresponding to the maximum point t of the true positive rate relative to the false positive rate in the receiver operation characteristic curve t As a preset threshold for determining estrogenic activity.
8. An electronic device, comprising:
one or more processors;
a memory for storing one or more instructions,
wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the screening method of claim 6 or 7.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to implement the screening method of claim 6 or 7.
CN202110092707.XA 2021-01-22 2021-01-22 Quick screening method for estrogen activity based on molecular surface point cloud Active CN112885415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110092707.XA CN112885415B (en) 2021-01-22 2021-01-22 Quick screening method for estrogen activity based on molecular surface point cloud

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110092707.XA CN112885415B (en) 2021-01-22 2021-01-22 Quick screening method for estrogen activity based on molecular surface point cloud

Publications (2)

Publication Number Publication Date
CN112885415A CN112885415A (en) 2021-06-01
CN112885415B true CN112885415B (en) 2024-02-06

Family

ID=76050692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110092707.XA Active CN112885415B (en) 2021-01-22 2021-01-22 Quick screening method for estrogen activity based on molecular surface point cloud

Country Status (1)

Country Link
CN (1) CN112885415B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI799269B (en) * 2022-05-16 2023-04-11 國立臺灣師範大學 Method for predicting activity of chemicals on estrogen receptors
CN115881212A (en) * 2022-10-26 2023-03-31 溪砾科技(深圳)有限公司 RNA target-based small molecule compound screening method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06309385A (en) * 1993-01-07 1994-11-04 Akiko Itai Constructing method for molecular structure for ligand having bioactivity
CN1886659A (en) * 2003-10-14 2006-12-27 维颂公司 Method and apparatus for analysis of molecular configurations and combinations
JP2010197419A (en) * 2009-02-23 2010-09-09 Japan Advanced Institute Of Science & Technology Hokuriku Molecular model of protein molecule, and method for manufacturing the same
CN103678951A (en) * 2013-12-11 2014-03-26 陕西科技大学 Prediction for activity of medicine against Aids through molecule surface random sampling analytical method
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
CN110232953A (en) * 2019-07-26 2019-09-13 中北大学 A kind of 7- [4- (5- aryl -1,3,4- oxadiazoles)] bridged piperazine derivatives antioxidant activity predictor method
CN111564185A (en) * 2020-03-19 2020-08-21 浙江师范大学 Method for rapidly predicting distribution coefficient of stored fat/water of organic compound
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning
CN112201313A (en) * 2020-09-15 2021-01-08 北京晶派科技有限公司 Automatic small molecule drug screening method and computing equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06309385A (en) * 1993-01-07 1994-11-04 Akiko Itai Constructing method for molecular structure for ligand having bioactivity
CN1886659A (en) * 2003-10-14 2006-12-27 维颂公司 Method and apparatus for analysis of molecular configurations and combinations
JP2010197419A (en) * 2009-02-23 2010-09-09 Japan Advanced Institute Of Science & Technology Hokuriku Molecular model of protein molecule, and method for manufacturing the same
CN103678951A (en) * 2013-12-11 2014-03-26 陕西科技大学 Prediction for activity of medicine against Aids through molecule surface random sampling analytical method
CN109033738A (en) * 2018-07-09 2018-12-18 湖南大学 A kind of pharmaceutical activity prediction technique based on deep learning
CN110232953A (en) * 2019-07-26 2019-09-13 中北大学 A kind of 7- [4- (5- aryl -1,3,4- oxadiazoles)] bridged piperazine derivatives antioxidant activity predictor method
CN111564185A (en) * 2020-03-19 2020-08-21 浙江师范大学 Method for rapidly predicting distribution coefficient of stored fat/water of organic compound
CN112201313A (en) * 2020-09-15 2021-01-08 北京晶派科技有限公司 Automatic small molecule drug screening method and computing equipment
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Improved atoms-in-molecule charge partitioning functional for simultaneously reproducing the electrostatic potential and chemical states in periodic and nonperiodic materials;Manz T A等;《Journal of chemical theory and computation》;第8卷(第8期);2844-2867 *
Molecular electrostatic potentials: an effective tool for the elucidation of biochemical phenomena;Politzer P等;《Environmental health perspectives》;第61卷;191-202 *
Pointnet: Deep learning on point sets for 3d classification and segmentation;Qi C R等;《Proceedings of the IEEE conference on computer vision and pattern recognition》;652-660 *
SepPCNET: deeping learning on a 3D surface electrostatic potential point cloud for enhanced toxicity classification and its application to suspected environmental estrogens;Wang L等;《Environmental Science & Technology》;第55卷(第14期);9958-9967 *
基于三维静电势参数研究 C60溶解性的构效关系;郭明等;《物理化学学报》;第19卷(第5期);432-435 *
基于分子表面静电势参数的定量结构-性质/活性关系研究;黄建湘;《中国优秀硕士学位论文全文数据库工程科技Ⅰ辑》(第2期);B014-914 *
基于局部分子表面静电势参数的定量构效关系研究;刘芬;《中国优秀硕士学位论文全文数据库工程科技Ⅰ辑》(第2期);第1.1.1节、第1.4节、第2章 *
基于神经网络的喹诺酮羧酸类衍生物活性研究;堵锡华等;《西北大学学报:自然科学版》;第46卷(第9期);第918-926页 *

Also Published As

Publication number Publication date
CN112885415A (en) 2021-06-01

Similar Documents

Publication Publication Date Title
CN112885415B (en) Quick screening method for estrogen activity based on molecular surface point cloud
CN107832787A (en) Recognition Method of Radar Emitters based on bispectrum own coding feature
CN112668579A (en) Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution
CN112699941B (en) Plant disease severity image classification method, device, equipment and storage medium
CN115017511A (en) Source code vulnerability detection method and device and storage medium
Peddi Data Pull out and facts unearthing in biological Databases
CN114782775A (en) Method and device for constructing classification model, computer equipment and storage medium
CN110633417B (en) Web service recommendation method and system based on service quality
CN113283524A (en) Anti-attack based deep neural network approximate model analysis method
CN115424101A (en) Disease identification method, device, equipment and storage medium
CN113065633A (en) Model training method and associated equipment
CN113378938B (en) Edge transform graph neural network-based small sample image classification method and system
CN113257357B (en) Protein residue contact map prediction method
CN113066528B (en) Protein classification method based on active semi-supervised graph neural network
Chang et al. Gene clustering by using query-based self-organizing maps
CN112634993A (en) Prediction model and screening method for activation activity of estrogen receptor of chemicals
CN114678083A (en) Training method and prediction method of chemical genetic toxicity prediction model
CN114625886A (en) Entity query method and system based on knowledge graph small sample relation learning model
JP6993250B2 (en) Content feature extractor, method, and program
CN110046770B (en) Grain mildew prediction method and device
CN112765606A (en) Malicious code homology analysis method, device and equipment
Li et al. A BYY scale-incremental EM algorithm for Gaussian mixture learning
CN112001436A (en) Water quality classification method based on improved extreme learning machine
CN113934813A (en) Method, system and equipment for dividing sample data and readable storage medium
Palamar et al. Probabilistic Graphical Model Based on Growing Neural Gas for Long Time Series Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant