CN117095767A

CN117095767A - Unknown substance acute toxicity prediction method, system and equipment based on deep learning

Info

Publication number: CN117095767A
Application number: CN202311034419.4A
Authority: CN
Inventors: 张晓迪; 任晓婷; 刘瑞; 吴昊; 王钊; 海春旭
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-21
Anticipated expiration: 2043-08-16
Also published as: CN117095767B

Abstract

The invention discloses a method, a system and equipment for predicting the acute toxicity of an unknown substance based on deep learning, and relates to the technical field of substance toxicity detection, wherein the method comprises the steps of encoding a CANONICAL SMILES code of the chemical substance in an obtained target area into a corresponding DGL chart, and inputting the corresponding DGL chart into an acute toxicity prediction model of the unknown substance to obtain the types of the chemical substance in the target area and the acute toxicity degree of each chemical substance; wherein the model is determined from a modified schematic force network model; the improved graph meaning network model comprises an input layer, a feature extraction layer, an implicit layer, a pooling layer and a linear classification layer; the attention mechanism of the hidden layer is an attention mechanism of SuperGAT based on graph node-oriented classification tasks. The invention can predict the chemical substance in the target area and the acute toxicity degree of the chemical substance with high precision and high efficiency.

Description

Unknown substance acute toxicity prediction method, system and equipment based on deep learning

Technical Field

The invention relates to the technical field of material toxicity detection, in particular to a method, a system and equipment for predicting the acute toxicity of unknown materials based on deep learning.

Background

The toxicity prediction of the source poison is an important reference basis for the emergency rescue scheme of sudden chemical poisoning accidents, and the clinical treatment of site casualties and the pollution treatment of site environments can be realized through the prediction of the toxic source substances at the accident site.

The computational toxicology technology is a method for constructing a computer model based on principles of computational chemistry, bioinformatics, systematic biology and the like, can realize the prediction of structures, characteristics, effects and the like of chemical substances in vitro according to the excavation and analysis of the existing data, and can avoid the problems of high cost, long time consumption, violation of animal ethics and the like when the traditional experimental means acquire related information. However, this method has drawbacks of low prediction efficiency and low prediction accuracy.

Disclosure of Invention

The invention aims to provide a method, a system and equipment for predicting the acute toxicity of unknown substances based on deep learning, which can predict the chemical substances in a target area and the acute toxicity degree of the chemical substances with high precision and high efficiency.

In order to achieve the above object, the present invention provides the following solutions:

in a first aspect, the present invention provides a method for predicting acute toxicity of an unknown substance based on deep learning, comprising:

acquiring a CANONICALSMILES code of the chemical substances in the target area, and encoding the CANONICALSMILES code of the chemical substances in the target area into a DGL chart of the chemical substances in the target area;

inputting the DGL graph of the chemical substances in the target area into an unknown substance acute toxicity prediction model to obtain the types of the chemical substances in the target area and the acute toxicity degree of each chemical substance;

the unknown substance acute toxicity prediction model is determined according to an improved graph annotation force network model; the improved graph meaning network model comprises an input layer, a feature extraction layer, an implicit layer, a pooling layer and a linear classification layer; the attention mechanism of the hidden layer is an attention mechanism of SuperGAT based on graph node-oriented classification tasks.

In a second aspect, the present invention provides a deep learning-based system for predicting acute toxicity of an unknown substance, comprising:

the target area parameter acquisition module is used for acquiring a CANONICALSMILES code of the chemical substances in the target area and encoding the CANONICAL SMILES code of the chemical substances in the target area into a DGL chart of the chemical substances in the target area;

the chemical substance acute toxicity prediction module is used for inputting the DGL graph of the chemical substances in the target area into the unknown substance acute toxicity prediction model to obtain the types of the chemical substances in the target area and the acute toxicity degree of each chemical substance;

In a third aspect, the present invention provides an electronic device, comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the deep learning based unknown substance acute toxicity prediction method according to the first aspect.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention combines the computational toxicology and the deep learning technology to research an alternative method for predicting the toxicity of unknown substances based on the determination of the toxicity end point index of the existing experiment, can continuously enrich and perfect the existing toxicity database, fills the blank of experimental data, and provides a certain reference and reference in the innovation of new technology and new method in the field of predictive toxicology.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for predicting acute toxicity of unknown substances based on deep learning according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an unknown substance acute toxicity prediction system based on deep learning according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, the embodiment provides a method for predicting acute toxicity of unknown substances based on deep learning, which includes:

step 101: a canonicalalsmles code of the chemical substance in the target area is acquired,

and encoding the canola codes for the chemicals within the target area into a DGL plot of the chemicals within the target area. Wherein, SMILES (simplified molecular linear input Specification), CANONICAL SMILES (normalized SMILES-the only SMILES of a compound).

Step 201: and inputting the DGL graph of the chemical substances in the target area into an unknown substance acute toxicity prediction model to obtain the types of the chemical substances in the target area and the acute toxicity degree of each chemical substance.

The unknown substance acute toxicity prediction model is determined according to an improved graph annotation force network model; the improved graph attention network model comprises an input layer, a feature extraction layer, an implicit layer, a pooling layer and a linear classification layer.

The feature extraction layer is a graph isomorphic network.

The attention mechanism of the hidden layer is an attention mechanism of SuperGAT based on the classification task of the graph-oriented nodes.

The pooling layer comprises a maximum pooling method and a weighted sum pooling method.

Further, the improved attention mechanism is an attention mechanism obtained by fusing the dot product attention function and the spliced attention function.

Still further, the hidden layer includes two layers of improved graph attention network; the improved attention network is obtained by modifying the attention mechanism of the original attention network into an improved attention mechanism on the basis of the original attention network; the artwork attention network is an artwork attention network with version GATv 2.

As a preferred embodiment, the determining process of the acute toxicity prediction model of the unknown substance is as follows:

(1) Constructing a sample data set; the sample data set includes a plurality of sample data; the sample data comprises input data and label data corresponding to the input data; the input data is a DGL map of the chemical substances in the sample area, and the label data is the types of the chemical substances in the sample area and the acute toxicity degree of each chemical substance.

(2) An improved graph-annotation network model is constructed.

(3) Training an improved graph attention network model by adopting sample data, and adjusting network parameters of the improved graph attention network model by adopting a back propagation iteration mode so that the loss value of the trained improved graph attention network model is smaller than a set threshold value to obtain a trained improved graph attention network model; the trained improved graph annotation force network model is an unknown substance acute toxicity prediction model.

The method for constructing the sample data set specifically comprises the following steps:

1) Acquiring the CANONICALSMILES codes of the chemical substances in the plurality of sample areas, and the corresponding chemical substance types and the acute toxicity degrees of the chemical substances; 2) Encoding the canola codes for the chemicals in each sample region into a DGL plot of the chemicals in the sample region; 3) A sample data set is constructed from DGL maps of chemicals within a plurality of sample regions, and corresponding chemical species and chemical acute toxicity levels.

The unknown substance acute toxicity prediction model of the present embodiment is further described below by way of an example.

Step 1: the chemical's canonicalalsmies code is encoded into a DGL map and the DGL map is used as an input to the improved graph-annotation network model.

Step 2: the original attention network is improved, namely the attention mechanism of the original attention network is modified into an improved attention mechanism, and the improved model is recorded as a GAT_MX model. Among the improved attentiveness mechanisms are: based on the attention mechanism of the SuperGAT facing the graph node classification task, namely the attention mechanism obtained by fusing the dot product attention function and the spliced attention function. The artwork attention network is an artwork attention network version GATv 2.

The improvement formula is as follows:

e _ij ＝(a ^T LeakyReLU(W·[h _i ||h _j ]))·Sigmoid((Wh _i ) ^T ×(Wh _j ))；

wherein alpha is ^T Is a neural network; e, e _ij Is the attention coefficient; w is a weight matrix; h is a _i ，h _j Respectively node i and node jIs a feature vector of (1); leakyReLU is the activation function.

Step 3: comparing the average scores of the 10 randomly partitioned datasets of the gat_mx model for different hidden layer numbers, determining the hidden layer of the gat_mx model as a two-layer improved graph-annotation-force network.

Step 4: before the improved graph meaning network, a graph isomorphic network is introduced to extract the characteristics, and the aggregation formula of the graph isomorphic network to the characteristics is as follows:

wherein L is the layer number of the graph isomorphic network,features of the first layer node i and the node j respectively; />Is a feature of the node i of the layer 1; epsilon is a learnable scalar parameter.

Step 5: the pooling layer is introduced before the improved graph-meaning network, and the pooling layer utilizes a maximum pooling method and a weighted sum pooling method simultaneously, and corresponds to different frequency domain information, uses large-scale features and small-scale features together, combines high-frequency (i.e. small-scale) information and low-frequency (i.e. large-scale) information and sends the combined information and the combined information into the linear classification layer.

The expression of the max-pooling method is as follows:

wherein N is _i All adjacent nodes of the node i;output values are maximized for pooling; k is the kth feature map.

The expression of weighted sum pooling method is as follows:

wherein,a weighted sum of all eigenvalues within k multiplied by corresponding weights.

In this embodiment, the expression after splicing the two pooling methods is as follows:

wherein R is the splicing result of the maximum pooling and weighted sum pooling output.

In summary, the best model finally obtained in this embodiment is a two-layer graph attention network with improved attention mechanism combined with a maximum pooling and weighted sum pooling model gat_mx_mws, i.e. an unknown acute toxicity prediction model.

The embodiment designs a single multi-task learning model, the same network is used for the feature extraction of the front stage, and the feature extraction of the front stage is only carried out on the linear output layers respectively, so that the prediction efficiency is improved, and the design is closer to the requirement of toxicity prediction tasks of accident sites.

In contrast to the single-task model performed by researchers in the prior art, some of the same molecular features may exist in different toxic tasks, and this part of the information may be discarded in the single-task model, which in turn results in information loss. In a real emergency use scenario, the user can determine the type of toxicity more quickly for the multitasking model. In addition, the basic graph annotation force network is improved, the problems of insufficient feature extraction and general generalization capability in the traditional graph neural network-based toxicity prediction method are solved, and the better performance of the graph annotation force network in a toxicity prediction task is shown.

The present embodiment chooses to encode the canonicalsmles code as a DGL map. The canola code is a linear symbol that represents a molecule using a single line of ASCII text, which possesses uniqueness, i.e., the canola code and structure of the molecule are synonymous. And when the DGL graph object is converted, compared with the information loss possibly caused by the molecular descriptor and the molecular fingerprint, the topology structure of the molecules is almost completely reserved when the DGL graph object is converted, and the method is more beneficial to the extraction of subsequent features.

The dot product attention is introduced on the mature GATv2 network to realize the aim of improving the attention mechanism, thereby providing a novel GAT_MX method. The method improves the capability of feature extraction of the acute toxicity data set, and has fewer training iteration times, more stable convergence and higher prediction precision when the optimal result is achieved.

In the graph-meaning network aggregation process, the front end of the graph already obtains structural information, and the introduction of the graph-isomorphic network brings a lead at the front end compared with the original GATv 2.

Previous researchers have recommended weighted sum pooling in molecular property prediction tasks, which is similar to average pooling, and aims at low frequency information of the whole molecule, but such pooling operation has the possibility of losing part of small-scale information, although good results are achieved. And in the embodiment, the molecular structure information is better processed through the realization of the pooling layer.

Example two

In order to execute the corresponding method of the above embodiment to achieve the corresponding functions and technical effects, a deep learning-based system for predicting acute toxicity of unknown substances is provided below.

As shown in fig. 2, the system for predicting acute toxicity of unknown substances based on deep learning provided in this embodiment includes:

the target area parameter obtaining module 201 is configured to obtain a canola code of a chemical in the target area, and encode the canola code of the chemical in the target area into a DGL chart of the chemical in the target area.

The chemical acute toxicity prediction module 202 is configured to input the DGL diagram of the chemical in the target area into the unknown acute toxicity prediction model, and obtain the type of the chemical in the target area and the acute toxicity degree of each chemical.

Further, the feature extraction layer is a graph isomorphic network; the hidden layer comprises two layers of improved graph attention network; the improved attention network is obtained by modifying the attention mechanism of the original attention network into an improved attention mechanism on the basis of the original attention network; the original image attention network is an image attention network with a version of GATv 2; the improved attention mechanism is obtained by fusing the dot product attention function and the spliced attention function; the pooling layer comprises a maximum pooling method and a weighted sum pooling method.

Example III

The embodiment of the invention provides an electronic device which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the unknown substance acute toxicity prediction method based on deep learning.

Alternatively, the electronic device may be a server.

In addition, an embodiment of the present invention further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements a method for predicting acute toxicity of an unknown substance based on deep learning according to the first embodiment.

The invention provides a method, a system and equipment for predicting the acute toxicity of unknown substances based on deep learning, which are based on an updated version GATv2 of a graph attention network, and provide a new thought for predicting the acute toxicity of molecules by improving the attention mechanism and the network composition structure of the graph attention network and improving the feature extraction capacity of the graph attention network, so that the prediction accuracy is improved on a toxicity data set.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method for predicting acute toxicity of an unknown substance based on deep learning, comprising the steps of:

acquiring a CANONICAL SMILES code of the chemical substances in the target area, and encoding the CANONICAL SMILES code of the chemical substances in the target area into a DGL chart of the chemical substances in the target area;

2. The method for predicting the acute toxicity of the unknown substance based on deep learning according to claim 1, wherein the determining process of the model for predicting the acute toxicity of the unknown substance is as follows:

constructing a sample data set; the sample data set includes a plurality of sample data; the sample data comprises input data and label data corresponding to the input data; the input data is a DGL graph of chemical substances in the sample area, and the label data is the types of the chemical substances in the sample area and the acute toxicity degree of each chemical substance;

constructing an improved graph annotation force network model;

training an improved graph attention network model by adopting sample data, and adjusting network parameters of the improved graph attention network model by adopting a back propagation iteration mode so that the loss value of the trained improved graph attention network model is smaller than a set threshold value to obtain a trained improved graph attention network model; the trained improved graph annotation force network model is an unknown substance acute toxicity prediction model.

3. The method for predicting the acute toxicity of an unknown substance based on deep learning according to claim 2, wherein the constructing a sample data set specifically comprises:

acquiring CANONICAL SMILES codes of chemical substances in a plurality of sample areas, and corresponding chemical substance types and acute toxicity degrees of the chemical substances;

encoding the canola signature for each chemical in the sample area into a DGL plot for the chemical in the sample area;

a sample data set is constructed from DGL maps of chemicals within a plurality of sample regions, and corresponding chemical species and chemical acute toxicity levels.

4. The method for predicting the acute toxicity of an unknown substance based on deep learning according to claim 1 or 2, wherein the feature extraction layer is a graph isomorphic network.

5. A method for predicting acute toxicity of unknown substances based on deep learning as claimed in either of claims 1 or 2, wherein said hidden layer comprises two layers of improved graph attention network; the improved attention network is obtained by modifying the attention mechanism of the original attention network into an improved attention mechanism on the basis of the original attention network; the artwork attention network is an artwork attention network with version GATv 2.

6. The method for predicting acute toxicity of unknown substance based on deep learning of claim 5, wherein the improved attention mechanism is an attention mechanism obtained by fusing a dot product attention function and a spliced attention function.

7. A method for predicting acute toxicity of unknown substances based on deep learning as claimed in claim 1 or 2, wherein said pooling layer comprises a maximum pooling method and a weighted sum pooling method.

8. A deep learning-based unknown substance acute toxicity prediction system, comprising:

the target area parameter acquisition module is used for acquiring the CANONICAL SMILES codes of the chemical substances in the target area and encoding the CANONICAL SMILES codes of the chemical substances in the target area into a DGL chart of the chemical substances in the target area;

9. The deep learning-based unknown substance acute toxicity prediction system according to claim 8, wherein the feature extraction layer is a graph isomorphic network;

the hidden layer comprises two layers of improved graph attention network; the improved attention network is obtained by modifying the attention mechanism of the original attention network into an improved attention mechanism on the basis of the original attention network; the original image attention network is an image attention network with a version of GATv 2; the improved attention mechanism is obtained by fusing the dot product attention function and the spliced attention function;

10. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform a deep learning based method of predicting acute toxicity of an unknown substance according to any one of claims 1 to 7.