CN111986740B - Method for classifying compounds and related equipment - Google Patents
Method for classifying compounds and related equipment Download PDFInfo
- Publication number
- CN111986740B CN111986740B CN202010917059.2A CN202010917059A CN111986740B CN 111986740 B CN111986740 B CN 111986740B CN 202010917059 A CN202010917059 A CN 202010917059A CN 111986740 B CN111986740 B CN 111986740B
- Authority
- CN
- China
- Prior art keywords
- vector
- atom
- compound
- atomic
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 150000001875 compounds Chemical class 0.000 title claims abstract description 265
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 447
- 238000013145 classification model Methods 0.000 claims abstract description 151
- 238000000605 extraction Methods 0.000 claims abstract description 74
- 231100000419 toxicity Toxicity 0.000 claims description 24
- 230000001988 toxicity Effects 0.000 claims description 24
- 238000006243 chemical reaction Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 13
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000002401 inhibitory effect Effects 0.000 description 4
- 238000002844 melting Methods 0.000 description 4
- 230000008018 melting Effects 0.000 description 4
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 231100000252 nontoxic Toxicity 0.000 description 2
- 230000003000 nontoxic effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000704 physical effect Effects 0.000 description 2
- 239000003440 toxic substance Substances 0.000 description 2
- 231100000440 toxicity profile Toxicity 0.000 description 2
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Crystallography & Structural Chemistry (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and provides a compound classification method and related equipment. The compound classification method comprises the following steps: obtaining a first tag vector of the sample compound based on the compound property; converting a first atomic representation of the sample compound into a first atomic vector sequence, converting a corresponding missing atom of the first atomic representation into a second tag vector of the first atomic representation; training a property classification model formed by a feature extraction model and a first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; and classifying the target compound by using the trained property classification model and taking a second atomic vector of the target compound as input. The invention improves the efficiency of classifying the compounds.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a compound classification method, a device, computer equipment and a computer readable storage medium.
Background
Compound classification is the basis for many biological, chemical works. In the conventional compound classification method, biologists and chemists are required to classify compounds by using expert knowledge.
How to classify compounds based on artificial intelligence to improve classification efficiency is a problem to be solved.
Disclosure of Invention
In view of the foregoing, there is a need for a method, apparatus, computer device, and computer-readable storage medium for classifying compounds that can classify compounds with improved efficiency.
A first aspect of the present application provides a compound classification method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation;
Extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
Acquiring a second atomic representation of the target compound to be classified;
Converting the second atomic representation into a second atomic vector sequence;
And classifying the target compound by taking the second atomic vector as input through the trained property classification model.
In another possible implementation, the obtaining the first atomic representation of the sample compound includes:
Obtaining a reduced molecular linear input canonical representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
In another possible implementation, the converting the first atomic representation into the first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
In another possible implementation, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
In another possible implementation manner, the calculating, by the first classification model, the characteristic feature vector of the sample compound according to the feature vector sequence includes:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
In another possible implementation manner, the training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
In another possible implementation manner, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
A second aspect of the present application provides a compound classification device comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, and obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
the first conversion module is used for converting the first atom representation into a first atom vector sequence and converting the missing atom into a second label vector of the first atom representation;
The extraction module is used for extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
A calculation module, configured to calculate a property feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
The training module is used for training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
A second acquisition module for acquiring a second atomic representation of the target compound to be classified;
A second conversion module for converting the second atomic representation into a second atomic vector sequence;
And the classification module is used for classifying the target compound by taking the second atomic vector as input through the trained property classification model.
A third aspect of the application provides a computer device comprising a processor for implementing the compound classification method when executing computer readable instructions stored in a memory.
A fourth aspect of the application provides a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the method of compound classification.
The method and the device pretrain the feature extraction model shared in the two models through the missing atom prediction model and the property classification model so as to improve the extraction effect of the feature extraction model on the atomic features of the compound and further improve the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Drawings
FIG. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention.
Fig. 2 is a block diagram of a device for classifying compounds according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, and the described embodiments are merely some, rather than all, embodiments of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Preferably, the compound classification method of the present invention is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
Example 1
Fig. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention. The compound classification method is applied to computer equipment and is used for classifying the compounds, so that the efficiency of classifying the compounds is improved.
As shown in fig. 1, the compound classification method includes:
101, obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom.
In a specific embodiment, the obtaining a first atomic representation of a sample compound comprises:
obtaining a reduced molecular linear input specification (SIMPLIFIED MOLECULAR INPUT LINE ENTRYSPECIFICATION, SMILE) representation of the sample compound; or (b)
Acquiring a molecular fingerprint (Extended Connectivity Fingerprint, FECP) representation of the sample compound; or (b)
An International compound identity (International CHEMICALIDENTIFIER, INCHI) based representation of the sample compound was obtained.
The sample compound is a compound with randomly missing atoms, which are represented by mask tags. For example, the complete compound is "CCCC (=o) NC 1=cc=c (OCC (O) CNC (C) C (=c1) C (C) =o", the SMILE of the sample compound is expressed as "[ cls ] CCCC (= [ mask ]) NC 1=cc=c (OCC (O) CNC (C) C (=c 1) C (C) =o [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
102, Converting the first atomic representation into a first atomic vector sequence, and converting the missing atoms into a second tag vector of the first atomic representation.
The first atomic representation and the missing atoms are converted into a vector sequence for processing and feature extraction by vector conversion.
In a specific embodiment, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
The coded sub-vector of each atom, which is a unique identification of the atom, may be queried by a preset coding table. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The graph structure subvector for each atom may include atom structure information and/or connection information of atoms in the sample compound.
And inquiring the coding sub-vector of the missing atom through the preset coding table, and determining the coding sub-vector of the missing atom as the second tag vector.
And 103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound.
In a specific embodiment, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
BERT is a deep bi-directional language characterization model based on a transducer, which utilizes a transducer structure to construct a multi-layer bi-directional Encoder network. The transducer is a deep model based on Self-attention mechanism (Self-attention) for processing NLP tasks. The processing effect of the transducer on NLP task is better than that of RNN, and the training speed is faster.
Specifically, the BERT model includes a plurality of neural network layers, each including a preset number of coding modules, each coding module including an Encoding structure of a bidirectional Transformer. Each Encoding structure includes a multi-headed attention network, a first residual network, a first feedforward neural network, and a second residual network.
104, Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model.
In another embodiment, the computing, by a first classification model, a property feature vector of the sample compound from the sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
The water-soluble classification sub-model comprises a first fully-connected network, and is used for fine-tuning the feature vector sequence based on parameters in the first fully-connected network on the basis that the feature vector sequence has the atomic characteristics of the compound, so as to obtain the characteristic feature vector, and the sample compound is classified by the characteristic feature vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification sub-model may also include a second feedforward neural network, a first convolutional neural network, and the like.
The toxicity classification sub-model comprises a second full-connection network, and is used for fine-tuning the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound, so as to obtain the toxicity characteristic vector, and the sample compound is classified by the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity profile. The toxicity classification sub-model may also include a third feedforward neural network, a second convolutional neural network, and so on.
The first classification model also includes a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
And 105, training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector.
In a specific embodiment, the training a feature classification model composed of the feature extraction model and the first classification model according to the first tag vector and the feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated from a cross entropy loss function; and calculating a second difference value vector of the second label vector and the missing atom vector according to the cross entropy loss function.
In a specific embodiment, said optimizing parameters in said property classification model and said missing atom prediction model according to said first difference vector, said second difference vector, and said third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
The first difference vector expresses the distance between the first label vector and the property feature vector, and the second difference vector expresses the distance between the second label vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, the first label vector and the second label vector are spliced to be used as integral label vectors of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the overall label vector from the overall output vector.
And after synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm, recalculating the integral output vector of the integral model, wherein the distance between the recalculated integral output vector of the integral model and the integral label vector is smaller, namely the classification accuracy of the integral model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference value by adopting a back propagation algorithm, wherein two times of optimization are performed asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between the property characteristic vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property characteristic classification model classifies the compound more accurately based on the compound physical property. And optimizing parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model predicts missing atoms in an input compound more accurately.
106, Obtaining a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or (b)
A second atomic representation of the target compound is received as input by a user.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining an SMILE representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
107 Converting the second atomic representation into a second atomic vector sequence.
The second atomic representation is converted by vector conversion into a vector sequence that facilitates processing and extraction of features.
The converting the second atomic representation into a second atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining a second atom vector of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
108, Classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
After optimizing parameters in the property classification model by training, the trained property classification model may classify the target compound based on compound property characteristics.
The trained property classification model may include a water-soluble classification sub-model, a toxicity classification sub-model, a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model comprises a water-soluble classification sub-model and a toxicity classification sub-model, and the trained property classification model can classify the target compound according to the water solubility and toxicity of the compound, so that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The method for classifying compounds of the first embodiment obtains a first atomic representation of a sample compound, obtains a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom; converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation; extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; acquiring a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atomic vector sequence; and classifying the target compound by taking the second atomic vector as input through the trained property classification model. In the first embodiment, the feature extraction model shared in the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the feature extraction model on the atomic features of the compound is improved, and the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model is further improved. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Example two
Fig. 2 is a block diagram of a device for classifying compounds according to a second embodiment of the present invention. The compound classifying device 20 is applied to a computer apparatus. The compound classification device 20 is used for classifying the compounds, and improves the efficiency of classifying the compounds.
As shown in fig. 2, the compound classification device 20 may include a first acquisition module 201, a first conversion module 202, an extraction module 203, a calculation module 204, a training module 205, a second acquisition module 206, a second conversion module 207, and a classification module 208.
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation.
In a specific embodiment, the obtaining a first atomic representation of a sample compound comprises:
obtaining a reduced molecular linear input specification (SIMPLIFIED MOLECULAR INPUT LINE ENTRYSPECIFICATION, SMILE) representation of the sample compound; or (b)
Acquiring a molecular fingerprint (Extended Connectivity Fingerprint, FECP) representation of the sample compound; or (b)
An International compound identity (International CHEMICALIDENTIFIER, INCHI) based representation of the sample compound was obtained.
The sample compound is a compound with randomly missing atoms, which are represented by mask tags. For example, the complete compound is "CCCC (=o) NC 1=cc=c (OCC (O) CNC (C) C (=c1) C (C) =o", the SMILE of the sample compound is expressed as "[ cls ] CCCC (= [ mask ]) NC 1=cc=c (OCC (O) CNC (C) C (=c 1) C (C) =o [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence and convert the missing atom into a second tag vector of the first atomic representation.
The first atomic representation and the missing atoms are converted into a vector sequence for processing and feature extraction by vector conversion.
In a specific embodiment, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
The coded sub-vector of each atom, which is a unique identification of the atom, may be queried by a preset coding table. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The graph structure subvector for each atom may include atom structure information and/or connection information of atoms in the sample compound.
And inquiring the coding sub-vector of the missing atom through the preset coding table, and determining the coding sub-vector of the missing atom as the second tag vector.
And the extracting module 203 is configured to extract the atomic characteristics of the compound by using the first atomic vector sequence as input through a characteristic extraction model, so as to obtain a characteristic vector sequence of the sample compound.
In a specific embodiment, the feature extraction model includes a BERT model, an RNN model, or a transducer model.
BERT is a deep bi-directional language characterization model based on a transducer, which utilizes a transducer structure to construct a multi-layer bi-directional Encoder network. The transducer is a deep model based on Self-attention mechanism (Self-attention) for processing NLP tasks. The processing effect of the transducer on NLP task is better than that of RNN, and the training speed is faster.
Specifically, the BERT model includes a plurality of neural network layers, each including a preset number of coding modules, each coding module including an Encoding structure of a bidirectional Transformer. Each Encoding structure includes a multi-headed attention network, a first residual network, a first feedforward neural network, and a second residual network.
A calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence.
In another embodiment, the computing, by a first classification model, a property feature vector of the sample compound from the sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
The water-soluble classification sub-model comprises a first fully-connected network, and is used for fine-tuning the feature vector sequence based on parameters in the first fully-connected network on the basis that the feature vector sequence has the atomic characteristics of the compound, so as to obtain the characteristic feature vector, and the sample compound is classified by the characteristic feature vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification sub-model may also include a second feedforward neural network, a first convolutional neural network, and the like.
The toxicity classification sub-model comprises a second full-connection network, and is used for fine-tuning the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound, so as to obtain the toxicity characteristic vector, and the sample compound is classified by the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity profile. The toxicity classification sub-model may also include a third feedforward neural network, a second convolutional neural network, and so on.
The first classification model also includes a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector.
In a specific embodiment, the training a feature classification model composed of the feature extraction model and the first classification model according to the first tag vector and the feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector includes:
calculating a first difference vector between the first tag vector and the property feature vector;
Calculating a second difference vector between the second tag vector and the missing atom vector;
Splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated from a cross entropy loss function; and calculating a second difference value vector of the second label vector and the missing atom vector according to the cross entropy loss function.
In a specific embodiment, said optimizing parameters in said property classification model and said missing atom prediction model according to said first difference vector, said second difference vector, and said third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or (b)
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm.
The first difference vector expresses the distance between the first label vector and the property feature vector, and the second difference vector expresses the distance between the second label vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, the first label vector and the second label vector are spliced to be used as integral label vectors of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the overall label vector from the overall output vector.
And after synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm, recalculating the integral output vector of the integral model, wherein the distance between the recalculated integral output vector of the integral model and the integral label vector is smaller, namely the classification accuracy of the integral model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference value by adopting a back propagation algorithm, wherein two times of optimization are performed asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between the property characteristic vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property characteristic classification model classifies the compound more accurately based on the compound physical property. And optimizing parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model predicts missing atoms in an input compound more accurately.
A second obtaining module 206 is configured to obtain a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or (b)
A second atomic representation of the target compound is received as input by a user.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining an SMILE representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
A second conversion module 207 for converting the second atomic representation into a second atomic vector sequence.
The second atomic representation is converted by vector conversion into a vector sequence that facilitates processing and extraction of features.
The converting the second atomic representation into a second atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining a second atom vector of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
After optimizing parameters in the property classification model by training, the trained property classification model may classify the target compound based on compound property characteristics.
The trained property classification model may include a water-soluble classification sub-model, a toxicity classification sub-model, a melting point classification sub-model, a half-maximal inhibitory concentration classification sub-model, and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model comprises a water-soluble classification sub-model and a toxicity classification sub-model, and the trained property classification model can classify the target compound according to the water solubility and toxicity of the compound, so that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification device 20 of embodiment two obtains a first atomic representation of a sample compound, obtains a first tag vector of the sample compound based on a compound property, and the first atomic representation corresponding to a missing atom; converting the first atomic representation into a first atomic vector sequence, and converting the missing atom into a second tag vector of the first atomic representation; extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector; acquiring a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atomic vector sequence; and classifying the target compound by taking the second atomic vector as input through the trained property classification model. In the embodiment, the feature extraction model shared in the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the feature extraction model on the atomic features of the compound is improved, and the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model is further improved. Meanwhile, the trained property classification model takes the second atomic vector as input to classify the target compound, so that the classification of the target compound by an expert is avoided, and the efficiency of classifying the compound is improved.
Example III
The present embodiment provides a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of the above-described embodiment of a method for classifying compounds, for example, steps 101-108 shown in fig. 1:
101, acquiring a first atomic representation of a sample compound, and acquiring a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second label vector of the first atom representation;
103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector 105;
106, obtaining a second atomic representation of the target compound to be classified;
107 converting the second atomic representation into a second atomic vector sequence;
108, classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
Or which when executed by a processor perform the functions of the modules of the apparatus embodiments described above, such as modules 201-208 in fig. 2:
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation;
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence, and convert the missing atom into a second tag vector of the first atomic representation;
The extracting module 203 is configured to extract, by using the feature extraction model and using the first atomic vector sequence as input, an atomic feature of the compound, to obtain a feature vector sequence of the sample compound;
a calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
a second obtaining module 206, configured to obtain a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
Example IV
Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 includes a memory 301, a processor 302, and computer readable instructions, such as a compound classification program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer-readable instructions, implements the steps of the compound classification method embodiments described above, such as 101-108 shown in fig. 1:
101, acquiring a first atomic representation of a sample compound, and acquiring a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second label vector of the first atom representation;
103, extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector 105;
106, obtaining a second atomic representation of the target compound to be classified;
107 converting the second atomic representation into a second atomic vector sequence;
108, classifying the target compound by taking the second atomic vector as an input through the trained property classification model.
Or which when executed by a processor perform the functions of the modules of the apparatus embodiments described above, such as modules 201-208 in fig. 2:
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property, and obtain a corresponding missing atom of the first atomic representation;
A first conversion module 202, configured to convert the first atomic representation into a first atomic vector sequence, and convert the missing atom into a second tag vector of the first atomic representation;
The extracting module 203 is configured to extract, by using the feature extraction model and using the first atomic vector sequence as input, an atomic feature of the compound, to obtain a feature vector sequence of the sample compound;
a calculation module 204, configured to calculate, by using a first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
A training module 205, configured to train a feature classification model formed by the feature extraction model and the first classification model according to the first tag vector and the feature vector, and train a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
a second obtaining module 206, configured to obtain a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
And a classification module 208, configured to classify the target compound by using the trained property classification model and the second atomic vector as input.
Illustratively, the computer readable instructions may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present methods. The one or more modules may be a series of computer readable instructions capable of performing a particular function, the instruction describing the execution of the computer readable instructions in the computer device 30. For example, the computer readable instructions may be divided into a first acquisition module 201, a first conversion module 202, an extraction module 203, a calculation module 204, a training module 205, a second acquisition module 206, a second conversion module 207, and a classification module 208 in fig. 2, where each module has a specific function, see embodiment two.
Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and is not meant to be limiting of the computer device 30, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the computer device 30 may also include input and output devices, network access devices, buses, etc.
The Processor 302 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center of the computer device 30, with various interfaces and lines connecting the various parts of the overall computer device 30.
The memory 301 may be used to store the computer readable instructions and the processor 302 implements the various functions of the computer device 30 by executing or executing the computer readable instructions or modules stored in the memory 301 and invoking data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device 30, or the like. In addition, the Memory 301 may include a hard disk, memory, a plug-in hard disk, a smart Memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash Memory card (FLASH CARD), at least one magnetic disk storage device, a flash Memory device, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or other non-volatile/volatile storage device.
The modules integrated by the computer device 30 may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the present invention may also be implemented by implementing all or part of the processes in the methods of the embodiments described above, by instructing the associated hardware by means of computer readable instructions, which may be stored in a computer readable storage medium, the computer readable instructions, when executed by a processor, implementing the steps of the respective method embodiments described above. Wherein the computer readable instructions comprise computer readable instruction code which may be in the form of source code, object code, executable files, or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a Random Access Memory (RAM), and so forth.
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in hardware plus software functional modules.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the compound classification method according to the embodiments of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other modules or steps, and that the singular does not exclude a plurality. A plurality of modules or means recited in the system claims can also be implemented by means of one module or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.
Claims (8)
1. A method of classifying a compound, the method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the first atomic representation corresponding to a missing atom;
Converting the first atom representation into a first atom vector sequence, wherein the code sub-vector of each atom is obtained from a preset code table, the first atom vector of each atom comprises the code sub-vector, the position sub-vector and the graph structure sub-vector of each atom, the missing atom is converted into a second label vector of the first atom representation, the code sub-vector of the missing atom is obtained from the preset code table, and the code sub-vector of the missing atom is determined to be the second label vector;
Extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
Calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
Training a property classification model composed of the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second tag vector and the missing atom vector, comprising: calculating a first difference vector between the first tag vector and the property feature vector; calculating a second difference vector between the second tag vector and the missing atom vector; splicing the first difference vector and the second difference vector to obtain a third difference vector; synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or, optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting a back propagation algorithm;
Acquiring a second atomic representation of the target compound to be classified;
Converting the second atomic representation into a second atomic vector sequence;
and classifying the target compound by taking the second atomic vector as input through the trained property classification model and the trained missing atom prediction model.
2. The method of classifying compounds according to claim 1, wherein said obtaining a first atomic representation of a sample compound comprises:
Obtaining a reduced molecular linear input canonical representation of the sample compound; or (b)
Obtaining a molecular fingerprint representation of the sample compound; or (b)
An international compound-based identification representation of the sample compound is obtained.
3. The method of classifying a compound as recited in claim 1, wherein said converting said first atomic representation into a first sequence of atomic vectors comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
Splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
And combining first atom vectors of a plurality of atoms in the first atom representation to obtain the first atom vector sequence.
4. The method of compound classification according to claim 1, wherein the feature extraction model comprises a BERT model, an RNN model, or a transducer model.
5. The method of classifying compounds according to claim 1, wherein said calculating, by a first classification model, a characteristic feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound from the sequence of feature vectors using a water-soluble classification sub-model in the first classification model;
And calculating a toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification sub-model in the first classification model.
6. A compound classification apparatus for implementing the compound classification method according to any one of claims 1 to 5, comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, and obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
the first conversion module is used for converting the first atom representation into a first atom vector sequence and converting the missing atom into a second label vector of the first atom representation;
The extraction module is used for extracting the atomic characteristics of the compound by taking the first atomic vector sequence as input through a characteristic extraction model to obtain a characteristic vector sequence of the sample compound;
A calculation module, configured to calculate a property feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
The training module is used for training a property classification model formed by the feature extraction model and the first classification model according to the first tag vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second tag vector and the missing atom vector;
A second acquisition module for acquiring a second atomic representation of the target compound to be classified;
A second conversion module for converting the second atomic representation into a second atomic vector sequence;
And the classification module is used for classifying the target compound by taking the second atomic vector as input through the trained property classification model.
7. A computer device comprising a processor for executing computer readable instructions stored in a memory to implement the compound classification method of any one of claims 1 to 5.
8. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the method of classifying a compound according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010917059.2A CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010917059.2A CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111986740A CN111986740A (en) | 2020-11-24 |
CN111986740B true CN111986740B (en) | 2024-05-14 |
Family
ID=73448044
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010917059.2A Active CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111986740B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018098588A1 (en) * | 2016-12-02 | 2018-06-07 | Lumiant Corporation | Computer systems for and methods of identifying non-elemental materials based on atomistic properties |
CN109493922A (en) * | 2018-11-19 | 2019-03-19 | 大连思利科环境科技有限公司 | Method for predicting molecular structure parameters of chemicals |
CN109658989A (en) * | 2018-11-14 | 2019-04-19 | 国网新疆电力有限公司信息通信公司 | Class drug compound toxicity prediction method based on deep learning |
CN110428864A (en) * | 2019-07-17 | 2019-11-08 | 大连大学 | Method for constructing the affinity prediction model of protein and small molecule |
CN110751230A (en) * | 2019-10-30 | 2020-02-04 | 深圳市太赫兹科技创新研究院有限公司 | Substance classification method, substance classification device, terminal device and storage medium |
CN110767271A (en) * | 2019-10-15 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001513823A (en) * | 1997-10-07 | 2001-09-04 | ニュー イングランド メディカル センター ホスピタルズ インコーポレイテッド | Rational design of papillomavirus infection-based compounds based on their structure |
-
2020
- 2020-09-03 CN CN202010917059.2A patent/CN111986740B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018098588A1 (en) * | 2016-12-02 | 2018-06-07 | Lumiant Corporation | Computer systems for and methods of identifying non-elemental materials based on atomistic properties |
CN109658989A (en) * | 2018-11-14 | 2019-04-19 | 国网新疆电力有限公司信息通信公司 | Class drug compound toxicity prediction method based on deep learning |
CN109493922A (en) * | 2018-11-19 | 2019-03-19 | 大连思利科环境科技有限公司 | Method for predicting molecular structure parameters of chemicals |
CN110428864A (en) * | 2019-07-17 | 2019-11-08 | 大连大学 | Method for constructing the affinity prediction model of protein and small molecule |
CN110767271A (en) * | 2019-10-15 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN110751230A (en) * | 2019-10-30 | 2020-02-04 | 深圳市太赫兹科技创新研究院有限公司 | Substance classification method, substance classification device, terminal device and storage medium |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
Non-Patent Citations (3)
Title |
---|
Representation of compounds for machine-learning prediction of physical properties;Atsuto Seko et al.;《PHYSICAL REVIEW B》;第95卷(第14期);144110(1-11) * |
基于机器学习的化合物分析;安强强;《当代化工》;第47卷(第1期);38-40, 52 * |
有机化合物水生毒性作用模式的支持向量机分类研究;易忠胜等;《广西科学》;第13卷卷(第1期);31-34页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111986740A (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111259142B (en) | Specific target emotion classification method based on attention coding and graph convolution network | |
WO2022068314A1 (en) | Neural network training method, neural network compression method and related devices | |
KR20230128492A (en) | Explainable Transducers Transducers | |
CN111091175A (en) | Neural network model training method, neural network model classification method, neural network model training device and electronic equipment | |
CN111461168A (en) | Training sample expansion method and device, electronic equipment and storage medium | |
JP7345046B2 (en) | Word overlap-based clustering cross-modal search | |
CN112086144B (en) | Molecule generation method, device, electronic equipment and storage medium | |
CN111639500A (en) | Semantic role labeling method and device, computer equipment and storage medium | |
CN116663568B (en) | Critical task identification system and method based on priority | |
CN113901823B (en) | Named entity identification method, named entity identification device, storage medium and terminal equipment | |
CN117094451B (en) | Power consumption prediction method, device and terminal | |
CN113111190A (en) | Knowledge-driven dialog generation method and device | |
CN113239702A (en) | Intention recognition method and device and electronic equipment | |
CN114528835A (en) | Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination | |
CN115222443A (en) | Client group division method, device, equipment and storage medium | |
CN114428860A (en) | Pre-hospital emergency case text recognition method and device, terminal and storage medium | |
CN112036439B (en) | Dependency relationship classification method and related equipment | |
CN114333803A (en) | Method, device and equipment for processing voice recognition model and storage medium | |
CN117829122A (en) | Text similarity model training method, device and medium based on conditions | |
CN111986740B (en) | Method for classifying compounds and related equipment | |
CN111931841A (en) | Deep learning-based tree processing method, terminal, chip and storage medium | |
KR102549122B1 (en) | Method and apparatus for recognizing speaker’s emotions based on speech signal | |
CN114998041A (en) | Method and device for training claim settlement prediction model, electronic equipment and storage medium | |
CN113284256B (en) | MR (magnetic resonance) mixed reality three-dimensional scene material library generation method and system | |
US20240289609A1 (en) | System for training neural network to detect anomalies in event data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210202 Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.) Applicant after: Shenzhen saiante Technology Service Co.,Ltd. Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000 Applicant before: Ping An International Smart City Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |