CN111986740A - Compound classification method and related equipment - Google Patents
Compound classification method and related equipment Download PDFInfo
- Publication number
- CN111986740A CN111986740A CN202010917059.2A CN202010917059A CN111986740A CN 111986740 A CN111986740 A CN 111986740A CN 202010917059 A CN202010917059 A CN 202010917059A CN 111986740 A CN111986740 A CN 111986740A
- Authority
- CN
- China
- Prior art keywords
- vector
- atom
- compound
- representation
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 150000001875 compounds Chemical class 0.000 title claims abstract description 264
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 449
- 238000013145 classification model Methods 0.000 claims abstract description 156
- 238000000605 extraction Methods 0.000 claims abstract description 74
- 231100000419 toxicity Toxicity 0.000 claims description 26
- 230000001988 toxicity Effects 0.000 claims description 26
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 10
- 230000002457 bidirectional effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000002401 inhibitory effect Effects 0.000 description 4
- 238000002844 melting Methods 0.000 description 4
- 230000008018 melting Effects 0.000 description 4
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 4
- 239000000126 substance Substances 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 231100000252 nontoxic Toxicity 0.000 description 2
- 230000003000 nontoxic effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000003440 toxic substance Substances 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and provides a compound classification method and related equipment. The compound classification method comprises the following steps: obtaining a first tag vector for a sample compound based on a compound property; converting a first atomic representation of a sample compound into a first atomic vector sequence, and converting a missing atom corresponding to the first atomic representation into a second tag vector represented by the first atomic representation; training a property classification model formed by a feature extraction model and a first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and a second classification model according to the second label vector and the missing atom vector; and using the second atom vector of the target compound as an input by using the trained property classification model to classify the target compound. The invention improves the efficiency of classifying compounds.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a compound classification method, a compound classification device, computer equipment and a computer readable storage medium.
Background
Compound classification is the basis of many biological, chemical works. In the conventional compound classification method, a biological scientist and a chemist are required to classify compounds by using professional knowledge.
How to classify compounds based on artificial intelligence to improve classification efficiency is a problem to be solved.
Disclosure of Invention
In view of the above, there is a need for a compound classification method, apparatus, computer device and computer readable storage medium, which can classify compounds and improve the efficiency of classifying compounds.
A first aspect of the present application provides a compound classification method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation;
extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
obtaining a second atomic representation of the target compound to be classified;
converting the second atomic representation into a second atom vector sequence;
and classifying the target compound by using the second atom vector as input through the trained property classification model.
In another possible implementation, the obtaining the first atomic representation of the sample compound includes:
obtaining a simplified molecular linear input canonical representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
In another possible implementation, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
In another possible implementation manner, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
In another possible implementation manner, the calculating, by the first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence includes:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
In another possible implementation manner, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
In another possible implementation manner, the optimizing, by using a back propagation algorithm, parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
A second aspect of the present application provides a compound sorting device comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
the extraction module is used for extracting the atomic features of the compounds by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compounds;
a calculation module, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
a training module, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module for obtaining a second atomic representation of the target compound to be classified;
a second conversion module for converting the second atomic representation into a second atomic vector sequence;
and the classification module is used for classifying the target compound by taking the second atom vector as input through the trained property classification model.
A third aspect of the application provides a computer device comprising a processor for implementing the compound classification method when executing computer readable instructions stored in a memory.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, implement the compound classification method.
According to the method, the characteristic extraction model shared by the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the characteristic extraction model on the compound atom characteristics is improved, and the accuracy of classifying the compound through the property classification model formed by the characteristic extraction model and the first classification model is further improved. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
Drawings
FIG. 1 is a flow chart of a method of classifying compounds provided in an embodiment of the present invention.
FIG. 2 is a block diagram of a compound sorting apparatus according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Preferably, the compound classification method of the present invention is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
Example one
FIG. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention. The compound classification method is applied to computer equipment and is used for classifying compounds and improving the efficiency of classifying the compounds.
As shown in fig. 1, the compound classification method includes:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the corresponding missing atom of the first atomic representation 101.
In a specific embodiment, said obtaining a first atomic representation of a sample compound comprises:
obtaining a Simplified molecular input line entry Specification (SMILE) representation of the sample compound; or
Obtaining a molecular Fingerprint (FECP) representation of the sample compound; or
An International chemical identifier (InChI) -based representation of the sample compound is obtained.
The sample compounds are compounds with random missing atoms, and the missing atoms are represented by mask labels. For example, the intact compound is "CCCC (═ O) NC1 ═ CC ═ C (occ) (O) cnc (C) C (═ C1) C (C) ═ O", the SMILE of the sample compound is denoted as "[ cls ] CCCC (═ mask ]) NC1 ═ CC ═ C (occ (O) cnc (C) C (═ C1) C (C) ═ O [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation.
Converting the first atom representation and the missing atom into a vector sequence for convenient processing and feature extraction through vector conversion.
In a specific embodiment, said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
The coded subvector for each atom can be looked up through a preset coding table, and the coded subvector for each atom is the unique identifier of the atom. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The map structure subvector for each atom may include atom structure information and/or atom connection information in the sample compound.
The coding sub-vector of the missing atom can be queried through the preset coding table, and the coding sub-vector of the missing atom is determined as the second tag vector.
And 103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound.
In one embodiment, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
BERT is a deep bidirectional language characterization model based on Transformer, and the BERT utilizes the Transformer structure to construct a multilayer bidirectional Encoder network. The Transformer is a deep model based on the Self-attention mechanism (Self-attention) for processing NLP task. The processing effect of the Transformer on the NLP task is better than that of the RNN, and the training speed is higher.
Specifically, the BERT model includes a plurality of neural network layers, each neural network layer includes a preset number of Encoding modules, and each Encoding module includes an Encoding structure of a bidirectional Transformer. Each Encoding structure comprises a multi-head attention network, a first residual error network, a first feedforward neural network and a second residual error network.
And 104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model.
In another embodiment, said calculating, by a first classification model, a property feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
The water-soluble classification submodel comprises a first full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the first full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the property characteristic vector, so as to realize classification of the sample compound through the property characteristic vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification submodel may also include a second feed-forward neural network, a first convolutional neural network, and the like.
The toxicity classification submodel comprises a second full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the toxicity characteristic vector, so that the sample compound is classified through the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity feature vectors. The toxicity classification submodel may also include a third feed-forward neural network, a second convolutional neural network, and the like.
The first classification model also comprises a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
And 105, training a property classification model formed by the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second label vector and the missing atom vector.
In a specific embodiment, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated according to a cross entropy loss function; and calculating a second difference vector of the second label vector and the missing atom vector according to a cross entropy loss function.
In a specific embodiment, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector by using a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
The first difference vector expresses a distance between the first tag vector and the property feature vector, and the second difference vector expresses a distance between the second tag vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, splicing the first label vector and the second label vector to be used as an integral label vector of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the global label vector from the global output vector.
And after parameters in the property classification model and the missing atom prediction model are synchronously optimized according to the third difference vector by adopting a back propagation algorithm, recalculating the overall output vector of the overall model, wherein the distance between the recalculated overall output vector of the overall model and the overall label vector is smaller, namely the classification accuracy of the overall model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference by adopting a back propagation algorithm, asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference by adopting the back propagation algorithm, and performing optimization twice asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between a property feature vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property feature classification model is more accurate in classifying the compound based on the compound property. And optimizing the parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model can predict the missing atoms in the input compound more accurately.
A second atomic representation of the target compound to be classified is obtained 106.
A second atomic representation of the target compound may be obtained from a database; or
Receiving a user input of a second atomic representation of the target compound.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining a SMILE representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
And 107, converting the second atom representation into a second atom vector sequence.
The second atomic representation is converted into a sequence of vectors for ease of processing and feature extraction by vector conversion.
Said converting said second atomic representation into a second atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining second atom vectors of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
And 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
After the parameters in the property classification model are optimized through training, the trained property classification model can classify the target compound based on compound property features.
The trained property classification model can comprise a water-solubility classification submodel, a toxicity classification submodel, a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model includes a water-solubility classification submodel and a toxicity classification submodel, and the trained property classification model can classify the target compound according to the water solubility and the toxicity of the compound, so as to obtain that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification method of example one obtains a first atomic representation of a sample compound, obtains a first tag vector based on a compound property of the sample compound and a corresponding missing atom of the first atomic representation; converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation; extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector; obtaining a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atom vector sequence; and classifying the target compound by using the second atom vector as input through the trained property classification model. In the first embodiment, a feature extraction model shared by the two models is pre-trained through the missing atom prediction model and the property classification model to improve the extraction effect of the feature extraction model on the atomic features of the compound, and further improve the accuracy of classifying the compound by the property classification model composed of the feature extraction model and the first classification model. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
Example two
FIG. 2 is a structural diagram of a compound sorting apparatus according to a second embodiment of the present invention. The compound sorting apparatus 20 is applied to a computer device. The compound classifying device 20 is used for classifying compounds, and the efficiency of classifying the compounds is improved.
As shown in fig. 2, the compound classification apparatus 20 may include a first obtaining module 201, a first converting module 202, an extracting module 203, a calculating module 204, a training module 205, a second obtaining module 206, a second converting module 207, and a classifying module 208.
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation.
In a specific embodiment, said obtaining a first atomic representation of a sample compound comprises:
obtaining a Simplified molecular input line entry Specification (SMILE) representation of the sample compound; or
Obtaining a molecular Fingerprint (FECP) representation of the sample compound; or
An International chemical identifier (InChI) -based representation of the sample compound is obtained.
The sample compounds are compounds with random missing atoms, and the missing atoms are represented by mask labels. For example, the intact compound is "CCCC (═ O) NC1 ═ CC ═ C (occ) (O) cnc (C) C (═ C1) C (C) ═ O", the SMILE of the sample compound is denoted as "[ cls ] CCCC (═ mask ]) NC1 ═ CC ═ C (occ (O) cnc (C) C (═ C1) C (C) ═ O [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
A first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation.
Converting the first atom representation and the missing atom into a vector sequence for convenient processing and feature extraction through vector conversion.
In a specific embodiment, said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
The coded subvector for each atom can be looked up through a preset coding table, and the coded subvector for each atom is the unique identifier of the atom. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The map structure subvector for each atom may include atom structure information and/or atom connection information in the sample compound.
The coding sub-vector of the missing atom can be queried through the preset coding table, and the coding sub-vector of the missing atom is determined as the second tag vector.
And the extracting module 203 is configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound.
In one embodiment, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
BERT is a deep bidirectional language characterization model based on Transformer, and the BERT utilizes the Transformer structure to construct a multilayer bidirectional Encoder network. The Transformer is a deep model based on the Self-attention mechanism (Self-attention) for processing NLP task. The processing effect of the Transformer on the NLP task is better than that of the RNN, and the training speed is higher.
Specifically, the BERT model includes a plurality of neural network layers, each neural network layer includes a preset number of Encoding modules, and each Encoding module includes an Encoding structure of a bidirectional Transformer. Each Encoding structure comprises a multi-head attention network, a first residual error network, a first feedforward neural network and a second residual error network.
A calculating module 204, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence.
In another embodiment, said calculating, by a first classification model, a property feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
The water-soluble classification submodel comprises a first full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the first full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the property characteristic vector, so as to realize classification of the sample compound through the property characteristic vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification submodel may also include a second feed-forward neural network, a first convolutional neural network, and the like.
The toxicity classification submodel comprises a second full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the toxicity characteristic vector, so that the sample compound is classified through the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity feature vectors. The toxicity classification submodel may also include a third feed-forward neural network, a second convolutional neural network, and the like.
The first classification model also comprises a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
A training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector.
In a specific embodiment, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated according to a cross entropy loss function; and calculating a second difference vector of the second label vector and the missing atom vector according to a cross entropy loss function.
In a specific embodiment, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector by using a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
The first difference vector expresses a distance between the first tag vector and the property feature vector, and the second difference vector expresses a distance between the second tag vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, splicing the first label vector and the second label vector to be used as an integral label vector of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the global label vector from the global output vector.
And after parameters in the property classification model and the missing atom prediction model are synchronously optimized according to the third difference vector by adopting a back propagation algorithm, recalculating the overall output vector of the overall model, wherein the distance between the recalculated overall output vector of the overall model and the overall label vector is smaller, namely the classification accuracy of the overall model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference by adopting a back propagation algorithm, asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference by adopting the back propagation algorithm, and performing optimization twice asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between a property feature vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property feature classification model is more accurate in classifying the compound based on the compound property. And optimizing the parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model can predict the missing atoms in the input compound more accurately.
A second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or
Receiving a user input of a second atomic representation of the target compound.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining a SMILE representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
A second conversion module 207 for converting the second atomic representation into a second atom vector sequence.
The second atomic representation is converted into a sequence of vectors for ease of processing and feature extraction by vector conversion.
Said converting said second atomic representation into a second atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining second atom vectors of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
A classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
After the parameters in the property classification model are optimized through training, the trained property classification model can classify the target compound based on compound property features.
The trained property classification model can comprise a water-solubility classification submodel, a toxicity classification submodel, a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model includes a water-solubility classification submodel and a toxicity classification submodel, and the trained property classification model can classify the target compound according to the water solubility and the toxicity of the compound, so as to obtain that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification apparatus 20 of example two obtains a first atomic representation of a sample compound, obtains a first tag vector based on a compound property of the sample compound and a corresponding missing atom of the first atomic representation; converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation; extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector; obtaining a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atom vector sequence; and classifying the target compound by using the second atom vector as input through the trained property classification model. The embodiment pretrains a feature extraction model shared by the missing atom prediction model and the property classification model so as to improve the extraction effect of the feature extraction model on the compound atom features and further improve the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, which stores computer-readable instructions, when executed by a processor, implement the steps in the above-mentioned compound classification method embodiments, such as the steps 101-108 shown in fig. 1:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation;
103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
105, training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
106, obtaining a second atomic representation of the target compound to be classified;
107, converting said second atomic representation into a second atomic vector sequence;
and 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules in the above device embodiments, for example, the module 201 and 208 in fig. 2:
a first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
an extracting module 203, configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound;
a calculating module 204, configured to calculate a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
a training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
a classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
Example four
Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 comprises a memory 301, a processor 302, and computer readable instructions, such as a compound classification program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer readable instructions, implements the steps in the above-described compound classification method embodiments, such as 101-108 shown in fig. 1:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation;
103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
105, training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
106, obtaining a second atomic representation of the target compound to be classified;
107, converting said second atomic representation into a second atomic vector sequence;
and 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules in the above device embodiments, for example, the module 201 and 208 in fig. 2:
a first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
an extracting module 203, configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound;
a calculating module 204, configured to calculate a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
a training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
a classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
Illustratively, the computer readable instructions may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present method. The one or more modules may be a series of computer-readable instructions capable of performing certain functions and describing the execution of the computer-readable instructions in the computer device 30. For example, the computer readable instructions may be divided into a first obtaining module 201, a first transforming module 202, an extracting module 203, a calculating module 204, a training module 205, a second obtaining module 206, a second transforming module 207, and a classifying module 208 in fig. 2, and specific functions of each module are described in embodiment two.
Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and does not constitute a limitation of the computer device 30, and may include more or less components than those shown, or combine certain components, or different components, for example, the computer device 30 may also include input and output devices, network access devices, buses, etc.
The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center for the computer device 30 and connecting the various parts of the overall computer device 30 using various interfaces and lines.
The memory 301 may be used to store the computer readable instructions, and the processor 302 may implement the various functions of the computer device 30 by executing or executing the computer readable instructions or modules stored in the memory 301 and invoking the data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer device 30, and the like. In addition, the Memory 301 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.
The modules integrated by the computer device 30 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by hardware that is configured to be instructed by computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer readable instructions comprise computer readable instruction code which may be in source code form, object code form, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer readable instruction code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, Read Only Memory (ROM), Random Access Memory (RAM), etc.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the compound classification method according to various embodiments of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is to be understood that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. A plurality of modules or means recited in the system claims may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A compound classification method, comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation;
extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
obtaining a second atomic representation of the target compound to be classified;
converting the second atomic representation into a second atom vector sequence;
and classifying the target compound by using the second atom vector as input through the trained property classification model.
2. The compound classification method of claim 1, wherein the obtaining a first atomic representation of a sample compound comprises:
obtaining a simplified molecular linear input canonical representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
3. The method of compound classification as claimed in claim 1 wherein said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
4. The compound classification method of claim 1, wherein the feature extraction model comprises a BERT model, an RNN model, or a Transformer model.
5. The method of compound classification of claim 1, wherein said calculating a property feature vector of the sample compound from the sequence of feature vectors by a first classification model comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
6. The compound classification method of claim 1, wherein the training of the property classification model comprised of the feature extraction model and the first classification model based on the first label vector and the property feature vector, and the training of the missing atom prediction model comprised of the feature extraction model and the second classification model based on the second label vector and the missing atom vector comprises:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
7. The compound classification method according to claim 6, wherein the optimizing parameters in the property classification model and the missing atom prediction model from the first difference vector, the second difference vector, and the third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
8. A compound sorting device, comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
the extraction module is used for extracting the atomic features of the compounds by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compounds;
a calculation module, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
a training module, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module for obtaining a second atomic representation of the target compound to be classified;
a second conversion module for converting the second atomic representation into a second atomic vector sequence;
and the classification module is used for classifying the target compound by taking the second atom vector as input through the trained property classification model.
9. A computer device comprising a processor for executing computer readable instructions stored in a memory to implement a compound classification method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, carry out a compound classification method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010917059.2A CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010917059.2A CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111986740A true CN111986740A (en) | 2020-11-24 |
CN111986740B CN111986740B (en) | 2024-05-14 |
Family
ID=73448044
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010917059.2A Active CN111986740B (en) | 2020-09-03 | 2020-09-03 | Method for classifying compounds and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111986740B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1067699A (en) * | 1997-10-07 | 1999-04-27 | New England Medical Center Hospitals, Inc., The | Structure-based rational design of compounds to inhibit papillomavirus infe ction |
WO2018098588A1 (en) * | 2016-12-02 | 2018-06-07 | Lumiant Corporation | Computer systems for and methods of identifying non-elemental materials based on atomistic properties |
CN109493922A (en) * | 2018-11-19 | 2019-03-19 | 大连思利科环境科技有限公司 | Method for predicting molecular structure parameters of chemicals |
CN109658989A (en) * | 2018-11-14 | 2019-04-19 | 国网新疆电力有限公司信息通信公司 | Class drug compound toxicity prediction method based on deep learning |
CN110428864A (en) * | 2019-07-17 | 2019-11-08 | 大连大学 | Method for constructing the affinity prediction model of protein and small molecule |
CN110751230A (en) * | 2019-10-30 | 2020-02-04 | 深圳市太赫兹科技创新研究院有限公司 | Substance classification method, substance classification device, terminal device and storage medium |
CN110767271A (en) * | 2019-10-15 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
-
2020
- 2020-09-03 CN CN202010917059.2A patent/CN111986740B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1067699A (en) * | 1997-10-07 | 1999-04-27 | New England Medical Center Hospitals, Inc., The | Structure-based rational design of compounds to inhibit papillomavirus infe ction |
WO2018098588A1 (en) * | 2016-12-02 | 2018-06-07 | Lumiant Corporation | Computer systems for and methods of identifying non-elemental materials based on atomistic properties |
CN109658989A (en) * | 2018-11-14 | 2019-04-19 | 国网新疆电力有限公司信息通信公司 | Class drug compound toxicity prediction method based on deep learning |
CN109493922A (en) * | 2018-11-19 | 2019-03-19 | 大连思利科环境科技有限公司 | Method for predicting molecular structure parameters of chemicals |
CN110428864A (en) * | 2019-07-17 | 2019-11-08 | 大连大学 | Method for constructing the affinity prediction model of protein and small molecule |
CN110767271A (en) * | 2019-10-15 | 2020-02-07 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN110751230A (en) * | 2019-10-30 | 2020-02-04 | 深圳市太赫兹科技创新研究院有限公司 | Substance classification method, substance classification device, terminal device and storage medium |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
Non-Patent Citations (3)
Title |
---|
ATSUTO SEKO ET AL.: "Representation of compounds for machine-learning prediction of physical properties", 《PHYSICAL REVIEW B》, vol. 95, no. 14, pages 1 - 11 * |
安强强: "基于机器学习的化合物分析", 《当代化工》, vol. 47, no. 1, pages 38 - 40 * |
易忠胜等: "有机化合物水生毒性作用模式的支持向量机分类研究", 《广西科学》, vol. 13, no. 1, pages 31 - 34 * |
Also Published As
Publication number | Publication date |
---|---|
CN111986740B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7193252B2 (en) | Captioning image regions | |
CN108959482B (en) | Single-round dialogue data classification method and device based on deep learning and electronic equipment | |
Xie et al. | Generative VoxelNet: Learning energy-based models for 3D shape synthesis and analysis | |
CN111738016A (en) | Multi-intention recognition method and related equipment | |
CN112559784A (en) | Image classification method and system based on incremental learning | |
CN111461168A (en) | Training sample expansion method and device, electronic equipment and storage medium | |
US9378464B2 (en) | Discriminative learning via hierarchical transformations | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN111460812B (en) | Sentence emotion classification method and related equipment | |
JP7229345B2 (en) | Sentence processing method, sentence decoding method, device, program and device | |
CN113761197B (en) | Application form multi-label hierarchical classification method capable of utilizing expert knowledge | |
CN112086144A (en) | Molecule generation method, molecule generation device, electronic device, and storage medium | |
CN111639500A (en) | Semantic role labeling method and device, computer equipment and storage medium | |
CN110704543A (en) | Multi-type multi-platform information data self-adaptive fusion system and method | |
CN113111190A (en) | Knowledge-driven dialog generation method and device | |
CN111027681B (en) | Time sequence data processing model training method, data processing method, device and storage medium | |
CN115796182A (en) | Multi-modal named entity recognition method based on entity-level cross-modal interaction | |
CN114896067A (en) | Automatic generation method and device of task request information, computer equipment and medium | |
CN113239702A (en) | Intention recognition method and device and electronic equipment | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN112036439B (en) | Dependency relationship classification method and related equipment | |
CN111767720B (en) | Title generation method, computer and readable storage medium | |
Tomer et al. | STV-BEATS: skip thought vector and bi-encoder based automatic text summarizer | |
US20230281826A1 (en) | Panoptic segmentation with multi-database training using mixed embedding | |
CN111259673A (en) | Feedback sequence multi-task learning-based law decision prediction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210202 Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.) Applicant after: Shenzhen saiante Technology Service Co.,Ltd. Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000 Applicant before: Ping An International Smart City Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |