CN111986740A - Compound classification method and related equipment - Google Patents

Compound classification method and related equipment Download PDF

Info

Publication number
CN111986740A
CN111986740A CN202010917059.2A CN202010917059A CN111986740A CN 111986740 A CN111986740 A CN 111986740A CN 202010917059 A CN202010917059 A CN 202010917059A CN 111986740 A CN111986740 A CN 111986740A
Authority
CN
China
Prior art keywords
vector
atom
compound
representation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010917059.2A
Other languages
Chinese (zh)
Other versions
CN111986740B (en
Inventor
李恬静
朱威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Saiante Technology Service Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010917059.2A priority Critical patent/CN111986740B/en
Publication of CN111986740A publication Critical patent/CN111986740A/en
Application granted granted Critical
Publication of CN111986740B publication Critical patent/CN111986740B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a compound classification method and related equipment. The compound classification method comprises the following steps: obtaining a first tag vector for a sample compound based on a compound property; converting a first atomic representation of a sample compound into a first atomic vector sequence, and converting a missing atom corresponding to the first atomic representation into a second tag vector represented by the first atomic representation; training a property classification model formed by a feature extraction model and a first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and a second classification model according to the second label vector and the missing atom vector; and using the second atom vector of the target compound as an input by using the trained property classification model to classify the target compound. The invention improves the efficiency of classifying compounds.

Description

Compound classification method and related equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a compound classification method, a compound classification device, computer equipment and a computer readable storage medium.
Background
Compound classification is the basis of many biological, chemical works. In the conventional compound classification method, a biological scientist and a chemist are required to classify compounds by using professional knowledge.
How to classify compounds based on artificial intelligence to improve classification efficiency is a problem to be solved.
Disclosure of Invention
In view of the above, there is a need for a compound classification method, apparatus, computer device and computer readable storage medium, which can classify compounds and improve the efficiency of classifying compounds.
A first aspect of the present application provides a compound classification method comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation;
extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
obtaining a second atomic representation of the target compound to be classified;
converting the second atomic representation into a second atom vector sequence;
and classifying the target compound by using the second atom vector as input through the trained property classification model.
In another possible implementation, the obtaining the first atomic representation of the sample compound includes:
obtaining a simplified molecular linear input canonical representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
In another possible implementation, the converting the first atomic representation into a first atomic vector sequence includes:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
In another possible implementation manner, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
In another possible implementation manner, the calculating, by the first classification model, a characteristic feature vector of the sample compound according to the feature vector sequence includes:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
In another possible implementation manner, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
In another possible implementation manner, the optimizing, by using a back propagation algorithm, parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
A second aspect of the present application provides a compound sorting device comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
the extraction module is used for extracting the atomic features of the compounds by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compounds;
a calculation module, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
a training module, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module for obtaining a second atomic representation of the target compound to be classified;
a second conversion module for converting the second atomic representation into a second atomic vector sequence;
and the classification module is used for classifying the target compound by taking the second atom vector as input through the trained property classification model.
A third aspect of the application provides a computer device comprising a processor for implementing the compound classification method when executing computer readable instructions stored in a memory.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, implement the compound classification method.
According to the method, the characteristic extraction model shared by the two models is pre-trained through the missing atom prediction model and the property classification model, so that the extraction effect of the characteristic extraction model on the compound atom characteristics is improved, and the accuracy of classifying the compound through the property classification model formed by the characteristic extraction model and the first classification model is further improved. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
Drawings
FIG. 1 is a flow chart of a method of classifying compounds provided in an embodiment of the present invention.
FIG. 2 is a block diagram of a compound sorting apparatus according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Preferably, the compound classification method of the present invention is applied in one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
Example one
FIG. 1 is a flow chart of a method for classifying compounds according to an embodiment of the present invention. The compound classification method is applied to computer equipment and is used for classifying compounds and improving the efficiency of classifying the compounds.
As shown in fig. 1, the compound classification method includes:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and the corresponding missing atom of the first atomic representation 101.
In a specific embodiment, said obtaining a first atomic representation of a sample compound comprises:
obtaining a Simplified molecular input line entry Specification (SMILE) representation of the sample compound; or
Obtaining a molecular Fingerprint (FECP) representation of the sample compound; or
An International chemical identifier (InChI) -based representation of the sample compound is obtained.
The sample compounds are compounds with random missing atoms, and the missing atoms are represented by mask labels. For example, the intact compound is "CCCC (═ O) NC1 ═ CC ═ C (occ) (O) cnc (C) C (═ C1) C (C) ═ O", the SMILE of the sample compound is denoted as "[ cls ] CCCC (═ mask ]) NC1 ═ CC ═ C (occ (O) cnc (C) C (═ C1) C (C) ═ O [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation.
Converting the first atom representation and the missing atom into a vector sequence for convenient processing and feature extraction through vector conversion.
In a specific embodiment, said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
The coded subvector for each atom can be looked up through a preset coding table, and the coded subvector for each atom is the unique identifier of the atom. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The map structure subvector for each atom may include atom structure information and/or atom connection information in the sample compound.
The coding sub-vector of the missing atom can be queried through the preset coding table, and the coding sub-vector of the missing atom is determined as the second tag vector.
And 103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound.
In one embodiment, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
BERT is a deep bidirectional language characterization model based on Transformer, and the BERT utilizes the Transformer structure to construct a multilayer bidirectional Encoder network. The Transformer is a deep model based on the Self-attention mechanism (Self-attention) for processing NLP task. The processing effect of the Transformer on the NLP task is better than that of the RNN, and the training speed is higher.
Specifically, the BERT model includes a plurality of neural network layers, each neural network layer includes a preset number of Encoding modules, and each Encoding module includes an Encoding structure of a bidirectional Transformer. Each Encoding structure comprises a multi-head attention network, a first residual error network, a first feedforward neural network and a second residual error network.
And 104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model.
In another embodiment, said calculating, by a first classification model, a property feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
The water-soluble classification submodel comprises a first full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the first full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the property characteristic vector, so as to realize classification of the sample compound through the property characteristic vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification submodel may also include a second feed-forward neural network, a first convolutional neural network, and the like.
The toxicity classification submodel comprises a second full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the toxicity characteristic vector, so that the sample compound is classified through the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity feature vectors. The toxicity classification submodel may also include a third feed-forward neural network, a second convolutional neural network, and the like.
The first classification model also comprises a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
And 105, training a property classification model formed by the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model formed by the feature extraction model and the second classification model according to the second label vector and the missing atom vector.
In a specific embodiment, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated according to a cross entropy loss function; and calculating a second difference vector of the second label vector and the missing atom vector according to a cross entropy loss function.
In a specific embodiment, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector by using a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
The first difference vector expresses a distance between the first tag vector and the property feature vector, and the second difference vector expresses a distance between the second tag vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, splicing the first label vector and the second label vector to be used as an integral label vector of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the global label vector from the global output vector.
And after parameters in the property classification model and the missing atom prediction model are synchronously optimized according to the third difference vector by adopting a back propagation algorithm, recalculating the overall output vector of the overall model, wherein the distance between the recalculated overall output vector of the overall model and the overall label vector is smaller, namely the classification accuracy of the overall model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference by adopting a back propagation algorithm, asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference by adopting the back propagation algorithm, and performing optimization twice asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between a property feature vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property feature classification model is more accurate in classifying the compound based on the compound property. And optimizing the parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model can predict the missing atoms in the input compound more accurately.
A second atomic representation of the target compound to be classified is obtained 106.
A second atomic representation of the target compound may be obtained from a database; or
Receiving a user input of a second atomic representation of the target compound.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining a SMILE representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
And 107, converting the second atom representation into a second atom vector sequence.
The second atomic representation is converted into a sequence of vectors for ease of processing and feature extraction by vector conversion.
Said converting said second atomic representation into a second atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining second atom vectors of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
And 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
After the parameters in the property classification model are optimized through training, the trained property classification model can classify the target compound based on compound property features.
The trained property classification model can comprise a water-solubility classification submodel, a toxicity classification submodel, a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model includes a water-solubility classification submodel and a toxicity classification submodel, and the trained property classification model can classify the target compound according to the water solubility and the toxicity of the compound, so as to obtain that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification method of example one obtains a first atomic representation of a sample compound, obtains a first tag vector based on a compound property of the sample compound and a corresponding missing atom of the first atomic representation; converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation; extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector; obtaining a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atom vector sequence; and classifying the target compound by using the second atom vector as input through the trained property classification model. In the first embodiment, a feature extraction model shared by the two models is pre-trained through the missing atom prediction model and the property classification model to improve the extraction effect of the feature extraction model on the atomic features of the compound, and further improve the accuracy of classifying the compound by the property classification model composed of the feature extraction model and the first classification model. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
Example two
FIG. 2 is a structural diagram of a compound sorting apparatus according to a second embodiment of the present invention. The compound sorting apparatus 20 is applied to a computer device. The compound classifying device 20 is used for classifying compounds, and the efficiency of classifying the compounds is improved.
As shown in fig. 2, the compound classification apparatus 20 may include a first obtaining module 201, a first converting module 202, an extracting module 203, a calculating module 204, a training module 205, a second obtaining module 206, a second converting module 207, and a classifying module 208.
A first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation.
In a specific embodiment, said obtaining a first atomic representation of a sample compound comprises:
obtaining a Simplified molecular input line entry Specification (SMILE) representation of the sample compound; or
Obtaining a molecular Fingerprint (FECP) representation of the sample compound; or
An International chemical identifier (InChI) -based representation of the sample compound is obtained.
The sample compounds are compounds with random missing atoms, and the missing atoms are represented by mask labels. For example, the intact compound is "CCCC (═ O) NC1 ═ CC ═ C (occ) (O) cnc (C) C (═ C1) C (C) ═ O", the SMILE of the sample compound is denoted as "[ cls ] CCCC (═ mask ]) NC1 ═ CC ═ C (occ (O) cnc (C) C (═ C1) C (C) ═ O [ sep ]", and the missing atom is "O". Wherein, "[ cls ]" is a start identifier, and "[ sep ]" is an end identifier.
A first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation.
Converting the first atom representation and the missing atom into a vector sequence for convenient processing and feature extraction through vector conversion.
In a specific embodiment, said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
The coded subvector for each atom can be looked up through a preset coding table, and the coded subvector for each atom is the unique identifier of the atom. The position sub-vector for each atom may be the position of the atom in the SMILE representation of the sample compound. The map structure subvector for each atom may include atom structure information and/or atom connection information in the sample compound.
The coding sub-vector of the missing atom can be queried through the preset coding table, and the coding sub-vector of the missing atom is determined as the second tag vector.
And the extracting module 203 is configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound.
In one embodiment, the feature extraction model includes a BERT model, an RNN model, or a Transformer model.
BERT is a deep bidirectional language characterization model based on Transformer, and the BERT utilizes the Transformer structure to construct a multilayer bidirectional Encoder network. The Transformer is a deep model based on the Self-attention mechanism (Self-attention) for processing NLP task. The processing effect of the Transformer on the NLP task is better than that of the RNN, and the training speed is higher.
Specifically, the BERT model includes a plurality of neural network layers, each neural network layer includes a preset number of Encoding modules, and each Encoding module includes an Encoding structure of a bidirectional Transformer. Each Encoding structure comprises a multi-head attention network, a first residual error network, a first feedforward neural network and a second residual error network.
A calculating module 204, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence.
In another embodiment, said calculating, by a first classification model, a property feature vector of said sample compound from said sequence of feature vectors comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
The water-soluble classification submodel comprises a first full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the first full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the property characteristic vector, so as to realize classification of the sample compound through the property characteristic vector. Parameters in the first fully-connected network may be optimized by supervised water solubility training to increase the accuracy of classification of the sample compounds by the property feature vectors. The water-soluble classification submodel may also include a second feed-forward neural network, a first convolutional neural network, and the like.
The toxicity classification submodel comprises a second full-connection network, and is used for finely adjusting the characteristic vector sequence based on parameters in the second full-connection network on the basis that the characteristic vector sequence has the atomic characteristics of the compound to obtain the toxicity characteristic vector, so that the sample compound is classified through the toxicity characteristic vector. Parameters in the second fully-connected network may be optimized by supervised toxicity training to increase the accuracy of classification of the sample compounds by the toxicity feature vectors. The toxicity classification submodel may also include a third feed-forward neural network, a second convolutional neural network, and the like.
The first classification model also comprises a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
A training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector.
In a specific embodiment, the training of the property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and the training of the missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector include:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
A first difference vector of the first label vector and the property feature vector may be calculated according to a cross entropy loss function; and calculating a second difference vector of the second label vector and the missing atom vector according to a cross entropy loss function.
In a specific embodiment, the optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector, and the third difference vector by using a back propagation algorithm includes:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
The first difference vector expresses a distance between the first tag vector and the property feature vector, and the second difference vector expresses a distance between the second tag vector and the missing atom vector.
When the property classification model and the missing atom prediction model are used as an integral model, splicing the first label vector and the second label vector to be used as an integral label vector of the integral model; and splicing the property feature vector and the missing atom vector to be used as an integral output vector of the integral model. The third difference vector expresses a distance of the global label vector from the global output vector.
And after parameters in the property classification model and the missing atom prediction model are synchronously optimized according to the third difference vector by adopting a back propagation algorithm, recalculating the overall output vector of the overall model, wherein the distance between the recalculated overall output vector of the overall model and the overall label vector is smaller, namely the classification accuracy of the overall model is higher.
And optimizing parameters in the feature extraction model and the first classification model according to the first difference by adopting a back propagation algorithm, asynchronously optimizing parameters in the feature extraction model and the second classification model according to the second difference by adopting the back propagation algorithm, and performing optimization twice asynchronously so as to improve the speed of training the feature extraction model. And optimizing parameters in the property classification model according to the first difference value, so that the distance between a property feature vector recalculated by the property classification model based on the optimized parameters and the first label vector is smaller, namely the property feature classification model is more accurate in classifying the compound based on the compound property. And optimizing the parameters in the missing atom prediction model according to the second difference value, so that the distance between the missing atom vector recalculated by the missing atom prediction model based on the optimized parameters and the second label vector is smaller, namely the missing atom prediction model can predict the missing atoms in the input compound more accurately.
A second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified.
A second atomic representation of the target compound may be obtained from a database; or
Receiving a user input of a second atomic representation of the target compound.
The obtaining a second atomic representation of the target compound to be classified comprises:
obtaining a SMILE representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
The type of the second atomic representation is consistent with the type of the first atomic representation. For example, the type of the second atomic representation and the type of the first atomic representation are both SMILE representations.
A second conversion module 207 for converting the second atomic representation into a second atom vector sequence.
The second atomic representation is converted into a sequence of vectors for ease of processing and feature extraction by vector conversion.
Said converting said second atomic representation into a second atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the second atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the second atom representation to obtain a second atom vector of each atom in the second atom representation;
and combining second atom vectors of a plurality of atoms in the second atom representation to obtain the second atom vector sequence.
A classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
After the parameters in the property classification model are optimized through training, the trained property classification model can classify the target compound based on compound property features.
The trained property classification model can comprise a water-solubility classification submodel, a toxicity classification submodel, a melting point classification submodel, a median inhibitory concentration classification submodel and the like.
The trained property classification model may classify the target compound according to one or more properties of the compound based on one or more sub-models. For example, the trained property classification model includes a water-solubility classification submodel and a toxicity classification submodel, and the trained property classification model can classify the target compound according to the water solubility and the toxicity of the compound, so as to obtain that the type of the target compound is a water-soluble compound and/or a non-toxic compound.
The compound classification apparatus 20 of example two obtains a first atomic representation of a sample compound, obtains a first tag vector based on a compound property of the sample compound and a corresponding missing atom of the first atomic representation; converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation; extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound; calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model; training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector; obtaining a second atomic representation of the target compound to be classified; converting the second atomic representation into a second atom vector sequence; and classifying the target compound by using the second atom vector as input through the trained property classification model. The embodiment pretrains a feature extraction model shared by the missing atom prediction model and the property classification model so as to improve the extraction effect of the feature extraction model on the compound atom features and further improve the accuracy of classifying the compound by the property classification model formed by the feature extraction model and the first classification model. Meanwhile, the target compounds are classified by taking the second atom vector as input through the trained property classification model, so that the target compounds are prevented from being classified by experts, and the efficiency of classifying the compounds is improved.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, which stores computer-readable instructions, when executed by a processor, implement the steps in the above-mentioned compound classification method embodiments, such as the steps 101-108 shown in fig. 1:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation;
103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
105, training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
106, obtaining a second atomic representation of the target compound to be classified;
107, converting said second atomic representation into a second atomic vector sequence;
and 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules in the above device embodiments, for example, the module 201 and 208 in fig. 2:
a first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
an extracting module 203, configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound;
a calculating module 204, configured to calculate a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
a training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
a classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
Example four
Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 comprises a memory 301, a processor 302, and computer readable instructions, such as a compound classification program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer readable instructions, implements the steps in the above-described compound classification method embodiments, such as 101-108 shown in fig. 1:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
102, converting the first atom representation into a first atom vector sequence, and converting the missing atom into a second tag vector of the first atom representation;
103, extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
104, calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
105, training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
106, obtaining a second atomic representation of the target compound to be classified;
107, converting said second atomic representation into a second atomic vector sequence;
and 108, classifying the target compound by using the second atom vector as an input through the trained property classification model.
Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules in the above device embodiments, for example, the module 201 and 208 in fig. 2:
a first obtaining module 201, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module 202, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
an extracting module 203, configured to extract the atomic features of the compound by using the first atomic vector sequence as an input through a feature extraction model, so as to obtain a feature vector sequence of the sample compound;
a calculating module 204, configured to calculate a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculate a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
a training module 205, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module 206 for obtaining a second atomic representation of the target compound to be classified;
a second conversion module 207 for converting the second atomic representation into a second atomic vector sequence;
a classification module 208, configured to classify the target compound according to the trained property classification model by using the second atom vector as an input.
Illustratively, the computer readable instructions may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present method. The one or more modules may be a series of computer-readable instructions capable of performing certain functions and describing the execution of the computer-readable instructions in the computer device 30. For example, the computer readable instructions may be divided into a first obtaining module 201, a first transforming module 202, an extracting module 203, a calculating module 204, a training module 205, a second obtaining module 206, a second transforming module 207, and a classifying module 208 in fig. 2, and specific functions of each module are described in embodiment two.
Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and does not constitute a limitation of the computer device 30, and may include more or less components than those shown, or combine certain components, or different components, for example, the computer device 30 may also include input and output devices, network access devices, buses, etc.
The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center for the computer device 30 and connecting the various parts of the overall computer device 30 using various interfaces and lines.
The memory 301 may be used to store the computer readable instructions, and the processor 302 may implement the various functions of the computer device 30 by executing or executing the computer readable instructions or modules stored in the memory 301 and invoking the data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer device 30, and the like. In addition, the Memory 301 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.
The modules integrated by the computer device 30 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by hardware that is configured to be instructed by computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer readable instructions comprise computer readable instruction code which may be in source code form, object code form, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer readable instruction code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, Read Only Memory (ROM), Random Access Memory (RAM), etc.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the compound classification method according to various embodiments of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is to be understood that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. A plurality of modules or means recited in the system claims may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A compound classification method, comprising:
obtaining a first atomic representation of a sample compound, obtaining a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
converting the first atom representation to a first atom vector sequence, converting the missing atom to a second tag vector of the first atom representation;
extracting the atomic features of the compound by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compound;
calculating a characteristic feature vector of the sample compound according to the feature vector sequence through a first classification model, and calculating a missing atom vector of the sample compound according to the feature vector sequence through a second classification model;
training a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and training a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
obtaining a second atomic representation of the target compound to be classified;
converting the second atomic representation into a second atom vector sequence;
and classifying the target compound by using the second atom vector as input through the trained property classification model.
2. The compound classification method of claim 1, wherein the obtaining a first atomic representation of a sample compound comprises:
obtaining a simplified molecular linear input canonical representation of the sample compound; or
Obtaining a molecular fingerprint representation of the sample compound; or
Obtaining an international compound identity based representation of the sample compound.
3. The method of compound classification as claimed in claim 1 wherein said converting said first atomic representation into a first atom vector sequence comprises:
acquiring a coding sub-vector, a position sub-vector and a graph structure sub-vector of each atom in the first atom representation;
splicing the coding sub-vector, the position sub-vector and the graph structure sub-vector of each atom in the first atom representation to obtain a first atom vector of each atom in the first atom representation;
and combining the first atom vectors of the atoms in the first atom representation to obtain the first atom vector sequence.
4. The compound classification method of claim 1, wherein the feature extraction model comprises a BERT model, an RNN model, or a Transformer model.
5. The method of compound classification of claim 1, wherein said calculating a property feature vector of the sample compound from the sequence of feature vectors by a first classification model comprises:
calculating a water-soluble feature vector of the sample compound according to the feature vector sequence by using a water-soluble classification submodel in the first classification model;
and calculating the toxicity characteristic vector of the sample compound according to the characteristic vector sequence by using a toxicity classification submodel in the first classification model.
6. The compound classification method of claim 1, wherein the training of the property classification model comprised of the feature extraction model and the first classification model based on the first label vector and the property feature vector, and the training of the missing atom prediction model comprised of the feature extraction model and the second classification model based on the second label vector and the missing atom vector comprises:
calculating a first difference vector of the first label vector and the property feature vector;
calculating a second difference vector of the second tag vector and the missing atom vector;
splicing the first difference vector and the second difference vector to obtain a third difference vector;
and optimizing parameters in the property classification model and the missing atom prediction model according to the first difference vector, the second difference vector and the third difference vector by adopting a back propagation algorithm.
7. The compound classification method according to claim 6, wherein the optimizing parameters in the property classification model and the missing atom prediction model from the first difference vector, the second difference vector, and the third difference vector using a back propagation algorithm comprises:
synchronously optimizing parameters in the property classification model and the missing atom prediction model according to the third difference vector by adopting a back propagation algorithm; or
And optimizing parameters in the property classification model according to the first difference value by adopting a back propagation algorithm, and asynchronously optimizing parameters in the missing atom prediction model according to the second difference value by adopting the back propagation algorithm.
8. A compound sorting device, comprising:
a first obtaining module, configured to obtain a first atomic representation of a sample compound, obtain a first tag vector of the sample compound based on a compound property and a corresponding missing atom of the first atomic representation;
a first conversion module, configured to convert the first atom representation into a first atom vector sequence, and convert the missing atom into a second tag vector of the first atom representation;
the extraction module is used for extracting the atomic features of the compounds by taking the first atomic vector sequence as input through a feature extraction model to obtain a feature vector sequence of the sample compounds;
a calculation module, configured to calculate, by using a first classification model, a property feature vector of the sample compound according to the feature vector sequence, and calculate, by using a second classification model, a missing atom vector of the sample compound according to the feature vector sequence;
a training module, configured to train a property classification model composed of the feature extraction model and the first classification model according to the first label vector and the property feature vector, and train a missing atom prediction model composed of the feature extraction model and the second classification model according to the second label vector and the missing atom vector;
a second obtaining module for obtaining a second atomic representation of the target compound to be classified;
a second conversion module for converting the second atomic representation into a second atomic vector sequence;
and the classification module is used for classifying the target compound by taking the second atom vector as input through the trained property classification model.
9. A computer device comprising a processor for executing computer readable instructions stored in a memory to implement a compound classification method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, carry out a compound classification method according to any one of claims 1 to 7.
CN202010917059.2A 2020-09-03 2020-09-03 Method for classifying compounds and related equipment Active CN111986740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010917059.2A CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010917059.2A CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Publications (2)

Publication Number Publication Date
CN111986740A true CN111986740A (en) 2020-11-24
CN111986740B CN111986740B (en) 2024-05-14

Family

ID=73448044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010917059.2A Active CN111986740B (en) 2020-09-03 2020-09-03 Method for classifying compounds and related equipment

Country Status (1)

Country Link
CN (1) CN111986740B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU1067699A (en) * 1997-10-07 1999-04-27 New England Medical Center Hospitals, Inc., The Structure-based rational design of compounds to inhibit papillomavirus infe ction
WO2018098588A1 (en) * 2016-12-02 2018-06-07 Lumiant Corporation Computer systems for and methods of identifying non-elemental materials based on atomistic properties
CN109493922A (en) * 2018-11-19 2019-03-19 大连思利科环境科技有限公司 Method for predicting molecular structure parameters of chemicals
CN109658989A (en) * 2018-11-14 2019-04-19 国网新疆电力有限公司信息通信公司 Class drug compound toxicity prediction method based on deep learning
CN110428864A (en) * 2019-07-17 2019-11-08 大连大学 Method for constructing the affinity prediction model of protein and small molecule
CN110751230A (en) * 2019-10-30 2020-02-04 深圳市太赫兹科技创新研究院有限公司 Substance classification method, substance classification device, terminal device and storage medium
CN110767271A (en) * 2019-10-15 2020-02-07 腾讯科技(深圳)有限公司 Compound property prediction method, device, computer device and readable storage medium
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU1067699A (en) * 1997-10-07 1999-04-27 New England Medical Center Hospitals, Inc., The Structure-based rational design of compounds to inhibit papillomavirus infe ction
WO2018098588A1 (en) * 2016-12-02 2018-06-07 Lumiant Corporation Computer systems for and methods of identifying non-elemental materials based on atomistic properties
CN109658989A (en) * 2018-11-14 2019-04-19 国网新疆电力有限公司信息通信公司 Class drug compound toxicity prediction method based on deep learning
CN109493922A (en) * 2018-11-19 2019-03-19 大连思利科环境科技有限公司 Method for predicting molecular structure parameters of chemicals
CN110428864A (en) * 2019-07-17 2019-11-08 大连大学 Method for constructing the affinity prediction model of protein and small molecule
CN110767271A (en) * 2019-10-15 2020-02-07 腾讯科技(深圳)有限公司 Compound property prediction method, device, computer device and readable storage medium
CN110751230A (en) * 2019-10-30 2020-02-04 深圳市太赫兹科技创新研究院有限公司 Substance classification method, substance classification device, terminal device and storage medium
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
CN110957012A (en) * 2019-11-28 2020-04-03 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for analyzing properties of compound

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ATSUTO SEKO ET AL.: "Representation of compounds for machine-learning prediction of physical properties", 《PHYSICAL REVIEW B》, vol. 95, no. 14, pages 1 - 11 *
安强强: "基于机器学习的化合物分析", 《当代化工》, vol. 47, no. 1, pages 38 - 40 *
易忠胜等: "有机化合物水生毒性作用模式的支持向量机分类研究", 《广西科学》, vol. 13, no. 1, pages 31 - 34 *

Also Published As

Publication number Publication date
CN111986740B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
JP7193252B2 (en) Captioning image regions
CN108959482B (en) Single-round dialogue data classification method and device based on deep learning and electronic equipment
Xie et al. Generative VoxelNet: Learning energy-based models for 3D shape synthesis and analysis
CN111738016A (en) Multi-intention recognition method and related equipment
CN112559784A (en) Image classification method and system based on incremental learning
CN111461168A (en) Training sample expansion method and device, electronic equipment and storage medium
US9378464B2 (en) Discriminative learning via hierarchical transformations
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN111460812B (en) Sentence emotion classification method and related equipment
JP7229345B2 (en) Sentence processing method, sentence decoding method, device, program and device
CN113761197B (en) Application form multi-label hierarchical classification method capable of utilizing expert knowledge
CN112086144A (en) Molecule generation method, molecule generation device, electronic device, and storage medium
CN111639500A (en) Semantic role labeling method and device, computer equipment and storage medium
CN110704543A (en) Multi-type multi-platform information data self-adaptive fusion system and method
CN113111190A (en) Knowledge-driven dialog generation method and device
CN111027681B (en) Time sequence data processing model training method, data processing method, device and storage medium
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN114896067A (en) Automatic generation method and device of task request information, computer equipment and medium
CN113239702A (en) Intention recognition method and device and electronic equipment
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN112036439B (en) Dependency relationship classification method and related equipment
CN111767720B (en) Title generation method, computer and readable storage medium
Tomer et al. STV-BEATS: skip thought vector and bi-encoder based automatic text summarizer
US20230281826A1 (en) Panoptic segmentation with multi-database training using mixed embedding
CN111259673A (en) Feedback sequence multi-task learning-based law decision prediction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210202

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Shenzhen saiante Technology Service Co.,Ltd.

Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000

Applicant before: Ping An International Smart City Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant