CN116758978A - Controllable attribute totally new active small molecule design method based on protein structure - Google Patents

Controllable attribute totally new active small molecule design method based on protein structure Download PDF

Info

Publication number
CN116758978A
CN116758978A CN202310707583.0A CN202310707583A CN116758978A CN 116758978 A CN116758978 A CN 116758978A CN 202310707583 A CN202310707583 A CN 202310707583A CN 116758978 A CN116758978 A CN 116758978A
Authority
CN
China
Prior art keywords
protein
small molecule
node
amino acid
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310707583.0A
Other languages
Chinese (zh)
Inventor
施建宇
李嘉宁
杨光
赵鹏程
韦学鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310707583.0A priority Critical patent/CN116758978A/en
Publication of CN116758978A publication Critical patent/CN116758978A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Medicinal Chemistry (AREA)
  • Public Health (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a protein structure-based controllable attribute totally new active small molecule design method, and provides a small molecule generation model, namely CproMG, based on a transducer. Based on the hierarchical view of the fusion protein, it significantly enhances the expression of the protein binding pocket by associating amino acid residues with its constituent atoms. By combining the sequence of the intercalating molecules, their drug-like properties and binding affinity to proteins, it automatically regresses in a controlled manner to produce new molecules with the desired properties by measuring the proximity of molecular tags to protein residues and atoms.

Description

Controllable attribute totally new active small molecule design method based on protein structure
Technical Field
The application belongs to the technical field of computer-aided drug research and development, and particularly relates to a controllable attribute totally new active small molecule design method based on a protein structure.
Background
In drug design, it is important to screen or design candidate compounds that bind to a protein target. However, the chemical space of small molecules is large, estimated to include 10 23 -10 60 A compound. Thus, finding suitable small molecules in such a space is extremely difficult.
In the development of computer-aided drug design, high-throughput screening and virtual screening technologies were first proposed, and target molecules were obtained by filtering molecules in a large compound library. High throughput screening is computer aided and tens of millions of samples can be tested by one experiment. Molecular docking techniques and quantitative structure-activity methods (QSAR) based on machine learning are applied to virtual screening, which are two virtual screening methods based on small molecule structure screening and drug action mechanism screening, respectively. With the development of artificial intelligence, a molecular biochemical property prediction model based on deep learning is also applied to virtual screening, and hopes are brought for the discovery of lead compounds. However, the above methods are all based on screening of known databases, which limits the search range in chemical space to a great extent, and the screened molecules are not original.
The de novo design of drug small molecules is essentially a search for small molecules in the chemical space, but is not limited by existing databases and can more fully explore the entire chemical space. With the development of artificial intelligence, many deep-generation models have been generated, which have been successfully applied to the fields of natural language processing and images. The method is inspired by the fact that a generation model is applied to small molecule generation at present, physicochemical properties and structural characteristics of small molecule data are learned, and finally ideal small molecules meeting specific conditions are generated.
Current methods of molecular generation based on deep learning can be broadly divided into ligand structure-based generation methods and receptor structure-based generation methods. The generation method based on the ligand structure does not consider target information or is limited by a target specific ligand data set, and the requirement of high binding force with a new target is difficult to meet. Although the receptor structure-based production method can solve the above problems, the biochemical and physicochemical properties of the produced molecules are difficult to control.
In view of this, it is necessary to devise a new generation method that allows the generated molecules to control properties on the basis of having a high binding force.
Disclosure of Invention
The application aims to solve the technical problem that the molecular generation method based on deep learning can not control the biochemical and physicochemical properties on the basis of meeting high binding force when designing molecules, and provides a controllable attribute totally new active micromolecule design method based on a protein structure
In order to achieve the above purpose, the technical solution provided by the present application is:
the design method of the controllable attribute totally new active small molecule based on the protein structure is characterized by comprising the following steps:
1) Constructing a small molecule generation model CProMG:
the small molecule generation model CProMG comprises a protein embedding module, a double-view encoder module, a small molecule embedding module and a decoder module, and a beam search algorithm is used for gradually generating a complete SMILES sequence;
the protein embedding module is used for obtaining amino acid diagram features and atomic diagram features of proteins (namely, the input of the protein is a 3D structure, and the output of the protein is the amino acid diagram features and the atomic diagram features), and comprises an amino acid diagram embedding unit and an atomic diagram embedding unit;
the double-view encoder module is used for fusing amino acid diagram features and atomic diagram features of proteins to obtain fused protein features (namely, input is the feature representation of the amino acid diagram and the atomic diagram, and output is the fused protein features), and comprises a multi-head attention network, a feedforward neural network and an information cross fusion unit;
the small molecule embedding module is used for obtaining the initial characteristics of small molecules (namely, the input is a small molecule sequence, the output is the small molecule embedding characteristics), and comprises a small molecule SMILES and attribute embedding unit, a segment coding unit and a position coding unit; the segment coding unit is used for distinguishing the molecular sequence from the molecular attribute, and the position coding unit is used for acquiring the position information;
the decoder module is used for generating a small molecule sequence (namely, the input is a small molecule embedded characteristic and a protein characteristic, and the output is a generated small molecule sequence) and comprises a mask multi-head attention network, an interactive multi-head attention network and a feedforward neural network;
2) Obtaining sample data, and training the small molecule generation model CProMG constructed in the step 1) to obtain a trained small molecule generation model; the specific training process is as follows:
2.1 Collecting sample data, constructing a training data set and a test data set
The sample data is that the combined attitude root mean square deviation is smaller thanScreening in existing dataset to generate small molecules based on protein structure, thus the generated small molecules have targeting affinity), which comprises three-dimensional structure information of protein and SMILES sequence information of small molecules;
2.2 Obtaining protein characteristics and initial characteristics of small molecules
Protein characteristics were obtained by:
A1. characterization of the three-dimensional Structure of the protein in step 2.1) construction of the protein amino acid map Using the K-nearest neighbor algorithm (KNN)And protein atomic map->Node information is encoded through one-hot, initial characteristics of nodes are obtained through Laplace position encoding, and a Gaussian kernel function is utilized to convert side lengths into side characteristics;
A2. carrying out fusion training on the initial characteristics of the protein amino acid diagram and the protein atomic diagram obtained in the A1 by utilizing a double-view encoder module to obtain protein characteristics;
double-view braidingThe encoder module comprises parallel amino acid view encoders En r And an atomic view encoder En a Each encoder comprises t encoding layers, each encoding layer firstly uses edge characteristics to enhance information of each node, then calculates attention scores of each node and adjacent nodes by utilizing a multi-head attention mechanism, uses the attention scores as weights to aggregate the adjacent nodes, updates node information, and finally transmits the node information into a feedforward neural network; the information cross fusion unit fuses the information of the two views (namely, the information of the atomic views is aggregated into the amino acid views through attention calculation, and the node characteristics of the amino acid views are updated); finally En is provided r and Ena To obtain a final protein characteristic representation;
the initial characteristics of the small molecules are obtained by:
characterizing the small molecule SMILES sequence information in the step 2.1), obtaining the physicochemical property of the small molecule by using RDkit, splicing the small molecule SMILES sequence in front of the small molecule SMILES sequence as a generating condition, and obtaining the initial characteristic of the small molecule by using the whole sequence of one-hot coding;
2.3 Training the initial characteristics of the small molecules obtained in the step 2.2) by using a decoder module, wherein the decoder is similar to the decoder of the original transducer and comprises t decoding layers, each decoding layer firstly learns the characteristics of the molecules through a mask multi-head attention network, then calculates the proximity of the molecular token and the protein characteristics obtained in the step 2.2) by using an interactive multi-head attention network to update the molecular characteristics, and finally transmits the molecular characteristics into a feedforward neural network to predict the complete molecular output;
2.4 Calculating model loss by using the molecules predicted in the step 2.3), adjusting model parameters according to the loss through negative feedback, and obtaining a small molecule generation model CProMG after training is completed;
3) And step 2), gradually generating a complete small molecule SMILES sequence by utilizing the CProMG model trained in the step 2) and combining a beam searching algorithm. The beam search algorithm is a strategy of search space used in the process of generating molecules after the model is trained, and the model is operated to predict the next character only according to a known sequence at a time, so that a complete SMILES sequence needs to be circularly operated for multiple times to be generated step by step.
Further, in step 2.2), the three-dimensional structure of the protein is represented as an amino acid patternAtomic diagram wherein ,/>Is a node set, v i Characteristic representing node i>Representing three-dimensional coordinates of nodes, ε= { e ij ,i,j=1,2,...,n&i+.j } represents edge features;
for amino acid diagrams, node characteristics v i One-hot code for the residue type of the i-th residue; constructing a protein amino acid diagram by using a K-nearest neighbor algorithm based on three-dimensional coordinates of amino acids; representing the side length as an n-dimensional vector as a side feature epsilon using a plurality of gaussian kernel functions;
for atomic diagram, node characteristics v i One-hot codes for information including atom type, amino acid to which it belongs, whether it is a backbone; constructing a protein atomic diagram by using a K-nearest neighbor algorithm based on the three-dimensional coordinates of atoms; the side length is represented as an n-dimensional vector as the side feature epsilon using a plurality of gaussian kernel functions.
Further, the laplace position code is a generalization of the position code used in the original transducer in the graph, and can better help to code the distance sensing information, that is, the nearby nodes have similar position features, and the farther nodes have different position features, so in the step 2.2), the laplace feature vector is used as the position code in the CProMG, where the feature vector is defined by factorization of the laplace matrix of the graph, and the formula is as follows:
wherein ,is an identity matrix, and the n×n diagonal matrix D is a graph +.>Is a degree matrix of->Is a contiguous matrix of (a); />Comprising a set of feature vectors->Which corresponds to a set of eigenvalues lambda k -a }; adding the position codes to the embedded features of the protein map nodes to obtain initial features of the protein map nodes with global spatial features (namely, adding position confidence to the map nodes, so that a subsequent model module can consider the position information of the map nodes):
wherein , and />Is a weight matrix.
Further, the architecture of the dual-view encoder module in step 2.2) is specifically:
amino acid diagramAnd atomic map->Respectively (i.e. the initial characteristics of the protein amino acid map and protein atomic map obtained by A1) are input into En of a double-view encoder r and Ena To obtain a final representation of the protein binding pocket;
each encoder is composed of t series coding units, each coding unit containing an edge enhancement coding blockAnd Multi-head attention block->First block->Enhancement node representation, attention block->Further updating the node representation by a self-care mechanism;
the edge enhanced q, k, v is defined as follows:
wherein ,{ W } is a weight matrix which can be learned, +. ij Representing edge characteristics between node i and node j;
updating node characteristics by a multi-headed attention block:
representing the number of nodes of the graph, d k Is a superparameter representation->Is a feature dimension of (2); node representation->Andwith a residual connection designed between, i.e. +.>η (·) represents a regularization function; the node characteristics are then input to the FNN with a residual connection, i.e. +.>
Definition encoder En r and Ena The outputs of (a) are H respectively (t) and Z(t) They are spliced to obtain the final representation characteristic H of the protein structure P =[H (t) ;Z (t) ]。
Further, the information cross fusion unit in the step 2.2) specifically includes:
the information cross fusion unit is realized through multi-head attention, and the node characteristics of the multi-head attention output of the atomic view encoder are realizedRegarding as Key s and value, the node feature of the multi-head attention output of the amino acid view encoder is +.>Considered as Queries:
wherein three W matrices represent a linear layer;
the ith node feature of the nth attention header of the amino acid view is updated by the following formula:
n represents the number of nodes of the atomic graph; d, d k Is a super parameter, representingIs a feature dimension of (2); the node characteristics of a plurality of attention heads are spliced and then pass through a linear layer +.>Updating nodes using residual connections
Further, the initial characteristics of the small molecules obtained in the step 2.2) are specifically:
given a small molecule SMILES sequence, its physicochemical properties including water-octanol partition coefficient (LogP), topological Polar Surface Area (TPSA), drug-like properties (QED) and Synthetic Accessibility (SA) were calculated using an open source chemical toolbox RDkit; splicing the four attribute values and the docking score of the protein-micromolecule pair (obtained by docking software Autodock vina) into a generated condition vector y; the obtained molecules are expressed as:
h m =[yW p ;SW s ]
wherein'; ' is a matrixA stacking operation; s denotes the one-hot coding of the SMILES sequence,and/>respectively representing two linear layers;
the position coding of the sequence is defined as follows:
where j=1, 2,..n, if d is even, n=d/2, if d is odd, n= (d+1)/2;
representation of the position of a molecule
Molecular segment codes h token =[t 1 ;t 0 ;...;t 0 ]The final small molecule intercalation is defined as:
h 0 =h m +h token +H pos
further, the interactive multi-head attention network in the step 2.3) is specifically as follows:
interactive attention of decoder uses the attention mechanism to learn the key dependency of small molecular substructures and proteins, characterizing protein H P As values and keys, the attention calculation is carried out by taking the small molecule characteristics as queries, and the updated ith token characteristic of the small molecule is as follows The expression of the r attention header of the layer I decoder is as follows:
wherein ,representing protein characteristic H P The number of nodes d k Is a superparameter representation->Is a feature dimension of (c).
Further, the step 2.4) calculates loss by using the cross entropy loss function, specifically as follows:
wherein ,x0 =[p,b]P and b represent attribute conditions and a start symbol, x, respectively i Representing token in SMILES sequence, P represents generating x i Is a probability of (2).
Further, the step 3) of generating a molecular SMILES sequence based on the beam search algorithm specifically includes:
the beam search contains a super parameter, the beam width k, representing the width of the search; at time step 1, given the desired molecular property and the start symbol '$' as the first two labels of the k candidate output sequences; at each subsequent time step, based on the k candidate output sequences of the previous time step, the k candidate output sequences with the highest conditional probabilities will continue to be picked out of several possible choices; repeating the steps until the end symbol' ≡is searched, and ending the search. When k=1, the beam search degenerates into a greedy search.
Meanwhile, the application also provides an electronic device and a computer readable storage medium, wherein the computer program is stored on the electronic device, and the electronic device is characterized in that: the computer program realizes the steps of the above method when being executed by a processor.
The principle of the application is as follows:
the application designs a molecule generation model CProMG based on a graph transducer, which can be used for generating molecules with high binding force and controllable properties with proteins, and mainly because the application generates small molecules based on protein structures, the application can be regarded as a process of 'translating' from the protein structures to small molecule sequences, and the generated small molecules have high targeting affinity; in the decoding process of the decoder, the expected attribute is generated as a condition, so that the generated molecule is ensured to have a specific attribute, namely the attribute is controllable. The generation model comprises a protein embedding module, a double-view encoder module, a small molecule embedding module and a decoder module; first, CProMG enhances the characteristics of the protein binding pocket by fusing protein hierarchy information; secondly, a protein interactive multi-head attention module in the decoder calculates the attention scores of small molecules and protein residues and atoms, so that key interactions between protein pockets and small molecules can be captured; finally, new molecules with the required properties are automatically regressed in a controllable manner by combining the properties of the embedded small molecules and the quasi-drugs thereof.
The application has the advantages that:
the application provides a controllable attribute totally new active small molecule design method based on a protein structure, namely CProMG, by learning a small molecule distribution rule in a known chemical space and exploring small molecules in an unknown space. The model learns the amino acid view and atomic view characteristics of the protein through the multi-head attention mechanism in the step 2.2), and effectively fuses the amino acid view and the atomic view characteristics, so that the characteristics of protein pockets are obviously enhanced. Secondly, the model decoder can effectively control the attribute of the generated molecules by embedding the small molecular drug attribute. In addition, the interaction attention module of the decoder in step 2.3) can learn the key interaction between the protein and the small molecule by calculating the attention score between the small molecule feature and the protein feature, so that the generated molecule and the protein have high binding force. Evaluation of CProMG on a data set shows that the small molecules generated by CProMG have better binding force and drug-like properties, and the application can provide a small molecule generation tool to promote drug discovery and development, thereby not only improving the generation quality, but also providing a certain interpretation.
Drawings
Fig. 1 is a general architecture of a method CProMG proposed by the present application.
Detailed Description
The application is described in further detail below with reference to the attached drawings and specific examples:
the controllable attribute totally new active small molecule design method based on the protein structure is provided for implementation, wherein:
this example usesProtein-ligand pair dataset of (c): the dataset contained about 180000 protein-ligand pairs. Each pair of data has a docking score calculated by Autodock Vina. Selecting 100000 pairs of data training models, randomly selecting 1000 pairs of data from the data training models as a verification set, and using the rest data training models as training sets; a data test model is selected 100. The sequence similarity of the data of the training model and the test model is less than 30%.
And constructing a protein amino acid graph and a protein atomic graph by using a K-nearest neighbor algorithm (KNN) aiming at protein three-dimensional structure information, and obtaining initial characteristics of graph nodes and edges through one-hot coding. Then, using Laplace position codes to obtain position information between graph nodes;
for the initial features of the obtained protein atomic map and amino acid map, a sample dual view encoder is used for processing. The information of each node is enhanced by using edge characteristics, then the attention score of each node and the adjacent nodes is calculated by using a multi-head attention mechanism, the information is aggregated on the nodes, and finally the information is transmitted into the feedforward neural network. And the information fusion module fuses the information of the two views to obtain the characteristics of each protein.
For small molecule SMILES sequence information, the physicochemical properties of the small molecule are obtained by means of RDkit tool, and the small molecule initial characteristics are obtained by one-hot encoding together with the SMILES sequence.
For the obtained small molecule initial features, the decoder predicts the complete small molecule SMILES sequence through the multi-head attention module, the interactive attention module and the feedforward neural network during training.
And calculating model loss by using a cross entropy loss function for the small molecule SMILES sequence generated by prediction.
And (3) training to obtain a small molecule generation model, generating by combining a beam search algorithm, generating the next character based on the current sequence each time, and circularly generating a complete small molecule SMILES sequence for multiple times.
In order to evaluate the model generation quality, the application selects the docking score (VS), the drug property (QED), the Synthesis Accessibility (SA) and the Diversity (Diversity) as basic evaluation indexes, wherein the smaller the VS and the SA, the better the QED and the Diversity are, the larger the QED and the better the Diversity are; the calculation of VS is realized through Autodock Vina calculation, and the other three indexes are realized through RDkit tool calculation.
The training-completed generation model is tested by using test set data, meanwhile, the application compares the training-completed generation model with other baseline basic methods in a unified data set, and test results are shown in table 1:
TABLE 1 performance display of CProMG
[1]Luo,S.et al.(2021)A3D Generative Model for Structure-Based Drug Design.Advances in Neural Information Processing Systems,34.
[2]Skalic,M.et al.(2019)From Target to Drug:Generative Modeling for the Multimodal Structure-Based Ligand Design.Mol.Pharm.,16,4282–4291.
As can be seen from table 1, the molecular docking score (VS), the drug class (QED), the Synthesis Accessibility (SA) and the Diversity (Diversity) generated by the method of the present application are all significantly higher than other baseline basic methods, with significant effects.
In summary, the present application may be used to generate small molecules with high binding strength for specific properties, and the implementation methods and general knowledge of the above-described schemes are not described here too much. It should be noted that modifications can be made to the application by those skilled in the art without departing from the scope of the application, which is also to be considered as the scope of the application, and which does not affect the practice of the application or the utility of the patent. The protection scope of the present application is defined by the claims, and the description of the embodiments and the like in the specification is to be construed as meaning the claims.

Claims (10)

1. The design method of the controllable attribute totally new active small molecule based on the protein structure is characterized by comprising the following steps:
1) Constructing a small molecule generation model CProMG:
the small molecule generation model CProMG comprises a protein embedding module, a double-view encoder module, a small molecule embedding module and a decoder module, and a beam search algorithm is used for gradually generating a complete SMILES sequence;
the protein embedding module is used for obtaining amino acid diagram characteristics and atomic diagram characteristics of the protein and comprises an amino acid diagram embedding unit and an atomic diagram embedding unit;
the double-view encoder module is used for fusing amino acid diagram features and atomic diagram features of proteins to obtain fused protein features, and comprises a multi-head attention network, a feedforward neural network and an information cross fusion unit;
the small molecule embedding module is used for obtaining the initial characteristics of small molecules, and comprises a small molecule SMILES and attribute embedding unit, a segment coding unit and a position coding unit;
the decoder module is used for generating a small molecule sequence, and comprises a mask multi-head attention network, an interactive multi-head attention network and a feedforward neural network;
2) Obtaining sample data, and training the small molecule generation model CProMG constructed in the step 1) to obtain a trained small molecule generation model; the specific training process is as follows:
2.1 Collecting sample data, constructing a training data set and a test data set
The sample data is that the combined attitude root mean square deviation is smaller thanA protein-small molecule pair comprising three-dimensional structural information of a protein and SMILES sequence information of a small molecule;
2.2 Obtaining protein characteristics and initial characteristics of small molecules
Protein characteristics were obtained by:
A1. characterization of the three-dimensional Structure of the protein in step 2.1) and construction of the protein amino acid map Using the K-nearest neighbor algorithmAnd protein atomic map->Node information is encoded through one-hot, initial characteristics of nodes are obtained through Laplace position encoding, and a Gaussian kernel function is utilized to convert side lengths into side characteristics;
A2. carrying out fusion training on the initial characteristics of the protein amino acid diagram and the protein atomic diagram obtained in the A1 by utilizing a double-view encoder module to obtain protein characteristics;
the dual view encoder module comprises parallel amino acid view encoders En r And an atomic view encoder En a Each encoder comprises t encoding layers, each encoding layer firstly uses edge characteristics to enhance information of each node, then calculates attention scores of each node and adjacent nodes by utilizing a multi-head attention mechanism, uses the attention scores as weights to aggregate the adjacent nodes, updates node information, and finally transmits the node information into a feedforward neural network; the information cross fusion unit fuses the information of the two views; finally En is provided r and Ena To obtain a final protein characteristic representation;
the initial characteristics of the small molecules are obtained by:
characterizing the small molecule SMILES sequence information in the step 2.1), obtaining the physicochemical property of the small molecule by using RDkit, splicing the small molecule SMILES sequence in front of the small molecule SMILES sequence as a generating condition, and obtaining the initial characteristic of the small molecule by using the whole sequence of one-hot coding;
2.3 Training the initial characteristics of the small molecules obtained in the step 2.2) by using a decoder module, wherein the decoder is similar to the decoder of the original transducer and comprises t decoding layers, each decoding layer firstly learns the characteristics of the molecules through a mask multi-head attention network, then calculates the proximity of the molecular token and the protein characteristics obtained in the step 2.2) by using an interactive multi-head attention network to update the molecular characteristics, and finally transmits the molecular characteristics into a feedforward neural network to predict the complete molecular output;
2.4 Calculating model loss by using the molecules predicted in the step 2.3), adjusting model parameters according to the loss through negative feedback, and obtaining a small molecule generation model CProMG after training is completed;
3) And step 2), gradually generating a complete small molecule SMILES sequence by utilizing the CProMG model trained in the step 2) and combining a beam searching algorithm.
2. The method for designing the controllable attribute totally new active small molecule based on the protein structure according to claim 1, wherein the method is characterized in that:
in step 2.2), the three-dimensional structure of the protein is represented as an amino acid patternAtomic diagram-> wherein ,/>Is a node set, v i Characteristic representing node i>Representing three-dimensional coordinates of nodes, ε= { e ij ,i,j=1,2,...,n&i+.j } represents edge features;
for amino acid diagrams, node characteristics v i One-hot code for the residue type of the i-th residue; constructing a protein amino acid diagram by using a K-nearest neighbor algorithm based on three-dimensional coordinates of amino acids; representing the side length as an n-dimensional vector as a side feature epsilon using a plurality of gaussian kernel functions;
for atomic diagram, node characteristics v i One-hot codes for information including atom type, amino acid to which it belongs, whether it is a backbone; constructing a protein atomic diagram by using a K-nearest neighbor algorithm based on the three-dimensional coordinates of atoms; the side length is represented as an n-dimensional vector as the side feature epsilon using a plurality of gaussian kernel functions.
3. The method for designing the totally new active small molecules based on the controllable attribute of the protein structure according to claim 1, wherein the laplace feature vector is used as the position code in the CProMG in the step 2.2), wherein the feature vector is defined by the factorization of the laplace matrix of the graph, and the formula is as follows:
wherein ,is an identity matrix, and the n×n diagonal matrix D is a graph +.>Is a degree matrix of->Is a contiguous matrix of (a);comprising a set of feature vectors->Which corresponds to a set of eigenvalues lambda k -a }; adding the position codes and the embedded features of the protein map nodes to obtain initial features of the protein map nodes with global spatial features:
wherein , and />Is a weight matrix.
4. The method for designing a totally new active small molecule based on controllable properties of protein structure according to claim 1, wherein the architecture of the dual-view encoder module in step 2.2) is specifically:
amino acid diagramAnd atomic map->Is embedded into En respectively input to a dual view encoder r and Ena To obtain a final representation of the protein binding pocket;
each encoder is composed of t series coding units, each coding unit containing an edge enhancement coding blockAnd Multi-head attention block->First block->Enhancement node representation, attention block->Further updating the node representation by a self-care mechanism;
the edge enhanced q, k, v is defined as follows:
wherein ,{ W } is a weight matrix which can be learned, +. ij Representing edge characteristics between node i and node j;
updating node characteristics by a multi-headed attention block:
representing the number of nodes of the graph, d k Is a superparameter representation->Is a feature dimension of (2); node representation-> and />With a residual connection designed between, i.e. +.>η (·) represents a regularization function; the node characteristics are then input to the FNN with a residual connection, i.e. +.>
Definition encoder En r and Ena The outputs of (a) are H respectively (t) and Z(t) They are spliced to obtain the final representation characteristic H of the protein structure P =[H (t) ;Z (t) ]。
5. The method for designing the controllable attribute totally new active small molecule based on the protein structure according to claim 1, wherein the information cross fusion unit in the step 2.2) is specifically:
the information cross fusion unit is realized through multi-head attention, and the node characteristics of the multi-head attention output of the atomic view encoder are realizedRegarding as Key s and value, the node feature of the multi-head attention output of the amino acid view encoder is +.>Considered as Queries:
wherein three W matrices represent a linear layer;
the ith node feature of the nth attention header of the amino acid view is updated by the following formula:
n represents the number of nodes of the atomic graph; d, d k Is a super parameter, representingIs a feature dimension of (2); the node characteristics of a plurality of attention heads are spliced and then pass through a linear layer +.>Updating nodes using residual connections
6. The method for designing a totally new active small molecule based on controllable properties of protein structure according to claim 1, wherein the initial characteristics of the small molecule obtained in step 2.2) are specifically:
given a small molecule SMILES sequence, its physicochemical properties including water-octanol partition coefficient (LogP), topological Polar Surface Area (TPSA), drug-like properties (QED) and Synthetic Accessibility (SA) were calculated using an open source chemical toolbox RDkit; splicing the four attribute values and the butt joint scores of the protein-micromolecule pairs into a generated condition vector y; the obtained molecules are expressed as:
wherein'; ' is a stacking operation of the matrix; s denotes the one-hot coding of the SMILES sequence,andrespectively representing two linear layers;
the position coding of the sequence is defined as follows:
where j=1, 2,..n, if d is even, n=d/2, if d is odd, n= (d+1)/2;
representation of the position of a molecule
Molecular segment codes h token =[t 1 ;t 0 ;...;t 0 ]The final small molecule intercalation is defined as:
h 0 =h m +h token +H pos
7. the method for designing the controllable attribute totally new active small molecule based on the protein structure according to claim 1, wherein the method is characterized in that: the interactive multi-head attention network in the step 2.3) is specifically as follows:
interactive attention of decoder uses the attention mechanism to learn the key dependency of small molecular substructures and proteins, characterizing protein H P As values and keys, the attention calculation is carried out by taking the small molecule characteristics as queries, and the updated ith token characteristic of the small molecule is as followsThe expression of the r attention header of the layer I decoder is as follows:
wherein ,representing protein characteristic H P The number of nodes d k Is a superparameter representation->Is a feature dimension of (c).
8. The method for designing the controllable attribute totally new active small molecule based on the protein structure according to claim 1, wherein the method is characterized in that: step 2.4) calculates loss by using the cross entropy loss function, specifically as follows:
wherein ,x0 =[p,b]P and b represent attribute conditions and a start symbol, x, respectively i Representing token in SMILES sequence, P represents generating x i Is a probability of (2).
9. The method for designing a totally new active small molecule based on controllable properties of protein structure according to claim 1, wherein the step 3) of generating a molecular SMILES sequence based on a beam search algorithm specifically comprises:
the beam search contains a super parameter, the beam width k, representing the width of the search; at time step 1, given the desired molecular property and the start symbol '$' as the first two labels of the k candidate output sequences; at each subsequent time step, based on the k candidate output sequences of the previous time step, the k candidate output sequences with the highest conditional probabilities will continue to be picked out of several possible choices; repeating the steps until the end symbol' ≡is searched, and ending the search.
10. An electronic device and a computer-readable storage medium having a computer program stored thereon, characterized in that: which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202310707583.0A 2023-06-15 2023-06-15 Controllable attribute totally new active small molecule design method based on protein structure Pending CN116758978A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310707583.0A CN116758978A (en) 2023-06-15 2023-06-15 Controllable attribute totally new active small molecule design method based on protein structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310707583.0A CN116758978A (en) 2023-06-15 2023-06-15 Controllable attribute totally new active small molecule design method based on protein structure

Publications (1)

Publication Number Publication Date
CN116758978A true CN116758978A (en) 2023-09-15

Family

ID=87960244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310707583.0A Pending CN116758978A (en) 2023-06-15 2023-06-15 Controllable attribute totally new active small molecule design method based on protein structure

Country Status (1)

Country Link
CN (1) CN116758978A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118098372A (en) * 2024-04-23 2024-05-28 华东交通大学 Virulence factor identification method and system based on self-attention coding and pooling mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118098372A (en) * 2024-04-23 2024-05-28 华东交通大学 Virulence factor identification method and system based on self-attention coding and pooling mechanism

Similar Documents

Publication Publication Date Title
Zhou et al. Uni-mol: A universal 3d molecular representation learning framework
Peng et al. Pocket2mol: Efficient molecular sampling based on 3d protein pockets
Li et al. Deep learning methods for molecular representation and property prediction
Jiang et al. Protein secondary structure prediction: A survey of the state of the art
CN111724867B (en) Molecular property measurement method, molecular property measurement device, electronic apparatus, and storage medium
CN113707235A (en) Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
Sunny et al. Protein–protein docking: Past, present, and future
CN114913917B (en) Drug target affinity prediction method based on digital twin and distillation BERT
Sonsare et al. Investigation of machine learning techniques on proteomics: A comprehensive survey
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
Baldi et al. A machine learning strategy for protein analysis
CN116343911B (en) Medicine target affinity prediction method and system based on three-dimensional spatial biological reaction
Zhang et al. DRBPPred-GAT: Accurate prediction of DNA-binding proteins and RNA-binding proteins based on graph multi-head attention network
CN116758978A (en) Controllable attribute totally new active small molecule design method based on protein structure
Zhang et al. Physics-aware graph neural network for accurate RNA 3D structure prediction
CN115662501A (en) Protein generation method based on position specificity weight matrix
WO2022259185A1 (en) Adversarial framework for molecular conformation space modeling in internal coordinates
Zhang et al. A Multi-perspective Model for Protein–Ligand-Binding Affinity Prediction
Ramesh et al. GAN based approach for drug design
Zhang et al. GANs for molecule generation in drug design and discovery
Talluri Algorithms for protein design
Peng et al. Pocket-specific 3d molecule generation by fragment-based autoregressive diffusion models
Torres et al. A novel ab-initio genetic-based approach for protein folding prediction
Baldi et al. Machine learning structural and functional proteomics
CN117976047B (en) Key protein prediction method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination