CN117524353B - Molecular large model based on multidimensional molecular information, construction method and application - Google Patents
Molecular large model based on multidimensional molecular information, construction method and application Download PDFInfo
- Publication number
- CN117524353B CN117524353B CN202311574206.0A CN202311574206A CN117524353B CN 117524353 B CN117524353 B CN 117524353B CN 202311574206 A CN202311574206 A CN 202311574206A CN 117524353 B CN117524353 B CN 117524353B
- Authority
- CN
- China
- Prior art keywords
- molecular
- node
- information
- self
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 title claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000000034 method Methods 0.000 claims abstract description 33
- 238000010586 diagram Methods 0.000 claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 29
- 238000005295 random walk Methods 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 19
- 238000013507 mapping Methods 0.000 claims description 15
- 125000004429 atom Chemical group 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 4
- 125000004435 hydrogen atom Chemical group [H]* 0.000 claims description 4
- 230000000873 masking effect Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 239000003814 drug Substances 0.000 abstract description 10
- 238000007877 drug screening Methods 0.000 abstract description 5
- 229940079593 drug Drugs 0.000 abstract description 4
- 238000012827 research and development Methods 0.000 abstract description 4
- 238000012512 characterization method Methods 0.000 description 7
- HCHKCACWOHOZIP-UHFFFAOYSA-N Zinc Chemical compound [Zn] HCHKCACWOHOZIP-UHFFFAOYSA-N 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 239000011701 zinc Substances 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004768 lowest unoccupied molecular orbital Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005381 potential energy Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000003775 Density Functional Theory Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 229940043263 traditional drug Drugs 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 229910052725 zinc Inorganic materials 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Medicinal Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Pharmacology & Pharmacy (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, comprising the steps of constructing an unsupervised pre-training data set, preprocessing the unsupervised pre-training data set and generating a molecular conformation to obtain a molecular pre-training data set formed by a molecular map; carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer; the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively; and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training. The invention can accelerate the drug screening speed and provide help for drug research and development.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly discloses a molecular large model construction method based on multidimensional molecular information.
Background
Traditional drug development is a complex and time-consuming process involving multiple links, such as potential target identification, compound optimization, bioactivity evaluation, etc., requiring significant manpower, material resources, and financial resources. The large model can utilize massive biomedical data to carry out excavation and analysis, and potential drug molecules can be rapidly screened out, so that the drug research and development speed is improved, the manpower and material resources and the input cost are reduced, and powerful support is provided for innovation and development of the intelligent medicine industry. For existing datasets, such as ZINC, which contain only two-dimensional molecular information, the ability of the model to learn spatial information is limited. In addition, there are some public datasets that contain three-dimensional spatial information, such as PCQM m 2 or QM9, but the limitation of the data size cannot meet the requirement of the current molecular large model, so that it is necessary to construct a large-scale molecular dataset that contains two-dimensional plane and three-dimensional stereo information at the same time. Successful application of large models is not supported by a transducer, but the existing transducer cannot fully characterize structural information in the graph, so that large model learning on large-scale graph data is difficult. 2021, ying et al proposed Graphormer to improve the modeling capability of structural information by introducing the structural coding information of the graph into the transducer, but this approach lacks learning of three-dimensional information, thereby limiting the application scope of the model. Therefore, luo introduces three-dimensional information learning based on Graphormer and introduces three-dimensional position coding, however, the method only adds additional information in the attention matrix and cannot improve the modeling capability of three-dimensional coordinates, thereby limiting the application range of the model.
In the existing pre-training model, such as SMILES character strings, structural information cannot be well captured due to the limitation of sequence conditions. For contrast pre-training tasks, improper data enhancement can result in false positive samples, and this learning approach can result in bias in model learning. Another method is to train the model by generating a pre-training task, and by masking a certain proportion of atoms and edges, and letting the model measure the attribute of the masked part, but the generating pre-training task may have a too simple problem, for example, for the chemical field, the nature only contains 118 elements, and there is a serious data imbalance problem, from this point of view, the molecular characterization learning may not fully utilize the prior knowledge of chemistry, and the performance is affected. From the three-dimensional perspective, GEM proposes to introduce spatial structure information of a compound into a pre-training encoding process, but only uses an edge-bond angle diagram as additional spatial information, and lacks complete utilization of three-dimensional coordinates; transformer-M first proposes to use two-dimensional information and three-dimensional information for characterization learning, but is limited by the number of three-dimensional equilibrium conformations, and cannot be widely popularized to a large-scale data set.
In summary, the existing methods have more or less certain limitations: (1) For the biomedical field, a molecular large model is needed to predict properties such as bioactivity and side effects of molecules to accelerate drug screening, or to generate molecules with specific properties and structures to provide candidate molecules for drug design and discovery; (2) The existing models are mostly dependent on small-scale data sets, are limited by the number of existing three-dimensional equilibrium conformations, and lack large-scale graph data sets which can be used for training a molecular large model; (3) The existing model has limited characterization capability on molecular information, cannot fully learn the three-dimensional space information, and needs to provide a model with high expression capability.
Disclosure of Invention
The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, and aims to solve the problems that a large-scale graph data set which can be used for molecular large model training is lacking in the existing biological medicine field, the existing model has limited representation capability on molecular information, and three-dimensional space information cannot be fully learned.
The invention provides a molecular large model construction method based on multi-dimensional molecular information, which comprises the following steps:
Constructing an unsupervised pre-training data set, and performing preprocessing and molecular conformation generation processing on the unsupervised pre-training data set to obtain a molecular pre-training data set formed by a molecular diagram;
Carrying out structural coding on a molecular diagram in the molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer;
integrating shortest path structural codes, side information codes and three-dimensional distance pair codes into the self-attention layer of the transducer, interacting the three-dimensional distance pair codes with the self-attention layer of the transducer in the training process, and iteratively updating node pair characteristics of the self-attention layer of the transducer;
And defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.
A method of molecular large model construction based on multidimensional molecular information according to some embodiments of the present application, the pre-processing includes removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, retaining a backbone structural representation of the molecule, the backbone structural representation including an ID number of the molecule in an original database and a one-dimensional molecular SMILES representation.
A method for constructing a molecular large model based on multidimensional molecular information according to some embodiments of the present application, the molecular conformation generation process includes performing the molecular conformation generation process by RDkit kits based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
According to some embodiments of the application, the method for constructing the molecular large model based on multidimensional molecular information comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer are subjected to standard normalization operation.
According to some embodiments of the application, a molecular large model construction method based on multidimensional molecular information is provided, wherein the molecular diagram is shown in a formula (1):
G={Xatom,A,E,R} (1)
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing the geometric space coordinates of the molecule in three dimensions.
According to some embodiments of the application, the initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
x0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
The three-dimensional distance codeAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj represents the distance information of the node i and the node j, r i represents the coordinate information of the node i, and r j represents the coordinate information of the node j.
According to the molecular large model construction method based on multidimensional molecular information in some embodiments of the present application, the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of bias terms, as shown in formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (X) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in a formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in a formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,j,βi,j,μk,σk are learnable parameters, alpha i,j,βi,j is controlled by an atomic node element type, corresponding alpha i,j,βi,j of node pairs formed by different elements are different, mu k,σk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
The three-dimensional distance is used for interacting the code with the self-attention layer of the transducer, and the interaction and the iterative updating process of the self-attention layer of the transducer are shown in the formula (11) -the formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents a mapping matrix,/>Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The characteristics of the updated node pairs represent bias terms as the next self-attention layer.
According to the molecular large model construction method based on multidimensional molecular information, the defined two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi|GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Added Gaussian noise/> Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (13):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
the loss function of the molecular large model based on the multidimensional molecular information is shown in a formula (17):
L=αL2p+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
The invention also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.
The invention also provides application of the model in the field of biological medicine, and the data set to be subjected to the downstream task is input into the multi-dimensional molecular large model for fine adjustment to obtain an output result corresponding to the downstream task.
The molecular large model based on multidimensional molecular information, the construction method and the application provided by the invention can assist the biological medicine field to better understand the molecular structure and the chemical principle from the artificial intelligence angle, thereby revealing the internal mechanism of the molecule, and can effectively learn the molecular characterization by fully utilizing two-dimensional and three-dimensional molecular information, widely assist various downstream tasks such as molecular property prediction, molecular generation and the like, and accelerate the research and development speed of the medicine, and the method specifically comprises the following steps:
(1) A large-scale graph pre-training data set is constructed, two-dimensional and three-dimensional information is contained, and the defect of single information mining of the existing model is overcome;
(2) The graph representation learning method with strong expression capability is designed, the topological structure of the graph is fully excavated, the potential energy conversion in the three-dimensional geometric space is simulated, interaction is carried out with the attention moment array, and the application range of downstream tasks can be enlarged;
(3) The multi-dimensional self-supervision learning task can fully utilize the two-dimensional and three-dimensional graph structure information, and further improve the representation capability of the model.
Drawings
FIG. 1 is a schematic flow diagram of a molecular large model construction method based on multidimensional molecular information according to an embodiment of the invention;
FIG. 2 is a schematic flow chart of a transducer according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
The embodiment provides a molecular large model construction method based on multi-dimensional molecular information, as shown in fig. 1, comprising the following steps:
step 1: the method comprises the steps of constructing an unsupervised pre-training data set, wherein data of the constructed unsupervised pre-training data set are derived from a PubChem database and a ZINC database, each piece of molecular data in the unsupervised pre-training data set comprises an ID number of the molecular data in an original database and a one-dimensional molecular SMILES representation, 1.1 parts per million of molecular data is derived from a PubChem database, and 10 parts per million of molecular data is derived from the ZINC database.
SMILES is a linear symbol that contains only simple atoms and bonds, and few grammatical rules, but can represent molecular information, and is similar to text information, and by learning this sequence representation in a large language model, molecular properties can be predicted by learning this sequence representation in a natural language model, by reference to learning patterns in natural language processing. However, this method has some problems: firstly, SMILES cannot fully capture molecular structure information, such as similarity information of two molecules, and the like, so that a model cannot fully utilize the structure information to influence final performance; meanwhile, one molecule can be characterized as a plurality of SMILES forms, so that learning is deviated and performance is affected; finally, since the input data only has a SMILES form, the input format of downstream tasks such as molecular property prediction and the like is greatly limited, and the method cannot be directly applied to large-scale drug screening.
Considering that the same molecule can have a plurality of different SMILES forms, this results in the inability to use the SMILES forms as a matching and deduplication operation for the compound, thus guaranteeing a one-to-one correspondence of molecules to SMILES sequences by removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardizing tautomers when the unsupervised pretrained dataset is pre-processed, preserving the backbone structural representation of the molecule, including the ID number of the molecule in the original database and the one-dimensional molecular SMILES representation.
Performing a conformational generation process on the unsupervised pre-training dataset includes performing a molecular conformational generation process by RDkit kit based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
And finally, generating a 2D graph representation of the molecule according to the standardized SMILES sequence, uniformly storing the atomic characteristics, the chemical bond characteristics, the adjacent matrix of the molecular graph and the atomic three-dimensional coordinates generated by the molecular conformation generation process into a pt type file, and ensuring that the model is directly called.
In this embodiment, the pre-training stage adopts the constructed unsupervised pre-training data set, and performs pre-training according to the set multi-dimensional self-supervision task, so that the model has a certain molecular characterization capability, and can be effectively generalized to various downstream tasks.
Step 2: carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer, as shown in fig. 2;
in molecular representation learning, a transducer can effectively capture global information, and the global information is particularly important in molecular characterization learning, and for this purpose, the transducer is used as a backbone network of a molecular large model framework, and comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation.
The molecular diagram is shown in formula (1):
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing the geometric space coordinates of the molecule in three dimensions.
The position coding is an indispensable component in the Transformer, and when the node characteristics are input, besides the atomic node characteristic matrix, the node degree coding, the random walk position coding and the three-dimensional distance coding are integrated, and for the graph structure, no fixed node sequence exists, so that the position coding is very difficult, compared with other position coding modes, the position coding based on the random walk has lower calculation complexity, the problem of characteristic value symbols is not needed to be considered, the characteristic position information is calculated mainly by virtue of the adjacent matrix and the degree, and the specific position coding can be provided for the nodes with different k-hop topological neighbors.
The initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
x0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
Three-dimensional distance codingAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj represents the distance information of the node i and the node j, r i represents the coordinate information of the node i, and r j represents the coordinate information of the node j.
Step 3: the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively;
the shortest path structural code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (r) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,j,βi,j,μk,σk are learnable parameters, alpha i,j,βi,j is controlled by an atomic node element type, corresponding alpha i,j,βi,j of node pairs formed by different elements are different, mu k,σk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
the three-dimensional distance is used for interacting the coding and the self-attention layer of the transducer, and the iterative updating process of the self-attention layer of the transducer is shown in the formula (11) -the formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents the mapping matrix and,Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The updated node pair feature serves as a bias term for the next self-attention layer.
Step 4: and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.
The self-supervision learning task of the joint sub-graph of the two-dimensional space and the three-dimensional space comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, and model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi|GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
atomic three-dimensional coordinate information when three-dimensional space coordinate denoising task is input through input Adding Gaussian noiseDisturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
The loss function of the molecular large model based on multidimensional molecular information is shown in formula (17):
L=αL2D+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
The model can be fused with molecular information of different visual angles and stored as a pt type file according to the design of the two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, so that the downstream task can be further fine-tuned, and the generalization performance is improved.
The embodiment also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.
The embodiment also provides application of the molecular large model based on the multi-dimensional molecular information in the biomedical field, and the data set to be subjected to the downstream task is input into the molecular large model based on the multi-dimensional molecular information for fine adjustment to obtain an output result corresponding to the downstream task, wherein the data set to be subjected to the downstream task comprises a molecular property prediction task data set, a three-dimensional coordinate generation task data set and a drug screening data set. The model can fully excavate molecular information in the field of biological medicine, learn molecular characterization by simulating two-dimensional structure information and potential energy conversion in a three-dimensional geometric space, improve the performance of the model on a plurality of downstream tasks such as molecular property prediction, target spot prediction, molecular synthesis and the like, and accelerate the drug screening speed, thereby providing important support and assistance for drug research and development.
The embodiment provides an application of a molecular large model based on multi-dimensional molecular information in a fine adjustment data set in the biomedical field, in general, the molecular tag information can be used for positively guiding a trained model in performance, and after the molecular large model in the embodiment completes a pre-training task, three types of output of a graph feature vector, a node feature matrix and a three-dimensional coordinate matrix can be finally generated to be connected with various downstream tasks, for example: and (3) completing a molecular property prediction task by using the graph feature vector, completing a molecular pose prediction task by using a three-dimensional coordinate matrix, and the like.
The dataset used in this embodiment is PCQM4mv2 dataset, and using the HOMO-LUMO values provided in PCQM4mv2 dataset, the supervised downstream task is defined as a regression task-predicting the quantum characteristics of the molecular graph to optimize model parameters in the self-supervised learning process, including:
Step 1: a supervised fine tuning dataset was constructed, the data of which was derived from the PCQM4mv2 dataset, and the PCQM4mv2 dataset contained 340 ten thousand organic molecules, recording the three-dimensional conformation and HOMO-LUMO energy gap under the molecular equilibrium state calculated using density functional theory.
Step 2: the supervision fine tuning data set is preprocessed, and the backbone structure representation of the molecules is reserved through removing hydrogen atoms, removing charges, removing small fragments, removing chirality and standardized tautomers, wherein the backbone structure representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation, so that one-to-one correspondence of the molecules and SMILES sequences is ensured.
Step 3: in this embodiment, MAE is used as the loss function, and an Adam optimizer is used to optimize the model learning parameters. And adjusting the super parameters according to the verification set.
MAE calculation is shown in equation (18):
where N represents the number of all molecular maps in the PCQM Mv2 dataset, Output result of nth molecule is expressed,/>Representing the true tag of the nth molecule.
The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Claims (3)
1. The molecular large model construction method based on multidimensional molecular information is characterized by comprising the following steps:
Constructing an unsupervised pre-training data set, and performing preprocessing and molecular conformation generation processing on the unsupervised pre-training data set to obtain a molecular pre-training data set formed by a molecular diagram;
Carrying out structural coding on a molecular diagram in the molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer;
integrating shortest path structural codes, side information codes and three-dimensional distance pair codes into the self-attention layer of the transducer, interacting the three-dimensional distance pair codes with the self-attention layer of the transducer in the training process, and iteratively updating node pair characteristics of the self-attention layer of the transducer;
Defining a two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training;
The transformers comprise a plurality of transformers, each Transformer consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation;
the molecular diagram is shown in formula (1):
G={Xatom,A,E,R} (1)
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing geometric space coordinates of the molecules in three-dimensional space;
The initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
X0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
Xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
The three-dimensional distance codeAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj || represents distance information of the node i and the node j, r i represents coordinate information of the node i, and r j represents coordinate information of the node j;
And the shortest path structure code, the side information code and the three-dimensional distance pair code are merged into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (X) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in a formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in a formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
Wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,jj,βi,j,μk,σk are learnable parameters, alpha i,j,βi,j is controlled by an atomic node element type, corresponding alpha i,j,βi,j of node pairs formed by different elements are different, mu k,σk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
The three-dimensional distance pair code interacts with the self-attention layer of the transducer, and the interaction and the self-attention layer iteration updating edge node pair feature matrix and node feature updating process of the transducer are shown in the following formula (11) -formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents a mapping matrix,/>Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The updated node pair characteristics are used as bias items of the self-attention layer of the next layer;
The self-supervision learning task of the defined two-dimensional space and three-dimensional space joint molecular graph comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi∣GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Adding Gaussian noise Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
the loss function of the molecular large model based on the multidimensional molecular information is shown in a formula (17):
L=αL2D+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
2. The method of claim 1, wherein the preprocessing comprises removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, and retaining a backbone structural representation of the molecules, wherein the backbone structural representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation.
3. The method for constructing a molecular large model based on multidimensional molecular information according to claim 1, wherein the molecular conformation generation process comprises the molecular conformation generation process by RDkit kits based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311574206.0A CN117524353B (en) | 2023-11-23 | 2023-11-23 | Molecular large model based on multidimensional molecular information, construction method and application |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311574206.0A CN117524353B (en) | 2023-11-23 | 2023-11-23 | Molecular large model based on multidimensional molecular information, construction method and application |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117524353A CN117524353A (en) | 2024-02-06 |
CN117524353B true CN117524353B (en) | 2024-05-10 |
Family
ID=89766059
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311574206.0A Active CN117524353B (en) | 2023-11-23 | 2023-11-23 | Molecular large model based on multidimensional molecular information, construction method and application |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117524353B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117912575B (en) * | 2024-03-19 | 2024-05-14 | 苏州大学 | Atomic importance analysis method based on multi-dimensional molecular pre-training model |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113299354A (en) * | 2021-05-14 | 2021-08-24 | 中山大学 | Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network |
CN114566232A (en) * | 2022-02-17 | 2022-05-31 | 北京百度网讯科技有限公司 | Molecular characterization model training method and device and electronic equipment |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
CN115831261A (en) * | 2022-11-14 | 2023-03-21 | 浙江大学杭州国际科创中心 | Three-dimensional space molecule generation method and device based on multi-task pre-training inverse reinforcement learning |
CN116052792A (en) * | 2023-01-31 | 2023-05-02 | 杭州碳硅智慧科技发展有限公司 | Training method and device for molecular optimal conformation prediction model |
WO2023153882A1 (en) * | 2022-02-11 | 2023-08-17 | Samsung Display Co., Ltd. | Method for optimizing properties of a molecule |
CN116978483A (en) * | 2023-07-31 | 2023-10-31 | 浙江大学 | Molecular property prediction method and system based on graphic neural network and three-dimensional encoder |
CN116978481A (en) * | 2023-03-23 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Molecular attribute prediction method, device, electronic equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111755078B (en) * | 2020-07-30 | 2022-09-23 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
-
2023
- 2023-11-23 CN CN202311574206.0A patent/CN117524353B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113299354A (en) * | 2021-05-14 | 2021-08-24 | 中山大学 | Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
WO2023153882A1 (en) * | 2022-02-11 | 2023-08-17 | Samsung Display Co., Ltd. | Method for optimizing properties of a molecule |
CN114566232A (en) * | 2022-02-17 | 2022-05-31 | 北京百度网讯科技有限公司 | Molecular characterization model training method and device and electronic equipment |
CN115831261A (en) * | 2022-11-14 | 2023-03-21 | 浙江大学杭州国际科创中心 | Three-dimensional space molecule generation method and device based on multi-task pre-training inverse reinforcement learning |
CN116052792A (en) * | 2023-01-31 | 2023-05-02 | 杭州碳硅智慧科技发展有限公司 | Training method and device for molecular optimal conformation prediction model |
CN116978481A (en) * | 2023-03-23 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Molecular attribute prediction method, device, electronic equipment and storage medium |
CN116978483A (en) * | 2023-07-31 | 2023-10-31 | 浙江大学 | Molecular property prediction method and system based on graphic neural network and three-dimensional encoder |
Non-Patent Citations (4)
Title |
---|
Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction;Philippe Schwaller ET AL;《ACS Cent. Sci.》;20190830;全文 * |
Unified 2D and 3D Pre-Training of Molecular Representations;Jinhua Zhu ET AL;《https:arxiv.org/abs/2207.08806》;20220818;全文 * |
基于深度学习的3D分子生成模型研究进展;20230209;《中国科学:化学》;20230209;全文 * |
方剂图论和拓扑学:方剂结构的化学图论和分子拓扑学原理及其研究方法(续);冯前进;刘润兰;;山西中医学院学报;20131028(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117524353A (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Hierarchical graph pooling with structure learning | |
Tahmasebi et al. | Machine learning in geo-and environmental sciences: From small to large scale | |
Kampffmeyer et al. | Rethinking knowledge graph propagation for zero-shot learning | |
Diallo et al. | Deep embedding clustering based on contractive autoencoder | |
Li et al. | Deep learning methods for molecular representation and property prediction | |
JP2023082017A (en) | computer system | |
CN113707235A (en) | Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning | |
CN112905801B (en) | Stroke prediction method, system, equipment and storage medium based on event map | |
CN117524353B (en) | Molecular large model based on multidimensional molecular information, construction method and application | |
CN109858015A (en) | A kind of semantic similarity calculation method and device based on CTW and KM algorithm | |
Song et al. | Geologist-level wireline log shape identification with recurrent neural networks | |
CN111476261A (en) | Community-enhanced graph convolution neural network method | |
Zhou et al. | M-evolve: structural-mapping-based data augmentation for graph classification | |
Herath et al. | Topologically optimal design and failure prediction using conditional generative adversarial networks | |
Rizvi et al. | Spectrum of advancements and developments in multidisciplinary domains for generative adversarial networks (GANs) | |
CN116741307A (en) | Three-dimensional molecular structure simulation method for synthesis and screening of lead compounds | |
Liu et al. | A systematic machine learning method for reservoir identification and production prediction | |
CN112270950A (en) | Fusion network drug target relation prediction method based on network enhancement and graph regularization | |
CN112529057A (en) | Graph similarity calculation method and device based on graph convolution network | |
CN116208399A (en) | Network malicious behavior detection method and device based on metagraph | |
Liu et al. | Object detection via inner-inter relational reasoning network | |
Peng et al. | Pocket-specific 3d molecule generation by fragment-based autoregressive diffusion models | |
Fan et al. | Gated graph pooling with self-loop for graph classification | |
Zhu et al. | Structural landmarking and interaction modelling: a “slim” network for graph classification | |
Rolon et al. | A multi-class structured dictionary learning method using discriminant atom selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |